Ruby custom split function slow

161 views Asked by At

I have a large file of mostly space-delimited data I want to parse into a hash. The problem is this is mostly space-delimited, so a simple string.split isn't going to work.

Here's a simplified example of one of the lines in the file:

field0 field1 [ [field2a] [field2b] ] field3

The contents contained by the outer brackets (including the outer brackets) need to be a hash member.

I wrote the following function, which works, but is very slow:

# row = String to be split
# fields = Integer indicating expected number of fields
def mysplit (row, fields)

 # Variable to keep track of brackets
 b = 0

 # Variable to keep track of iterations for array index
 i = 0

 rowsplit = Array.new(fields)
 rowsplit[0] = ""
 row.each_char do |byte|

  case byte

   when ' '
    if b == 0
     i += 1
     rowsplit[i] = ""
    else
     rowsplit[i] += byte
    end

   when '['
    b += 1
    rowsplit[i] += byte

   when ']'
    b -= 1
    rowsplit[i] += byte

   else
    rowsplit[i] += byte

  end

 end

 if i != fields - 1
  raise StandardError,
   "Resulting fields do not match expected fields: #{rowsplit}",
   caller
 elsif b != 0
  raise StandardError, "Bracket never closed.", caller
 else
  return rowsplit
 end

end

It takes 36 seconds to run this on a 7 MB file 6600 lines long. It's worth mentioning that my environment is running Ruby 1.8.7, which I have no control over.

Is it possible to make this faster?

2

There are 2 answers

0
lacostenycoder On

You want .squeeze and .strip

str = "  field0     field1 [ [field2a] [field2b] ] field3"

puts str.squeeze.strip
#=> "field0 field1 [ [field2a] [field2b] ] field3"

Squeeze compresses any extra white space to just 1. Strip will remove leading and trailing space of the line.

From there you should be able to use regex pattern matching to parse each line into the data structure you're trying to create, but I can't help with that without knowing how to parse the data.

You also should try to raise expectations sooner, no need to iterate over the entire file.

If you know you're line will match this pattern in your example:

if str.squeeze!.strip! !str[/\w+\ +\[\ +\[+\w+\]\ \[+\w+\]\ \]\ \w+/]
  raise StandardError, "Raise this string pattern is wrong #{str}"
end

If you're good then you can split or whatever:

str.split(' ')
#=>["field0", "field1", "[", "[field2a]", "[field2b]", "]", "field3"] 
0
Meier On

To really tune your code, you can use the benchmark module to find the bottleneck.

But I expect that the biggest problem of your code is the string adding:

rowsplit[i] += byte

the ruby interpreter translates this to

rowsplit[i] = rowsplit[i] + byte

This creates a new string object for each byte in your input file. So a 7MB file creates and destroys seven million string objects... You probably get fast enough by using the string concatenation method:

rowsplit[i] << byte

Beware that the << changes the original object, this is not a problem in you program, but may be a problem when you use it in other contexts.