Parsing a big string in Ruby

300 views Asked by At

I have a file of a few hundred megabytes containing strings:

str1 x1 x2\n
str2 xx1 xx2\n
str3 xxx1 xxx2\n
str4 xxxx1 xxxx2\n
str5 xxxxx1 xxxxx2

where x1 and x2 are some numbers. How big the numbers x(...x)1 and x(...x)2 are is unknown.

Each line has in "\n" in it. I have a list of strings str2 and str4.

I want to find the corresponding numbers for those strings.

What I'm doing is pretty straightforward (and, probably, not efficient performance-wise):

source_str = read_from_file() # source_str contains all file content of a few hundred Megabyte
str_to_find = [str2, str4]
res = []
str_to_find.each do |x|
  index = source_str.index(x)
  if index
    a = source_str[index .. index + x.length] # a contains "str2"

    #?? how do I "select" xx1 and xx2 ??


    # and finally...
    # res << num1
    # res << num2
  end
end

Note that I can't apply source_str.split("\n") due to the error ArgumentError: invalid byte sequence in UTF-8 and I can't fix it by changing a file in any way. The file can't be changed.

2

There are 2 answers

7
the Tin Man On

If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.

The IO class has the foreach method, which makes it easy to read a file line-by-line, which also makes it possible to easily locate lines that contain the particular string you want to find.

If you had your source input file saved as "foo.txt", you could read it using something like:

str2 = 'some value'
str4 = 'some other value'
numbers = []
File.foreach('foo.txt') do |li|
  numbers << li.split[2] if li[str2] || li[str2]
end

At the end of the loop numbers should contain the numbers you want.

You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it. Without that information we can't really help you fix that problem except to say you need to tell Ruby what the file encoding is. You can do that when the file is opened; You'd properly set the open_args to whatever the encoding should be. Odds are good it should be an encoding of ISO-8859-1 or Win-1252 since those are very common with Windows machines.


I have to find a list of values, iterating through each line doesn't seem sensible because I'd have to iterate for each value over and over again.

We can only work with the examples you give us. Since that wasn't clearly explained in your question you got an answer based on what was initially said.

Ruby's Regexp has the tools necessary to make this work, but to do it correctly requires taking advantage of Perl's Regexp::Assemble library, since Ruby has nothing close to it. See "Is there an efficient way to perform hundreds of text substitutions in ruby?" for more information.

Note that this will allow you to scan through a huge string in memory, however that is still not a good way to process what you are talking about. I'd use a database instead, which are designed for this sort of task.

8
Uri Agassi On

You want to avoid reading a hundred of megabytes into memory, as well as scanning them repeatedly. This has the potential of taking forever, while clogging the machine's available memory.

Try to re-frame the problem, so you can treat the large input file as a stream, so instead of asking for each string you want to find "does it exist in my file?", try asking for each line in the file "does it contain a string I am looking for?".

str_to_find = [str2, str4]
numbers = []
File.foreach('foo.txt') do |li|
  columns = li.split
  numbers += columns[2] if str_to_find.include?(columns.shift)
end

Also, read again @theTinMan's answer regarding the file encoding - what he is suggesting is that you may be able fine-tune the reading of the file to avoid the error, without changing the file itself.

If you have a very large number of items in str_to_find, I'd suggest that you use a Set instead of an Array for better performance:

str_to_find = [str1, str2, ... str5000].to_set