I have a file of a few hundred megabytes containing strings:
str1 x1 x2\n
str2 xx1 xx2\n
str3 xxx1 xxx2\n
str4 xxxx1 xxxx2\n
str5 xxxxx1 xxxxx2
where x1
and x2
are some numbers. How big the numbers x(...x)1
and x(...x)2
are is unknown.
Each line has in "\n"
in it. I have a list of strings str2
and str4
.
I want to find the corresponding numbers for those strings.
What I'm doing is pretty straightforward (and, probably, not efficient performance-wise):
source_str = read_from_file() # source_str contains all file content of a few hundred Megabyte
str_to_find = [str2, str4]
res = []
str_to_find.each do |x|
index = source_str.index(x)
if index
a = source_str[index .. index + x.length] # a contains "str2"
#?? how do I "select" xx1 and xx2 ??
# and finally...
# res << num1
# res << num2
end
end
Note that I can't apply source_str.split("\n")
due to the error ArgumentError: invalid byte sequence in UTF-8
and I can't fix it by changing a file in any way. The file can't be changed.
If you want to find a line in a text file, which it sounds like you are reading, then read the file line-by-line.
The IO class has the
foreach
method, which makes it easy to read a file line-by-line, which also makes it possible to easily locate lines that contain the particular string you want to find.If you had your source input file saved as "foo.txt", you could read it using something like:
At the end of the loop
numbers
should contain the numbers you want.You say you're getting an encoding error, but you don't give us any clue what the characters are that are causing it. Without that information we can't really help you fix that problem except to say you need to tell Ruby what the file encoding is. You can do that when the file is opened; You'd properly set the
open_args
to whatever the encoding should be. Odds are good it should be an encoding of ISO-8859-1 or Win-1252 since those are very common with Windows machines.We can only work with the examples you give us. Since that wasn't clearly explained in your question you got an answer based on what was initially said.
Ruby's Regexp has the tools necessary to make this work, but to do it correctly requires taking advantage of Perl's Regexp::Assemble library, since Ruby has nothing close to it. See "Is there an efficient way to perform hundreds of text substitutions in ruby?" for more information.
Note that this will allow you to scan through a huge string in memory, however that is still not a good way to process what you are talking about. I'd use a database instead, which are designed for this sort of task.