Regex causing high CPU load, causing Rails to not respond

243 views Asked by At

I have a Ruby 1.8.7 script to parse iOS localization files:

singleline_comment = /\/\/(.*)$/
multiline_comment = /\/\*(.*?)\*\//m
string_line = /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;\s*/xm

out = decoded_src.scan(/(?:#{singleline_comment}|#{multiline_comment})?\s*?#{string_line}/)

It used to work fine, but today we tested it with a file that is 800Kb, and that doesn't have ; at the end of each line. The result was a high CPU load and no response from the Rails server. My assumption is that it took the whole file as a single string in the capturing group and that blocked the server.

The solution was to add ? (regex quantificator, 0 or 1 time) to the ; literal character:

  /\s*"(.*?)"\s*=\s*"(.*?)"\s*\;?\s*/xm

Now it works fine again even with those files in the old iOS format, but my fear now is, what if a user submits a malformed file, like one with no ending ". Will my server get blocked again?

And how do I prevent this? Is there any way to try to run this only for five seconds? What I can I do to avoid halting my whole Rails application?

1

There are 1 answers

1
the Tin Man On

It looks like you're trying to parse an entire configuration as if it was a string. While that is doable, it's error-prone. Regular expression engines have to do a lot of looking forward and backward, and poorly written patterns can end up wasting a huge amount of CPU time. Sometimes a minor tweak will fix the problem, but the more text being processed, and the more complex the expression, the higher the chance of something happening that will mess you up.

From benchmarking different ways of getting at data for my own work, I've learned that anchoring regexp patterns can make a huge difference in speed. If you can't anchor a pattern somehow, then you are going to suffer from the backtracking and greediness of patterns unless you can limit what the engine wants to do by default.

I have to parse a lot of device configurations, but instead of trying to treat them as a single string, I break them down into logical blocks consisting of arrays of lines, and then I can provide logic to extract data from those blocks based on knowledge that blocks contain certain types of information. Small blocks are faster to search, and it's a lot easier to write patterns that can be anchored, providing huge speedups.

Also, don't hesitate to use Ruby's String methods, like split to tear apart lines, and sub-string matching to find lines containing what you want. They're very fast and less likely to induce slowdowns.

If I had a string like:

config = "name:\n foo\ntype:\n thingie\nlast update:\n tomorrow\n"
chunks = config.split("\n").slice_before(/^\w/).to_a
# => [["name:", " foo"], ["type:", " thingie"], ["last update:", " tomorrow"]]

command_blocks = chunks.map{ |k, v| [k[0..-2], v.strip] }.to_h

command_blocks['name'] # => "foo"
command_blocks['last update'] # => "tomorrow"

slice_before is a very useful method for this sort of task as it lets us define a pattern that is then used to test for breaks in the master array, and group by those. The Enumerable module has lots of useful methods in it, so be sure to look through it.

The same data could be parsed.

Of course, without sample data for what you're trying to do it's difficult to suggest something that works better, but the idea is, break down your input into small manageable chunks and go from there.

As a comment on how you're defining your patterns.

Instead of using /\/.../ (which is known as "leaning-toothpicks syndrome") use %r which allows you to define a different delimiter:

singleline_comment = /\/\/(.*)$/     # => /\/\/(.*)$/
singleline_comment = %r#//(.*)$#     # => /\/\/(.*)$/

multiline_comment = /\/\*(.*?)\*\//m # => /\/\*(.*?)\*\//m
multiline_comment = %r#/\*(.*?)\*/#m # => /\/\*(.*?)\*\//m

The first line in each sample above is how you're doing it, and the second is how I'd do it. They result in identical regexp objects, but the second ones are easier to understand.

You can even have Regexp help you by escaping things for you:

NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*?)'
GREEDY_CAPTURE_NONE_TO_ALL_CHARS = '(.*)'
EOL = '$'

Regexp.new(Regexp.escape('//') + GREEDY_CAPTURE_NONE_TO_ALL_CHARS + EOL) # => /\/\/(.*)$/
Regexp.new(Regexp.escape('/*') + NONGREEDY_CAPTURE_NONE_TO_ALL_CHARS + Regexp.escape('*/'), Regexp::MULTILINE) # => /\/\*(.*?)\*\//m

Doing this you can iteratively build up extremely complex expressions while keeping them relatively easy to maintain.

As far as halting your Rails app, don't try to process the files in the same Ruby process. Run a separate job that watches for the files and process them and store whatever you're looking for to be accessed as needed later. That way your server will continue to respond rather than lock up. I wouldn't do it in a thread, but would write a separate Ruby script that looks for incoming data, and if nothing is found, sleeps for some interval of time then looks again. Ruby's sleep method will help with that, or you could use the cron capability of your OS.