Regex vulnerable to polynomial runtime

828 views Asked by At

How do I improve this code? SonarQube is highlighting that the regex pattern that could become really slow and produce denial of service. Here's the code:

  // Single quotes
  // Double quotes
  // Control characters (x01 - x1f)
  private static final String SQL_INJECTION_PATTERN = ".*['\"\\x01-\\x1f].*";

  if (myValue.matches(SQL_INJECTION_PATTERN)) { // Here's the problem!
    ...
  }

I need to find if any of those chars is present, to take an action.

2

There are 2 answers

7
rzwitserloot On BEST ANSWER

Just ignore the warning. There aren't many regexes that have potentially exponential performance characteristics. If you do not let your users choose the regex (i.e. imagine you have a search feature where the user enters a regex or something you use as part of a regex in a search-for-this form field and they can submit it, then you'd be vulnerable here) and you don't try to apply such a regex that you wrote yourself to user input (trivially fixed by.. not having any such regexes in your code base)...

it doesn't apply to you.

But you do have a security leak here

You're engaging in an act known as denylisting. An older term for this is 'blacklisting'. You list things that you know are bad, and scan for these. Any input that contains bad things is disallowed. The rest is allowed.

Effectively you're saying: Unless I can show conclusively it is bad, we assume it is good.

This is a broken approach to security.

Allowlisting is the correct approach: Unless you can prove the input is safe, do not run it.

Specifically, for SQL injection pattern, the solution is to let the SQL engine take care of it: Make PreparedStatements, and use .setX to set them - or use a library like JDBI or JOOQ that does this for you. Then there is simply no reason to worry about SQL injection whatsoever.

Note that the actual security leak you do have is of the 'the hacker will p0wn the entire box by gaining admin creds on it' variety. Compared to 'users might be able to DOS your box', it's.. a few million times worse.


EDIT: Here is an example of an allowlist approach in the same vein as the code you have (and the same caveat: Your security scanner may complain about a potential DOS attack venue here, you can ignore it, or.. see later paragraph):


private static final String SQL_INJECTION_PATTERN = ".*[a-zA-Z0-9_ -].*";

  if (!myValue.matches(SQL_INJECTION_PATTERN)) throw new IllegalArgumentException();

This code will check if the input consists solely of a very specific limited range of alphanums (just the ascii chars), and dash, and underscore, and space, and disallows everything that doesn't consist just of those things.

This may still be unsafe - it depends on what you do with it. SQL has keywords, for example. If certain keywords are bad who knows - you'd need to elaborate considerably on what you're doing with user input here.

The central point stands: It is highly likely your entire approach is wrong and you need to be using preparedstatements.

One way to get rid of the warning is to stop using regexes. You can do this kind of scan just as easily with a for loop (and it is faster too, if anything):

for (int i = 0; i < sql.length(); i++) {
  if ((i < 'a' || i > 'z') && (i < 'A' && i > 'Z') && (i < '0' || i > '9') && i != ' ' && i != '_' && i != ' ') throw new IllegalArgumentException();

But, as a general rule, rewriting code solely because you're trying to work around an overly aggressive linting tool is an incredibly perniciously dangerous thing to do to a codebase. Programming is hard enough as is.

4
motto On

The "vulnerability" (such as it is) is not one of those chars, but the leading .*. tldr: you could change it to .*?

Consider an input string with none of the suspect characters, e.g. xxxxxx. In trying to match this string against .*['"\x01-\x1f].*, the regex engine will attempt to match the leading .* greedily, so it tries:


xxxxxx (fails to match because there is no ['"\x01-\x1f] past the end of the input)
xxxxx  (fails to match because the next character is x which is not ['"\x01-\x1f])
xxxx   (ditto)
xxx    (ditto)
xx     (ditto)
x      (ditto)
       (ditto)
  -> no match found

The result of the greedy matching up front is that the engine must backtrack repeatedly, so determining the lack of match takes time superlinear(?) in the length of the input. The regex analyser will be being very conservative when it comes to evaluating the "risk".

The easy workaround is to use a non-greedy quantifier on the leading .*, i.e. .*?['"\x01-\x1f].*. The regex engine does not backtrack, so the input is evaluated in linear time.

On a side-note, this type of "potential ReDoS" pattern is reminiscent to one that was reported in AngularJS's angular.copy a couple of weeks back (and indeed in lodash's clone machinery for RegExps, and probably countless other libraries that use the same quick trick to extract flags from the end of a stringified RegExp).

Unrelated to the mechanics of this, I agree with rzwitserloot that this stuff tends to be a distraction from other more significant vulnerabilities. Maybe the real "denial of service" of each of these reports of a "regular expression denial of service" is the amount of time wasted every time while folks around the world scramble around trying to ascertain whether there's even an associated attack vector.