Adversarial input for this regex

Question

Adversarial input for this regex

123 views Asked by nalzok At 29 May 2021 at 01:11

Someone asked me an interview question: write a function match(s, t) to decide if a string s is a generalized substring of another string t. More concretely, match should return True if and only if removing some characters in t can equalize it to s. For example, match("abc", "abbbc") is True, because we can remove the two extra bs in t.

Surely the interviewer is expecting some kind of recursive solution, but I'm feeling adventurous and wrote

def match(s, t):
    return re.search('.*?'.join(s), t) is not None

This seems to satisfy all the test cases, but then I started wondering if there exists any adversarial input that can hog this operation (and potentially make our service vulnerable to DoS attacks). See this great article for an extended discussion, but as a quick example, try

re.match("a?"*29+"a"*29, "a"*29)

and it will take several seconds before finding a match. The reason is that re implements a backtracking regex engine:

Consider the regular expression a?ⁿaⁿ. It matches the string aⁿ when the a? are chosen not to match any letters, leaving the entire string to be matched by the aⁿ. Backtracking regular expression implementations implement the zero-or-one ? by first trying one and then zero. There are n such choices to make, a total of 2ⁿ possibilities. Only the very last possibility—choosing zero for all the ?—will lead to a match. The backtracking approach thus requires O(2ⁿ) time, so it will not scale much beyond n=25.

Back to the interview question, technically '.*?'.join(s) gives us at least len(s) - 1 choices to make, which means there can be some adversarial inputs similar to the "a?"*29+"a"*29 and "a"*29 above, but I failed to find them after some trial and error.

Question: what s and t can make match ridiculously slow?

Original Q&A

There are 2 answers

smac89 On 29 May 2021 at 02:47

Your solution suffers due to backtracking, which is why the accepted answer was able to provide inputs that will slow it down considerably.

Here is a regex solution which closely mimics my previous answer, and it also seems to avoid the issues plaguing your current solution.

def match(s, t):
    return re.search("".join(f'[^{re.escape(c)}]*{re.escape(c)}' for c in s), t) is not None

Previous answer

This doesn't directly answer your question, but

Surely the interviewer is expecting some kind of recursive solution...

I don't think so. Something like this can also work

def match(s, t):
    i = -1
    return all ((i := t.find(c, i + 1)) != -1 for c in s)

I just wanted to provide an alternative solution that doesn't require recursion or regex and therefore leading to the vulnerabilities you mentioned.

**Jérôme Richard** · Accepted Answer · 2021-05-29T02:16:08+00:00

Lazy quantifiers are generally quite good for performance, but AFAIK they do not prevent the pathological emphasized behaviour.

This is especially true when the beginning of the regexp match with the beginning of a text but the match is early and will fail at the end of the text requiring a lot of backtracks to "fix" the bad early lazy match of the beginning of the regexp.

In your case, here is an example of pathological input requiring an exponential number of steps:

# This should take at least few minutes to compute
n = 12
match('ab'*(n+1), 'abbbbb'*n+'a')

Here is the number of steps and the Python match time required based on the value of n:

 n |  steps |    time
 1 |     44 |   2.4 µs
 2 |    374 |   6.4 µs
 3 |   2621 |  22.1 µs
 4 |  18353 | 131.3 µs
 5 | 126211 | 925.8 µs
 6 |    -   |   6.2 ms
 7 |    -   |  42.2 ms
 8 |    -   | 288.7 ms
 9 |    -   |  1.97 s

You can understand and analyse the resulting regexp matching behaviour on this great website (more specifically backtracking in this case).

TechQA.

Adversarial input for this regex

There are 2 answers

Previous answer

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PERFORMANCE

Related Questions in SECURITY

Related Questions in NON-GREEDY

Popular Questions

Trending Questions