Someone asked me an interview question: write a function match(s, t) to decide if a string s is a generalized substring of another string t. More concretely, match should return True if and only if removing some characters in t can equalize it to s. For example, match("abc", "abbbc") is True, because we can remove the two extra bs in t.
Surely the interviewer is expecting some kind of recursive solution, but I'm feeling adventurous and wrote
def match(s, t):
return re.search('.*?'.join(s), t) is not None
This seems to satisfy all the test cases, but then I started wondering if there exists any adversarial input that can hog this operation (and potentially make our service vulnerable to DoS attacks). See this great article for an extended discussion, but as a quick example, try
re.match("a?"*29+"a"*29, "a"*29)
and it will take several seconds before finding a match. The reason is that re implements a backtracking regex engine:
Consider the regular expression a?nan. It matches the string an when the a? are chosen not to match any letters, leaving the entire string to be matched by the an. Backtracking regular expression implementations implement the zero-or-one ? by first trying one and then zero. There are n such choices to make, a total of 2n possibilities. Only the very last possibility—choosing zero for all the ?—will lead to a match. The backtracking approach thus requires O(2n) time, so it will not scale much beyond n=25.
Back to the interview question, technically '.*?'.join(s) gives us at least len(s) - 1 choices to make, which means there can be some adversarial inputs similar to the "a?"*29+"a"*29 and "a"*29 above, but I failed to find them after some trial and error.
Question: what s and t can make match ridiculously slow?
Lazy quantifiers are generally quite good for performance, but AFAIK they do not prevent the pathological emphasized behaviour.
This is especially true when the beginning of the regexp match with the beginning of a text but the match is early and will fail at the end of the text requiring a lot of backtracks to "fix" the bad early lazy match of the beginning of the regexp.
In your case, here is an example of pathological input requiring an exponential number of steps:
Here is the number of steps and the Python
matchtime required based on the value ofn:You can understand and analyse the resulting regexp matching behaviour on this great website (more specifically backtracking in this case).