I'm currently studying implementations of UNIX-style glob pattern matching. Generally, the fnmatch
library is a good reference implementation for this functionality.
Looking at some of the implementations, as well as reading various blogs/tutorials about this, it seems that this algorithm is usually implemented recursively.
Generally, any sort of algorithm that requires "back tracking", or requires returning to an earlier state, nicely lends itself to a recursive solution. This includes things like tree traversal, or parsing nested structures.
But I'm having trouble understanding why glob pattern matching in particular is so often implemented recursively. I get the idea that sometimes back tracking will be necessary, for example if we have a string aabaabxbaab
, and a pattern a*baab
, the characters after the *
will match the first "baab" substring, like aa(baab)xbaab
, and then fail to match the rest of the string. So the algorithm will need to backtrack so that the character match counter starts over, and we can match the second occurrence of baab
, like: aabaabx(baab)
.
Okay, but generally recursion is used when we might require multiple nested levels of backtracking, and I don't see how that would be necessary in this case. It seems we'd only ever have to backtrack one level at a time, when the iterator over the pattern and the iterator over the input string fail to match. At this point, the iterator over the pattern would need to move back to the last "save point", which would either be the beginning of the string, or the last *
that successfully matched something. This doesn't require a stack - just a single save point.
The only complication I can think of is in the event of an "overlapping" match. For example if we have the input string aabaabaab
, and the pattern a*baab
, we would fail to match because the "b" in the last baab
could be part of either the first match or the second match. But it seems this could be solved by simply backtracking the input iterator if the distance between the last pattern iterator save point and the end of the pattern is greater than the distance between the input iterator position and the end of the input string.
So, as far as I'm seeing, it shouldn't be too much of an issue to implement this glob matching algorithm iteratively (assuming a very simple glob matcher, which only uses the *
character to mean "match zero or more characters". Also, the matching strategy would be greedy.)
So, I assume I'm definitely wrong about this, because everyone else does this recursively - so I must be missing something. It's just that I can't see what I'm missing here. So I implemented a simple glob matcher in C++ (that only supports the *
operator), to see if I could figure out what I'm missing. This is an extremely straightforward, simple iterative solution which just uses an inner loop to do the wildcard matching. It also records the indices which the *
character matches in a vector of pairs:
bool match_pattern(const std::string& pattern, const std::string& input,
std::vector<std::pair<std::size_t, std::size_t>>& matches)
{
const char wildcard = '*';
auto pat = std::begin(pattern);
auto pat_end = std::end(pattern);
auto it = std::begin(input);
auto end = std::end(input);
while (it != end && pat != pat_end)
{
const char c = *pat;
if (*it == c)
{
++it;
++pat;
}
else if (c == wildcard)
{
matches.push_back(std::make_pair(std::distance(std::begin(input), it), 0));
++pat;
if (pat == pat_end)
{
matches.back().second = input.size();
return true;
}
auto save = pat;
std::size_t matched_chars = 0;
while (it != end && pat != pat_end)
{
if (*it == *pat)
{
++it;
++pat;
++matched_chars;
if (pat == pat_end && it != end)
{
pat = save;
matched_chars = 0;
// Check for an overlap and back up the input iterator if necessary
//
std::size_t d1 = std::distance(it, end);
std::size_t d2 = std::distance(pat, pat_end);
if (d2 > d1) it -= (d2 - d1);
}
}
else if (*pat == wildcard)
{
break;
}
else
{
if (pat == save) ++it;
pat = save;
matched_chars = 0;
}
}
matches.back().second = std::distance(std::begin(input), it) - matched_chars;
}
else break;
}
return it == end && pat == pat_end;
}
Then I wrote a series of tests to see if I could find some pattern or input string that would require multiple levels of backtracking (and therefore recursion or a stack), but I can't seem to come up with anything.
Here is my test function:
void test(const std::string& input, const std::string& pattern)
{
std::vector<std::pair<std::size_t, std::size_t>> matches;
bool result = match_pattern(pattern, input, matches);
auto match_iter = matches.begin();
std::cout << "INPUT: " << input << std::endl;
std::cout << "PATTERN: " << pattern << std::endl;
std::cout << "INDICES: ";
for (auto& p : matches)
{
std::cout << "(" << p.first << "," << p.second << ") ";
}
std::cout << std::endl;
if (result)
{
std::cout << "MATCH: ";
for (std::size_t idx = 0; idx < input.size(); ++idx)
{
if (match_iter != matches.end())
{
if (idx == match_iter->first) std::cout << '(';
else if (idx == match_iter->second)
{
std::cout << ')';
++match_iter;
}
}
std::cout << input[idx];
}
if (!matches.empty() && matches.back().second == input.size()) std::cout << ")";
std::cout << std::endl;
}
else
{
std::cout << "NO MATCH!" << std::endl;
}
std::cout << std::endl;
}
And my actual tests:
test("aabaabaab", "a*b*ab");
test("aabaabaab", "a*");
test("aabaabaab", "aa*");
test("aabaabaab", "aaba*");
test("/foo/bar/baz/xlig/fig/blig", "/foo/*/blig");
test("/foo/bar/baz/blig/fig/blig", "/foo/*/blig");
test("abcdd", "*d");
test("abcdd", "*d*");
test("aabaabqqbaab", "a*baab");
test("aabaabaab", "a*baab");
So this outputs:
INPUT: aabaabaab
PATTERN: a*b*ab
INDICES: (1,2) (3,7)
MATCH: a(a)b(aaba)ab
INPUT: aabaabaab
PATTERN: a*
INDICES: (1,9)
MATCH: a(abaabaab)
INPUT: aabaabaab
PATTERN: aa*
INDICES: (2,9)
MATCH: aa(baabaab)
INPUT: aabaabaab
PATTERN: aaba*
INDICES: (4,9)
MATCH: aaba(abaab)
INPUT: /foo/bar/baz/xlig/fig/blig
PATTERN: /foo/*/blig
INDICES: (5,21)
MATCH: /foo/(bar/baz/xlig/fig)/blig
INPUT: /foo/bar/baz/blig/fig/blig
PATTERN: /foo/*/blig
INDICES: (5,21)
MATCH: /foo/(bar/baz/blig/fig)/blig
INPUT: abcdd
PATTERN: *d
INDICES: (0,4)
MATCH: (abcd)d
INPUT: abcdd
PATTERN: *d*
INDICES: (0,3) (4,5)
MATCH: (abc)d(d)
INPUT: aabaabqqbaab
PATTERN: a*baab
INDICES: (1,8)
MATCH: a(abaabqq)baab
INPUT: aabaabaab
PATTERN: a*baab
INDICES: (1,5)
MATCH: a(abaa)baab
The parentheses that appear in the output after "MATCH: "
show the substrings that were matched/consumed by each *
character. So, this seems to work fine, and I can't seem to see why it would be necessary to backtrack multiple levels here - at least if we limit our pattern to only allow *
characters, and we assume greedy matching.
So I assume I'm definitely wrong about this, and probably overlooking something simple - can someone help me to see why this algorithm might require multiple levels of backtracking (and therefore recursion or a stack)?
I didn't check all the details of your implementation, but it is certainly true that you can do the match without recursive backtracking.
You can actually do glob matching without backtracking at all by building a simple finite-state machine. You could translate the glob into a regular expression and build a DFA in the normal way, or you could use something very similar to the Aho-Corasick machine; if you tweaked your algorithm a little bit, you'd achieve the same result. (The key is that you don't actually need to backup the input scan; you just need to figure out the correct scan state, which can be precomputed.)
The classic fnmatch implementations are not optimized for speed; they're based on the assumption that patterns and target strings are short. That assumption is usually reasonable, but if you allow untrusted patterns, you're opening yourself up to a DoS attack. And based on that assumption, an algorithm similar to the one you present, which does not require precomputation, is probably faster in the vast majority of use cases than any algorithm which requires precomputing state transition tables while avoiding the exponential blowup with pathological patterns.