Precedence rules for matching groups with regexp

1.8k views Asked by At

Consider the following .NET regular expression:

^(REF)?(.{1,10})-(\d{12})-(\d+)$

It defines four groups, in which I'm interested and which I will analyse separately.

Now, consider an input string for this regexp:

REFmisc03-123456789012-213

It is possible to match it like this:

(REF)(misc03)-(123456789012)-(213)

And it is also possible to match it like this:

()(REFmisc03)-(123456789012)-(213)

Is it documented what way will be preferred by the regexp engine, or is it random?

1

There are 1 answers

0
Ahmad Mageed On BEST ANSWER

It is not random. This boils down to how quantifiers are interpreted by the regex engine and potential backtracking. By quantifier I am referring to the ? in (REF)?. According to MSDN:

Ordinarily, quantifiers are greedy; they cause the regular expression engine to match as many occurrences of particular patterns as possible. Appending the ? character to a quantifier makes it lazy; it causes the regular expression engine to match as few occurrences as possible.

In other words, ? is greedy, and ?? is lazy. Both match zero or one time, but they will have an effect on how the matching is performed.

With regards to backtracking, MSDN mentions:

the regular expression engine tries to fully match optional or alternative subexpressions. When it advances to the next language element in the subexpression and the match is unsuccessful, the regular expression engine can abandon a portion of its successful match and return to an earlier saved state in the interest of matching the regular expression as a whole with the input string. This process of returning to a previous saved state to find a match is known as backtracking.

Another useful resource to learn more about backtracking can be found here: Possessive Quantifiers.

To answer your question directly, we can compare both approaches.

Greedy approach

Original input: REFmisc03-123456789012-213

Usage of (REF)? will match your text with 4 groups (excluding the first group with the entire match) and all groups will be successfully matched:

  1. REF
  2. misc03
  3. 123456789012
  4. 213

This matches your first possible match scenario (loosely defined):

(REF)(misc03)-(123456789012)-(213)

As long as the "misc..." portion is 1-10 characters long, the match will be the same, with all 1-10 characters appearing in the second group. The REF portion will always be matched in the first group.

New input: REF-123456789012-213

The "misc..." portion is absent. Since (REF)? is optional, and (.{1,10}) isn't, the regex engine will use the "REF" input to satisfy the latter (required) portion of the pattern and disregard the former (optional) portion. This will yield the following group values:

  1. "" (empty string, Success property = false)
  2. REF
  3. 123456789012
  4. 213

Lazy approach

Original input: REFmisc03-123456789012-213

By using (REF)??, and keeping the rest of your pattern the same, the quantifier becomes lazy and this returns 4 groups with these values:

  1. "" (empty string, Success property = false)
  2. REFmisc03
  3. 123456789012
  4. 213

This matches your second possible match scenario:

()(REFmisc03)-(123456789012)-(213)

Since the first group is optional with a lazy quantifier, the regex engine is able to disregard it. Since "REFmisc03" is 9 characters long, the engine proceeds to lump "REF" in with "misc03" because they fit into the (.{1,10}) group.

New input: REF-123456789012-213

This behaves similarly to the greedy pattern and the same reasoning applies.

Another new input: REFmisc0345-123456789012-213

In this example the "misc0345" portion is 8 characters long. Although the pattern uses a lazy quantifier it can't fit "REFmisc0345" into the second group because it exceeds the 10 character limit. The regex engine will backtrack and match "REF" in the first group, and "misc0345" in the second group:

  1. REF
  2. misc0345
  3. 123456789012
  4. 213