Consider the following .NET regular expression:
^(REF)?(.{1,10})-(\d{12})-(\d+)$
It defines four groups, in which I'm interested and which I will analyse separately.
Now, consider an input string for this regexp:
REFmisc03-123456789012-213
It is possible to match it like this:
(REF)(misc03)-(123456789012)-(213)
And it is also possible to match it like this:
()(REFmisc03)-(123456789012)-(213)
Is it documented what way will be preferred by the regexp engine, or is it random?
It is not random. This boils down to how quantifiers are interpreted by the regex engine and potential backtracking. By quantifier I am referring to the
?
in(REF)?
. According to MSDN:In other words,
?
is greedy, and??
is lazy. Both match zero or one time, but they will have an effect on how the matching is performed.With regards to backtracking, MSDN mentions:
Another useful resource to learn more about backtracking can be found here: Possessive Quantifiers.
To answer your question directly, we can compare both approaches.
Greedy approach
Original input:
REFmisc03-123456789012-213
Usage of
(REF)?
will match your text with 4 groups (excluding the first group with the entire match) and all groups will be successfully matched:This matches your first possible match scenario (loosely defined):
As long as the
"misc..."
portion is 1-10 characters long, the match will be the same, with all 1-10 characters appearing in the second group. TheREF
portion will always be matched in the first group.New input:
REF-123456789012-213
The
"misc..."
portion is absent. Since(REF)?
is optional, and(.{1,10})
isn't, the regex engine will use the"REF"
input to satisfy the latter (required) portion of the pattern and disregard the former (optional) portion. This will yield the following group values:""
(empty string,Success
property =false
)Lazy approach
Original input:
REFmisc03-123456789012-213
By using
(REF)??
, and keeping the rest of your pattern the same, the quantifier becomes lazy and this returns 4 groups with these values:""
(empty string,Success
property =false
)This matches your second possible match scenario:
Since the first group is optional with a lazy quantifier, the regex engine is able to disregard it. Since
"REFmisc03"
is 9 characters long, the engine proceeds to lump"REF"
in with"misc03"
because they fit into the(.{1,10})
group.New input:
REF-123456789012-213
This behaves similarly to the greedy pattern and the same reasoning applies.
Another new input:
REFmisc0345-123456789012-213
In this example the
"misc0345"
portion is 8 characters long. Although the pattern uses a lazy quantifier it can't fit"REFmisc0345"
into the second group because it exceeds the 10 character limit. The regex engine will backtrack and match"REF"
in the first group, and"misc0345"
in the second group: