I'm trying to learn about Recursion in Regular Expressions, and have a basic understanding of the concepts in the PCRE flavour. I want to break a string:
Geese (Flock) Dogs (Pack)
into:
Full Match: Geese (Flock) Dogs (Pack)
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)
I know neither regex quite does this, but I was more curious as to the reason why the the first pattern works, but the second one doesn't.
Pattern 1: ((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
Pattern 2: ((.*?)(\(\w{1,}\)))((\g<2>)(\g<3>))*
Also, if for example you're dealing with a long string, and a pattern repeats itself, is it possible to continually expand the full match, and incrementally increase the groups without writing a loop statement separate to the regex.
Full Match: Geese (Flock) Dogs (Pack) Elephants (Herd)
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)
Group 7: Elephants (Herd)
Group 8: Elephants
Group 9: (Herd)
This is the closest I've came to was this pattern, but the middle group: Dogs (Pack) becomes Group 0.
((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
Mind that recursion levels in PCRE are atomic. Once these patterns find a match they are never re-tried.
See Recursion and Subroutine Calls May or May Not Be Atomic:
Your second pattern, at the first recursion level, will look like
See demo. That is,
\g<2>
is then(?>.*?)
, not.*?
. That means that, after the((.*?)(\(\w{1,}\)))
pattern matchedGeese (Flock)
, the regex engine tries to match with(?>.*?)
, sees it is a lazy pattern that does not have to consume any chars, skips it (and will never come back to this pattern), and tries to match with(?>\(\w{1,}\))
. As there is no(
after)
, the regex returns what it consumed.As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.