Recursive PCRE search with patterns

589 views Asked by At

This question has to do with PCRE .

I have seen a recursive search for nested parentheses used with this construct:

\(((?>[^()]+)|(?R))*\)

The problem with this is that, while the '[^()]+' can match any character including newline, you are forced to match only single-character characters, such as braces, brackets, punctuation, single letters, etc.

What I am trying to do is replace the '(' and ')' characters with ANY kind of pattern (keywords such as 'BEGIN' and 'END', for example).

I have come up with the following construct:

(?xs)  (?# <-- 'xs' ignore whitespace in the search term, and allows '.'
               to match newline )
(?P<pattern1>BEGIN)
(
   (?> (?# <-- "once only" search )
      (
         (?! (?P=pattern1) | (?P<pattern2>END)).
      )+
   )
   | (?R)
)*
END

This will actually work on something that looks like this:

BEGIN <<date>>
  <<something>
    BEGIN
      <<something>>
    END <<comment>>
    BEGIN <<time>>
      <<more somethings>>
      BEGIN(cause we can)END
      BEGINEND
    END
  <<something else>>
END

This successfully matches any nested BEGIN..END pairs.

I set up named patterns pattern1 and pattern2 for BEGIN and END, respectively. Using pattern1 in the search term works fine. However, I can't use pattern2 at the end of the search: I have to write out 'END'.

Any idea how I can rewrite this regex so I only have to specify the patterns a single time and use them "everywhere" within the code? In other words, so I don't have to write END both in the middle of the search as well as at the very end.

2

There are 2 answers

0
Firas Dib On BEST ANSWER

To further extend on @Kobis answer, please see the following regex:

(?xs)
(?(DEFINE)
        (?<pattern1>BEGIN)
        (?<pattern2>END)
)
(?=((?&pattern1)
(?:
   (?> (?# <-- "once only" search )
      (?:
         (?! (?&pattern1) | (?&pattern2)) .
      )+
   )*
   | (?3)
)*
(?&pattern2)
))

This regex will allow you to even fetch the data for each individual data block! Use the 3rd backreference, as the first two have been defined in the define block.

Demo: http://regex101.com/r/bX8mB6

0
Kobi On

This looks like a good use case for a (?(DEFINE)) block, which is used to create such constructs. A Perl example would be:

(?xs)
(?(DEFINE)
        (?<pattern1>BEGIN)
        (?<pattern2>END)
)
(?&pattern1)
(
   (?> (?# <-- "once only" search )
      (
         (?! (?&pattern1) | (?&pattern2)).
      )+
   )
   | (?R)
)*
(?&pattern2)

Example: http://ideone.com/8o9cg

(please note I don't really know any perl, and couldn't get it to work on PHP on any of the online testers)

See also: http://www.pcre.org/pcre.txt (look for (?(DEFINE) 0 it doesn't look like they have pages)


A low-tech solution that works on most flavors is to use lookahead at the start of the pattern:

(?=.*?(?P<pattern1>BEGIN))
(?=.*?(?P<pattern2>END))
...
(?P=pattern1) (?# should work - it was captured )