Why doesn't this recursive regex capture the entire code block?

270 views Asked by At

I'm trying to write a recursive regular expression to capture code blocks, but for some reason it seems to not be capturing them properly. I would expect the code below to capture the full body of the function, but instead it only captures the contents of the first if statement.

It's almost like the .+? is somehow gobbling up the first {, but it's supposed to be non-greedy so I don't understand why it would.

What is causing it to act this way?

Script:

use strict;
use warnings;

my $text = << "END";
int max(int x, int y)
{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}
END

# Regular expression to capture balanced "{}" groups
my $regex = qr/
    \{              # Match opening brace
        (?:         # Start non-capturing group
            [^{}]++ #     Match non-brace characters without backtracking
            |       #     or
            (?R)    #     Recursively match the entire expression
        )*          # Match 0 or more times
    \}              # Match closing brace
/x;

# is ".+?" gobbling up the first "{"?
# What would cause it to do this?
if ($text =~ m/int\s.+?($regex)/s){
    print $1;
}

Output:

{
        return x;
    }

Expected Output:

{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}

I know that there is a Text::Balanced module for this purpose, but I am attempting to do this by hand in order to learn more about regular expressions.

2

There are 2 answers

8
amon On BEST ANSWER

(?R) recurses into the whole pattern – but what is the whole pattern? When you embed the quoted $regex into /int\s.+?($regex)/, the pattern is recompiled and (?R) refers to the new pattern. That's not what you intended.

I'd recommend you use named captures instead so that you can recurse by name. Change the $regex like

/(?<nestedbrace> ... (?&nestedbrace) ...)/

If you want to avoid extra captures, you can use the (?(DEFINE) ...) syntax to declare named regex patterns that can be called later:

my $define_nestedbrace_re = qr/(?(DEFINE)
  (?<nestedbrace ... (?&nestedbrace) ...)
)/x;

Then: /int\s.+?((?&nestedbrace))$define_nestedbrace_re/

That won't create additional captures. However, it is not generally possible to write encapsulated regex fragments. Techniques like preferring named captures instead of numbered captures can help here.

4
anubhava On

You can change your recursive pattern to this one:

/int\s+.*?  (
    \{              # Match opening brace
        (?:         # Start non-capturing group
            [^{}]++ # Match non-brace chars without backtracking
            |       # OR
            (?-1)   # Recursively match the previous group
        )*          # Match 0 or more times
    \}
)/sx
  • Note use of (?-1) instead of (?R) that recurses whole matched pattern.
  • (?-1) is back-reference of previous capturing group.

Updated RegEx Demo