I'm trying to write a recursive regular expression to capture code blocks, but for some reason it seems to not be capturing them properly. I would expect the code below to capture the full body of the function, but instead it only captures the contents of the first if
statement.
It's almost like the .+?
is somehow gobbling up the first {
, but it's supposed to be non-greedy so I don't understand why it would.
What is causing it to act this way?
Script:
use strict;
use warnings;
my $text = << "END";
int max(int x, int y)
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
END
# Regular expression to capture balanced "{}" groups
my $regex = qr/
\{ # Match opening brace
(?: # Start non-capturing group
[^{}]++ # Match non-brace characters without backtracking
| # or
(?R) # Recursively match the entire expression
)* # Match 0 or more times
\} # Match closing brace
/x;
# is ".+?" gobbling up the first "{"?
# What would cause it to do this?
if ($text =~ m/int\s.+?($regex)/s){
print $1;
}
Output:
{
return x;
}
Expected Output:
{
if (x > y)
{
return x;
}
else
{
return y;
}
}
I know that there is a Text::Balanced
module for this purpose, but I am attempting to do this by hand in order to learn more about regular expressions.
(?R)
recurses into the whole pattern – but what is the whole pattern? When you embed the quoted$regex
into/int\s.+?($regex)/
, the pattern is recompiled and(?R)
refers to the new pattern. That's not what you intended.I'd recommend you use named captures instead so that you can recurse by name. Change the
$regex
likeIf you want to avoid extra captures, you can use the
(?(DEFINE) ...)
syntax to declare named regex patterns that can be called later:Then:
/int\s.+?((?&nestedbrace))$define_nestedbrace_re/
That won't create additional captures. However, it is not generally possible to write encapsulated regex fragments. Techniques like preferring named captures instead of numbered captures can help here.