Sub-pattern in regex can't be dereferenced?

667 views Asked by At

I have following Perl script to extract numbers from a log. It seems that the non-capturing group with ?: isn't working when I define the sub-pattern in a variable. It's only working when I leave out the grouping in either the regex-pattern or the sub-pattern in $number.

#!/usr/bin/perl
use strict;
use warnings;

my $number = '(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)';
#my $number = '-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?';

open(FILE,"file.dat") or die "Exiting with: $!\n";
while (my $line = <FILE>) {
        if ($line =~ m{x = ($number). y = ($number)}){
        print "\$1= $1\n";
        print "\$2= $2\n";
        print "\$3= $3\n";
        print "\$4= $4\n";
    };
}
close(FILE);

The output for this code looks like:

$1= 12.15
$2= 12.15
$3= 3e-5
$4= 3e-5

for an input of:

asdf x = 12.15. y = 3e-5 yadda

Those doubled outputs aren't desired.

Is this because of the m{} style in contrast to the regular m// patterns for regex? I only know the former style to get variables (sub-strings) in my regex expressions. I just noticed this for the backreferencing so possibly there are other differences for metacharacters?

2

There are 2 answers

3
Ibrahim Najjar On BEST ANSWER

The delimiters you use for the regular expression aren't causing any problems but the following is:

(:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)
 ^^
Notice this isn't a capturing group, it is an optional colon :

Probably a typo mistake but it is causing the trouble.

Edit: It looks that it is not a typo mistake, i substituted the variables in the regex and I got this:

x = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?)). y = ((:?-?(?:(?:\d+\.?\d*)|(?:\.\d+))(?:[Ee][+-]?\d+)?))
    ^^           first and second group               ^^      ^^    third and fourth grouop                     ^^

As you can see the first and second capturing group are capturing exactly the same thing, the same is happening for the third and fourth capturing group.

2
CodeGorilla On

You're going to kick yourself...

Your regexp reads out as:

capture {
 maybe-colon
 maybe-minus
 cluster {     (?:(?:\d+\.?\d*)|(?:\.\d+))
  cluster {    (?:\d+\.?\d*)
   1+ digits
   maybe-dot
   0+ digits
  }
  -or-
  cluster {    (?:\.\d+)
   dot
   1+digits
  }
 }
 maybe cluster {
   E or e
   maybe + or -
   1+ digets
 }             (?:[Ee][+-]?\d+)?
}

... which is what you're looking for.

However, when you then do your actual regexp, you do:

$line =~ m{x = $number. y = $number})

(the curly braces are a distraction.... you may use any \W if the m or s has been specified)

What this is asking is to capture whatever the regexp defined in $number is.... which is, itself, a capture.... hence $1 and $2 being the same thing.

Simply remove the capture braces from either $number or the regexp line.