How can I capture multiple matches for a sub-expression from a single string with Perl?

536 views Asked by At

I currently have the following regular expression:

^\s*(.+)(?:[-\._ ]+)(\d+)\s*[xX]\s*(\d+)

This will match show_3x01_ep. name and retrieve show, 3 , 01. I would like to extend this so that multiple episodes can be captured. For example:

 show _3x01_3x02 ep. name

should return:

show, 3, 01, 3, 02

Could someone please explain to me how this might be done?

3

There are 3 answers

1
Borodin On BEST ANSWER

You are expecting too much from your regular expression. The simplest way is to do this in two steps.

Note first of all though that the (.+) which matches show in your example is too general. If you apply the pattern to show _3x01_3x02 ep. name then you will get show -- with a trailing space -- because the following [-._ ]+ (there is no need to escape the dot or enclose the character class in (?: ... ) ) is satisfied with just one character.

This will do as you ask. It finds the first string of alphabetic characters, and then all pairs of digit strings that are spearated by a single x.

use strict;
use warnings;

my $s = 'show _3x01_3x02 ep. name';

if ( my ($prefix) = $s =~ /([a-z]+)/i ) {
  print "$prefix\n";
  print "$1 $2\n" while $s =~ /(\d+)x(\d+)/g;
}

output

show
3 01
3 02
0
Todd A. Jacobs On

Use String#scan in Ruby Instead

Your filenames aren't consistent, so you're probably better off scanning for known patterns and then cleaning up. I've already provided a Perl solution, but offer this Ruby solution as an alternative. For example:

str = 'show _3x01_3x02 ep. name'
str.scan(/\A(.*?)(?=\d)|(\d+)x(\d+)/).
    flatten.compact.map { |e| e.gsub(?_, ' ').strip }
#=> ["show", "3", "01", "3", "02"]

There's a lot going on in this one line of code, but it should be easy enough to follow. The code will:

  1. Match everything from the beginning of the string up to the first digit as the show name.
  2. Match all season/episode pairs that it can find.
  3. Return all matches as an array.
  4. Flatten nested arrays created by capture groups, and discard nils.
  5. Replace underscores with spaces in each member of the array.
  6. Strips any surrounding whitespace from each member of the array.
  7. Return the array.

The regular expression itself is Perl-compatible, but the rest of the logic relies on Ruby's String#scan and other internals that may not map directly to Perl. YMMV.

0
Todd A. Jacobs On

Use Perl's g Modifier

You can use Perl's g regex modifier to scan for a pattern more than once in a string. You can then save those matches to a list, and then do something with that list or its individual elements. For example:

$ echo 'show _3x01_3x02 ep.name' |
      perl -ne '@words = ($_ =~ /\A(.*?)(?=\d)|(\d+)x(\d+)/g);
                @words = grep { $_ ne "" } @words;
                while (my $idx = each @words) {
                    @words[$idx] =~ s/^\s+|\s+\b|_//g;
                };
                print join(", ", @words), "\n"'
show, 3, 01, 3, 02