Perl regex without variable length lookbehind?

486 views Asked by At

I'm trying to hyperlink 400 or so keywords in a 50,000 word markdown document.

This is one of several steps in a Perl "build chain", so it would be ideal to achieve the hypelinking in Perl also.

I have a separate file contain all the keywords, and mapping each to a markdown fragment which it should be replaced with, like this:

keyword::(keyword)[#heading-to-jump-to]

The above example implies that wherever "keyword" occurs in the source markdown document, it should be replaced by the markdown fragment "(keyword)[#heading-to-jump-to]".

Ignoring keywords that occur as substrings of other keywords, plural/singular forms, and ambiguous keywords, it's reasonably straightforward. But naturally, there are two additional constraints.

I need to match only instances of keyword which are:

  • Not on a line not beginning #
  • Not most directly below The Heading To Jump To

The plain English meaning of these is: don't match keywords in any headings, and don't replace keywords that are under the heading they would link to.

My Perl script reads the $keyword::$link pairs and then, pair by pair, substitutes them into a regex, and then searches/replaces the document with that regex.

I've written a regex that does the matching (for the cases I've manually tested so far) using Regex Buddy's JGSoft regex implementation. It looks like this:

Frog::(Frog)[#the-frog)
-->    
([Ff]rog'?s?'?)(?=[\.!\?,;: ])(?<!#+ [\w ]*[Ff]rogs?)(?<!#+ the-frog)(?<!#+ the-frog[^#]*)

The problem (or, perhaps, a problem) with this it that it uses variable length lookbacks which are not supported by Perl. So I can't even test this regex on the full document to see if it really works.

I've read a bunch of other posts on how to work around variable length lookbacks, but I can't seem to get it right for my particular case. Can any of the resident regex wizards help out with a neater regex that will execute in Perl?

2

There are 2 answers

3
amon On BEST ANSWER

As I see it, your program will have three states:

  1. In a headline.
  2. In a paragraph directly after a headline.
  3. In other paragraphs.

Because this roughly is a regular language, it can be parsed by regexes. But why would we want to do that, considering we would need 400 passes over the text?

It might really be easier to split the file into an array of paragraphs. When we hit a headline, we produce all links that can point there. Then in the next paragraph, we substitute all keywords except the forbidden ones. E.g:

my %substitutions = ...;
my $kw_regex = ...;
my %forbidden; # holds state

local $/ = ""; # paragraph mode
while (<>) {
  if (/^#/) {
    # it's a headline
    @forbidden{ slugify($_) } = ();  # extract forbidden link(s)
  } else {
    # a paragraph
    s{($kw_regex)}{
      my $keyword = $1;
      my $link = $substitutions{lc $keyword};
      exists $forbidden{$link} ? $keyword : "($keyword)[$link]";
    }eg;
    %forbidden = (); # forbidden links only in 1st paragraph after headline
  }
  print;
}

If headlines are not guaranteed to be seperated from their paragraphs by an empty line, then the paragrapg mode will not work, and you'll have to roll your own.

Regexes are awesome, but they are not always an adequate tool.

4
TLP On

That is one horrible regex. I would not want to be the poor sucker who is stuck with maintaining it. Also, how did you generate it from your replacement template?

I would suggest something considerably simpler. Use a hash to store the replacements, use word boundary to prevent partial matches, use /i modifier to match case insensitively, and use regular loop logic to avoid replacements on commented lines.

use strict;
use warnings;

my @kw = "keyword::(keyword)[#heading-to-jump-to]";
my %rep = map { /([^:]+)::(.+)/ } @kw;
while (<DATA>) {
    next if /^#/;
    for my $kw (keys %rep) {
        s/\b\Q$kw\E\b/$rep{$kw}/ig;
    }
} continue {
    print;
}

__DATA__
This is a text with keywords. Only the keyword 'keyword' should be replaced.
# Dont replace keyword when in a comment

Output:

This is a text with keywords. Only the (keyword)[#heading-to-jump-to] '(keyword)
[#heading-to-jump-to]' should be replaced.
# Dont replace keyword when in a comment

Explanation:

  • Create the hash of replacement keywords with a map statement, which returns a two element list for each keyword::replacement string.
  • With lines that begin with #, skip directly to print
  • For each keyword in the hash, perform a global /g, case insensitive /i substitution on each line. Use word boundary \b to prevent partial matches, and quote meta characters with \Q ... \E. Substitute with the hash value for that keyword.

As with all language processing, this will have some caveats and edge cases that needs handling. For example, word boundary will replace foo in foo-bar. As for how to control what not to replace under which heading, you would first have to tell me how to identify a heading.

Update:

If I understand you correctly, what you mean by skipping keywords inside paragraphs with their own heading, is something like this:

#heading-to-jump-to
Here is 'keyword' not replaced

Look up the string #heading-to-jump-to and remove keyword from the replacement list.

You might use a lookup hash with the keys being the heading references, and combine that with the generation of the first hash. Although, in this case I would start being concerned that you can have multiple keywords for each link, e.g. both foo and bar point to #foobar, so #foobar should exclude keywords foo and bar both.

my %rep;
my %heading;

for my $str (@kw) {
    chomp $str;
    my ($kw, $rep) = split /::/, $str, 2;  # split into 2 fields
    $rep{$kw} = $rep;
    my ($heading) = $rep =~ /\[([^]]+)\]/;
    push @{ $heading{$heading} }, $kw;
}

And then instead of simply skipping a line with next, do something like

my @kws = keys %rep;   # default list
while (<DATA>) {
    if (/^(#.+)/) {    # inside heading
        my %exclude = map { $_ => 1 } @{ $heading{$1} };
        @kws = grep { ! $exclude{$_} } @kws;
    } else {
        # not in a heading
        # ...
    }
}

Note that this is just a demonstration of the principle and not intended as working code. As you can see, the tricky part here is knowing when to reset the limited list of @kws and when to use it. You will have to make those decisions, since I do not know your data.