Is it possible to modify word character class or \b boundary to exclude underscore character?

294 views Asked by At

I need to replace a very big list of predefined patterns. These patterns can contain only [a-zA-Z] characters, underscore is excluded. These patterns may appear in different forms : as a whole word or word preceded and/or followed by an undescore char '_'

example: I want replace FOO by BAR I use the 4 following instructions

$ cat > /tmp/try.pl
s/\bFOO\b/BAR/g;s/\bFOO_/BAR_/g;s/_FOO\b/_BAR/g;s/_FOO_/_BAR_/g;
$ perl -p /tmp/try.pl 
FOO aaa_FOO FOO_bbb FOO.txt a-FOO-b.txt aaa_FOO_bbb dontchange_FOOQUX_dontchange
BAR aaa_BAR BAR_bbb BAR.txt a-BAR-b.txt aaa_BAR_bbb dontchange_FOOQUX_dontchange

It makes exactly what I want. But with thousands of words it takes time. If i can excluded the underscore from the word character class, i think i can use only one instruction :

s/\bFOO\b/BAR/g.

So is there any way to modify perl world character class or /b boundary definitions to exclude underscore character ?

3

There are 3 answers

2
TLP On

You can just combine \b and _ in a capture group (\b|_) and combine the regexes into one:

s/(\b|_)FOO(\b|_)/${1}BAR$2/g;

This is using the functionality of your original substitution, but as ikegami points out in the comments, this will fail for for example _FOO_FOO_. We can fix that using lookaround assertions:

s/(?:\b|_)\KFOO(?=\b|_)/BAR/g

This is non-destructive towards our border characters and can therefore match two replacements separated by a single border character, such as in the case of _FOO_FOO_.

2
zdim On

Update

It is clarified that the words to be replaced are literal strings from a given list (no need to match [a-zA-Z]) -- then use alternation built with these words. Further, each of these words need be replaced by an also predefined, given, pattern. Use a hash for that.

I assume that a word must not be surrounded by anything other than possibly a _ or a word boundary, on either side. For that one can use lookarounds

A test program

use warnings;
use strict;
use feature 'say';

my @words_to_replace = qw(one ones thing nothing clean);
my %replacement = map { $_ => 'NEW.'.$_ } @words_to_replace;

my $re_word = join '|', @words_to_replace;  # no quotemeta; only [a-zA-Z]

my @test = qw(noone ones_ athing _thing nothing. _nothing_ clean);

for (@test) {
    printf "For %-12s: ", "|$_|";

    if ( s{ (?<! [^_\W]) ($re_word) (?! [^_\W]) }{$replacement{$1}}x ) {
        say "mathced |$1|, now have |$_|";
    }
    else { say '' }
}

I make up a replacement for each word by appending NEW. to it. Prints as expected.

The lookarounds specify that a word must not be surrounded by anything other than _ or \W (character word boundary). That nasty triple negation there (not anything that is not not-word-boundary character) is a way to also account for a zero-width anchor in a lookaround.


The alternation built with ("thousands" of) words can be a problem for regex if the obtained pattern is longer than some 32k or so characters. If your lists are indeed so long that $re_word's length exceeds this number, perhaps the most economical way is to break the list up into multiple ones that are small enough, and do the above for each. (Trying to match and replace one word at a time will be much slower.)


The original response (believing that we need match [a-zA-Z] with only possible _ around)

One way is to use POSIX character classes, where [[:alpha:]] matches [a-zA-Z]

It isn't clear to me what a replacement for a generic word is, but once it's given

s/([[:alpha:]]+)/$replacement/;

Another way is to form a pattern just as you like it and use that

my $re_char = qr/[a-zA-Z]/;

s/($re_char+)/$replacement/;

Please clarify how that replacement should work (other than foo-bar language).

If the replacement itself doesn't matter but it need be done only when the matched word is possibly surrounded on either side only by _ then one can use lookarounds to exclude any character other than _

m/(?<! [^_] )( [[:alpha:]]+ ) (?! [^_]) /x;

(Edit—   To add word boundary use [^_\W] instead. See the first part)

A test program

use warnings;
use strict;
use feature 'say';

my @words = qw(_before _. after_ _both_ none .ahem nah/);

for (@words) { 
    printf "%-8s:\t", $_; 
    if ( m/(?<! [^_] )( [[:alpha:]]+ ) (?! [^_]) /x ) { 
        say $1; 
    }   
    else { say "... no match" }
} 

This matches words ([a-zA-Z]) with underscore on each or both sides, or nothing around them, but not the ones with other characters around (. and /).

(Edit—   To allow for a word-boundary along with _ use [^_\W]. See the first part)

0
ikegami On

You want to exclude a lot more than underscores. \w matches 29,511 characters, a tad more than the 53 you think it matches.

You can use

my %repl = ( FOO => "BAR" );
s{[a-zA-Z]+}{ $repl{$&} // $& }eg

or

s/(?<![a-zA-Z])FOO(?![a-zA-Z])/BAR/g

An explanation of the latter follows, and answers to the title questions follows that.


\b

is equivalent to

(?: (?<!\w)(?=\w)   # At the beginning of a word
|   (?<=\w)(?!\w)   # At the end of a word
)

We want to replace \w with [a-zA-Z].

(?: (?<![a-zA-Z])(?=[a-zA-Z])
|   (?<=[a-zA-Z])(?![a-zA-Z])
)

So

\bFOO\b

would get replaced with

(?: (?<![a-zA-Z])(?=[a-zA-Z])
|   (?<=[a-zA-Z])(?![a-zA-Z])
)
FOO
(?: (?<![a-zA-Z])(?=[a-zA-Z])
|   (?<=[a-zA-Z])(?![a-zA-Z])
)

Yikes! Thankfully, because we know FOO both starts and end with a character that matches [a-zA-Z], this can be simplified!

(?<![a-zA-Z])FOO(?![a-zA-Z])

Modifying \w to Exclude Underscore

You can use

[^\W_]    # \w is equivalent to [^\W]

or

(?[ \w - [_] ])   # Experimental

Modifying \b to Exclude Underscore

You can use (?<![^\W_])FOO(?![^\W_]) instead of \bFOO\b as explained above.