Regular expression to match boundary between different Unicode scripts

1k views Asked by At

Regular expression engines have a concept of "zero width" matches, some of which are useful for finding edges of words:

  • \b - present in most engines to match any boundary between word and non-word characters
  • \< and \> - present in Vim to match only the boundary at the beginning of a word, and at the end of a word, respectively.

A newer concept in some regular expression engines is Unicode classes. One such class is script, which can distinguish Latin, Greek, Cyrillic, etc. These examples are all equivalent and match any character of the Greek writing system:

  • \p{greek}
  • \p{script=greek}
  • \p{script:greek}
  • [:script=greek:]
  • [:script:greek:]

But so far in my reading through sources on regular expressions and Unicode I haven't been able to determine if there is any standard or nonstandard way to achieve a zero-width match where one script ends and another begins.

In the string παν語 there would be a match between the ν and characters, just as \b and \< would match just before the π character.

Now for this example I could hack something together based on looking for \p{Greek} followed by \p{Han}, and I could even hack something together based on all possible combinations of two Unicode script names.

But this wouldn't be a deterministic solution since new scripts are still being added to Unicode with each release. Is there a future-proof way to express this? Or is there a proposal to add it?

1

There are 1 answers

5
tchrist On BEST ANSWER

EDIT: I just noticed you didn’t actually specify which pattern-matching language you were using. Well, I hope a Perl solution will work for you, since the needed mechanations are likely to be really tough in any other language. Plus if you’re doing pattern matching with Unicode, Perl really is the best choice available for that particular kind of work.


When the $rx variable below is set to the appropriate pattern, this little snippet of Perl code:

my $data = "foo1 and Πππ 語語語 done";

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n"; 
} 

Generates this output:

Got string: 'foo1 and '
Got string: 'Πππ '
Got string: '語語語 '
Got string: 'done'

That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. This is pretty darned closed to what I think you actually need.

The reason I didn’t post this yesterday is that I was getting weird core dumps. Now I know why.

My solution uses lexical variables inside of a (??{...}) construct. Turns out that that is unstable before v5.17.1, and at best worked only by accident. It fails on v5.17.0, but succeeds on v5.18.0 RC0 and RC2. So I’ve added a use v5.17.1 to make sure you’re running something recent enough to trust with this approach.

First, I decided that you didn’t actually want a run of all the same script type; you wanted a run of all the same script type plus Common and Inherited. Otherwise you will get messed up by punctuation and whitespace and digits for Common, and by combining characters for Inherited. I really don’t think you want those to interrupt your run of “all the same script”, but if you do, it’s easy to stop considering those.

So what we do is lookahead for the first character that has a script type of other than Common or Inherited. More than that, we extract from it what that script type actually is, and use this information to construct a new pattern that is any number of characters whose script type is either Common, Inherited, or whatever script type we just found and saved off. Then we evaluate that new pattern and continue.

Hey, I said it was hairy, didn’t I?

In the program I’m about to show, I’ve left in some commented-out debugging statements that show just what it’s doing. If you uncomment them, you get this output for the last run, which should help understand the approach:

DEBUG: Got peekahead character f, U+0066
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'foo1 and '
DEBUG: Got peekahead character Π, U+03a0
DEBUG: Scriptname is Greek
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
Got string: 'Πππ '
DEBUG: Got peekahead character 語, U+8a9e
DEBUG: Scriptname is Han
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*}
Got string: '語語語 '
DEBUG: Got peekahead character d, U+0064
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'done'

And here at last is the big hairy deal:

use v5.17.1;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);
use utf8;

use Unicode::UCD qw(charscript);

# regex to match a string that's all of the
# same Script=XXX type
#
my $rx = qr{
    (?=
       [\p{Script=Common}\p{Script=Inherited}] *
        (?<CAPTURE>
            [^\p{Script=Common}\p{Script=Inherited}]
        )
    )
    (??{
        my $capture = $+{CAPTURE};
   #####printf "DEBUG: Got peekahead character %s, U+%04x\n", $capture, ord $capture;
        my $scriptname = charscript(ord $capture);
   #####print "DEBUG: Scriptname is $scriptname\n";
        my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=)
                . $scriptname
                . q(}]*);
   #####print "DEBUG: string to re-interpolate as regex is q{$run}\n";
        $run;
    })
}x;


my $data = "foo1 and Πππ 語語語 done";

$| = 1;

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n";
}

Yeah, there oughta be a better way. I don’t think there is—yet.

So for now, enjoy.