Python and PCRE regex that are the same give different outputs for the same input

70 views Asked by At

I am trying to implement the minbpe library in zig, using a wrapper over PCRE library.

The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४, I get the following output:

>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']

It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g flag in the end for UTF-8 matching

However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output

$ pcre2test -8
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \xe0
 0: \xa5\xa7
 0: \xe0
 0: \xa5\xa8
 0: \xe0
 0: \xa5
 0: \xaa

Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'१२'

Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?

2

There are 2 answers

1
mudskipper On BEST ANSWER

The Hindi codepoints are actually matched, but rendered on screen as UTF-8 hexcodes:

>>> "१२४".encode("utf-8")
b'\xe0\xa5\xa7\xe0\xa5\xa8\xe0\xa5\xaa'

According to the pcr2test spec:

When pcre2test is outputting text in the compiled version of a pattern, bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes.

When pcre2test is outputting text that is a matched part of a subject string, it behaves in the same way, unless a different locale has been set for the pattern (using the locale modifier). In this case, the isprint() function is used to distinguish printing and non-printing characters.

The spec doesn't mention which locales can be used. The example (fr_FR) suggests two-letter country code and two-letter language code, but it's unclear to me if Hindi is supported.

With the `(*UTF) flag you do get two matches and the Hindi numerals are then rendered as unicode hexes:

re> /(*UTF)(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \x{967}\x{968}\x{96a}
0
sytech On

You just need to decode the bytes using UTF-8, rather than treating them as a string.

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'१२'
>>> b"\xe0\xa5\xa7\xe0\xa5\xa8".decode('utf-8')
'१२'

To get a result that looks more like the Python one, use the utf8_input option (with the 32-bit pcre2test) or (*UTF) in the beginning of the pattern:

$ pcre2test -32
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g,utf8_input
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \x{967}\x{968}\x{96a}

In Python, you can show these code points as follows:

>>> u"\u0967\u0968\u096a"
'१२४'