Python and PCRE regex that are the same give different outputs for the same input

Question

Python and PCRE regex that are the same give different outputs for the same input

70 views Asked by draklor40 At 11 March 2024 at 21:48

I am trying to implement the minbpe library in zig, using a wrapper over PCRE library.

The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४, I get the following output:

>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']

It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g flag in the end for UTF-8 matching

However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output

$ pcre2test -8
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \xe0
 0: \xa5\xa7
 0: \xe0
 0: \xa5\xa8
 0: \xe0
 0: \xa5
 0: \xaa

Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'à¥§à¥¨'

Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?

Original Q&A

There are 2 answers

sytech On 11 March 2024 at 22:19

You just need to decode the bytes using UTF-8, rather than treating them as a string.

>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'à¥§à¥¨'
>>> b"\xe0\xa5\xa7\xe0\xa5\xa8".decode('utf-8')
'१२'

To get a result that looks more like the Python one, use the utf8_input option (with the 32-bit pcre2test) or (*UTF) in the beginning of the pattern:

$ pcre2test -32
PCRE2 version 10.42 2022-12-11
  re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g,utf8_input
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \x{967}\x{968}\x{96a}

In Python, you can show these code points as follows:

>>> u"\u0967\u0968\u096a"
'१२४'

**mudskipper** · Accepted Answer · 2024-03-11T22:35:48+00:00

The Hindi codepoints are actually matched, but rendered on screen as UTF-8 hexcodes:

>>> "१२४".encode("utf-8")
b'\xe0\xa5\xa7\xe0\xa5\xa8\xe0\xa5\xaa'

According to the pcr2test spec:

When pcre2test is outputting text in the compiled version of a pattern, bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes.

When pcre2test is outputting text that is a matched part of a subject string, it behaves in the same way, unless a different locale has been set for the pattern (using the locale modifier). In this case, the isprint() function is used to distinguish printing and non-printing characters.

The spec doesn't mention which locales can be used. The example (fr_FR) suggests two-letter country code and two-letter language code, but it's unclear to me if Hindi is supported.

With the `(*UTF) flag you do get two matches and the Hindi numerals are then rendered as unicode hexes:

re> /(*UTF)(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
 0: abcdeparallel
 0:  \x{967}\x{968}\x{96a}

TechQA.

Python and PCRE regex that are the same give different outputs for the same input

There are 2 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PCRE

Related Questions in PCRE2

Popular Questions

Trending Questions