Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
Python regex matching Unicode properties
21k views Asked by ThomasH AtThere are 6 answers
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M}
would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
, and \P{M}
would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
.
People would thank you. :)
The regex module (an alternative to the standard re
module) supports Unicode codepoint properties with the \p{}
syntax.
Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...}
into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say
\p{Armenian}
to match Armenian characters.\p{Ll}
or\p{Zs}
work too.