Python regex matching Unicode properties

Question

Python regex matching Unicode properties

21k views Asked by ThomasH At 02 December 2009 at 13:25

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

Original Q&A

There are 6 answers

Jonathan Feinberg On 02 December 2009 at 14:26

You're right that Unicode property classes are not supported by the Python regex parser.

If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].

People would thank you. :)

tzot On 05 December 2009 at 15:24

Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'. The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

zellyn On 12 November 2010 at 00:23

You can painstakingly use unicodedata on each character:

import unicodedata

def strip_accents(x):
    return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')

ronnix On 30 November 2010 at 16:37

The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.

mgibsonbr On 04 March 2012 at 05:12

Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).

Example usage:

>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'ÁñÇ_1+2').group(0)
ÁñÇ_1
>>>

Here's the source. There is also a JavaScript version, using the same data.

**joeforker** · Accepted Answer · 2009-12-02T22:22:09+00:00

joeforker On 02 December 2009 at 22:22 BEST ANSWER

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

TechQA.

Python regex matching Unicode properties

There are 6 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in UNICODE

Related Questions in UCD

Related Questions in CHARACTER-PROPERTIES

Popular Questions

Trending Questions