Often one wants to list all characters in a given Unicode category. For example:
- List all Unicode whitespace, How can I get all whitespaces in UTF-8 in Python?
- Characters with the property
Alphabetic
It is possible to produce this list by iterating over all Unicode code-points and testing for the desired category (Python 3):
[c for c in map(chr, range(0x110000)) if unicodedata.category(c) in ('Ll',)]
or using regexes,
re.findall(r'\s', ''.join(map(chr, range(0x110000))))
But these methods are slow. Is there a way to look up a list of characters in the category without having to iterate over all of them?
Related question for Perl: How do I get a list of all Unicode characters that have a given property?
If you need to do this often, it's easy enough to build yourself a re-usable map:
And from there on out use that map to translate back to a series of characters for a given category:
If this is too costly for start-up time, consider dumping that structure to a file; loading this mapping from a JSON file or other quick-to-parse-to-dict format should not be too painful.
Once you have the mapping, looking up a category is done in constant time of course.