perluniprops lists the Unicode properties of the version of Unicode it supports. For Perl 5.32.1, that's Unicode 13.0.0.
You can obtain a list of the characters that match a category using Unicode::Tussle's unichars
.
unichars '\p{Close_Punctuation}'
And the help:
$ unichars --help
Usage:
unichars [*options*] *criterion* ...
Each criterion is either a square-bracketed character class, a regex
starting with a backslash, or an arbitrary Perl expression. See the
EXAMPLES section below.
OPTIONS:
Selection Options:
--bmp include the Basic Multilingual Plane (plane 0) [DEFAULT]
--smp include the Supplementary Multilingual Plane (plane 1)
--astral -a include planes above the BMP (planes 1-15)
--unnamed -u include various unnamed characters (see DESCRIPTION)
--locale -l specify the locale used for UCA functions
Display Options:
--category -c include the general category (GC=)
--script -s include the script name (SC=)
--block -b include the block name (BLK=)
--bidi -B include the bidi class (BC=)
--combining -C include the canonical combining class (CCC=)
--numeric -n include the numeric value (NV=)
--casefold -f include the casefold status
--decimal -d include the decimal representation of the code point
Miscellaneous Options:
--version -v print version information and exit
--help -h this message
--man -m full manpage
--debug -d show debugging of criteria and examined code point span
Special Functions:
$_ is the current code point
ord is the current code point's ordinal
NAME is charname::viacode(ord)
NUM is Unicode::UCD::num(ord), not code point number
CF is casefold->{status}
NFD, NFC, NFKD, NFKC, FCD, FCC (normalization)
UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)
Singleton, Exclusion, NonStDecomp, Comp_Ex
checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE
Other than reading the list of categories from the webpage, is there a way to programmatically get all the possible \p{...}
categories?
From the comments, I believe you are trying to port a Perl program using
\p
regex properties to Python. You don't need a list of all categories (whatever that means); you just need to know what Code Points each of the property used by the program matches.Now, you could get the list of Code Points from the Unicode database. But a much simpler solution is to use Python's regex module instead of the re module. This will give you access to the same Unicode-defined properties that Perl exposes.
The latest version of the regex module even uses Unicode 13.0.0 just like the latest Perl.
Note that the program uses
\p{IsAlnum}
, a long way of writing\p{Alnum}
.\p{Alnum}
is not a standard Unicode property, but a Perl extension. It's the union of Unicode properties\p{Alpha}
and\p{Nd}
. I don't know know if the regex module definesAlnum
identically, but it probably does.