Is there a way to list all categories in perluniprops?

80 views Asked by At

perluniprops lists the Unicode properties of the version of Unicode it supports. For Perl 5.32.1, that's Unicode 13.0.0.

You can obtain a list of the characters that match a category using Unicode::Tussle's unichars.

unichars '\p{Close_Punctuation}' 

And the help:

$ unichars --help
Usage:
    unichars [*options*] *criterion* ...

    Each criterion is either a square-bracketed character class, a regex
    starting with a backslash, or an arbitrary Perl expression. See the
    EXAMPLES section below.

    OPTIONS:

     Selection Options:

        --bmp           include the Basic Multilingual Plane (plane 0) [DEFAULT]
        --smp           include the Supplementary Multilingual Plane (plane 1)
        --astral    -a  include planes above the BMP (planes 1-15)
        --unnamed   -u  include various unnamed characters (see DESCRIPTION)
        --locale    -l  specify the locale used for UCA functions

     Display Options:

        --category  -c  include the general category (GC=)
        --script    -s  include the script name (SC=)
        --block     -b  include the block name (BLK=)
        --bidi      -B  include the bidi class (BC=)
        --combining -C  include the canonical combining class (CCC=)
        --numeric   -n  include the numeric value (NV=)
        --casefold  -f  include the casefold status
        --decimal   -d  include the decimal representation of the code point

     Miscellaneous Options:

        --version   -v  print version information and exit
        --help      -h  this message
        --man       -m  full manpage
        --debug     -d  show debugging of criteria and examined code point span

     Special Functions:

         $_    is the current code point
         ord   is the current code point's ordinal

         NAME is charname::viacode(ord)
         NUM is Unicode::UCD::num(ord), not code point number
         CF is casefold->{status}
         NFD, NFC, NFKD, NFKC, FCD, FCC  (normalization)
         UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)

         Singleton, Exclusion, NonStDecomp, Comp_Ex
         checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
         NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE

Other than reading the list of categories from the webpage, is there a way to programmatically get all the possible \p{...} categories?

1

There are 1 answers

0
ikegami On BEST ANSWER

From the comments, I believe you are trying to port a Perl program using \p regex properties to Python. You don't need a list of all categories (whatever that means); you just need to know what Code Points each of the property used by the program matches.

Now, you could get the list of Code Points from the Unicode database. But a much simpler solution is to use Python's regex module instead of the re module. This will give you access to the same Unicode-defined properties that Perl exposes.

The latest version of the regex module even uses Unicode 13.0.0 just like the latest Perl.


Note that the program uses \p{IsAlnum}, a long way of writing \p{Alnum}. \p{Alnum} is not a standard Unicode property, but a Perl extension. It's the union of Unicode properties \p{Alpha} and \p{Nd}. I don't know know if the regex module defines Alnum identically, but it probably does.