How to get all Unicode characters from specific categories?

Question

How to get all Unicode characters from specific categories?

2.1k views Asked by AudioBubble At 31 March 2017 at 22:19

How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?

Original question

I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.

Syntax quote

Identifier ::

IdentifierName but not ReservedWord

IdentifierName ::

IdentifierStart

IdentifierName IdentifierPart

IdentifierStart ::

UnicodeLetter

$

_

~~\ UnicodeEscapeSequence~~ # no need to check this

IdentifierPart ::

IdentifierStart

UnicodeCombiningMark

UnicodeDigit

UnicodeConnectorPunctuation

UnicodeLetter ::

any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

UnicodeCombiningMark ::

any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”

UnicodeDigit ::

any character in the Unicode category “Decimal number (Nd)”

UnicodeConnectorPunctuation ::

any character in the Unicode category “Connector punctuation (Pc)”

As you can see, it takes any character of certain categories.

I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?

Original Q&A

There are 2 answers

Hydroper On 28 September 2022 at 11:19

I'm the OP. I'm actually using another approach for determining Unicode General Category. I made a tool for converting UnicodeData.txt file into very optimal binaries: https://github.com/matheusdiasdesouzads/unicode-general-category/tree/master/data and a library for working with General Categories: https://github.com/matheusdiasdesouzads/unicode-general-category/tree/master/language-specific/javascript-nodejs

let cat = GeneralCategory.from(0x41);
cat.toString(); // 'Lu'

**CharlotteBuff** · Accepted Answer · 2017-04-02T00:24:46+00:00

Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in [::].

For all characters in Unicode 5 you want to do [:age=5.0:].

The rest are "general categories" (gc). So for example [:age=5.0:]&[:gc=Lu:] will find all uppercase letters in Unicode 5 (gc=L will find all letters in general).

For IdentifierStart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]. For IdentifierPart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_].

Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.

Here is also an overview of all Unicode character properties.

TechQA.

How to get all Unicode characters from specific categories?

Original question

There are 2 answers

Related Questions in JAVASCRIPT

Related Questions in UNICODE

Related Questions in ECMASCRIPT-3

Related Questions in ECMASCRIPT-4

Popular Questions

Popular Tags

Trending Questions