How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx
from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?
Original question
I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.
Syntax quote
Identifier
::
IdentifierName
but notReservedWord
IdentifierName
::
IdentifierStart
IdentifierName
IdentifierPart
IdentifierStart
::UnicodeLetter
- $
- _
\# no need to check thisUnicodeEscapeSequence
IdentifierPart
::
IdentifierStart
UnicodeCombiningMark
UnicodeDigit
UnicodeConnectorPunctuation
UnicodeLetter
::
- any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.
UnicodeCombiningMark
::
- any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”
UnicodeDigit
::
- any character in the Unicode category “Decimal number (Nd)”
UnicodeConnectorPunctuation
::
- any character in the Unicode category “Connector punctuation (Pc)”
As you can see, it takes any character of certain categories.
I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?
Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in
[::]
.For all characters in Unicode 5 you want to do
[:age=5.0:]
.The rest are "general categories" (gc). So for example
[:age=5.0:]&[:gc=Lu:]
will find all uppercase letters in Unicode 5 (gc=L
will find all letters in general).For IdentifierStart you need
[:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]
. For IdentifierPart you need[:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_]
.Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.
Here is also an overview of all Unicode character properties.