We're trying to use JavaCC as a parser to parse source code which is in UTF-8( the language is Japanese). In JavaCC, we have a declaration like:
< #LETTER:
[
"\u0024",
"\u0041"-"\u005a",
"\u005f",
"\u0061"-"\u007a",
"\u00c0"-"\u00d6",
"\u00d8"-"\u00f6",
"\u00f8"-"\u00ff",
"\u0100"-"\u1fff",
"\u3040"-"\u318f",
"\u3300"-"\u337f",
"\u3400"-"\u3d2d",
"\u4e00"-"\u9fff",
"\uf900"-"\ufaff"
]
>
If it meets a string like "日建フェンス工業", it will fail because of 業 character. If I remove it, it works as expected. The code of 業 character is "\u696d", and as you can see in the declaration, it should belong to the range "\u4e00"-"\u9fff"
Any suggestion on this?
PS: If we rewrite this grammar using Antlr, how does it look like
Thank you so much
There is nothing wrong with your token fragment and nothing wrong with JavaCC. The problem lies elsewhere.
Here is a JavaCC specification made by copying and pasting your problem code into JavaCC.
And here is the output from the resulting Java program
As you can see the generated tokenizer has no trouble seeing
\u696d
as aLETTER
.