how to handle Unicode dot in table driven FSM?

109 views Asked by At

Tools like "lex" and "flex", as far as I know, handle byte input only. ASCII that is. The FSM state transition tables generated by these tools are not big as the result, because there are only 256 possible characters in the alphabet.

I am trying to figure out how to implement a . (any character), or a [^...] range in a regular expression evaluator if my alphabet is Unicode. Say, UTF8. Are there any known techniques as to how to make the transition tables manageable in this case? Making them keep all possible characters is of course unreasonable.

Any ideas?

0

There are 0 answers