Using lex to tokenize without failing

113 views Asked by At

I'm interested in using lex to tokenize my input string, but I do not want it to be possible to "fail". Instead, I want to have some type of DEFAULT or TEXT token, which would contain all the non-matching characters between recognized tokens.

Anyone have experience with something like this?

2

There are 2 answers

2
user207421 On

To expand on @Chris Dodd's answer, the final rule in any lex script should be:

. return yytext[0];

and don't write any single-character rules like "+" return PLUS;. Just use the special characters you recognize directly in the grammar, e.g. term: term '+' factor;.

This practice:

  • saves you a lot of lex rules
  • makes your grammar much more readable
  • returns illegal characters as tokens to the parser, where you can do anything you like with them, or nothing, in which case you get the benefit of yacc's error recovery.
0
Chris Dodd On

Use the pattern . at the end of all your lex rules to match any character that isn't matched by any other rule. You may also need a \n rule to match newlines (a newline is the only character the . doesn't match)

If you want to combine adjacent non-matching characters into a single token, that is harder, and is more easily done in the parser.