I know this question has been asked in more or less the same terms before, but none of the answers are working for me:
grammar Problem;
top: (IDENT | INT)*;
IDENT: (ALPHA|'_') (ALPHA|DIGIT|'_')*;
INT: DEC_INT | HEX_INT;
DEC_INT: (ZERO | (NZERO_DIGIT DIGIT*));
HEX_INT: ZERO X HEX+;
ZERO: '0';
NZERO_DIGIT: '1'..'9';
DIGIT: '0'..'9';
ALPHA: [a-zA-Z];
HEX: [0-9a-fA-F];
X: [xX];
WS: [ \t\r\n]+ -> skip;
When I give this input to the parser:
0xFF ZZ123
followed by a newline and ctrl-D, it gets parsed as :
(top 0xFF ZZ123)
Which is the intended behaviour.
However when I give this input to the parser:
0xFFZZ123
followed by a newline and ctrl-D, it gets parsed as :
(top 0xFF ZZ123)
which is not at all intended. I would like this to trigger a lexer error, considering this as a misspelled HEX_INT.
If I disable whitespace skipping, I still get the same lexer behaviour (a single group of chars parsed as two tokens), however since WS tokens are now reported to the parser, I get the following error:
0XFFZZ123
line 1:9 extraneous input '\n' expecting {<EOF>, IDENT, INT}
(top 0XFF ZZ123 \n)
And in addition I cannot type space separated tokens anymore (normal since top does not mention WS):
0XFF ZZ123
line 1:4 extraneous input ' ' expecting {<EOF>, IDENT, INT}
(top 0XFF ZZ123)
I have tried to fix the grammar by disabling whitespace skipping and changing the top rule to :
top: WS* (IDENT | INT) (WS+ (IDENT|INT))* WS*;
However if I enter the following stream to the parser,
0xFF ZZ123 0XFFZZ123
I get this error:
line 1:20 extraneous input 'ZZ123' expecting {<EOF>, WS}
(top 0xFF ZZ123 0xFF ZZ123 \n)
Where you can still see that the last input token has been split in OxFF and ZZ123, whereas I would really trigger a lexing error here instead of having to handle whitespace in the parser explicitly.
So what combination of tricks do I have to use to obtain the desired behaviour?
You can write a token that accepts erroneous tokens like 0XFFZZ123, and place it just before WS. For example:
What happens is the following. If you input 0xFF ZZ123, then INT and IDENT win, because of their position. If you input 0XFFZZ123, then ERROR_TOKEN wins, because of the length (length has priority over position). Since ERROR_TOKEN is not part of the "top", an error would be raised.
I hope this solves the problem.