antlr4: perplexed about whitespace handling

1.1k views Asked by At

I know this question has been asked in more or less the same terms before, but none of the answers are working for me:

grammar Problem;
top: (IDENT | INT)*;
IDENT: (ALPHA|'_') (ALPHA|DIGIT|'_')*;
INT: DEC_INT | HEX_INT;
DEC_INT: (ZERO | (NZERO_DIGIT DIGIT*));
HEX_INT: ZERO X HEX+;
ZERO: '0';
NZERO_DIGIT: '1'..'9';
DIGIT: '0'..'9';
ALPHA: [a-zA-Z];
HEX: [0-9a-fA-F];
X: [xX];
WS: [ \t\r\n]+ -> skip;

When I give this input to the parser:

0xFF ZZ123

followed by a newline and ctrl-D, it gets parsed as :

(top 0xFF ZZ123)

Which is the intended behaviour.

However when I give this input to the parser:

0xFFZZ123

followed by a newline and ctrl-D, it gets parsed as :

(top 0xFF ZZ123)

which is not at all intended. I would like this to trigger a lexer error, considering this as a misspelled HEX_INT.

If I disable whitespace skipping, I still get the same lexer behaviour (a single group of chars parsed as two tokens), however since WS tokens are now reported to the parser, I get the following error:

0XFFZZ123
line 1:9 extraneous input '\n' expecting {<EOF>, IDENT, INT}
(top 0XFF ZZ123 \n)

And in addition I cannot type space separated tokens anymore (normal since top does not mention WS):

0XFF ZZ123
line 1:4 extraneous input ' ' expecting {<EOF>, IDENT, INT}
(top 0XFF   ZZ123)

I have tried to fix the grammar by disabling whitespace skipping and changing the top rule to :

top: WS* (IDENT | INT) (WS+ (IDENT|INT))* WS*;

However if I enter the following stream to the parser,

0xFF ZZ123 0XFFZZ123                                              

I get this error:

line 1:20 extraneous input 'ZZ123' expecting {<EOF>, WS}                                        
(top     0xFF   ZZ123     0xFF ZZ123 \n)

Where you can still see that the last input token has been split in OxFF and ZZ123, whereas I would really trigger a lexing error here instead of having to handle whitespace in the parser explicitly.

So what combination of tricks do I have to use to obtain the desired behaviour?

1

There are 1 answers

3
Andy On

You can write a token that accepts erroneous tokens like 0XFFZZ123, and place it just before WS. For example:

grammar SandBox;
top: (IDENT | INT)*;
IDENT: (ALPHA|'_') (ALPHA|DIGIT|'_')*;
INT: DEC_INT | HEX_INT;
DEC_INT: (ZERO | (NZERO_DIGIT DIGIT*));
HEX_INT: ZERO X HEX+;
ZERO: '0';
NZERO_DIGIT: '1'..'9';
DIGIT: '0'..'9';
ALPHA: [a-zA-Z];
HEX: [0-9a-fA-F];
X: [xX];

ERROR_TOKEN: (~[ \t\r\n])+;

WS: [ \t\r\n]+ -> skip;

What happens is the following. If you input 0xFF ZZ123, then INT and IDENT win, because of their position. If you input 0XFFZZ123, then ERROR_TOKEN wins, because of the length (length has priority over position). Since ERROR_TOKEN is not part of the "top", an error would be raised.

I hope this solves the problem.