I would like to use Lark to generate a standalone parser for a small rational language of mine. This needs to be a LALR(1) parser.
It should accept the following input:
(lorem) consectetur adipiscing elit
(lorem) [ipsum dolor sit amet] consectetur adipiscing elit
My best guess for the grammar (note: I am a complete beginner in parsing):
start : (blank_lines | line)*
blank_lines : /^([ \t]*\n)+/m
line : "(" head ")" ("[" option "]")? tail "\n"
head : /\w+/
option : TEXT
tail: TEXT
TEXT : /[^\[\]\n]+/
%ignore /[ \t]+/
This works with Lark's Earley parser, but fails with LALR(1) (you can test that on https://www.lark-parser.org/ide/).
More precisely, LALR(1) accepts the first lorem-line, but fails on the second one with:
(lorem) [ipsum dolor sit amet] consectetur adipi
^
Expected one of:
* NEWLINE
Previous tokens: Token('TEXT', ' ')
(Obviously, if I suppress the ? in the definition of line, it fails on the first one and succeeds on the second one.)
Ok, let's replace the definition of TEXT by:
TEXT : /[^ \[][^\[\]\n]*/
Now it gives the expected result, both with LALR(1) and Earley. I thought specifying %ignore /[ \t]+/ would have made this useless.
Is there a better way to write this grammar?
You have an ambiguity between the
TEXTterminal and the%ignoreterminal. Lark does not necessarily gurantee how this behaves. However, in general it will prefer using the terminal that is not ignored to actually make progress while parsing.You need to make sure this ambiguity does not exists, which you are doing with your changed definition of
TEXT.