Accept an optional substring with Lark's LALR(1) parser

Question

Accept an optional substring with Lark's LALR(1) parser

57 views Asked by Aristide At 18 May 2023 at 16:46

I would like to use Lark to generate a standalone parser for a small rational language of mine. This needs to be a LALR(1) parser.

It should accept the following input:

(lorem) consectetur adipiscing elit

(lorem) [ipsum dolor sit amet] consectetur adipiscing elit

My best guess for the grammar (note: I am a complete beginner in parsing):

start : (blank_lines | line)*
blank_lines : /^([ \t]*\n)+/m
line : "(" head ")" ("[" option "]")? tail "\n"
head : /\w+/
option : TEXT
tail: TEXT
TEXT : /[^\[\]\n]+/

%ignore /[ \t]+/

This works with Lark's Earley parser, but fails with LALR(1) (you can test that on https://www.lark-parser.org/ide/).

More precisely, LALR(1) accepts the first lorem-line, but fails on the second one with:

(lorem) [ipsum dolor sit amet] consectetur adipi
        ^
Expected one of: 
    * NEWLINE

Previous tokens: Token('TEXT', ' ')

(Obviously, if I suppress the ? in the definition of line, it fails on the first one and succeeds on the second one.)

Ok, let's replace the definition of TEXT by:

TEXT : /[^ \[][^\[\]\n]*/

Now it gives the expected result, both with LALR(1) and Earley. I thought specifying %ignore /[ \t]+/ would have made this useless.

Is there a better way to write this grammar?

Original Q&A

There are 2 answers

**MegaIng** · Answer 1 · 2023-05-18T19:19:43+00:00

You have an ambiguity between the TEXT terminal and the %ignore terminal. Lark does not necessarily gurantee how this behaves. However, in general it will prefer using the terminal that is not ignored to actually make progress while parsing.

You need to make sure this ambiguity does not exists, which you are doing with your changed definition of TEXT.

**Aristide** · Answer 2 · 2023-05-19T10:47:55+00:00

Answering my own question.

For some reason, ignoring /[ \t]+/ is not equivalent to ignoring (" "|/\t/)+ (which is defined as WS_INLINE in common.lark).

Replacing the former expression by the latter in my first version was enough to make it accept the input. But it also helps to produce a version that I think is slightly better (no more explicit negated class):

start : (blank_lines | line)*
blank_lines : /^([ \t]*\n)+/m
line : "(" head ")" ("[" option "]")? tail _NL
head : /\w+/
option : /.+(?=\])/
tail : /(?!\[).+/

%import common.NEWLINE -> _NL
%import common.WS_INLINE
%ignore WS_INLINE

TechQA.

Accept an optional substring with Lark's LALR(1) parser

There are 2 answers

Related Questions in PARSING

Related Questions in REGULAR-LANGUAGE

Related Questions in LALR

Related Questions in LARK

Popular Questions

Trending Questions