Why is my example to parse bold text with Lark not working?

Question

Why is my example to parse bold text with Lark not working?

50 views Asked by Jonas At 04 January 2024 at 15:58

I defined the following grammar:

from lark import Lark

grammar = r'''
    start: ( bold_text | text)
    bold_text.2: "**" text "**"
    text.1: /.+/
'''

parser = Lark(grammar, start = "start", debug=True, ambiguity="explicit")

text = r"""**test**"""

parsed = parser.parse(text)
print(parsed.pretty())

For "**test**", this parses "text" and not "bold_text" as expected. Lark does not even show an ambiguity. What am I doing wrong? Any help would be greatly appreciated. Thank you.

Original Q&A

There are 2 answers

EarthAndMoon On 24 January 2024 at 18:48

from lark import Lark, Transformer, v_args

grammar = r"""
start: (bold_text | text_ )

text_: text
bold_text: "**" text "**"
text.1: (/./)+

"""

parser = Lark(grammar, start="start", debug=True)

text = r"""**test**"""

@v_args(inline=True)
class important(Transformer):
    def text(self,*chars):
        return "".join(chars)

parsed = important().transform(parser.parse(text))
print(parsed.pretty())

Returns:

start
  bold_text     test

Maybe this is not the best, but it should work.

**Erez** · Accepted Answer · 2024-01-25T03:43:04+00:00

Lark is using greedy regular expressions as the infrastructure of its lexer, as is common for parsers.

The odd behavior you're seeing happens because when bold_text matches the text rule, its regexp also catches the two stars at the end, because they are matched by .

A way to fix your grammar would be to do this instead:

start: bold_text | text
bold_text.2: "**" text "**"
text.1: /[^*]+/

That does add some difficulties, since there are situations when you might want to allow * as part of the text. But there are tricks to work around it.

You can even use this solution with LALR, which will run much faster.

An "easier" solution would be to enable the complete-Earley feature of Lark, which isn't affected by the greediness of regexps (but it also runs much slower):

grammar = r'''
    start: ( bold_text | text)
    bold_text.2: "**" text "**"
    text.1: /.+/
'''

parser = Lark(grammar, start = "start", debug=True, ambiguity="explicit", lexer="dynamic_complete")

And the result is -

_ambig
  start
    bold_text
      text      test
  start
    text        **test**

But using complete-Earley for very large texts will prove to performance-intensive, especially if you use catch-all regexps like .+.

Having said all that, Lark is better suited for parsing structured languages like Python than for free-form text like markup. You might want to consider writing a specialized parser (aka "hand-written parser") for this task.

TechQA.

Why is my example to parse bold text with Lark not working?

There are 2 answers

Related Questions in LARK-PARSER

Popular Questions

Trending Questions