Python lrparsing module; cannot parse simple recursive grammar

540 views Asked by At
expr = Ref('expr')
block = '{' + Repeat(expr) + '}'
expr = block | Token(re='[0-9]')
START = expr

Here's a grammar using the lrparsing module for Python. The module reports no conflicts in the grammar.

It fails to parse the string {{0}} with the error lrparsing.ParseError: line 1 column 5: Got '}' when expecting __end_of_input__ while trying to match block in state 11

The stack state step by step is:

shift  '{'; ItemSet:5='{'
shift  '{'; ItemSet:5='{' ItemSet:5='{'
shift  /[0-9]/; ItemSet:4=/[0-9]/ ItemSet:5='{' ItemSet:5='{'
reduce '}'; ItemSet:4=/[0-9]/ -- ItemSet:7=expr ItemSet:5='{' ItemSet:5='{'
reduce '}'; ItemSet:7=expr -- ItemSet:9=expr ItemSet:5='{' ItemSet:5='{'
shift  '}'; ItemSet:11='}' ItemSet:9=expr ItemSet:5='{' ItemSet:5='{'

Which afaik means it's shifting {{0, then upon seeing } reducing 0 to expr, then reducing on } again without having shifted it first, which confuses the bajeezus out of me.

Is this a bug, or am I doing something infinitely simple and stupid?

If it's my grammar, how would I refactor it to satisfy my eternally steamy and passionate desires? If it's a bug, could someone direct me to a python module that has syntax most similar to lrparsing that does work?

EDIT: Refactoring as:

blocks = Ref('blocks')
block = Ref('block')
expr = Ref('expr')
blocks = blocks + block | THIS*0 # THIS*0 is the idiomatic way of getting the empty string
block = '{' + expr + '}'
expr = blocks | Token(re='[0-9]')
START = expr

Allows for proper parsing. The question for me now, is... why? I feel like lrparsing would've complained to me earlier about any parsing problems... Does Repeat simply not work the way I expect it to?

3

There are 3 answers

2
mknecht On BEST ANSWER

lrparsing has a bug; it does not consider the recursive repeats correctly.

Your actual problem can be solved by simple recursion, as you did in your extended edit, though with a bit less clutter.

block = Ref('block')
block = '{' + block + '}' | Token(re='[0-9]')
START = block

Note also, that your original grammar would have allowed input such as {{0{1}}}. (The reason being that the repeatable part can be expanded to either simple numbers or expr again.) Considering your second grammar, you probably didn't want that anyways.

I did get some work done with pyparsing, but the syntax is considerably different. Analogous example:

from pyparsing import Forward, Literal, nums, oneOf, Word

l = lambda c: Literal(c).suppress()
block = Forward()
block << (Word(nums, exact=1) ^ l('{') + block + l('}'))
print(block.parseString("{{0}}"))

Output:

['0']

Hope that helps.

0
Russell Stuart On

Lrparsing author here. As Serge said it was a bug, and is fixed in 1.0.8. This only happened because Serge reported it on Source Forge's but tracker - otherwise I would not have known. Thank you Serge.

The comments about it possibly being a bug in Repeat() hints at not understanding what lrparsing does. Lrparsing is a rather complex beast. It lets you enter grammar in a way I hope is natural for a Python programmer. It then compiles into something a LR(1) parser generator can understand, which is a series of productions. Then it generates an LR(1) parsing table from those productions. And finally it feeds your input language and the parsing table to an LR(1) parser to generate the parse tree. For what it's worth, the bug was in the part that generates the parsing table.

Debugging such a series of transformations would be near impossible for me I if I could not see what each step produces. Accordingly lrparsing has a repr_xxxx() function that displays the output of each step. The first transformation is parsing your grammar. The result is displayed by repr_grammar():

<G> = START + __end_of_input__
START = expr
block = '{' + expr * () + '}'
expr = block | /[0-9]/

Which looks very similar to the original grammar presented in the question. The next step is to compile those rules in productions, which is what an LR(1) parser generator can understand. These are printed by repr_productions():

<G> = START __end_of_input__
START = expr
block = '{' '}'
block = '{' block.Sequence.Repeat '}'
block.Sequence.Repeat = expr
block.Sequence.Repeat = block.Sequence.Repeat expr
expr = block
expr = /[0-9]/

The block.Sequence.Repeat is a new Nonterminal lrparsing introduced in order to handle the Repeat(). Those productions look like a faithful representation of the original grammar to me.

Lrparsing goes out of it's way to hide the nonterminals it introduces like block.Sequence.Repeat. For example they won't appear in the output parse tree. That means there is no need for an lrparsing user to care about them - except for 2 cases. Those 2 cases are error recovery and trying to understand the log output of the parse engine. The former is a complex technique most won't attempt. But some here looked at the latter in order to try and understand what lrparsing was doing. The log won't make much sense unless you can see the productions the LR(1) parser is trying to recognise. But if you had seen them, you would have known there wasn't a bug in Repeat().

You can also dump the generated LR(1) parse table. If you really want to understand how a LR(1) parser works, that is what you should be trying to grok. Unless you happen to find parsing a deeply interesting topic I don't recommend it.

1
Serge Monkewitz On

I reported your issue to the lrparsing author when I ran across this page - it was indeed a bug, and it has been fixed in version 1.0.8. In case it's useful in the future, the lrparsing bug tracker can be found here.