Parsing a CFG with alternatives

502 views Asked by At

I have a fairly simple language represented as a CFG.

S → A z
A → A y A
  | A x A
  | A w
  | v

Since there's left-recursion, a recursive descent parser isn't going to cut it. However, I also need to find every possible interpretation: given v x v y v z, I need my parser to find both (v x (v y v)) z and ((v x v) y v) z.

What options do I have? Shift-reduce with added backtracking to find all possibilities seems good, but I've heard that adding backtracking to a shift-reduce parser can give it exponential time complexity. This CFG is small enough that it shouldn't matter, but I'm going to need to scale this up to significantly larger grammars (with thousands of terminals).

2

There are 2 answers

0
rici On

The Earley algorithm (and its variants, such as GLR) can be implemented to create a parser which works in cubic time. Since there can be an exponential number of possible parses, that complexity is only to build a "parse forest", which is a DAG containing all possible parses. Actually enumerating the parses will take time proportional to the number of possibilities.

Note that when we talk about time complexity here, we're talking in terms of the length of the input string, not the size of the grammar. The size if the grammar has an impact, of course -- usually linear, but it depends on how you measure the size -- but the assumption is that a parser has been built for a particular grammar (which might require preprocessing dependent on the grammar size.)

The Wikipedia article linked above has a list of implementations in various languages, some of them intended for production and others simply to demonstrate the algorithm. Note that the GLR parser produced by bison is not cubic; in pathological cases, it can be exponential. Also, it does not build a parse tree (or forest); that is left to the user, and naive algorithms which don't share storage could require exponential space and time. (Nonetheless, it is quite usable for real-world grammars.)

0
Erez On

There are several classes of context-free grammars. There are deterministic grammars, which can be parsed with a shift-reduce parser. There are non-deterministic grammars, for which the solution is often a dynamic lookahead, or backtracking. And there are ambiguous grammars, like the one you described, which are best solved by algorithms especially designed with ambiguity in mind.

Two such algorithms, are Earley (as @rici pointed out), and CYK. They are designed to return all the possible derivations, as you require, and can be used to create a SPPF (Shared Packed Parse Forest), which is a very efficient structure for highly ambiguous grammars (whether yours will fit this description or not, I cannot say of course).

If you want to experiment with it, you can try my parsing library for python Lark, which implements both Earley and CYK, and can give you a list of all the possible derivations for your input (However, SPPF support is still work-in-progress).