I know and use bison/yacc. But in parsing world, there's a lot of buzz around packrat parsing.
What is it? Is it worth studing?
I know and use bison/yacc. But in parsing world, there's a lot of buzz around packrat parsing.
What is it? Is it worth studing?
Pyparsing is a pure-Python parsing library that supports packrat parsing, so you can see how it is implemented. Pyparsing uses a memoizing technique to save previous parse attempts for a particular grammar expression at a particular location in the input text. If the grammar involves retrying that same expression at that location, it skips the expensive parsing logic and just returns the results or exception from the memoizing cache.
There is more info here at the FAQ page of the pyparsing wiki, which also includes links back to Bryan Ford's original thesis on packrat parsing.
At a high level:
Packrat parsers make use of parsing expression grammars (PEGs) rather than traditional context-free grammars (CFGs).
Through their use of PEGs rather than CFGs, it's typically easier to set up and maintain a packrat parser than a traditional LR parser.
Due to how they use memoization, packrat parsers typically use more memory at runtime than "classical" parsers like LALR(1) and LR(1) parsers.
Like classical LR parsers, packrat parsers run in linear time.
In that sense, you can think of a packrat parser as a simplicity/memory tradeoff with LR-family parsers. Packrat parsers require less theoretical understanding of the parser's inner workings than LR-family parsers, but use more resources at runtime. If you're in an environment where memory is plentiful and you just want to throw a simple parser together, packrat parsing might be a good choice. If you're on a memory-constrained system or want to get maximum performance, it's probably worth investing in an LR-family parser.
The rest of this answer gives a slightly more detailed overview of packrat parsers and PEGs.
Many traditional parsers (and many modern parsers) make use of context-free grammars. A context-free grammar consists of a series of rules like the ones shown here:
E -> E * E | E + E | (E) | N
N -> D | DN
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
For example, the top line says that the nonterminal E can be replaced either with E * E
, or E + E
, or (E)
, or with N
. The second line says that N can be replaced with either D
or DN
. The last line says that D
can be replaced with any single digit.
If you start with the string E and follow the rules from the above grammar, you can generate any mathematical expression using +, *, parentheses, and single digits.
Context-free grammars are a compact way to represent a collection of strings. They have a rich and well-understood theory. However, they have two main drawbacks. The first one is that, by itself, a CFG defines a collection of strings, but doesn't tell you how to check whether a particular string is generated by the grammar. This means that whether a particular CFG will lend itself to a nice parser depends on the particulars of how the parser works, meaning that the grammar author may need to familiarize themselves with the internal workings of their parser generator to understand what restrictions are placed on the sorts of grammar structures can arise. For example, LL(1) parsers don't allow for left-recursion and require left-factoring, while LALR(1) parsers require some understanding of the parsing algorithm to eliminate shift/reduce and reduce/reduce conflicts.
The second, larger problem is that grammars can be ambiguous. For example, the above grammar generates the string 2 + 3 * 4, but does so in two ways. In one way, we essentially get the grouping 2 + (3 * 4), which is what's intended. The other one gives us (2 + 3) * 4, which is not what's meant. This means that grammar authors either need to ensure that the grammar is unambiguous or need to introduce precedence declarations auxiliary to the grammar to tell the parser how to resolve the conflicts. This can be a bit of a hassle.
Packrat parsers make use of an alternative to context-free grammars called parsing expression grammars (PEGs). Parsing expression grammars in some ways resemble CFGs - they describe a collection of strings by saying how to assemble those strings from (potentially recursive) smaller parts. In other ways, they're like regular expressions: they involve simpler statements combined together by a small collection of operations that describe larger structures.
For example, here's a simple PEG for the same sort of arithmetic expressions given above:
E -> F + E / F
F -> T * F / T
T -> D* / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
To see what this says, let's look at the first line. Like a CFG, this line expresses a choice between two options: you can either replace E
with F + E
or with F
. However, unlike a regular CFG, there is a specific ordering to these choices. Specifically, this PEG can be read as "first, try replacing E
with F + E
. If that works, great! And if that doesn't work, try replacing E
with F
. And if that works, great! And otherwise, we tried everything and it didn't work, so give up."
In that sense, PEGs directly encode into the grammar structure itself how the parsing is to be done. Whereas a CFG more abstractly says "an E may be replaced with any of the following," a PEG specifically says "to parse an E, first try this, then this, then this, etc." As a result, for any given string that a PEG can parse, the PEG can parse it exactly one way, since it stops trying options once the first parse is found.
PEGs, like CFGs, can take some time to get the hang of. For example, CFGs in the abstract - and many CFG parsing techniques - have no problem with left recursion. For example, this CFG can be parsed with an LR(1) parser:
E -> E + F | F
F -> F * T | T
T -> (E) | N
N -> ND | D
D -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
However, the following PEG can't be parsed by a packrat parser (though later improvements to PEG parsing can correct this):
E -> E + F / F
F -> F * T / T
T -> (E) / D*
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Let's take a look at that first line. The first line says "to parse an E, first try reading an E, then a +, then an F. And if that fails, try reading an F." So how would it then go about trying out that first option? The first step would be to try parsing an E, which would work by first trying to parse an E, and now we're caught in an infinite loop. Oops. This is called left recursion and also shows up in CFGs when working with LL-family parsers.
Another issue that comes up when designing PEGs is the need to get the ordered choices right. If you're coming from the Land of Context-Free Grammars, where choices are unordered, it's really easy to accidentally mess up a PEG. For example, consider this PEG:
E -> F / F + E
F -> T / T * F
T -> D+ / (E)
D -> 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Now, what happens if you try to parse the string 2 * 3 + 4? Well:
The issue here is that we first tried parsing F
before F + E
, and similarly first tried parsing T
before parsing T * F
. As a result, we essentially bit off less than we could check, because we tried reading a shorter expression before a longer one.
Whether you find CFGs, with attending ambiguities and precedence declarations, easier or harder than PEGs, with attending choice orderings, is mostly a matter of personal preference. But many people report finding PEGs a bit easier to work with than CFGs because they more mechanically map onto what the parser should do. Rather than saying "here's an abstract description of the strings I want," you get to say "here's the order in which I'd like you to try things," which is a bit closer to how parsing often works.
Compared with the algorithms to build LR or LL parsing tables, the algorithm used by a packrat parsing is conceptually quite simple. At a high level, a packrat parser begins with the start symbol, then tries the ordered choices, one at a time, in sequence until it finds one that works. As it works through those choices, it may find that it needs to match another nonterminal, in which case it recursively tries matching that nonterminal on the rest of the string. If a particular choice fails, the parser backtracks and then tries the next production.
Matching any one individual production isn't that hard. If you see a terminal, either it matches the next available terminal or it doesn't. If it does, great! Match it and move on. If not, report an error. If you see a nonterminal, then (recursively) match that nonterminal, and if it succeeds pick up with the rest of the search at the point after where the nonterminal finished matching.
This means that, more generally, the packrat parser works by trying to solve problems of the following form:
Given some position in the string and a nonterminal, determine how much of the string that nonterminal matches starting at that position (or report that it doesn't match at all.)
Here, notice that there's no ambiguity about what's meant by "how much of the string the nonterminal matches." Unlike a traditional CFG where a nonterminal might match at a given position in several different lengths, the ordered choices used in PEGs ensure that if there's some match starting at a given point, then there's exactly one match starting at that point.
If you've studied dynamic programming, you might realize that these subproblems might overlap one another. In fact, in a PEG with k
nonterminals and a string of length n
, there are only Θ(kn) possible distinct subproblems: one for each combination of a starting position and a nonterminal. This means that, in principle, you could use dynamic programming to precompute a table of all possible position/nonterminal parse matches and have a very fast parser. Packrat parsing essentially does this, but using memoization rather than dynamic programming. This means that it won't necessarily try filling all table entries, just the ones that it actually encounters in the course of parsing the grammar.
Since each table entry can be filled in in constant time (for each nonterminal, there are only finitely many productions to try for a fixed PEG), the parser ends up running in linear time, matching the speed of an LR parser.
The drawback with this approach is the amount of memory used. Specifically, the memoization table may record multiple entries per position in the input string, requiring memory usage proportional to both the size of the PEG and the length of the input string. Contrast this with LL or LR parsing, which only needs memory proportional to the size of the parsing stack, which is typically much smaller than the length of the full string.
That being said, the tradeoff here in worse memory performance is offset by not needing to learn the internal workings of how the packrat parser works. You can just read up on PEGs and take things from there.
Hope this helps!
Packrat parsing is a way of providing asymptotically better performance for parsing expression grammars (PEGs); specifically for PEGs, linear time parsing can be guaranteed.
Essentially, Packrat parsing just means caching whether sub-expressions match at the current position in the string when they are tested -- this means that if the current attempt to fit the string into an expression fails then attempts to fit other possible expressions can benefit from the known pass/fail of subexpressions at the points in the string where they have already been tested.