LALR parsers and look-ahead

2.8k views Asked by At

I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.

I have gotten myself confused on what look-ahead means.

If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?

    P :== a E

Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.

    P :== a b E
      |   a b F

I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.

I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.

I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.

I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.

1

There are 1 answers

0
rici On

The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.

To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)

This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.

It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.

Here's a very simple example; a comma separated list:

1. Start -> List
2. List  -> ELEMENT
3. List  -> List ',' ELEMENT

Let's suppose the input is:

ELEMENT , ELEMENT , ELEMENT

At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:

queue                   remaining input                 action
----------------------  ---------------------------     -----
                        ELEMENT , ELEMENT , ELEMENT     SHIFT

At the next step, the parser decides to reduce using production 2:

ELEMENT                 , ELEMENT , ELEMENT             REDUCE 2

Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:

List                    , ELEMENT , ELEMENT             SHIFT
List ,                  ELEMENT , ELEMENT               SHIFT
List , ELEMENT          , ELEMENT                       REDUCE 3
List                    , ELEMENT                       SHIFT
List ,                  ELEMENT                         SHIFT
List , ELEMENT          --                              REDUCE 3

Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:

List                    --                              REDUCE 1
Start                   --                              ACCEPT

and the parse is successful.

That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?

As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).

For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.