retrieve the grammar rules from the generated parsing tables

825 views Asked by At

I have a quite old C corporate parser code that was generated from an ancient Yacc and uses the yyact, yypact, yypgo, yyr1, yyr2, yytoks, yyexca, yychk, yydef tables (but no yyreds though) and the original grammar source is lost. That legacy piece of code need revamping but I cannot afford to recode it from scratch.

Could it be possible to mechanically retrieve / regenerate the parsing rules by deduction of the parsing tables in order to reconstruct the grammar?

Example with a little expression parser that I can process with the same ancient Yacc:

yytabelem yyexca[] ={
-1, 1,
    0, -1,
    -2, 0,
-1, 21,
    261, 0,
    -2, 8,
    };
yytabelem yyact[]={

    13,     9,    10,    11,    12,    23,     8,    22,    13,     9,
    10,    11,    12,     9,    10,    11,    12,     1,     2,    11,
    12,     6,     7,     4,     3,     0,    16,     5,     0,    14,
    15,     0,     0,     0,    17,    18,    19,    20,    21,     0,
     0,    24 };
yytabelem yypact[]={

  -248, -1000,  -236,  -261,  -236,  -236, -1000, -1000,  -248,  -236,
  -236,  -236,  -236,  -236,  -253, -1000,  -263,  -245,  -245, -1000,
 -1000,  -249, -1000,  -248, -1000 };
yytabelem yypgo[]={

     0,    17,    24 };
yytabelem yyr1[]={

     0,     1,     1,     1,     2,     2,     2,     2,     2,     2,
     2,     2,     2 };
yytabelem yyr2[]={

     0,     8,    12,     0,     6,     6,     6,     6,     6,     6,
     4,     2,     2 };
yytabelem yychk[]={

 -1000,    -1,   266,    -2,   259,   263,   257,   258,   267,   262,
   263,   264,   265,   261,    -2,    -2,    -1,    -2,    -2,    -2,
    -2,    -2,   260,   268,    -1 };
yytabelem yydef[]={

     3,    -2,     0,     0,     0,     0,    11,    12,     3,     0,
     0,     0,     0,     0,     0,    10,     1,     4,     5,     6,
     7,    -2,     9,     3,     2 };

yytoktype yytoks[] =
{
    "NAME", 257,
    "NUMBER",   258,
    "LPAREN",   259,
    "RPAREN",   260,
    "EQUAL",    261,
    "PLUS", 262,
    "MINUS",    263,
    "TIMES",    264,
    "DIVIDE",   265,
    "IF",   266,
    "THEN", 267,
    "ELSE", 268,
    "LOW",  269,
    "UMINUS",   270,
    "-unknown-",    -1  /* ends search */
};

/* am getting this table in my example, 
but it is not present in the studied parser :^( */
char * yyreds[] =
{
    "-no such reduction-",
    "stmt : IF exp THEN stmt",
    "stmt : IF exp THEN stmt ELSE stmt",
    "stmt : /* empty */",
    "exp : exp PLUS exp",
    "exp : exp MINUS exp",
    "exp : exp TIMES exp",
    "exp : exp DIVIDE exp",
    "exp : exp EQUAL exp",
    "exp : LPAREN exp RPAREN",
    "exp : MINUS exp",
    "exp : NAME",
    "exp : NUMBER",
};

I am looking to retrieve

stmt    : IF exp THEN stmt
        | IF exp THEN stmt ELSE stmt
        | /*others*/
        ;

exp     : exp PLUS exp
        | exp MINUS exp
        | exp TIMES exp
        | exp DIVIDE exp
        | exp EQUAL exp
        | LPAREN exp RPAREN
        | MINUS exp
        | NAME
        | NUMBER
        ;

Edit: I have stripped down the generated parser of my example for clarity, but to help some analysis i have published the whole generated code as a gist. Please not that for some unknown reason there is no yyreds table in the parser I am trying to study / change. I suppose it would not have been fun :^S

1

There are 1 answers

5
Chris Dodd On

An interesting problem. Just from matching the tables to the grammar, it seems that yyr1 and yyr2 give you the "outline" of the rules -- yyr1 is the symbol on the left side of each rule, while yyr2 is 2x the number of symbols on the right side. You also have the names of all the terminals in a convenient table. But the names of the non-terminals are lost.

To figure out which symbols go on the rhs of each rule, you'll need to reconstruct the state machine from the tables, which likely involves reading and understanding the code in the y.tab.c file that actually does the parsing. Some of the tables (looks like yypact, yychk and yydef) are indexed by state number. It seems likely that yyact is indexed by yypact[state] + token. But those are only guesses. You need to look at the parsing code and understand how its using the tables to encode possible shifts, reduces, and gotos.

Once you have the state machine, you can backtrack from the states containing reductions of specific rules through the states that have shifts and gotos of that rule. A shift into a reduction state means the last symbol on the rhs of that rule is the token shifted. A goto into a reduction state means the last symbol on the rhs is symbol for the goto. The second-to-last symbol comes from the shift/goto to the state that does the shift/goto to the reduction state, and so on.

edit

As I surmised, yypact is the 'primary action' for a state. If the value is YYFLAG (-1000), this is a reduce-only state (no shifts). Otherwise it is a potential shift state and yyact[yypact[state] + token] gives you the potential state to shift to. If yypact[state] + token is out of range for the yyact table, or the token doesn't match the entry symbol for that state, then there's no shift on that token.

yychk is the entry symbol for each state -- a positive number means you shift to that state on that token, while a negative means you goto that state on that non-terminal.

yydef is the reduction for that state -- a positive number means reduce that rule, while 0 means no reduction, and -2 means two or more possible reductions. yyexca is the table of reductions for those states with more than one reduction. The pair -1 state means the following entries are for the given state; following pairs of token rule mean that for lookahead token it should reduce rule. A -2 for token is a wildcard (end of the list), while a 0 for the rule means no rule to reduce (an error instead), and -1 means accept the input.

The yypgo table is the gotos for a symbol -- you go to state yyact[yypgo[sym] + state + 1] if that's in range for yyact and yyact[yypgo[sym]] otherwise.

So to reconstruct rules, look at the yydef and yyexca tables to see which states reduce each rule, and go backwards to see how the state is reached.

For example, rule #1. From the yyr1 and yyr2 tables, we know its of the form S1: X X X X -- non-terminal #1 on the lhs and 4 symbols on the rhs. Its reduced in state 16 (from the yydef table, and the accessing symbol for state 16 (from yychk) is -1. So its:

S1: ?? ?? ?? S1

You get into state 16 from yyact[26], and yypgo[1] == 17, so that means the goto is coming from state 8 (26 == yypgo[1] + 8 + 1. The accessing symbol of state 8 is 267 (THEN) so now we have:

S1: ?? ?? THEN S1

You get into state 8 from yyact[6], so the previous state has yypact[state] == -261 which is state 3. yychk[3] == -2, so we have:

S1: ?? S2 THEN S1

You get into state 3 from yyact[24], and yypgo[2] == 24 so any state might goto 3 here. So we're now kind of stuck for this rule; to figure out what the first symbol is, we need to work our way forward from state 0 (the start state) to reconstruct the state machine.

edit

This code will decode the state machine from the table format above and print out all the shift/reduce/goto actions in each state:

#define ALEN(A)         (sizeof(A)/sizeof(A[0]))
for (int state = 0; state < ALEN(yypact); state++) {
    printf("state %d:\n", state);
    for (int i = 0; i < ALEN(yyact); i++) {
        int sym = yychk[yyact[i]];
        if (sym > 0 && i == yypact[state] + sym)
            printf("\ttoken %d shift state %d\n", sym, yyact[i]);
        if (sym < 0 && -sym < ALEN(yypgo) && 
            (i == yypgo[-sym] || i == yypgo[-sym] + state + 1))
            printf("\tsymbol %d goto state %d\n", -sym, yyact[i]); } 
    if (yydef[state] > 0)
        printf("\tdefault reduce rule %d\n", yydef[state]);
    if (yydef[state] < 0) {
        for (int i = 0; i < ALEN(yyexca); i+= 2) {
            if (yyexca[i] == -1 && yyexca[i+1] == state) {
                for (int j = i+2; j < ALEN(yyexca) && yyexca[j] != -1; j += 2) {
                    if (yyexca[j] < 0) printf ("\tdefault ");
                    else printf("\ttoken %d ", yyexca[j]);
                    if (yyexca[j+1] < 0) printf("accept\n");
                    else if(yyexca[j+1] == 0) printf("error\n");
                    else printf("reduce rule %d\n", yyexca[j+1]); } } } } }

It will produce output like:

state 0:
        symbol 1 goto state 1
        token 266 shift state 2
        symbol 2 goto state 3
        default reduce rule 3
state 1:
        symbol 1 goto state 1
        symbol 2 goto state 3
        token 0 accept
        default error
state 2:
        symbol 1 goto state 1
        token 257 shift state 6
        token 258 shift state 7
        token 259 shift state 4
        symbol 2 goto state 3
        token 263 shift state 5
state 3:
        token 261 shift state 13
        token 262 shift state 9
        token 263 shift state 10
        token 264 shift state 11
        token 265 shift state 12
        token 267 shift state 8
        symbol 1 goto state 1
        symbol 2 goto state 3
..etc

which should be helpful for reconstructing the grammar.