I have a quite old C corporate parser code that was generated from an ancient Yacc and uses the yyact, yypact, yypgo, yyr1, yyr2, yytoks, yyexca, yychk, yydef tables (but no yyreds though) and the original grammar source is lost. That legacy piece of code need revamping but I cannot afford to recode it from scratch.
Could it be possible to mechanically retrieve / regenerate the parsing rules by deduction of the parsing tables in order to reconstruct the grammar?
Example with a little expression parser that I can process with the same ancient Yacc:
yytabelem yyexca[] ={
-1, 1,
0, -1,
-2, 0,
-1, 21,
261, 0,
-2, 8,
};
yytabelem yyact[]={
13, 9, 10, 11, 12, 23, 8, 22, 13, 9,
10, 11, 12, 9, 10, 11, 12, 1, 2, 11,
12, 6, 7, 4, 3, 0, 16, 5, 0, 14,
15, 0, 0, 0, 17, 18, 19, 20, 21, 0,
0, 24 };
yytabelem yypact[]={
-248, -1000, -236, -261, -236, -236, -1000, -1000, -248, -236,
-236, -236, -236, -236, -253, -1000, -263, -245, -245, -1000,
-1000, -249, -1000, -248, -1000 };
yytabelem yypgo[]={
0, 17, 24 };
yytabelem yyr1[]={
0, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2 };
yytabelem yyr2[]={
0, 8, 12, 0, 6, 6, 6, 6, 6, 6,
4, 2, 2 };
yytabelem yychk[]={
-1000, -1, 266, -2, 259, 263, 257, 258, 267, 262,
263, 264, 265, 261, -2, -2, -1, -2, -2, -2,
-2, -2, 260, 268, -1 };
yytabelem yydef[]={
3, -2, 0, 0, 0, 0, 11, 12, 3, 0,
0, 0, 0, 0, 0, 10, 1, 4, 5, 6,
7, -2, 9, 3, 2 };
yytoktype yytoks[] =
{
"NAME", 257,
"NUMBER", 258,
"LPAREN", 259,
"RPAREN", 260,
"EQUAL", 261,
"PLUS", 262,
"MINUS", 263,
"TIMES", 264,
"DIVIDE", 265,
"IF", 266,
"THEN", 267,
"ELSE", 268,
"LOW", 269,
"UMINUS", 270,
"-unknown-", -1 /* ends search */
};
/* am getting this table in my example,
but it is not present in the studied parser :^( */
char * yyreds[] =
{
"-no such reduction-",
"stmt : IF exp THEN stmt",
"stmt : IF exp THEN stmt ELSE stmt",
"stmt : /* empty */",
"exp : exp PLUS exp",
"exp : exp MINUS exp",
"exp : exp TIMES exp",
"exp : exp DIVIDE exp",
"exp : exp EQUAL exp",
"exp : LPAREN exp RPAREN",
"exp : MINUS exp",
"exp : NAME",
"exp : NUMBER",
};
I am looking to retrieve
stmt : IF exp THEN stmt
| IF exp THEN stmt ELSE stmt
| /*others*/
;
exp : exp PLUS exp
| exp MINUS exp
| exp TIMES exp
| exp DIVIDE exp
| exp EQUAL exp
| LPAREN exp RPAREN
| MINUS exp
| NAME
| NUMBER
;
Edit: I have stripped down the generated parser of my example for clarity, but to help some analysis i have published the whole generated code as a gist. Please not that for some unknown reason there is no yyreds table in the parser I am trying to study / change. I suppose it would not have been fun :^S
An interesting problem. Just from matching the tables to the grammar, it seems that
yyr1andyyr2give you the "outline" of the rules --yyr1is the symbol on the left side of each rule, whileyyr2is 2x the number of symbols on the right side. You also have the names of all the terminals in a convenient table. But the names of the non-terminals are lost.To figure out which symbols go on the rhs of each rule, you'll need to reconstruct the state machine from the tables, which likely involves reading and understanding the code in the y.tab.c file that actually does the parsing. Some of the tables (looks like
yypact,yychkandyydef) are indexed by state number. It seems likely thatyyactis indexed byyypact[state] + token. But those are only guesses. You need to look at the parsing code and understand how its using the tables to encode possible shifts, reduces, and gotos.Once you have the state machine, you can backtrack from the states containing reductions of specific rules through the states that have shifts and gotos of that rule. A shift into a reduction state means the last symbol on the rhs of that rule is the token shifted. A goto into a reduction state means the last symbol on the rhs is symbol for the goto. The second-to-last symbol comes from the shift/goto to the state that does the shift/goto to the reduction state, and so on.
edit
As I surmised, yypact is the 'primary action' for a state. If the value is
YYFLAG(-1000), this is a reduce-only state (no shifts). Otherwise it is a potential shift state andyyact[yypact[state] + token]gives you the potential state to shift to. Ifyypact[state] + tokenis out of range for theyyacttable, or the token doesn't match the entry symbol for that state, then there's no shift on that token.yychkis the entry symbol for each state -- a positive number means you shift to that state on that token, while a negative means you goto that state on that non-terminal.yydefis the reduction for that state -- a positive number means reduce that rule, while 0 means no reduction, and -2 means two or more possible reductions.yyexcais the table of reductions for those states with more than one reduction. The pair-1 statemeans the following entries are for the given state; following pairs oftoken rulemean that for lookaheadtokenit should reducerule. A-2for token is a wildcard (end of the list), while a0for the rule means no rule to reduce (an error instead), and-1means accept the input.The
yypgotable is the gotos for a symbol -- you go to stateyyact[yypgo[sym] + state + 1]if that's in range foryyactandyyact[yypgo[sym]]otherwise.So to reconstruct rules, look at the
yydefandyyexcatables to see which states reduce each rule, and go backwards to see how the state is reached.For example, rule #1. From the
yyr1andyyr2tables, we know its of the formS1: X X X X-- non-terminal #1 on the lhs and 4 symbols on the rhs. Its reduced in state 16 (from theyydeftable, and the accessing symbol for state 16 (fromyychk) is -1. So its:You get into state 16 from
yyact[26], andyypgo[1] == 17, so that means the goto is coming from state 8 (26 ==yypgo[1] + 8 + 1. The accessing symbol of state 8 is 267 (THEN) so now we have:You get into state 8 from
yyact[6], so the previous state hasyypact[state] == -261which is state 3.yychk[3] == -2, so we have:You get into state 3 from
yyact[24], andyypgo[2] == 24so any state might goto 3 here. So we're now kind of stuck for this rule; to figure out what the first symbol is, we need to work our way forward from state 0 (the start state) to reconstruct the state machine.edit
This code will decode the state machine from the table format above and print out all the shift/reduce/goto actions in each state:
It will produce output like:
which should be helpful for reconstructing the grammar.