I have a quite old C corporate parser code that was generated from an ancient Yacc and uses the yyact
, yypact
, yypgo
, yyr1
, yyr2
, yytoks
, yyexca
, yychk
, yydef
tables (but no yyreds
though) and the original grammar source is lost. That legacy piece of code need revamping but I cannot afford to recode it from scratch.
Could it be possible to mechanically retrieve / regenerate the parsing rules by deduction of the parsing tables in order to reconstruct the grammar?
Example with a little expression parser that I can process with the same ancient Yacc:
yytabelem yyexca[] ={
-1, 1,
0, -1,
-2, 0,
-1, 21,
261, 0,
-2, 8,
};
yytabelem yyact[]={
13, 9, 10, 11, 12, 23, 8, 22, 13, 9,
10, 11, 12, 9, 10, 11, 12, 1, 2, 11,
12, 6, 7, 4, 3, 0, 16, 5, 0, 14,
15, 0, 0, 0, 17, 18, 19, 20, 21, 0,
0, 24 };
yytabelem yypact[]={
-248, -1000, -236, -261, -236, -236, -1000, -1000, -248, -236,
-236, -236, -236, -236, -253, -1000, -263, -245, -245, -1000,
-1000, -249, -1000, -248, -1000 };
yytabelem yypgo[]={
0, 17, 24 };
yytabelem yyr1[]={
0, 1, 1, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2 };
yytabelem yyr2[]={
0, 8, 12, 0, 6, 6, 6, 6, 6, 6,
4, 2, 2 };
yytabelem yychk[]={
-1000, -1, 266, -2, 259, 263, 257, 258, 267, 262,
263, 264, 265, 261, -2, -2, -1, -2, -2, -2,
-2, -2, 260, 268, -1 };
yytabelem yydef[]={
3, -2, 0, 0, 0, 0, 11, 12, 3, 0,
0, 0, 0, 0, 0, 10, 1, 4, 5, 6,
7, -2, 9, 3, 2 };
yytoktype yytoks[] =
{
"NAME", 257,
"NUMBER", 258,
"LPAREN", 259,
"RPAREN", 260,
"EQUAL", 261,
"PLUS", 262,
"MINUS", 263,
"TIMES", 264,
"DIVIDE", 265,
"IF", 266,
"THEN", 267,
"ELSE", 268,
"LOW", 269,
"UMINUS", 270,
"-unknown-", -1 /* ends search */
};
/* am getting this table in my example,
but it is not present in the studied parser :^( */
char * yyreds[] =
{
"-no such reduction-",
"stmt : IF exp THEN stmt",
"stmt : IF exp THEN stmt ELSE stmt",
"stmt : /* empty */",
"exp : exp PLUS exp",
"exp : exp MINUS exp",
"exp : exp TIMES exp",
"exp : exp DIVIDE exp",
"exp : exp EQUAL exp",
"exp : LPAREN exp RPAREN",
"exp : MINUS exp",
"exp : NAME",
"exp : NUMBER",
};
I am looking to retrieve
stmt : IF exp THEN stmt
| IF exp THEN stmt ELSE stmt
| /*others*/
;
exp : exp PLUS exp
| exp MINUS exp
| exp TIMES exp
| exp DIVIDE exp
| exp EQUAL exp
| LPAREN exp RPAREN
| MINUS exp
| NAME
| NUMBER
;
Edit: I have stripped down the generated parser of my example for clarity, but to help some analysis i have published the whole generated code as a gist. Please not that for some unknown reason there is no yyreds
table in the parser I am trying to study / change. I suppose it would not have been fun :^S
An interesting problem. Just from matching the tables to the grammar, it seems that
yyr1
andyyr2
give you the "outline" of the rules --yyr1
is the symbol on the left side of each rule, whileyyr2
is 2x the number of symbols on the right side. You also have the names of all the terminals in a convenient table. But the names of the non-terminals are lost.To figure out which symbols go on the rhs of each rule, you'll need to reconstruct the state machine from the tables, which likely involves reading and understanding the code in the y.tab.c file that actually does the parsing. Some of the tables (looks like
yypact
,yychk
andyydef
) are indexed by state number. It seems likely thatyyact
is indexed byyypact[state] + token
. But those are only guesses. You need to look at the parsing code and understand how its using the tables to encode possible shifts, reduces, and gotos.Once you have the state machine, you can backtrack from the states containing reductions of specific rules through the states that have shifts and gotos of that rule. A shift into a reduction state means the last symbol on the rhs of that rule is the token shifted. A goto into a reduction state means the last symbol on the rhs is symbol for the goto. The second-to-last symbol comes from the shift/goto to the state that does the shift/goto to the reduction state, and so on.
edit
As I surmised, yypact is the 'primary action' for a state. If the value is
YYFLAG
(-1000), this is a reduce-only state (no shifts). Otherwise it is a potential shift state andyyact[yypact[state] + token]
gives you the potential state to shift to. Ifyypact[state] + token
is out of range for theyyact
table, or the token doesn't match the entry symbol for that state, then there's no shift on that token.yychk
is the entry symbol for each state -- a positive number means you shift to that state on that token, while a negative means you goto that state on that non-terminal.yydef
is the reduction for that state -- a positive number means reduce that rule, while 0 means no reduction, and -2 means two or more possible reductions.yyexca
is the table of reductions for those states with more than one reduction. The pair-1 state
means the following entries are for the given state; following pairs oftoken rule
mean that for lookaheadtoken
it should reducerule
. A-2
for token is a wildcard (end of the list), while a0
for the rule means no rule to reduce (an error instead), and-1
means accept the input.The
yypgo
table is the gotos for a symbol -- you go to stateyyact[yypgo[sym] + state + 1]
if that's in range foryyact
andyyact[yypgo[sym]]
otherwise.So to reconstruct rules, look at the
yydef
andyyexca
tables to see which states reduce each rule, and go backwards to see how the state is reached.For example, rule #1. From the
yyr1
andyyr2
tables, we know its of the formS1: X X X X
-- non-terminal #1 on the lhs and 4 symbols on the rhs. Its reduced in state 16 (from theyydef
table, and the accessing symbol for state 16 (fromyychk
) is -1. So its:You get into state 16 from
yyact[26]
, andyypgo[1] == 17
, so that means the goto is coming from state 8 (26 ==yypgo[1] + 8 + 1
. The accessing symbol of state 8 is 267 (THEN
) so now we have:You get into state 8 from
yyact[6]
, so the previous state hasyypact[state] == -261
which is state 3.yychk[3] == -2
, so we have:You get into state 3 from
yyact[24]
, andyypgo[2] == 24
so any state might goto 3 here. So we're now kind of stuck for this rule; to figure out what the first symbol is, we need to work our way forward from state 0 (the start state) to reconstruct the state machine.edit
This code will decode the state machine from the table format above and print out all the shift/reduce/goto actions in each state:
It will produce output like:
which should be helpful for reconstructing the grammar.