Bison/Flex print value of terminal from alternative

1.2k views Asked by At

I have written a simple grammar:

operations :
    /* empty */
    | operations operation ';'
    | operations operation_id ';'
    ;

operation :
    NUM operator NUM
    {
        printf("%d\n%d\n",$1, $3);
    }
    ;

operation_id :
    WORD operator WORD
    {
        printf("%s\n%s\n%s\n",$1, $3, $<string>2);
    }
    ;

operator :
    '+' | '-' | '*' | '/'
    {
        $<string>$ = strdup(yytext);
    }
    ;

As you can see, I have defined an operator that recognizes one of 4 symbols. Now, I want to print this symbol in operation_id. Problem is, that logic in operator works only for last symbol in alternative. So if I write a/b; it prints ab/ and that's cool. But for other operations, eg. a+b; it prints aba. What am I doing wrong?

*I ommited new lines symbols in example output.

1

There are 1 answers

0
rici On

This non-terminal from your grammar is just plain wrong.

operator :
    '+' | '-' | '*' | '/' { $<string>$ = strdup(yytext); }
    ;

First, in yacc/bison, each production has an action. That rule has four productions, of which only the last has an associated action. It would be clearer to write it like this:

operator : '+' 
         | '-'
         | '*'
         | '/' { $<string>$ = strdup(yytext); }
         ;

which makes it a bit more obvious that the action only applies to the reduction from the token '/'.

The action itself is incorrect as well. yytext should never be used outside of a lexer action, because its value isn't reliable; it will be the value at the time the most recent lexer action was taken, but since the parser usually (but not always) reads one token ahead, it will usually (but not always) be the string associated with the next token. That's why the usual advice is to make a copy of yytext, but the idea is to copy it in the lexer rule, assigning the copy to the appropriate member of yylval so that the parser can use the semantic value of the token.

You should avoid the use of $<type>$ =. A non-terminal can only have one type, and it should be declared in the prologue to the bison file:

 %type <string> operator

Finally, you will find that it is very rarely useful to have a non-terminal which recognizes different operators, because the different operators are syntactically different. In a more complete expression grammar, you'd need to distinguish between a + b * c, which is the sum of a and the product of b and c, and a * b + c, which is the sum of c and the product of a and b. That can be done by using different non-terminals for the sum and product syntaxes, or by using different productions for an expression non-terminal and disambiguating with precedence rules, but in both cases you will not be able to use an operator non-terminal which produces + and * indiscriminately.

For what its worth, here is the explanation of why a+b results in the output of aba:

  1. The production operator : '+' has no explicit action, so it ends up using the default action, which is $$ = $1.

  2. However, the lexer rule which returns '+' (presumably -- I'm guessing here) never sets yylval. So yylval still has the value it was last assigned.

  3. Presumably (another guess), the lexer rule which produces WORD correctly sets yylval.string = strdup(yytext);. So the semantic value of the '+' token is the semantic value of the previous WORD token, which is to say a pointer to the string "a".

  4. So when the rule

    operation_id :
        WORD operator WORD
        {
            printf("%s\n%s\n%s\n",$1, $3, $<string>2);
        }
        ;
    

executes, $1 and $2 both have the value "a" (two pointers to the same string), and $3 has the value "b".

Clearly, it is semantically incorrect for $2 to have the value "a", but there is another error waiting to occur. As written, your parser leaks memory because you never free() any of the strings created by strdup. That's not very satisfactory, and at some point you will want to fix the actions so that semantic values are freed when they are no longer required. At that point, you will discover that having two semantic values pointing at the same block of allocated memory makes it highly likely that free() will be called twice on the same memory block, which is Undefined Behaviour (and likely to produce very difficult-to-diagnose bugs).