non- terminal symbol and symbol rules explain and especially is the expr reserved?

1.8k views Asked by At
 %{
 #include <stdio.h>
 #include <stdlib.h>
 void yyerror(const char*);
 %}
 %token WORD
 %token EOL
 %%
 input:
 /* empty */
 |
 input line
 ;
 line:
 EOL { exit(1); }
 | words EOL { printf("correct!\n"); }
 ;
 words:
 words WORD
 | WORD
 ;
 %%
 void yyerror(const char *str)
 {
 fprintf(stderr," error: %s\n",str);
 }
 main()
 {
 yyparse();
 }

Can someone help me understand non-terminal symbols and symbol rules like line input words expr .Is expr a reserved word??

1

There are 1 answers

5
Brian Tompsett - 汤莱恩 On BEST ANSWER

It looks like you are a complete beginner with using parser generating tools like bison. Although you have a small working example program you are wanting to extend it but do not fully understand how it works or understand some of the terms used in the documentation you are using. I'll to explain these points in a tutorial ending with code which extends your example with expressions.

To make the tutorial more readable some of the explanations are hidden until you hover with a mouse. This enables anyone to only read the bits they need and makes the page less of a wall of text.

Overview of Bison

The Bison Manual is a good source of reference for questions on using bison. The first thing to note is the overall layout of bison code which is like this:


%{

Prologue

%}

Bison declarations

%%

Grammar rules

%%

Epilogue


The first section is the Prologue which is used to put the declarations and imports for the generated code (which is usually in C.

The second section are the bison declarations which introduce the terminal symbols used in the grammar, and optionally indicate the start rule for the grammar.

The third section contains the (context free) grammar rules and actions to be taken when a rule is matched. These rules are converted into a parser in the output language (usually C) by the bison tool. This is usually the main body of any bison file.

The fourth and last section contains the epilogue. This is used to place any code that should be copied into the output file. It usually contains any code for functions required in the rule actions used earlier.

Grammar Rules

Understanding the grammar rules used by bison involves some computer science study. It is often one of the areas that programmers who have not studied computer science have trouble with and therefore sometimes is in need of further explanation. Grammar rules are used to define what sequences of symbols are valid in a language. Parsing is the process of matching input to those rules. Bison's job is to build a parser that implements those rules. The rules specify sequences of input entities known as tokens which are defined in the bison declaration section. The tokens and other input symbols are known as terminal symbols in the grammar. The rules are given names, and the names for the rules are called the non-terminals of the grammar. The rules have the form of:

rule: sequence of TERMINALS and non terminals ;

By convention TERMINAL names are capitalised and non terminal names are not.

An example:

declaration : ID ":" ID ;

This rule defines the non-terminal declaration to match the sequence of tokens ID, ":", ID in that order. Note that we can have strings and the names of tokens in that rule.

Those rules can have actions added to them, which define code to be executed when the rule match occurs. Actions are put in the { ... } pair at the end of each rule, like this:

declaration : ID ":" ID { printf("Matched a declaration\n"); }
            ;

For advanced usage we can also extract information from the tokens to include in our action:

declaration : ID ":" ID { printf("%s declared of type %s\n",getSymbol($1),getSymbol($3)); }
            ;

Now we have covered the basics of what is a terminal and non-terminal symbol, let's use that to decode the example program you used...

Annotated Example

%{

The %{ is the beginning of the pair `%{ ... }% which mark code to be copied into the preamble of the C code generated by bison. It permits declarations and imports to be inserted.

 #include <stdio.h>
 #include <stdlib.h>
 void yyerror(const char*);

We import the standard C libraries so we can do printing . We declare that we use yyerror
The %} ends the preamble

 %}
 %token WORD
 %token EOL
 %%

This is the bison declaration section indicating that we import tokens called WORD and EOL from the lexical analyser. The lexical analyser code is defined elsewhere, so we have no idea what sequences of characters these tokens represent. We can guess from their names though!
The %% marks the end of the declarations and the start of the rules section The first rule encountered after the %% is considered to be the start of the grammar, unless the %start directive is used in the declaration section to name another start.

 input:
 /* empty */

This defines a rule (or non-terminal) called input.
The rule consists of a comment only, which actually means an empty string or a an empty file is a valid match for the rule called input. This means the parser is allowed to parse nothing! If this was missed out, there would be no way of specifying what a line was. As in the following rule line is always defined in terms of line leaving no way out.

 |
 input line

The stick symbol | indicates or or an alternative rule for the non-terminal called input. The alternative rule is input line. This is identical to writing:
input : input line

 ;

The semicolon ; marks the end of the grammar rule (non-terminal) for line. Knowledge of context free grammars will will show us that this rule matches an optional sequence of lines of any length. What defines a line is defined by the non-terminal 'line' that follows.

 line:
 EOL { exit(1); }
 | words EOL { printf("correct!\n"); }
 ;

This defines the non-terminal line to be an unlimited sequence of lines, each one optionally containing words and terminated by the token/terminal EOL. When a line is matched the string correct is printed. An line containing no words (empty) will end the parse and the parser will exit.

 words:
 words WORD
 | WORD
 ;

This defines the non-terminal called words, which is an unlimited sequence of the token/terminal called WORD.

 %%

This marks the end of the bison rules section. All the code after this point is just copied unchanged into the parser code file by bison. It contains the error function called by the parser (yyerror) and the main program that calls the parser.

 void yyerror(const char *str)
 {
 fprintf(stderr," error: %s\n",str);
 }
 main()
 {
 yyparse();
 }

Extending the parser

You asked about expr expecting it to be some command or built-in feature of bison I suspect. It is not clear what you are asking, as you never clarified that point. However, in my experience, many students are given a simple parser, like the one you illustrated, in some class notes, and asked, by way of an exercise, to extend it to recognise lines containing arithmetic expressions. That is what I am going to assume, and explain.

Let's imagine that we wish to change (or expand) the parser to match simple expressions like WORD + WORD instead of lines of words.

We could redefine the rule line like this:

 line:
 EOL { exit(1); }
 | expr EOL { printf("correct!\n"); }
 ;

and the rule expr like this:

expr :
     WORD '+' expr
     | WORD
     ;

We could even make this into a simple calculator, if our lexical analyser matched NUMBER instead of WORD. (The lexical analyser is define elsewhere). We could now do things like:

 line:
 EOL { exit(1); }
 | expr EOL { printf("expression value is = %d\n",$1); }
 ;

expr :
     NUMBER '+' expr { $$ = $1 + $3 ; }
     | NUMBER    {$$ = $1 )
     ;

Further Reading

If you want to go further, I have several hours of (unexciting) YouTube video tutorials on building simple parsers that support my classes. They might be helpful.