direct-coded vs table-driven lexer?

6.7k views Asked by At

I'm new in compiler construction world , I want to know what are the differences between direct-coded vs table-driven lexer analyzer ?

Please use simple source code example if it's possible.

Thanks.

Edit :

in Engineering a Compiler book, the author divided lexers into three(3) types: table-driven,direct coded, and hand coded.

2

There are 2 answers

5
Martin Törnwall On BEST ANSWER

I shall assume that you by "direct-coded" mean a hand-written lexer rather than one produced as the output of a lexer generator. In that case... (See below.)

... a table-driven lexer is a (usually automatically generated) program that uses some kind of lookup table to determine which action to take. Consider the finite automaton that corresponds to the regular expression ab*a (intentionally not minimized):

DFA for ab*a

If we limited ourselves to only considering characters 'a' and 'b', we could encode it in a lookup table like so:

#define REJECT -1

/* This table encodes the transitions for a given state and character. */
const int transitions[][] = {
    /* In state 0, if we see an a then go to state 1 (the 1).
     * Otherwise, reject input.
     */
    { /*a*/  1,  /*b*/  REJECT },
    { /*a*/  2,  /*b*/  3      },
    { /*a*/ -1,  /*b*/ -1      }, /* Could put anything here. */
    { /*a*/  2,  /*b*/  3      }
};

/* This table determines, for each state, whether it is an accepting state. */
const int accept[] = { 0, 0, 1, 0 };

Now we just need a driver that actually uses the table:

int scan(void) {
    char ch;
    int state = 0;

    while (!accept[state]) {
        ch = getchar() - 'a'; /* Adjust so that a => 0, b => 1. */
        if (transitions[state][ch] == REJECT) {
            fprintf(stderr, "invalid token!\n");
            return 0; /* Fail. */
        } else {
            state = transitions[state][ch];
        }
    }
    return 1; /* Success! */
}

Of course, in a real lexer every token would have a corresponding accepting state, and you would have to somehow modify the accept table to contain the token identifier. I want to stress two things, though:

  1. A table-driven lexer doesn't necessarily operate on the level of DFA states.
  2. I don't recommend writing table-driven lexers by hand as it is a tedious and error-prone process.

A hand-written (for lack of a better name) lexer doesn't usually use a lookup table. Suppose we want a lexer for a simple Lisp-like language that has parentheses, identifiers and decimal integers:

enum token {
    ERROR,
    LPAREN,
    RPAREN,
    IDENT,
    NUMBER
};

enum token scan(void) {
    /* Consume all leading whitespace. */
    char ch = first_nonblank();
    if (ch == '(') return LPAREN;
    else if (ch == ')') return RPAREN;
    else if (isalpha(ch)) return ident();
    else if (isdigit(ch)) return number();
    else {
        printf("invalid token!\n");
        return ERROR;
    }
}

char first_nonblank(void) {
    char ch;
    do {
        ch = getchar();
    } while (isspace(ch));
    return ch;
}

enum token ident(void) {
    char ch;
    do {
        ch = getchar();
    } while (isalpha(ch));
    ungetc(ch, stdin); /* Put back the first non-alphabetic character. */
    return IDENT;
}

enum token number(void) {
    char ch;
    do {
        ch = getchar();
    } while (isdigit(ch));
    ungetc(ch, stdin); /* Put back the first non-digit. */
    return NUMBER;
}

Like the table-driven lexer example, this one is not complete. For one thing, it needs some kind of buffering to store the characters that are part of a token like IDENT and NUMBER. For another, it doesn't handle EOF particularly gracefully. But hopefully you get the gist of it.


Edit: Based on the definition in Engineering a Compiler, a direct-coded lexer is basically a hybrid of the two. Rather than using a table, we use code labels to represent states. Let's see how that would look using the same DFA as before.

int scan(void) {
    char ch;

state0:
    ch = getchar();
    if (ch == 'a') goto state1;
    else { error(); return 0; }

state1:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3;
    else { error(); return 0; }

state2:
    return 1; /* Accept! */

state3:
    ch = getchar();
    if (ch == 'a') goto state2;
    else if (ch == 'b') goto state3; /* Loop. */
    else { error(); return 0; }
}

In my personal experience the "best" approach to writing lexers is the hand-written approach I outlined above. It works on every platform, in every language, you need not learn ancient tools like lex, and, perhaps most importantly, you have the flexilbility to extend the lexer with features that are difficult or impossible to implement with a tool. For example, perhaps you want your lexer to understand Python-esque block indentaiton, or maybe you need to dynamically extend certain token classes.

0
Radoslaw Jocz On

Look at my lexer generator, it is very simple and easy to uderstand, it generates DFA direct code automata, as nested switch instructions. I used such approach for my projects, firstly was hand written and later generated by using this tool. The approach is based on my experience by studying this topic by reading several books and studying the implementations of the more advanced parser generators. There is a project on github - rmjocz/langgen