How to get the cursor location during parsing?

241 views Asked by At

I made a minimal example for Packcc parser generator. Here, the parser have to recognize float or integer numbers. I try to print the location of the detected numbers. For simplicity there is no line/column count, just the number from "ftell".

%auxil "FILE*" # The type sent to "pcc_create" for access in "ftell".

test <- line+
        /
        _ EOL+

line <- num _ EOL

num <-  [0-9]+'.'[0-9]+     {printf("Float at %li\n", ftell(auxil));}
        /
        [0-9]+              {printf("Integer at %li\n", ftell(auxil));}

_ <- [ \t]*

EOL <- '\n' / '\r\n' / '\r'

%%

int main()
{
    FILE* file = fopen("test.txt", "r");
    stdin = file;
    if(file == NULL) {
    // try to open.
        puts("File not found");
    }
    else {
    //  parse.
        pcc_context_t *ctx = pcc_create(file);
        while(pcc_parse(ctx, NULL));
        pcc_destroy(ctx);
    }
    return 0;
}

The file to parse can be

2.0
42

The command can be

packcc test.peg && cc test.c && ./a.out

The problem is the cursor value is always at the end of file whatever the number position.

2

There are 2 answers

0
Ploumploum On BEST ANSWER

Positions can be retrieved by special variables. In the example above "ftell" must be replaced by "$0s" or "$0e". $0s is the begining of the matched pattern, $0e is the end of the matched pattern.

https://github.com/arithy/packcc/blob/master/README.md

5
rici On

Without looking more closely at the generated code, it would seem that the parser insists on reading the entire text into memory before executing any of the actions. That seems unnecessary for this grammar, and it is certainly not the way a typical generated lexical scanner would work. It's particularly odd since it seems like the generated scanner uses getchar to read one byte at a time, which is not very efficient if you are planning to read the entire file.

To be fair, you wouldn't be able to use ftell in a flex-generated scanner either, unless you forced the scanner into interactive mode. (The original AT&T lex, which also reads one character at a time, would give you reasonable value from ftell. But you're unlikely to find a scanner built with it anymore.)

Flex would give you the wrong answer because it deliberately reads its input in chunks the size of its buffer, usually 8k. That's a lot more efficient than character-at-a-time reading. But it doesn't work for interactive environments -- for example, where you are parsing directly from user input -- because you don't want to read beyond the end of the line the user typed.

You'll have to ask whoever maintains packcc what their intended approach for maintaining source position is. It's possible that they have something built in.