TinyPG doesn't properly parse this grammar, bug or bad grammar?

268 views Asked by At

I need to parse a simple language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser.

Things had been going pretty well, until I ran into this construct in the language. (This is a simplified version, but it does show the problem):

EOF               -> @"^\s*$";
[Skip] WHITESPACE -> @"\s+";
LIST              -> "LIST";
END               -> "END";
IDENTIFIER        -> @"[a-zA-Z_][a-zA-Z0-9_]*";
Expr              -> LIST IDENTIFIER+ END;
Start             -> (Expr)+ EOF;

The resulting parser cannot parse this:

LIST foo BAR Baz END

because it greedily lexes END as an IDENTIFIER, instead of properly as the END keyword.

So, Here are my questions:

  1. Is this grammar ambiguous or wrong for LL(1) parsing? Or is this a bug in TinyPG?

  2. Is there any way to redesign the grammar such that TinyPG will properly parse the example line?

  3. Are there any other suggestions for a simple parser that outputs code in C# and doesn't require additional libraries? I've looked at LLLPG and ANTLR4, but found them much more troublesome than TinyPG.

1

There are 1 answers

0
Theodor Solbjørg On

You might be the same guy since the issue looks identical, as the one I answered on GitHub, but here it is again for people who google this issue.

Here is an example from the Simple-CIL-compiler project, The identifier has to catch single words except the ones listed, which means you have to include the exception token's in to the identifier

IDENTIFIER-> @"[a-zA-Z_][a-zA-Z0-9_]*(?<!(^)(end|else|do|while|for|true|false|return|to|incby|global|or|and|not|write|readnum|readstr|call))(?!\w)";

Hope that helps.

(Link to Original post)