BNF grammar that has sections with no ending?

184 views Asked by At

I need to parse a simple proprietary language that I didn't design, so I can't change the language. I need the results in C#, so I've been using TinyPG because it's so easy to use, and doesn't require external libraries to run the parser. TinyPG generates a simple LL(1) parser.

The problem I'm currently having is related to how the language divides the file into sections. It has sections for different kinds of variables, setting their initial values, equation definitions, etc. I only care about sections that declare variables, so I would like to just ignore the rest. I don't know all the rules for the other sections, and don't want to have to figure them out. They may be treated as comments.

Here's a code example:

  PARAMETER
    Density             AS REAL
    CrossSectionalArea  AS REAL

 SET # Parameter values
    T101.FO                 := "SimpleEventFOI::dummy";
    T101.CrossSectionalArea := 1    ; # m2

EQUATION
    OutSingleInt = SingleInt;
    OutArrayInt = ArrayInt;

I care about the PARAMETER and SET sections, but not the EQUATION section. As you can see, the problem is that these sections have no END markers. So I can't figure out how to tell the grammar that a section ends when you get a different keyword, but that the new keyword may start a new section. In my attempts the new section starting keyword gets consumed to close off the old section.

There are many more sections than I have listed here, some of which I care about, some I don't. They seem to fall into two types, "Looks like PARAMETER" which don't have semicolons at the end of the statements, and "Looks like EQUATION" which do. This language is not case or whitespace sensitive. The sections could be in any order. (e.g. SET, EQUATION, PARAMETER) Aside from comments, the whole thing could be written on one line.

Currently I'm getting around this by using a regular expression to find the sections I'm interested in, and only feeding those to the parser, but I'm also having trouble coming up with a regular expression that works in all cases, but doesn't accidentally pick up keywords in comments. I may end up just expanding this workaround to solve it's issues, but it would be nicer to solve the problem directly in the grammar. It's possible this just isn't an LL(1) language.

1

There are 1 answers

4
Paul Chen On

I tried the following tpg code, it can parse your example. Looks TinyPG cannot distinguish keyword and id so I hacked the ID a little bit.

//Tiny Parser Generator v1.3
//Copyright © Herre Kuijpers 2008-2012

<% @TinyPG Namespace="Test" %>

PARAMETER   -> @"PARAMETER";
SET         -> @"SET";
EQUATION    -> @"EQUATION";

AS          -> @"AS";

ID          -> @"\b(?!(PARAMETER|SET|EQUATION)\b)([a-zA-Z]\w+)";
DOT         -> @"\.";
EQ          -> @":=";
EXPR        -> @"\d|""[^""]*""";
END         -> @";";

[Skip] WS   -> @"\s+|#[^\r\n]+";

EQDECL      -> @"\b(?!(PARAMETER|SET|EQUATION)\b)([^#;]+)";
Equations   -> EQUATION (EQDECL END)*;

Parameters  -> PARAMETER ParamDecl*;
ParamDecl   -> ID AS ID;

Sets        -> SET SetDecl*;
SetDecl     -> FullId EQ EXPR END;
FullId      -> ID DOT ID;

Section     -> Equations | Parameters | Sets;

Start       -> Section*;