Generating a parser for expressions

405 views Asked by At

I am trying to generate a javascript parser for the ebnf grammar described in this Microsoft article. The ebnf specified in the article does not work when I use it as its written, so I have tried to simplify it to make it work with the REx parser generator.

The goal is in Javascript to be able to parse and evaluate expressions like these to True or False:

  • AttributeA > 2 - The value of AttributeA is greater than 2
  • HasCategory(Assembly) - The node has Category Assembly
  • Assembly - The node has Category Assembly
  • HasValue(AttributeA) - The attribute AttributeA has a value. Its not undefined.
  • AttributeA < AttributeB - The value of attribute AttributeA is less than the value of attribute Attribute B
  • IsReference - The value of the attribute IsReference is True
  • AttributeA + 2 > 5 and AttributeB - 5 != 7
  • AttributeA * 1.25 >= 500

I am using the REx parser generator online here: https://www.bottlecaps.de/rex/. If you have suggestions for other parser generators that produce JavaScript I would appreciate some links to where I can find them.

The issue I'm struggling with is the definition of the MethodCall. I have tried a lot of different definitions but all fail. When I remove the MethodCall and MethodArgs definition the REx parser generator produces a parser.

So I would appreciate any help to crack this problem a lot.

Below is the grammar as far as I have been able to get.

Expression
    ::= BinaryExpression | MethodCall | "(" Expression ")" | Number
BinaryExpression
    ::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )*
RelationalExpression
    ::= AdditiveExpression ( ( '<' | '>' | '<=' | '>=' | 'and' | 'or' ) AdditiveExpression )*
AdditiveExpression
    ::= MultiplicativeExpression ( ( '+' | '-' ) MultiplicativeExpression )*
MultiplicativeExpression
    ::= UnaryExpression ( ( '*' | '/' | '%' ) UnaryExpression )*
UnaryExpression
    ::= "!" Identifier | "+" Identifier | "-" Identifier | Identifier

MethodCall
    ::= Identifier "(" MethodArgs* ")"
MethodArgs
    ::= Expression | Expression "," MethodArgs

Identifier
    ::= Letter ( Letter | Digit | "_" )*
Number
    ::= Digit ('.' Digit) | (Digit)*

<?TOKENS?>

Letter
    ::= [A-Za-z]
Digit
    ::= [0-9]

Here are some of the different versions for the MethodCall definition I have tried but all of them fail.

MethodCall
    ::= Identifier "(" MethodArgs? ")"
MethodArgs
    ::= Expression ("," MethodArgs)*

MethodCall
    ::= Identifier "(" MethodArgs ")"
MethodArgs
    ::= ( Expression ("," MethodArgs)* )?

MethodCall
    ::= MethodName "(" MethodArgs? ")"
MethodName
    ::= Identifier
MethodArgs
    ::= Expression ("," MethodArgs)*

MethodCall
    ::= Identifier "(" MethodArgs ")"
MethodArgs
    ::= Expression ("," MethodArgs)* | ""

MethodCall
    ::= Identifier MethodArgs
MethodArgs
    ::= "(" (Expression ("," MethodArgs)* | "")  ")"

MethodCall
    ::= Identifier "(" MethodArgs ")"
MethodArgs
    ::= Expression | Expression "," MethodArgs | ""

MethodCall
    ::= Identifier "(" Expression ")"

I have tried to get inspiration from a number of other languages to see how they do it, but with no luck yet so far:

Update

Just wanted to let you know how this turned out. I struggled to get the EBNF grammar to do what I needed it to do, so I took a look at Nearley like @rici suggested and converted my grammar into the Nearley grammar syntax and got the tooling to work. It was just a much better choice for my project. The documentation is very good, the tooling is also great and the error messages is very helpful. So a huge thanks to @rici for suggesting Nearley.

Below is the grammar I have implemented. I have tested with the following inputs:

'2 + 4', '2 + 4 - 6', '(2 + 4)', '!true', '!(true)', 'hasCategory(test)', 'hasCategory(test,test2)', 'hasCategory( test , test2 )', 'hasCategory(test,test2, test3)', 'IsReference', 'IsReference()', '2 * 4', '(2 / 4)', '2 * 4 + 2', '(2 / 4) + 2', '2 > 4', '2 >= 2', '2 = 4', '2 == 2', '2 != 4', '2 !== 2', '(2 * 4 + 2) > 4', '(2 * 4 + 2) > (4 + 10)', 'true', 'true or false', 'true || false', 'true and false', 'true && false', '(true or false)', '!(true or false)', '2 != 1+1', '2 != (1+1)', '2 != (1+2)', '(2 > 2 or (2 != 1+1))',

@builtin "whitespace.ne" # `_` means arbitrary amount of whitespace
@builtin "number.ne"     # `int`, `decimal`, and `percentage` number primitives
@builtin "postprocessors.ne"

@{%
function methodCall(nameId, argsId = -1) {
  return function(data) {
      return {
          type: 'methodCall',
          name: data[nameId],
          args: argsId == -1 ? [] : data[argsId]
      };
    }
}

function value() {
  return function(data) {
      return {
          type: 'value',
          value: data[0]
      };
    }
}
%}
expression -> 
    methodCall {% id %}
  | relationalExpression {% value() %}
  | booleanExpression  {% value() %}
  | _ identifier _  {% methodCall(1) %}

booleanExpression ->
      parentheses {% id %}
    | parentheses _ "and"i _ parentheses {% d => d[0] && d[4] %}
    | parentheses _ "&&" _ parentheses {% d => d[0] && d[4] %}
    | parentheses _ "or"i _ parentheses {% d => d[0] || d[4] %}
    | parentheses _ "||" _ parentheses {% d => d[0] || d[4] %}

parentheses ->
    _ "(" relationalExpression ")" _ {% nth(2) %}
  |  _ "(" booleanExpression ")" _ {% nth(2) %}
  | unaryExpression {% id %}

relationalExpression -> 
      _ additiveExpression _ {% nth(1) %}
    | relationalExpression _ "=" _ additiveExpression {% d => d[0] == d[4] %}
    | relationalExpression _ "==" _ additiveExpression {% d => d[0] == d[4] %}
    | relationalExpression _ "!=" _ additiveExpression {% d => d[0] != d[4] %}
    | relationalExpression _ "!==" _ additiveExpression {% d => d[0] != d[4] %}
    | relationalExpression _ "<" _ additiveExpression {% d => d[0] < d[4] %}
    | relationalExpression _ ">" _ additiveExpression {% d => d[0] > d[4] %}
    | relationalExpression _ "<=" _ additiveExpression {% d => d[0] <= d[4] %}
    | relationalExpression _ ">=" _ additiveExpression {% d => d[0] >= d[4] %}

additiveExpression -> 
    _ multiplicativeExpression _ {% nth(1) %}
  | additiveExpression _ "+" _ multiplicativeExpression {% d => d[0] + d[4] %}
  | additiveExpression _ "-" _ multiplicativeExpression {% d => d[0] - d[4] %}

multiplicativeExpression ->
    _ parentheses _  {% nth(1) %}
  | parentheses _ "*" _ parentheses {% d => d[0] * d[4] %}
  | parentheses _ "/" _ parentheses {% d => d[0] / d[4] %}
  | parentheses _ "%" _ parentheses {% d => d[0] % d[4] %}

unaryExpression ->  
    _ "!" _ expression _ {% d => !d[3] %}
  | _ decimal _ {% nth(1) %}
  | _ unsigned_int _ {% nth(1) %}
  | _ boolean _ {% nth(1) %}
  | _ identifier _ {% nth(1) %}

methodCall -> 
      identifier "(" methodArgs ")" {% methodCall(0, 2) %}
    | identifier "(" _ ")" {% methodCall(0) %}

methodArgs ->
    _ identifier _  {% d => [d[1]] %}
  | _ identifier _ "," _ methodArgs _ {% d => [d[1]].concat(d[5]) %}

boolean ->
    "true"i {% () => true %} 
  | "false"i {% () => false %}

identifier -> 
  [A-Za-z0-9_]:+ {% (data, l, reject) => {
    var ident = data[0].join('');
    if (ident.toLowerCase() === 'true' || ident.toLowerCase() === 'false') {
      return reject;
    } else {
      return ident;
    }
  }
   %}
1

There are 1 answers

1
kaby76 On BEST ANSWER

There are a few problems with your grammar, but it's mostly fine.

  • 'and' and 'or' conflict with Identifier. So, subtract those string literals from Identifier in its rule.
  • Number was missing parentheses. It should be Number ::= Digit ( ('.' Digit) | (Digit)* )
  • You are missing the EOF rule. Almost every parser generator I know requires a bottom/EOF rule to force consumption of the entire input. I added the rule "Input".
  • Make sure to click the "configure" box, then "backtracking". Your grammar is ambiguous, which is fine, but requires you to tell the parser generator to handle that.

Parser generators have a slightly different syntax for "EBNF", which is what REx takes. REx adds a <?TOKENS?> string to denote the boundary between parser and lexer rules. Microsoft says the grammar is "BNF" but it's not because it uses the Kleene operator <Identifier> ::= [^. ]*, an EBNF construct. It also fudges the definition of <Literal> and <Number> with prose.

I haven't tested the generated parser, but it seems like a straightforward recursive descent implementation. The parser generators that I am familiar with, and that are popular, are listed in the conversion page. (I'm writing converters for all of them and many more.)

Try this:

Input ::= Expression EOF

Expression
    ::= BinaryExpression | MethodCall | "(" Expression ")" | Number

BinaryExpression
    ::= RelationalExpression ( ( '=' | '!=' ) RelationalExpression )*
RelationalExpression
    ::= AdditiveExpression ( ( '<' | '>' | '<=' | '>=' | 'and' | 'or' ) AdditiveExpression )*
AdditiveExpression
    ::= MultiplicativeExpression ( ( '+' | '-' ) MultiplicativeExpression )*
MultiplicativeExpression
    ::= UnaryExpression ( ( '*' | '/' | '%' ) UnaryExpression )*
UnaryExpression
    ::= "!" Identifier | "+" Identifier | "-" Identifier | Identifier

MethodCall
    ::= Identifier "(" MethodArgs* ")"
MethodArgs
    ::= Expression | Expression "," MethodArgs

<?TOKENS?>

Identifier
    ::= ( ( Letter ( Letter | Digit | "_" )* ) - ( 'and' | 'or' ) )
Number ::= Digit ( ('.' Digit) | (Digit)* )

Letter
    ::= [A-Za-z]
Digit
    ::= [0-9]

EOF      ::= $