Building a Solidity parser in Rust, hitting an `expression can not fail` and `recursion` error

445 views Asked by At

I am building a Solidity parser with Rust. I am using the Pest Parser crate and am setting up my grammar.pest file to be very similar to the Solidity repo's Lexer/Parser.

I am hitting two errors. The first is an error saying:

            |
          41 | callArgumentList = {LParen ~ ((expression ~ (Comma ~ expression)*)? | LBrace ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrace) ~ RParen␊
             |                               ^-----------------------------------^
             |
             = expression cannot fail; following choices cannot be reached

This error is showing for this rule defined in my grammar.pest.

callArgumentList = {LParen ~ ((expression ~ (Comma ~ expression)*)? | LBrace ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrace) ~ RParen
    }

The above syntax is trying to replicate this rule in the Solidity official parser.

I believe that I have my syntax to match the Solidity repo's syntax for this rule but I am not sure what is going on here.

Secondly, I am getting a recursive error.

        --> 131:6
              |
          131 |     (expression ~ LBrack ~ expression? ~ RBrack)␊
              |      ^--------^
              |
              = rule expression is left-recursive (expression -> expression); pest::prec_climber might be useful in this case
    

This second error is due to the grammar for an expression, however it must be recursive. Here is the rule from the Solidity Repo.

Below is my grammar.pest implementation.

expression = {
    (expression ~ LBrack ~ expression? ~ RBrack)
    | (expression ~ LBrack ~ expression? ~ Colon ~ expression? ~ RBrack)
    | (expression ~ Period ~ (identifier | Address))
    | (expression ~ LBrack ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrack)
    | (expression ~ callArgumentList)
    | (Payable ~ callArgumentList)
    | (Type ~ LParen ~ typeName ~ RParen)
    | ((Inc | Dec | Not | BitNot | Delete | Sub) ~ expression)
    | (expression ~ (Inc | Dec))
    | (expression ~ Exp ~ expression)
    | (expression ~ (Mul | Div | Mod) ~ expression)
    | (expression ~ (Add | Sub) ~ expression)
    | (expression ~ (Shl | Sar | Shr) ~ expression)
    | (expression ~ BitAnd ~ expression)
    | (expression ~ BitXor ~ expression)
    | (expression ~ BitOr ~ expression)
    | (expression ~ (LessThan | GreaterThan | LessThanOrEqual | GreaterThanOrEqual) ~ expression)
    | (expression ~ (Equal | NotEqual) ~ expression)
    | (expression ~ And ~ expression)
    | (expression ~ Or ~ expression)
    | (expression ~ Conditional ~ expression ~ Colon ~ expression)
    | (expression ~ assignOp ~ expression)
    | (New ~ typeName)
    | (tupleExpression)
    | (inlineArrayExpression)
}

Can someone help me figure out how to fix these two errors?

2

There are 2 answers

1
kaby76 On BEST ANSWER

The easiest fix for direct left recursion is to do the usual refactoring for operator precedence by inserting rules and ordering according to precedence. For example (with many alts removed to simplify):

expression = { array }
array = { arraycolon ~ ( LBrack ~ expression? ~ RBrack ) * }
arraycolon = { qualified ~ ( LBrack ~ expression? ~ Colon ~ expression? ~ RBrack ) * }
qualified = { arraycomma ~ ( Period ~ identifier ) * }
arraycomma = { multexpr ~ ( LBrack ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrack ) * }
multexpr = { addexpr ~ ( (Mul | Div | Mod) ~ expression ) * }
addexpr = { andexpr ~ ( (Add | Sub) ~ expression ) * }
andexpr = { orexpr ~ ( And ~ expression ) * }
orexpr = { identifier ~ ( Or ~ expression ) * }
namedArgument = { identifier ~ Colon ~ expression }
LBrack = @{ "[" }
RBrack = @{ "]" }
Colon = @{ ":" }
Period = @{ "." }
Comma = @{ "," }
Mul = @{ "*" }
Div = @{ "/" }
Mod = @{ "%" }
Add = @{ "+" }
Sub = @{ "-" }
And = @{ "&&" }
Or = @{ "||" }
identifier = { (ASCII_ALPHANUMERIC | "_" | "-")+ }
WHITESPACE = _{ " " | "\t" | NEWLINE }
COMMENT    = _{ "#" ~ (!NEWLINE ~ ANY)* }

You can also do it using the usual refactoring that is in the Dragon book. Or, you can even use the native Pest feature to specify operator precedence.

The other problem is caused because the ?-operator is in the wrong place in ( expression ~ ( Comma expression ) * )?. A similar problem is discussed here. The solution is a refactoring to remove ?-operator, the reintroduce it in the correct place:

x : l (a? | b) r;
   ==> (eliminate ?-operator, useless parentheses)
x : l ( | a | b) r;
   ==> (add ?-operator)
x : l ( a | b )? r ;

Using this and regrouping, a possible solution is:

callArgumentList = { LParen ~ ( ( expression ~ ( Comma ~ expression ) * ) | ( LBrace ~ ( namedArgument ~ ( Comma ~ namedArgument ) * ) ? ~ RBrace ) )? ~ RParen }
0
rici On

The linked Solidity grammar is for Antlr, not Pest, and it uses a different parsing methodology. So there's no reason to expect the grammar to work unmodified in Pest.

Evidently, Antlr has different limitations. Unlike Pest, Antlr allows some left recursion by translating it into semantic predicates, a feature which doesn't appear to exist in Pest. Instead, PEST offers precedence climbing, which can be used to parse left-associative operators in algebraic expressions. That feature doesn't seem to be fully documented, but there are some examples which can be used as a model.

I'm not sure how Antlr handles the callArgumentList production, but since Antlr can backtrack, there are various possibilities. PEG grammars don't backtrack after success, and ? and * expressions always succeed because they can successfully match nothing. This is explained in the Pest book. So the error message is correct; the second alternative will never be used because the first one will always succeed, possibly by matching nothing. The intention is that the entire argument list be optional, so the optionality operator is misplaced. It should be (reformatted to avoid horizontal scrolling)

callArgumentList = 
    {
      LParen ~
        (
          expression ~ (Comma ~ expression)*
        | 
          LBrace ~ (namedArgument ~ (Comma ~ namedArgument)*)? ~ RBrace
        )? ~
      RParen
    }

The difference is subtle -- all I did was to move the first ? up one level.

I don't know if that will actually work, though, because it's possible that the braced mapping can match expression. In that case, it might be necessary to put the alternatives in the opposite order to get a correct parse. (And that's not guaranteed to work, either. It really depends on the rest of the grammar.)