Unable to encode precedence of block rule over statement rule in tree-sitter

172 views Asked by At

I am trying to encode simple grammar which covers both plain statements and also statements enclosed with a block. Block has special keyword for it. I have specified block rule precedence over zero, but tree-sitter still doesn't match it. Even it reports error, i.e. other rules don't match. But nevertheless it doesn't want to match block! Why and how to fix?

The code:

area = pi*r^2;

block {
    r=12;
}

tree-sitter matches entire sequence block { r=12; as a statement, despite the fact that curle brackets disallowed in statements. So it reports an error, but doesn't want to match block rule, although it is applicable.

Grammar:

module.exports = grammar({
    name: 'test',

    rules: {
        source_file: $ => seq(
            repeat(choice($.block, $.statement_with_semicolon)),
            optional($.statement_without_semicolon)
        ),

        block: $ => prec(1, seq(
            "block",
            "{",
            repeat( $.statement_with_semicolon ),
            optional( $.statement_without_semicolon),
            "}",
            optional(";")
        )),

        statement_without_semicolon: $ => $.token_chain,

        statement_with_semicolon: $ => seq(
            $.token_chain,
            ";"
        ),

        token_chain: $ => repeat1(
            $.token
        ),

        token: $ => choice(
            $.alphanumeric,
            $.punctuation
        ),

        alphanumeric: $ => /[a-zA-Zα-ωΑ-Ωа-яА-Я0-9]+/,

        punctuation: $ => /[^a-zA-Zα-ωΑ-Ωа-яА-Я0-9"{}\(\)\[\];]+/
    }
});

Output:

>tree-sitter parse example-file
(source_file [0, 0] - [4, 1]
  (statement_with_semicolon [0, 0] - [0, 14]
    (token_chain [0, 0] - [0, 13]
      (token [0, 0] - [0, 4]
        (alphanumeric [0, 0] - [0, 4]))
      (token [0, 4] - [0, 7]
        (punctuation [0, 4] - [0, 7]))
      (token [0, 7] - [0, 9]
        (alphanumeric [0, 7] - [0, 9]))
      (token [0, 9] - [0, 10]
        (punctuation [0, 9] - [0, 10]))
      (token [0, 10] - [0, 11]
        (alphanumeric [0, 10] - [0, 11]))
      (token [0, 11] - [0, 12]
        (punctuation [0, 11] - [0, 12]))
      (token [0, 12] - [0, 13]
        (alphanumeric [0, 12] - [0, 13]))))
  (statement_with_semicolon [0, 14] - [3, 9]
    (token_chain [0, 14] - [3, 8]
      (token [0, 14] - [2, 0]
        (punctuation [0, 14] - [2, 0]))
      (token [2, 0] - [2, 5]
        (alphanumeric [2, 0] - [2, 5]))
      (token [2, 5] - [2, 6]
        (punctuation [2, 5] - [2, 6]))
      (ERROR [2, 6] - [2, 7])
      (token [2, 7] - [3, 4]
        (punctuation [2, 7] - [3, 4]))
      (token [3, 4] - [3, 5]
        (alphanumeric [3, 4] - [3, 5]))
      (token [3, 5] - [3, 6]
        (punctuation [3, 5] - [3, 6]))
      (token [3, 6] - [3, 8]
        (alphanumeric [3, 6] - [3, 8]))))
  (statement_without_semicolon [3, 9] - [4, 0]
    (token_chain [3, 9] - [4, 0]
      (token [3, 9] - [4, 0]
        (punctuation [3, 9] - [4, 0]))))
  (ERROR [4, 0] - [4, 1]))
example-file    0 ms    (ERROR [2, 6] - [2, 7])
1

There are 1 answers

0
ahelwer On

Your issue is that your punctuation regex matches newline characters \n and \r, which you can see here:

  (statement_with_semicolon [0, 14] - [3, 9]
    (token_chain [0, 14] - [3, 8]
      (punctuation [0, 14] - [2, 0]))

See how it matches the end of the zeroth line and the blank first line? By the time the parser gets to block it thinks block is just another token in statement_with_semicolon matching alphanumeric. You can fix this immediate issue by changing your punctuation definition to:

punctuation: $ => /[^a-zA-Zα-ωΑ-Ωа-яА-Я0-9"{}\(\)\[\];\n\r]+/

However this likely won't be the last issue of this type you run into, so you might want to rewrite your grammar to be more explicit about the punctuation it accepts, and where. Defining the set of valid operators, for example.

This also answers your other question.