I am trying to encode simple grammar which covers both plain statements and also statements enclosed with a block. Block has special keyword for it. I have specified block rule precedence over zero, but tree-sitter still doesn't match it. Even it reports error, i.e. other rules don't match. But nevertheless it doesn't want to match block! Why and how to fix?
The code:
area = pi*r^2;
block {
r=12;
}
tree-sitter
matches entire sequence block { r=12;
as a statement, despite the fact that curle brackets disallowed in statements. So it reports an error, but doesn't want to match block rule, although it is applicable.
Grammar:
module.exports = grammar({
name: 'test',
rules: {
source_file: $ => seq(
repeat(choice($.block, $.statement_with_semicolon)),
optional($.statement_without_semicolon)
),
block: $ => prec(1, seq(
"block",
"{",
repeat( $.statement_with_semicolon ),
optional( $.statement_without_semicolon),
"}",
optional(";")
)),
statement_without_semicolon: $ => $.token_chain,
statement_with_semicolon: $ => seq(
$.token_chain,
";"
),
token_chain: $ => repeat1(
$.token
),
token: $ => choice(
$.alphanumeric,
$.punctuation
),
alphanumeric: $ => /[a-zA-Zα-ωΑ-Ωа-яА-Я0-9]+/,
punctuation: $ => /[^a-zA-Zα-ωΑ-Ωа-яА-Я0-9"{}\(\)\[\];]+/
}
});
Output:
>tree-sitter parse example-file
(source_file [0, 0] - [4, 1]
(statement_with_semicolon [0, 0] - [0, 14]
(token_chain [0, 0] - [0, 13]
(token [0, 0] - [0, 4]
(alphanumeric [0, 0] - [0, 4]))
(token [0, 4] - [0, 7]
(punctuation [0, 4] - [0, 7]))
(token [0, 7] - [0, 9]
(alphanumeric [0, 7] - [0, 9]))
(token [0, 9] - [0, 10]
(punctuation [0, 9] - [0, 10]))
(token [0, 10] - [0, 11]
(alphanumeric [0, 10] - [0, 11]))
(token [0, 11] - [0, 12]
(punctuation [0, 11] - [0, 12]))
(token [0, 12] - [0, 13]
(alphanumeric [0, 12] - [0, 13]))))
(statement_with_semicolon [0, 14] - [3, 9]
(token_chain [0, 14] - [3, 8]
(token [0, 14] - [2, 0]
(punctuation [0, 14] - [2, 0]))
(token [2, 0] - [2, 5]
(alphanumeric [2, 0] - [2, 5]))
(token [2, 5] - [2, 6]
(punctuation [2, 5] - [2, 6]))
(ERROR [2, 6] - [2, 7])
(token [2, 7] - [3, 4]
(punctuation [2, 7] - [3, 4]))
(token [3, 4] - [3, 5]
(alphanumeric [3, 4] - [3, 5]))
(token [3, 5] - [3, 6]
(punctuation [3, 5] - [3, 6]))
(token [3, 6] - [3, 8]
(alphanumeric [3, 6] - [3, 8]))))
(statement_without_semicolon [3, 9] - [4, 0]
(token_chain [3, 9] - [4, 0]
(token [3, 9] - [4, 0]
(punctuation [3, 9] - [4, 0]))))
(ERROR [4, 0] - [4, 1]))
example-file 0 ms (ERROR [2, 6] - [2, 7])
Your issue is that your
punctuation
regex matches newline characters\n
and\r
, which you can see here:See how it matches the end of the zeroth line and the blank first line? By the time the parser gets to
block
it thinks block is just another token instatement_with_semicolon
matchingalphanumeric
. You can fix this immediate issue by changing yourpunctuation
definition to:However this likely won't be the last issue of this type you run into, so you might want to rewrite your grammar to be more explicit about the punctuation it accepts, and where. Defining the set of valid operators, for example.
This also answers your other question.