TatSu: square brackets are being ignored in the grammar

84 views Asked by At

TatSu tends to ignore the square bracket characters, be it [, ], and the mix of two at times and recognize them at different times for some reason, which I will show in an example below I'm experimenting with in TatSu 5.10.1, Python 3.11.6, Linux 6.5.7 if it is related in any way.

I aim to render a subset of Markdown, but I'll start with a simplified grammar to discuss the issue.

(I'm using a unit separator as a rare character since other ways to disable whitespace handling were more confusing. If there's a more straightforward and reliable way to tell TatSu to recognize the whitespace as characters it should treat as a part of the text, that'll be useful to know, too.)

@@grammar::Markdown

@@whitespace :: /[␟]/

start = pieces $ ;

text = text:/[a-z]+/ ;

pieces = {text}*
    ;

This test code leads TatSu to ignore the [] and not fail with an error. If I set the markdown_str as something else, like () or {}, TatSu will fail. Individual square brackets, [ or ], won't lead to an exception.

import tatsu

with open("./grammar.txt", "r") as grammar_file:
    grammar = grammar_file.read()

class MarkdownSemantics:

    def pieces(self, ast):
        return ''.join(ast)

parser = tatsu.compile(grammar)

markdown_str = "[]"
ast = parser.parse(markdown_str, semantics=MarkdownSemantics())
print(ast)

I expect this to be a bug, as I don't see what's so special about the square bracket characters. They are not defined as a part of whitespace to be ignored, and other characters similar to them are.

At the same time, I am told here that it's about learning parsing principles. Is my EBNF above allowing [ or ] to pass?

2

There are 2 answers

5
dnn On BEST ANSWER

Your example code does not work, the semantics class definition expects the argument to pieces() to be a list of strings, but it is not.

Anyhow, the issue is with your whitespace definition. Contrary to what the documentation says, the @@whitespace directive in the grammar definition is interpreted as a list of characters to skip over between tokens (at least this is how I read the TatSu source code). Therefore, your grammar definition skips over [ and ].

To disable white space handling, you can assign None or False to the @@whitespace directive:

@@whitespace :: None
4
Apalala On

The problem is in the @@whitespace definition (there's a weird character there, but I don't think it's that).

The grammar works if you use this regexp instead:

@@whitespace :: /\s+/

It seems that TatSu is incorrectly escaping the original regex:

'[ ]'
re.compile('(?m)[\\[\\ \\]]+', re.MULTILINE)