I found the following (simplified) grammar on the internet as I was looking for a solution to a problem where I had to parse syntax similar to Markdown.
grammar Markdown;
parse : stat+;
stat : bold
| text
| WS
;
text : TEXT|SPACE;
bold : ('**'stat*'**');
TEXT : [a-zA-Z0-9]+;
SPACE : ' ';
WS : [\t\r\n]+;
What I want to achieve is that antlr4 kind of does a shortest match first on a sentence that looks like **bold1** not bold **bold2**.
This would mean that bold1 would be bold not bold not, and bold2 bold again.
However, due to antlr4 using longest match first, antlr4 parses the example as two nested bolds which is wrong.
I have already thought about multiple very complicated solutions to this problem which I actually don't want to use. Is there a simple solution?
UPDATE:
Apparently I simplified the example too much.
Now here is an expanded grammar (that does not recognize markdown anymore) but which illustrates the problem I have.
It is important to note that stat can also be a variable which is just a placeholder for any keyword my language can possibly contain.
grammar StyleParser;
parse : styled_stat+;
styled_stat : italic
| bold
| underline
| stat
;
stat : variable
| text
;
variable: VARIABLE;
text : TEXT|SPACE;
italic : ITALIC (stat | italic_bold | italic_underline)* ITALIC;
italic_bold: BOLD (stat | italic_bold_underline)* BOLD;
italic_bold_underline: UNDERLINE stat* UNDERLINE;
italic_underline: UNDERLINE (stat | italic_underline_bold)* UNDERLINE;
italic_underline_bold: BOLD stat* BOLD;
bold : BOLD (stat | bold_italic | bold_underline)* BOLD;
bold_italic: ITALIC (stat | bold_italic_underline)* ITALIC;
bold_italic_underline: UNDERLINE stat* UNDERLINE;
bold_underline: UNDERLINE (stat | bold_underline_italic)* UNDERLINE;
bold_underline_italic: ITALIC stat* ITALIC;
underline : UNDERLINE (stat | underline_bold | underline_italic)* UNDERLINE;
underline_italic: ITALIC (stat | underline_italic_bold)* ITALIC;
underline_italic_bold: BOLD stat* BOLD;
underline_bold: BOLD (stat | underline_bold_italic)* BOLD;
underline_bold_italic: ITALIC stat* ITALIC;
SPACE : ' ';
VARIABLE : 'VAR';
TEXT : [a-zA-Z0-9]+;
ITALIC: '//';
BOLD: '==';
UNDERLINE: '__';
With this grammar I can not nest the same style, but I can nest different styles. For example, it parses ==bold1 __underline //italic//__== not __underline__ //italic// bold ==VAR== correctly.
The thing is that the amount of rules grows exponentially with the amount of styles you introduce, and I want to avoid this.
Parsing markdown is just non-trivial. One approach is to
So, for the general case of a markdown-styled
WORD, defined as some string of text exclusive of qualifying attributes, the parser definition isIn the lexer, define all attributes as default
leftand reserve tokens forrightattributes andWORDIn the lexer
superclass, overrideand decide whether
leftattribute should really be reassigned as arightattributeCHARshould be accumulated into a currentWORDor should be added to a newWORDinstance.Now, a tree-walker can evaluate the sequences of
words and handle treatment of the potentially multiple overlapping, nested attributes.