antlr grammar for triple quoted string

515 views Asked by At

I am trying to update an ANTLR grammar that follows the following spec

https://github.com/facebook/graphql/pull/327/files

In logical terms its defined as

StringValue ::
   - `"` StringCharacter* `"`
   - `"""` MultiLineStringCharacter* `"""`

StringCharacter ::
  - SourceCharacter but not `"` or \ or LineTerminator
  - \u EscapedUnicode
  - \ EscapedCharacter

MultiLineStringCharacter ::
  - SourceCharacter but not `"""` or `\"""`
  - `\"""`

(Not the above is logical - not ANTLR syntax)

I tried the follow in ANTRL 4 but it wont recognize more than 1 character inside a triple quoted string

string : triplequotedstring | StringValue ;

triplequotedstring: '"""' triplequotedstringpart?  '"""';

triplequotedstringpart : EscapedTripleQuote* | SourceCharacter*;

EscapedTripleQuote : '\\"""';

SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];

StringValue: '"' (~(["\\\n\r\u2028\u2029])|EscapedChar)* '"';

With these rules it will recognize '"""a"""' but as soon as I add more characters it fails

eg: '"""abc"""' wont parse and the IntelliJ plugin for ANTLR says

line 1:14 extraneous input 'abc' expecting {'"""', '\\"""', SourceCharacter}

How do I do triple quoted strings in ANTLR with '\"""' escaping?

2

There are 2 answers

4
Bart Kiers On

Some of your parer rules should really be lexer rules. And SourceCharacter should probably be a fragment.

Also, instead of EscapedTripleQuote* | SourceCharacter*, you probably want ( EscapedTripleQuote | SourceCharacter )*. The first matches aaa... or bbb..., while you probably meant to match aababbba...

Try something like this instead:

string
 : Triplequotedstring 
 | StringValue 
 ;

Triplequotedstring
 : '"""' TriplequotedstringPart*? '"""'
 ;

StringValue
 : '"' ( ~["\\\n\r\u2028\u2029] | EscapedChar )* '"'
 ;

// Fragments never become a token of their own: they are only used inside other lexer rules
fragment TriplequotedstringPart : EscapedTripleQuote | SourceCharacter;
fragment EscapedTripleQuote : '\\"""';
fragment SourceCharacter :[\u0009\u000A\u000D\u0020-\uFFFF];
0
Alex Zerntev On

Triple quoted strings are often used to allow multi-line strings and unescaped characters inside a string. Assuming that you are skipping spaces and linebreaks, parsing triple quotes can be quite tricky, because there are some corner cases like:

  • Since triple quotes are multi-line, the parsing errors should be adapted to that. If you define triple quotes as part of the lexer, with line breaks, the line (and column) numbers will be wrong.
  • In case of """"""" (one double quote surrounded by triple quotes) the parsed result should be a string literal with content: "

In order to cope with the above issues a grammar with modes can be used is:

Lexer:

START_TRIPLE_QUOTE: '"""' -> pushMode(INSIDE_TRIPLE_QUOTE);

mode INSIDE_TRIPLE_QUOTE;
TRIPLE_QUOTED_STRING_CONTENT : '"' '"'? ~["]  // Match one or two quotes followed by a non-quote
                             | ~["]           // Match any character that is not a quote
                             ;
TRIPLE_QUOTE_END_2: '"""""' -> popMode;
TRIPLE_QUOTE_END_1: '""""' -> popMode;
TRIPLE_QUOTE_END_0: '"""' -> popMode;

Parser:

triple_string_literal: START_TRIPLE_QUOTE (TRIPLE_QUOTED_STRING_CONTENT)*
                              (TRIPLE_QUOTE_END_2
                              | TRIPLE_QUOTE_END_1
                              | TRIPLE_QUOTE_END_0);

And in your Listener/Visitor:

TripleQuotedStringConst(ctx.getText().substring(3, ctx.getText().length() - 3))

As a reference here is an article that I wrote: https://medium.com/@alexzerntev/parsing-multi-line-triple-quoted-strings-with-antlr4-ceca41cdeadb