ANTLR4 - parse function-like structures in regular text

67 views Asked by At

I'm experimenting with a grammar, which will be able to match function-like structures inside regular text. These functions starts with a dollar sign accept text arguments surrounded by apostrophes and allows nesting of other functions.

I was able to achieve this with more constrained conditions like every text has to be surrounded by apostrophes and concatenation with '+' character is available but I wanted to redesign it to work without this constraing.

I came up with this grammar:

grammar Functions;

fragment DIGIT : [0-9];
fragment LETTER : [A-Za-z];

FUNCTION_NAME : '$' LETTER (LETTER | DIGIT)+;

APOSTROPHE : '\'';
LEFT_PARENTHESIS  : '(';
RIGHT_PARENTHESIS : ')';

ESCAPE_CHARACTER: '\\' [$()\\'];
TEXT  : '\'' ~[\r\n']* '\'';

PLAIN_TEXT : . -> skip;

start : subString*;

subString: function
   | ESCAPE_CHARACTER
   | LEFT_PARENTHESIS 
   | RIGHT_PARENTHESIS
   | APOSTROPHE
   | TEXT
   ;

function
    : FUNCTION_NAME LEFT_PARENTHESIS param? RIGHT_PARENTHESIS
    ;

param
    : function
    | TEXT
    ;

But following example does not work well:

Text $func('A') 'text $func2()'

Because 'text $func2()' is matched as TEXT token. Therefore I came with escaping feature so adding \' solves the problem.

However, I'd like to make it work so that characters outside the function context are treated as regular characters. Because of this 'context' I'm starting to think that I've reached the limitations of context-free grammar but I don't have much practical experience to confirm that.

Is it possible to reach my requirements using ANTLR4?

2

There are 2 answers

3
Bart Kiers On BEST ANSWER

This could work:

FunctionsLexer.g4

lexer grammar FunctionsLexer;

FUNCTION_NAME : '$' LETTER (LETTER | DIGIT)* -> pushMode(InFunction);
PLAIN_TEXT : . -> skip;

mode InFunction;

FUNCTION_NAME_NESTED
 : '$' LETTER (LETTER | DIGIT)* -> type(FUNCTION_NAME), pushMode(InFunction)
 ;

PARAM : '\'' ~['$]* '\'';
LEFT_PARENTHESIS  : '(';
RIGHT_PARENTHESIS : ')' -> popMode;

fragment DIGIT : [0-9];
fragment LETTER : [A-Za-z];

FunctionsParser.g4

parser grammar FunctionsParser;

options {
  tokenVocab=FunctionsLexer;
}

start
 : subString* EOF
 ;

subString
 : function
 ;

function
 : FUNCTION_NAME LEFT_PARENTHESIS param? RIGHT_PARENTHESIS
 ;

param
 : function
 | PARAM
 ;

The input Text $func('A') 'text $func2()' BLA $fun3($fun4('...')) produces 15 tokens:

  1    FUNCTION_NAME                  '$func'
  2    LEFT_PARENTHESIS               '('
  3    PARAM                          '\'A\''
  4    RIGHT_PARENTHESIS              ')'
  5    FUNCTION_NAME                  '$func2'
  6    LEFT_PARENTHESIS               '('
  7    RIGHT_PARENTHESIS              ')'
  8    FUNCTION_NAME                  '$fun3'
  9    LEFT_PARENTHESIS               '('
  10   FUNCTION_NAME                  '$fun4'
  11   LEFT_PARENTHESIS               '('
  12   PARAM                          '\'...\''
  13   RIGHT_PARENTHESIS              ')'
  14   RIGHT_PARENTHESIS              ')'
  15   EOF                            '<EOF>'

and start produces the following parse tree:

'- start
   |- subString
   |  '- function
   |     |- '$func' (FUNCTION_NAME)
   |     |- '(' (LEFT_PARENTHESIS)
   |     |- param
   |     |  '- '\'A\'' (PARAM)
   |     '- ')' (RIGHT_PARENTHESIS)
   |- subString
   |  '- function
   |     |- '$func2' (FUNCTION_NAME)
   |     |- '(' (LEFT_PARENTHESIS)
   |     '- ')' (RIGHT_PARENTHESIS)
   |- subString
   |  '- function
   |     |- '$fun3' (FUNCTION_NAME)
   |     |- '(' (LEFT_PARENTHESIS)
   |     |- param
   |     |  '- function
   |     |     |- '$fun4' (FUNCTION_NAME)
   |     |     |- '(' (LEFT_PARENTHESIS)
   |     |     |- param
   |     |     |  '- '\'...\'' (PARAM)
   |     |     '- ')' (RIGHT_PARENTHESIS)
   |     '- ')' (RIGHT_PARENTHESIS)
   '- '<EOF>'
4
Octavian Theodor On

I think your grammar can be simplified, as you only need to match "function calls" and don't care about anything else.

Here's a simpler version. I think it matches your basic requirements (functions with, at most, one argument, which can be a function call or a quoted string).

grammar Functions;

start
    : function*
    ;

function
    : FUNCTION_NAME LEFT_PARENTHESIS param? RIGHT_PARENTHESIS
    ;

param
    : PARAM
    | function
    ;

FUNCTION_NAME : '$' LETTER (LETTER | DIGIT)*;
fragment DIGIT : [0-9];
fragment LETTER : [A-Za-z];

PARAM
    : '\'' ~['$]* '\''
    ;

LEFT_PARENTHESIS  : '(';
RIGHT_PARENTHESIS : ')';

PLAIN_TEXT : . -> skip;

The main trick was to break out of a quoted string argument/param token via either the first encountered quote OR the first begin-function char, $.