ANTLR 4 - How to reduce prediction time for IF statements with optional END-IF (COBOL grammar)

758 views Asked by At

I've been working on a COBOL grammar and found that the generated parser runs very slow for programs having deeply nested IF statements.

I think it's better to look at a sample code first, to explain one of COBOL's "annoying" feature - one DOT (full stop) can close any level of unclosed nested IF statements. Here it is:

IF condition-1
    S1
    IF condition-2
        S2
        (more nested IF conditions)
           IF condition-n
               Sn
  (0~n "END-IF"s )
.

The text between parenthesis are pseudo code. The key point is that the "END-IF"s are optional before a DOT which closes a so-called "sentence" in COBOL. To allow the "END-IF"s to be optional, I've written the grammar this way:

sentence:
  statement_list DOT
  ;
statement_list:
  statement+
  ;
statement:
  if_statement
  | ...   //other statements
  ;
if_statement:
  IF condition THEN? statement_list
  (ELSE statement_list)?
  END_IF?
  ;

This grammar turn out to be slow for the style of code like the sample above. When debugging I found the parser is busy predicting inside if_statement. This is expected as there are ambiguities in the grammar. For the sample code, "IF condition-2 ..." can be recognized as either a child or brother of "IF condition-1 ...", because there can be an omitted "END-IF" right after S1. The inner most "IF condition-n" can have up to n choices. Similarly, every the END-IF before the DOT have alternatives over the "IF" they pair with if their total number is not n.

I tried to cut off alternatives by adding semantic predicates, like below:

statement_list:
  statement+  {_input.LA(1)==END_IF || _input.LA(1)==ELSE || _input.LA(1)==DOT}?
  ;  
if_statement:
  IF condition THEN? statement_list
   ( {_input.LA(1)!=ELSE}? | ELSE statement_list)
   ( {_input.LA(1)!=END-IF}? | END_IF) 
   ;

This reduced parse time of large programs by 2/3, but does not reduce it to be linear time, as deep prediction over if_statement and statement_list is still observed. A closer look into antlr 4 runtime seems to show that semantic predicates are not used at all during prediction.

Does anyone know of a trick to make this "optional END-IF" unambiguous, or make it predict fast?

Actually there is also ambiguity for the statements before "ELSE", just like those before END-IF.

Note: Paring the sample with 30 levels of nested ELSE-IF uses 3 seconds on my machine. And I'm facing large programs that have many occurrence of such deep nesting and uses more than 10 minutes to parse. One of them even goes stack overflow when predicting.

Another note: I'm using antlr 4.2.

0

There are 0 answers