How to implement SEPARATE island grammar in ANTLR4 with correct line numbers and char index?

759 views Asked by At

I've been developing a COBOL grammar with support of embedded SQL statements. For anyone who's not familiar with COBOL, here is an example.

MOVE A TO B.
EXEC SQL
    SELECT C FROM T WHERE ID=1
    INTO :E
END-EXEC
MOVE F TO G

The code between "EXEC SQL" and "END-EXEC" uses a (specially augmented) SQL syntax, which is a perfect example of island grammar.

I know this can be implemented with Lexer mode in ANTLR4. But I have another requirement that the SQL grammar should be separated from COBOL grammar so that the SQL grammar could be reused when embedded in other languages like PL1, without copy paste programming.

So what I did is using a simple lexer mode to capture anything between "EXEC SQL" and "END-EXEC", extract the SQL code as a String and give it to a separate SQL lexer (and parser).

This worked OK with one drawback - the line numbers and char index of tokens recognized in the SQL parser is counted from the start of the extracted SQL code string, instead of starting from the original COBOL program. When it comes to tracking back to source code, e.g. in case there are parsing errors, the line numbers turn out to be mis-leading.

So the question is : is there a simpler way in ANTLR 4 to implement island grammars seperately (both lexer and parser seperated), yet still preserving correct line numbers and char index in the tokens generated for the island part?

Update: I found there is grammar import feature in ANTLR 4 and my colleague told me we've been trying that but failed. The problem is - lexer mode in imported grammar are not well supported, which gets compiling errors. This issue is being tracked here.

1

There are 1 answers

1
GRosenberg On

To expand on Bill's comment, when instancing your SQL parser/lexer, pass it the line offset of the beginning of the EXEC block. Implement a custom SQL token that reports the line number as offset plus the SQL text relative line number. Have your SQL TokenFactory inject the offset as a constant in to each token generated.

Update

Using modes to implement an idiomatic island grammar, with or without using includes (which work quite well for me at least), is the most natural approach.

Barring that, initiating an external SQL block parser process can be from an Action in the lexer or parser, by an override of the lexer's token emit() method (or related methods), and from a visitor walking the base grammar's parse tree.

Only you can balance which is acceptable, desirable or necessary in any given circumstance.

For example, if the parse tree evaluation provides a value for use in the dynamic execution of an SQL exec block or, conversely, depends on the values returned by such an execution, you are essentially forced to use a symbol table and defer initiations of the SQL executions to a walker. Of course, you can then cache each of the different generated SQL parse trees and reinitialize their symbol tables with instance specific data for reuse without reparsing the SQL blocks.

Just depends on whatever your real requirements are.