We're using the ObjectiveC preprocessor parser and lexer grammars for parsing directives in C code like #define, #include, #ifndef, etc. Below are relevant portions of the grammar (shortened for brevity).
file: text* EOF ;
text: code
| '#' directive (NEW_LINE | EOF) ;
code: CODE+ ;
directive
: 'include'
| 'define' SYMBOL
| 'ifndef' ;
SHARP: '#' ;
CODE: ~[#'"/]+ ;
The code we're parsing is Pro*C - a dialect of C from Oracle but it's essentially C99 with embedded SQL. We parse and evaluate the directives, no problem; anything not a directive (C + SQL) is swallowed up by CODE as arbitrary text for later processing.
#define DEBUG /* directive */
/* below is all <CODE> */
int main(){
int x;
EXEC SQL Select * from PRODUCTS;
}
Problem
The parser fails on a handful of source files having SQL statements with temp table references: TABLE#temp_table, and that's the main issue since # is also the prefix for all directives: #define, #include
#define DEBUG
int main(){
int x;
EXEC SQL Select * from PRODUCTS#stage;
} ^------------- OH NO!
[@0,0:0='#',<1>,1:0]
[@1,1:7='define ',<7>,1:1]
[@2,8:12='DEBUG',<32>,1:8]
[@3,13:13='\n',<35>,1:13]
[@4,14:67='int x;\nint main(){\n EXEC SQL Select * from PRODUCTS',<2>,2:0] <CODE>
[@5,68:68='#',<1>,4:35] <SHARP> , should be <CODE>
[@6,69:73='stage',<32>,4:36] <SYMBOL>, should be <CODE>
[@7,75:75='\n',<35>,4:42]
[@8,76:77='}\n',<2>,5:0]
[@9,78:77='<EOF>',<-1>,6:0]
In the above example, the lexer reads everything from int x to PRODUCTS as CODE, then switches to parsing for directives since the rule for CODE excludes #, we don't want that. We need the entire main() function parsed as CODE from int main() to }.
Attempts
I tried using parser predicates to look ahead and check if the token after # starts with a directive name (define, if, etc), and if true, match that rule. I don't have much experience with predicates but it wasn't long before I realized this was a dead end since the real issue is in the lexer, not the parser.
text
: code
| {_input.LT(2).getText().startsWith("define")}? '#' directive (NEW_LINE | EOF)
;
I tried doing the same in the lexer but CharStream has no practical means to look ahead without consuming tokens.
Modes?
This problem feels similar to how the XML lexer grammar uses modes for parsing text between < > tags, so perhaps modes - not predicates, is the right solution here?
UPDATE 1
Thanks to @kaby76, I used a predicate from PHP lexer grammar to check if characters preceding # are non-WS, and if so, it's a directive, otherwise it's CODE so call more() to keep reading.
@lexer::members {
private void directiveORcode(int pos) {
if(this._input.LA(pos) <= 0 ||
this._input.LA(pos) == '\r' ||
this._input.LA(pos) == '\n')
this.mode(DIRECTIVE_MODE);
else
this.more();
}
}
SHARP: '#' {this.directiveORcode(-2);} ;
Which produces the following; @4,@5 showing correct token type CODE, very nice.
[@0,0:0='#',<1>,1:0]
[@1,1:7='define ',<7>,1:1]
[@2,8:12='DEBUG',<32>,1:8]
[@3,13:13='\n',<35>,1:13]
[@4,14:51='int x;\nEXEC SQL SELECT * from PRODUCTS',<2>,2:0]
[@5,52:59='#stage;\n',<2>,3:31]
[@6,60:59='<EOF>',<-1>,4:0]