JFlex: How can I let yytext continue during matching

786 views Asked by At

I am trying to write a lexer for an IntelliJ language plugin. In the JFLex manual there is an example that can lex string literals. However in this example they use a StringBuffer to insert each part of the lexed characters and continually build up a single string. The problem I have with this method is that it creates a copy of the characters that are being read and I dont know how to integrate that example with the IntelliJ. In IntelliJ one always returns a IElementType and then the associated text is taken from yytext() using the functions getTokenStart() and getTokenEnd(), such that the start and end of the whole token is mapped directly to the input string.

So I want to be able to return a token and the associated yytext() should span over the whole text since the last time another token was returned. For example in the string literal example, I would read \" which marks the literal start, then I change into state STRING and when I read \" again I change back into another state and return the string literal token. At that point I want yytext() to contain the whole string literal.

Is this possible with JFlex? If not what is the recommended why to pass the content from a StringBuffer to the IntelliJ API after a token has been matched that spans multiple actions.

1

There are 1 answers

0
lsf37 On

You could write a regular expression that matches the entire String literal so that you get it in one yytext() call, but this match would contain escape sequences unprocessed.

From the JFlex java example:

<STRING> {
  \"                             { yybegin(YYINITIAL); return symbol(STRING_LITERAL, string.toString()); }

  {StringCharacter}+             { string.append( yytext() ); }

  /* escape sequences */
  "\\b"                          { string.append( '\b' ); }
  "\\t"                          { string.append( '\t' ); }
  "\\n"                          { string.append( '\n' ); }
  "\\f"                          { string.append( '\f' ); }
  "\\r"                          { string.append( '\r' ); }
  "\\\""                         { string.append( '\"' ); }
  "\\'"                          { string.append( '\'' ); }
  "\\\\"                         { string.append( '\\' ); }
  \\[0-3]?{OctDigit}?{OctDigit}  { char val = (char) Integer.parseInt(yytext().substring(1),8);
                                           string.append( val ); }

  /* error cases */
  \\.                            { throw new RuntimeException("Illegal escape sequence \""+yytext()+"\""); }
  {LineTerminator}               { throw new RuntimeException("Unterminated string at end of line"); }
}

This code doesn't just match escape sequences like "\\t", but turns them into the single character '\t'. You could match the whole string in one expression in an expression like this

\" ({StringCharacter} | \\[0-3]?{OctDigit}?{OctDigit} | "\\b" | "\\t" | .. | "\\\\") * \"

but yytext will then contain the unprocessed sequence \\t instead of the character '\t'.

If that is acceptable, then that's the easy solution. If the token is supposed to be an actual substring of the input, then it sounds like this is what you want.

If it's not, you'll need something more complicated, for instance an intermediate interface function that is not yytext(), but that returns the StringBuffer content when the last match was a string match (a flag you could set in the string action), and otherwise returns yytext().