Simple Antlr3 Token parsing

328 views Asked by At

while i'm somewhat comforted by the amount of questions regarding Antlr grammar (it's not just me trying to shave this yak shaped thing), i haven't found a question/answer that comes close to helping with my issue.

I'm using Antlr3.3 with a mixed Token/Parser lexer.

I'm using gUnit to help prove the grammar, and some jUnit tests; this is where the fun begins.

I have a simple config file i want to parse:

identifier foobar {
port=8080
stub plusone.google.com {
        status-code = 206
        header = []
        body = []
  }
 }

I'm having trouble parsing the "identifier" (foobar in this example): Valid names i want to allow are:

foobar
foo-bar
foo_bar
foobar2
foo-bar2
foo_bar2
3foobar
_foo-bar3

and so on, therefore a valid name can use the characters 'a..z'|'A..Z', '0..9' '_' and '-'

The grammar i've arrived at is this (note this isnt the full grammar, just the portion pertinent to this question):

fragment HYPHEN : '-' ;

fragment UNDERSCORE : '_' ;

fragment DIGIT  : '0'..'9' ;

fragment LETTER : 'a'..'z' |'A'..'Z' ;

fragment NUMBER : DIGIT+ ;

fragment WORD : LETTER+  ;

IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;

and the corresponding gUnit test

IDENTIFIER:
"foobar" OK
"foo_bar" OK
"foo-bar" OK
"foobar1" OK
"foobar12" OK
"foo-bar2" OK
"foo_bar2" OK
"foo-bar-2" OK
"foo-bar_2" OK
"5foobar" OK
"f_2-a" OK
"aA0_" OK
// no "funny chars"
"foo@bar" FAIL
// not with whitepsace
"foo bar" FAIL

Running the gUnit tests only fails for "5foobar". I've managed to parse the difficult stuff, and yet the seemingly simple task of parsing an identifier has beaten me.

Can anyone point me to where i'm going wrong? How can i match without being greedy?

Many thanks in advance.

-- UPDATE --

I changed the grammar as per Barts answer, to this:

IDENTIFIER : ('0'..'9'| 'a'..'z'|'A'..'Z' | '_'|'-') ('_'|'-'|'a'..'z'|'A'..'Z'|'0'..'9')* ;

and this fixed the failing gUnit tests, but broke an unreleated jUnit test, that tests the "port" parameter. The following grammar deals with the "port=8080" element of the config snippet above:

configurationStatement[MiddlemanConfiguration config]
        :   PORT EQ port=NUMBER {
config.setConfigurationPort(Integer.parseInt(port.getText())); }
            |   def=proxyDefinition { config.add(def); }
;

The message i get is:

mismatched input '8080' expecting NUMBER

Where NUMBER is defined as NUMBER : ('0'..'9')+ ;

Moving the rule for NUMBER above the IDENTIFIER block, fixed this issue.

1

There are 1 answers

3
Bart Kiers On BEST ANSWER
IDENTIFIER : DIGIT | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*;

is equivalent to:

IDENTIFIER 
 : DIGIT 
 | LETTER (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
 ;

So, an IDENTIFIER is eiter a single DIGIT, or starts with a LETTER followed by (LETTER | DIGIT | HYPHEN | UNDERSCORE)*.

You probably meant:

IDENTIFIER 
 : (DIGIT | LETTER | UNDERSCORE) (LETTER | DIGIT | HYPHEN | UNDERSCORE)*
 ;

However, that also allows for 3---3 as being a valid IDENTIFIER, is that correct?