ANTLR on a noisy data stream

392 views Asked by At

I'm very new in the ANTLR world and I'm trying to figure out how can I use this parsing tool to interpret a set of "noisy" string. What I would like to achieve is the following.

let's take for example this phrase : It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV

What I would like to extract is CAT, SLEEPING and SOFA and have a grammar that match easily the following pattern : SUBJECT - VERB - INDIRECT OBJECT... where I could define

VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';

etc.. I don't want to ends up with a permanent "NoViableException" as I can't describe all the possibilities around the language structure. I just want to tear apart useless words and just keep the one that are interesting.

It's more like if I had a tokeniser and asked the parser "Ok, read the stream until you find a SUBJECT, then ignore the rest until you find a VERB, etc.."

I need to extract an organized structure in an un-organized set... For example, I would like to be able to interpret (I'm not judging the pertinence of this utterly basic and incorrect view of 'english grammar')
SUBJECT - VERB - INDIRECT OBJECT
INDIRECT OBJECT - SUBJECT - VERB

so I will parse sentences like

It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV

or

It's 10PM and, on the SOFA in front of the TV, the Lazy CAT is currently SLEEPING heavily

1

There are 1 answers

1
Bart Kiers On BEST ANSWER

You could create only a couple of lexer rules (the ones you posted, for example), and as a last lexer rule, you could match any character and skip() it:

VERB            : 'SLEEPING' | 'WALKING';
SUBJECT         : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY             : . {skip();};

The order is important here: the lexer tries to match tokens from top to bottom, so if it can't match any of the tokens VERB, SUBJECT or INDIRECT_OBJECT, it "falls through" to the ANY rule and skips this token. You can then use these parser rules to filter your input stream:

parse
  :  sentenceParts+ EOF
  ;

sentenceParts
  :  SUBJECT VERB INDIRECT_OBJECT
  ;  

which will parse the input text:

It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. The DOG is WALKING on the SOFA.

as follows:

alt text