How to identify n-gram before tokenization in stanford core-nlp?

279 views Asked by At

I am trying to use the core-nlp annotation pipeline with default settings all through from tokenizing until ner_tags. I did observe that the "tokenizer" module is identifying , say "Vice President" as two individual tokens {vice,President} resulting in ner_tags identification as {o,TITLE} instead of {Vice President} and {TITLE}. How can I get the tokenizer to identify "Vice president" as one single token , that help Ner_Tags to identify titles appropriately.

1

There are 1 answers

0
Gabor Angeli On

What properties are you using to get TITLE as an NER tag? This is not one of the standard tags, and if you're using the TokensRegexNER annotator (e.g., for the kbp annotator) multi-word titles like 'vice president' should be picked up. It works on corenlp.run at least.

In general, it's not the tokenizer's job to collapse NER spans into a single mention. The tokenizer should separate 'vice' and 'president' into different tokens, both of which should be marked TITLE by an appropriate NER annotator. You may be interested in the entitymention annotator, which groups contiguous NER tags into NER mentions -- this would give you 'vice president' as a single mention, rather than two tokens both marked as TITLE. These mentions can be retrieved using the mentions annotation on a sentence CoreMap, or using the List<String> mention(String nerTag) or List<String> mentions() functions in the simple API.