Extending Stanford NER terms with new terms

274 views Asked by At

We need to add terms to named entity extraction tables/model in Stanford and can't figure out how. Use case - we need to build up a set of IED terms over time and want to the Stanford pipeline to extract the terms when found in text files.

Looking to see if this is something someone has done before

2

There are 2 answers

0
Angel Chang On

Please take a look at http://nlp.stanford.edu/software/regexner/ to see how to use it. It allows you to specify a file of mappings of phrases to entity types. When you want to update the mappings, you update the file and rerun the Stanford pipeline.

If you are interested in how to actually learn patterns for the terms over time, you can take a look at our pattern learning system: http://nlp.stanford.edu/software/patternslearning.shtml

0
StanfordNLPHelp On

Could you specify the tags you want to apply?

To use the RegexNER all you have to do is build a file with 1 entry per line of the form:

TEXT_PATTERN\tTAG

You would put all of the things you want in your custom dictionary into a file, say custom_dictionary.txt

I am assuming by IED you mean

https://en.wikipedia.org/wiki/Improvised_explosive_device ??

So your file might look like:

VBIED\tIED_TERM

sticky bombs\tIED_TERM

RCIED\tIED_TERM

New Country\tLOCATION

New Person\tPERSON

(Note Stack Overflow has some strange formatting, there should not be blank lines between each entry, it should be 1 entry per line!!)

If you then run this command:

java -mx1g -cp '*' edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators 'tokenize,ssplit,pos,lemma,regexner,ner' -file sample_input.txt -regexner.mapping custom_dictionary.txt

you will tag sample_input.txt

Updating is merely a matter of updating custom_dictionary.txt

One thing to be on the lookout for, it matters if you put "ner" first or "regexner" first in your list of annotators.

If your highest priority is tagging with your specialized terms (for instance IED_TERM), I would run the regexner first in the pipeline, since there are some tricky issues with how the taggers overwrite each other.