Do I need to retrain Bert for NER to create new labels?

326 views Asked by At

I am very new to natural language processing and I was thinking about working on named entity recognition NER. A friend of mine who works with NLP advised me to check out BERT, which I did. When reading the documentation and checking out the CoNLL-2003 data set, I noticed that the only labels are person, organization, location, miscellanious and outside. What if instead of outside, I want the model to recognize date, time, and other labels. I get that I would need a dataset labelled as such so, assuming that I have that, do I need to retrain BERT from stratch or can I somehow fine tune the existing model without needing to restart the whole process?

1

There are 1 answers

0
Kyle F. Hartzenberg On BEST ANSWER

Yes, you would have to use a model trained using the specific labels you require. The OntoNotes dataset may be better suited for what you are trying to do, as it includes the 18 entity names listed below (see OntoNotes 5.0 Release Notes for further info).

The HuggingFace flair/ner-english-ontonotes-large (here) and flair/ner-english-ontonotes-fast (here) models are trained on this dataset and will likely produce results closer to what you desire. As a demo (make sure to pip install flair first)

from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")  # load tagger
sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")  # example sentence
tagger.predict(sentence)  # predict NER tags

# Print sentence and NER spans
print(sentence)
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

# Output
# Span [2,3]: "September 1st"   [− Labels: DATE (1.0)]
# Span [4]: "George"   [− Labels: PERSON (1.0)]
# Span [6,7]: "1 dollar"   [− Labels: MONEY (1.0)]
# Span [10,11,12]: "Game of Thrones"   [− Labels: WORK_OF_ART (1.0)

OntoNotes 5.0 Named Entities

  1. PERSON (People, including fictional)
  2. NORP (Nationalities or religious or political groups)
  3. FACILITY (Buildings, airports, highways, bridges, etc.)
  4. ORGANIZATION (Companies, agencies, institutions, etc.)
  5. GPE (Countries, cities, states)
  6. LOCATION (Non-GPE locations, mountain ranges, bodies of water)
  7. PRODUCT (Vehicles, weapons, foods, etc. (Not services))
  8. EVENT (Named hurricanes, battles, wars, sports events, etc.)
  9. WORK OF ART (Titles of books, songs, etc.)
  10. LAW (Named documents made into laws)
  11. LANGUAGE (Any named language)
  12. DATE (Absolute or relative dates or periods)
  13. TIME (Times smaller than a day)
  14. PERCENT (Percentage (including “%”))
  15. MONEY (Monetary values, including unit)
  16. QUANTITY (Measurements, as of weight or distance)
  17. ORDINAL (“first”, “second”)
  18. CARDINAL (Numerals that do not fall under another type)