How to join sub words produced by the named entity recognization task on transformer huggingface?

885 views Asked by At
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("/nfs/storages/bio_corpus/ner/BC2GM/ner_outputs")
tokenizer = AutoTokenizer.from_pretrained("/nfs/storages/bio_corpus/ner/BC2GM/ner_outputs")

ner_model = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)

sequence = "In this issue of Eurosurveillance, we are publishing two articles on different aspects of the newly emerged 2019-nCoV. One is a research article by Corman et al. on the development of a diagnostic methodology based on RT-PCR of the E and RdRp genes, without the need for virus material; the assays were validated in five international laboratories。"

ner_model(sequence)

[{'entity_group': 'B', 'score': 0.9881901144981384, 'word': 'E'},
 {'entity_group': 'B', 'score': 0.9853595495223999, 'word': 'Rd'},
 {'entity_group': 'I', 'score': 0.9730346202850342, 'word': '##Rp genes'}]

In the codes, the sub word was spited by "##". please show me how to remove "##" and join 'Rd' and 'Rp genes' as an entity.


items = ner_model(sequence)
entities = []
for item in items:
    word = item['word']
    if word.startswith('##'):
        word = entities[len(entities)-1] + word.replace('##','')
        entities.pop()
    entities.append(word)
print(entities)
1

There are 1 answers

0
Rub On

In their course , they have an example that makes exactly what you want.

(I modified some stuff because of warnings) Python code.

from transformers import pipeline

# ner_pipe = pipeline("ner", grouped_entities=True, use_fast=True)  # 1.33Gb

# UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, 
# defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.

ner_pipe = pipeline("ner", aggregation_strategy="simple", use_fast=True)  # 1.33Gb


sequence = """Where do you want to meet?
In Paris.
And where will we have lunch?
At Cafe Central which is near the Elysium Park"""

ner_pipe(sequence)

Output

  [{'end': 35,
      'entity_group': 'LOC',
      'score': 0.9997283,
      'start': 30,
      'word': 'Paris'},
     {'end': 82,
      'entity_group': 'LOC',
      'score': 0.8921441,
      'start': 70,
      'word': 'Cafe Central'},
     {'end': 113,
      'entity_group': 'LOC',
      'score': 0.9675388,
      'start': 101,
      'word': 'Elysium Park'}]

+Info on aggregation_strategy https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TokenClassificationPipeline