How to join sub words produced by the named entity recognization task on transformer huggingface?

Question

How to join sub words produced by the named entity recognization task on transformer huggingface?

886 views Asked by Ivan Lee At 29 September 2020 at 06:08

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("/nfs/storages/bio_corpus/ner/BC2GM/ner_outputs")
tokenizer = AutoTokenizer.from_pretrained("/nfs/storages/bio_corpus/ner/BC2GM/ner_outputs")

ner_model = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)

sequence = "In this issue of Eurosurveillance, we are publishing two articles on different aspects of the newly emerged 2019-nCoV. One is a research article by Corman et al. on the development of a diagnostic methodology based on RT-PCR of the E and RdRp genes, without the need for virus material; the assays were validated in five international laboratories。"

ner_model(sequence)

[{'entity_group': 'B', 'score': 0.9881901144981384, 'word': 'E'},
 {'entity_group': 'B', 'score': 0.9853595495223999, 'word': 'Rd'},
 {'entity_group': 'I', 'score': 0.9730346202850342, 'word': '##Rp genes'}]

In the codes, the sub word was spited by "##". please show me how to remove "##" and join 'Rd' and 'Rp genes' as an entity.

items = ner_model(sequence)
entities = []
for item in items:
    word = item['word']
    if word.startswith('##'):
        word = entities[len(entities)-1] + word.replace('##','')
        entities.pop()
    entities.append(word)
print(entities)

Original Q&A

There are 1 answers

**Rub** · Answer 1 · 2021-08-15T17:46:17+00:00

In their course , they have an example that makes exactly what you want.

(I modified some stuff because of warnings) Python code.

from transformers import pipeline

# ner_pipe = pipeline("ner", grouped_entities=True, use_fast=True)  # 1.33Gb

# UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, 
# defaulted to `aggregation_strategy="AggregationStrategy.SIMPLE"` instead.

ner_pipe = pipeline("ner", aggregation_strategy="simple", use_fast=True)  # 1.33Gb


sequence = """Where do you want to meet?
In Paris.
And where will we have lunch?
At Cafe Central which is near the Elysium Park"""

ner_pipe(sequence)

Output

  [{'end': 35,
      'entity_group': 'LOC',
      'score': 0.9997283,
      'start': 30,
      'word': 'Paris'},
     {'end': 82,
      'entity_group': 'LOC',
      'score': 0.8921441,
      'start': 70,
      'word': 'Cafe Central'},
     {'end': 113,
      'entity_group': 'LOC',
      'score': 0.9675388,
      'start': 101,
      'word': 'Elysium Park'}]

+Info on aggregation_strategy https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TokenClassificationPipeline

TechQA.

How to join sub words produced by the named entity recognization task on transformer huggingface?

There are 1 answers

Related Questions in HUGGINGFACE-TRANSFORMERS

Popular Questions

Popular Tags

Trending Questions