from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("/nfs/storages/bio_corpus/ner/BC2GM/ner_outputs")
tokenizer = AutoTokenizer.from_pretrained("/nfs/storages/bio_corpus/ner/BC2GM/ner_outputs")
ner_model = pipeline('ner', model=model, tokenizer=tokenizer, grouped_entities=True)
sequence = "In this issue of Eurosurveillance, we are publishing two articles on different aspects of the newly emerged 2019-nCoV. One is a research article by Corman et al. on the development of a diagnostic methodology based on RT-PCR of the E and RdRp genes, without the need for virus material; the assays were validated in five international laboratories。"
ner_model(sequence)
[{'entity_group': 'B', 'score': 0.9881901144981384, 'word': 'E'},
{'entity_group': 'B', 'score': 0.9853595495223999, 'word': 'Rd'},
{'entity_group': 'I', 'score': 0.9730346202850342, 'word': '##Rp genes'}]
In the codes, the sub word was spited by "##". please show me how to remove "##" and join 'Rd' and 'Rp genes' as an entity.
items = ner_model(sequence)
entities = []
for item in items:
word = item['word']
if word.startswith('##'):
word = entities[len(entities)-1] + word.replace('##','')
entities.pop()
entities.append(word)
print(entities)
In their course , they have an example that makes exactly what you want.
(I modified some stuff because of warnings) Python code.
Output
+Info on aggregation_strategy https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TokenClassificationPipeline