Loading saved NER back into HuggingFace pipeline?

780 views Asked by At

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their website:

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
       "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

What I would like to do is save and run this locally without having to download the "ner" model every time (which is over 1 GB in size). In their documentation, I see that you can save the pipeline using the "pipeline.save_pretrained()" function to a local folder. The results of this are various files which I am storing into a specific folder.

My question would be how can I load this model back up into a script to continue classifying as in the example above after saving? The output of "pipeline.save_pretrained()" is multiple files.

Here is what I have tried so far:

1: Following the documentation about pipeline

pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')

The error I got was: 'str' object has no attribute "config"

2: Following HuggingFace example on ner:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")

label_list = [
"O",       # Outside of a named entity
"B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC",  # Miscellaneous entity
"B-PER",   # Beginning of a person's name right after another person's name
"I-PER",   # Person's name
"B-ORG",   # Beginning of an organisation right after another organisation
"I-ORG",   # Organisation
"B-LOC",   # Beginning of a location right after another location
"I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
       "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

This yields an error: list index out of range

I also tried printing out just predictions which is not returning the text format of the tokens along with their entities.

Any help would be much appreciated!

1

There are 1 answers

0
ClaudiaR On BEST ANSWER

Loading a model like this has always worked for me:

from transformers import pipeline

pipe = pipeline('token-classification', model=model_folder, tokenizer=model_folder)

Have a look at here for further examples on how to use pipelines.