I have the following code:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
sentence = "some example sentence here"
results = pipeline(sentence)
this works fine. But instead of a str
, I wan't to pass a list
of tokens. How do I do that?
The reason I want to do that is, my sentences are already tokenized and simple " ".join()
does not reproduce the sentence correctly. For example, isn't
has been tokenized into is
and n't
. But a simple " ".join()
will produce is n't
I assume the original data is tokenized by NLTK, so try NLTK detokenizer: