how to use transformer in huggingface without tokenization?

521 views Asked by At

I have the following code:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
model = AutoModelForTokenClassification.from_pretrained("sagorsarker/codeswitch-spaeng-lid-lince")
pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
sentence = "some example sentence here"
results = pipeline(sentence)

this works fine. But instead of a str, I wan't to pass a list of tokens. How do I do that?

The reason I want to do that is, my sentences are already tokenized and simple " ".join() does not reproduce the sentence correctly. For example, isn't has been tokenized into is and n't. But a simple " ".join() will produce is n't

1

There are 1 answers

0
joe32140 On

I assume the original data is tokenized by NLTK, so try NLTK detokenizer:

from nltk.tokenize.treebank import TreebankWordDetokenizer
toks = ['hello', ',', 'i', 'ca', "n't", 'feel', 'my', 'feet', '!', 'Help', '!', '!']
twd = TreebankWordDetokenizer()
twd.detokenize(toks)
# "hello, i can't feel my feet! Help!!"