Iterating through Huggingface tokenizer with remainder

478 views Asked by Mittenchops At 05 October 2020 at 22:05

Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way?

Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something amenable to looping. Naively:

subst = " ".join(mytext.split(" ")[0:MAX_LEN])

would let me loop through chunks with something like:

START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
  substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
  START = START + MAX_LEN
  i = i + 1
  tokens = tokenizer(text)

However, " ".join(mytext.split(" ")[0:MAX_LEN]) is not equal to the length given by tokenizer(text).

You can see the difference below:

>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens

>>> len(mytext.split(" "))
10001

>>> encoded_input = tokenizer(mytext) 
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors

What is the function argument to tokenizer or if none available, the generally accepted iteration procedure for longer documents?

Original Q&A

TechQA.

Iterating through Huggingface tokenizer with remainder

There are 0 answers

Related Questions in HUGGINGFACE-TOKENIZERS

Popular Questions

Popular Tags

Trending Questions