Iterating through Huggingface tokenizer with remainder

461 views Asked by At

Transformer models have maximum token limits. If I want to substring my text to fit within that limit, what is the generally accepted way?

Due to the treatment of special characters, it isn't the case that the tokenizer maps its tokens to something amenable to looping. Naively:

subst = " ".join(mytext.split(" ")[0:MAX_LEN])

would let me loop through chunks with something like:

START = 0
i = 0
substr = []
while START+MAX_LEN < len(mytext.split(" ")):
  substr[i] = " ".join(mytext.split(" ")[START:START+MAX_LEN])
  START = START + MAX_LEN
  i = i + 1
  tokens = tokenizer(text)

However, " ".join(mytext.split(" ")[0:MAX_LEN]) is not equal to the length given by tokenizer(text).

You can see the difference below:

>>> from transformers import LongformerTokenizer
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> mytext = "This is a long sentence. " * 2000 # about 10k tokens

>>> len(mytext.split(" "))
10001

>>> encoded_input = tokenizer(mytext) 
Token indices sequence length is longer than the specified maximum sequence length for this model (12003 > 4096). Running this sequence through the model will result in indexing errors

What is the function argument to tokenizer or if none available, the generally accepted iteration procedure for longer documents?

0

There are 0 answers