Using MBart50TokenizerFast tokenizer with multiple sentences

36 views Asked by At

I am trying to use MBart50TokenizerFast with facebook/mbart-large-50-many-to-one-mmt on GPU, and trying to provide multiple sentences in one go (the sentences cannot be combined). Here is my code (based on https://stackoverflow.com/a/62688252/194742):

tokenizer.src_lang = source_lang
inputs = tokenizer([title, ftext], return_tensors="pt").to(device)
outputs = model.generate(**inputs).to(device)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
translated_title = translations[0]
translated_ftext = translations[1]

This mostly follows the example given on the page, except that I am trying to include multiple sentences in one go. Here is the error message I get:

Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

The code does work with this line:

inputs = self.tokenizer(title, return_tensors="pt").to(self.device)

What is the correct way to use multiple sentences? Thanks for any pointers.

1

There are 1 answers

0
Samik R On

Looks like I had to enable truncation as suggested in the error message. The final code that works:

tokenizer.src_lang = source_lang
inputs = tokenizer([title, ftext], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
outputs = model.generate(**inputs, max_length=512)
translations = [self.tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
translated_title = translations[0]
translated_ftext = translations[1]