from torchtext.datasets import WikiText2, IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
tkzer = get_tokenizer('basic_english')
tr_iter = WikiText2(split='train')
vocabulary = build_vocab_from_iterator(map(tkzer, tr_iter), specials=['<unk>'])
tr_iter_imdb = IMDB(split='train')
vocabulary = build_vocab_from_iterator(map(tkzer, tr_iter_imdb), specials=['<unk>'])
The code for WikiText2 runs fine. But when it comes to IMDB, I get the following error while running build_vocab_from_iterator.
'tuple' object has no attribute 'lower'
Can someone please help me understand why is that the case? I assume this relates to IMDB data structure different from WikiText2. In that case, how can I build vocab for IMDB dataset.
IMDB()returns a tuple containing an int and a str:I suggest that you check that the text in the tuple is what you want, and then update your map function to something like:
map(lambda x : tkzer(x[1]),tr_iter_imdb)