I am trying to train a word2vec model using gensim. This is the line I am using:
model = Word2Vec(training_texts, size=50, window=5, min_count=1, workers=4, max_vocab_size=20000)
Where training_texts is a list of lists of strings representing words. The corpora I am using has 8924372 sentences with 141,985,244 words and 1,531,477 unique words. After training, only 15642 words are present in the model:
len(list(model.wv.vocab))
# returns 15642
Shouldn't the model have 20,000 words, as specified max_vocab_size? Why is it missing most of the training words?
Thanks!!
You can look at the unique words it discovered via
model.wv.vocab.keys()ormodel.wv.vocab.index2entity.Are they the words you expected? Can you list a word that you are sure you provided in
training_textsthat isn't there?Note that
training_textsshould be a sequence of lists of string tokens. If you are only providing a sequence of strings, it will see each string character as a word, and only model those single-character "words". (With texts using latin-alphabet languages, this usually means just a few dozen "words", but if your texts include other languages' characters I suppose you could wind up with a count of 15642 unique single-character words.)