I am trying to train a word2vec model using gensim. This is the line I am using:
model = Word2Vec(training_texts, size=50, window=5, min_count=1, workers=4, max_vocab_size=20000)
Where training_texts is a list of lists of strings representing words. The corpora I am using has 8924372 sentences with 141,985,244 words and 1,531,477 unique words. After training, only 15642 words are present in the model:
len(list(model.wv.vocab))
# returns 15642
Shouldn't the model have 20,000 words, as specified max_vocab_size? Why is it missing most of the training words?
Thanks!!
You can look at the unique words it discovered via
model.wv.vocab.keys()
ormodel.wv.vocab.index2entity
.Are they the words you expected? Can you list a word that you are sure you provided in
training_texts
that isn't there?Note that
training_texts
should be a sequence of lists of string tokens. If you are only providing a sequence of strings, it will see each string character as a word, and only model those single-character "words". (With texts using latin-alphabet languages, this usually means just a few dozen "words", but if your texts include other languages' characters I suppose you could wind up with a count of 15642 unique single-character words.)