I am running the gensim word2vec code on a corpus of resumes(stopwords removed) to identify similar context words in the corpus from a list of pre-defined keywords.
Despite several iterations with input parameters,stopword removal etc the similar context words are not at all making sense(in terms of distance or context) Eg. correlation and matrix occurs in the same window several times yet matrix doesnt fall in the most_similar results for correlation
Following are the details of the system and codes gensim 2.3.0 ,Running on Python 2.7 Anaconda Training Resumes :55,418 sentences Average words per sentence : 3-4 words(post stopwords removal) Code :
wordvec_min_count=int()
size = 50
window=10
min_count=5
iter=50
sample=0.001
workers=multiprocessing.cpu_count()
sg=1
bigram = gensim.models.Phrases(sentences, min_count=10, threshold=5.0)
trigram = gensim.models.Phrases(bigram[sentences], min_count=10, threshold=5.0)
model=gensim.models.Word2Vec(sentences = trigram[sentences], size=size, alpha=0.005, window=window, min_count=min_count,max_vocab_size=None,sample=sample, seed=1, workers=workers, min_alpha=0.0001, sg=sg, hs=1, negative=0, cbow_mean=1,iter=iter)
model.wv.most_similar('correlation')
Out[20]:
[(u'rankings', 0.5009744167327881),
(u'salesmen', 0.4948525130748749),
(u'hackathon', 0.47931140661239624),
(u'sachin', 0.46358123421669006),
(u'surveys', 0.4472047984600067),
(u'anova', 0.44710394740104675),
(u'bass', 0.4449636936187744),
(u'goethe', 0.4413239061832428),
(u'sold', 0.43735259771347046),
(u'exceptional', 0.4313117265701294)]
I am lost as to why the results are so random ? Is there anyway to check the accuracy for word2vec ?
Also is there an alternative of word2vec for most_similar() function ? I read about gloVE but was not able to install the package.
Any information in this regard would be helpful
Enable INFO-level logging and make sure that it indicates real training is happening. (That is, you see incremental progress taking time over the expected number of texts, over the expected number of iterations.)
You may be hitting this open bug issue in
Phrases
, where requesting the Phrase-promotion (as withtrigram[sentences]
) only offers a single-iteration, rather than the multiply-iterable collection object thatWord2Vec
needs.Word2Vec
needs to pass over the corpus once for vocabulary-discovery, theniter
times again for training. Ifsentences
or the phrasing-wrappers only support single-iteration, only the vocabulary will be discovered – training will end instantly, and the model will appear untrained.As you'll see in that issue, a workaround is to perform the Phrases-transformation and save the results into an in-memory list (if it fits) or to a separate text corpus on disk (that's already been phrase-combined). Then, use a truly restartable iterable on that – which will also save some redundant processing.