Why Word2Vec's most_similar() function is giving senseless results on training?

925 views Asked by At

I am running the gensim word2vec code on a corpus of resumes(stopwords removed) to identify similar context words in the corpus from a list of pre-defined keywords.

Despite several iterations with input parameters,stopword removal etc the similar context words are not at all making sense(in terms of distance or context) Eg. correlation and matrix occurs in the same window several times yet matrix doesnt fall in the most_similar results for correlation

Following are the details of the system and codes gensim 2.3.0 ,Running on Python 2.7 Anaconda Training Resumes :55,418 sentences Average words per sentence : 3-4 words(post stopwords removal) Code :

    wordvec_min_count=int()
    size = 50
    window=10
    min_count=5
    iter=50
    sample=0.001
    workers=multiprocessing.cpu_count()
    sg=1
    bigram = gensim.models.Phrases(sentences, min_count=10, threshold=5.0)
    trigram = gensim.models.Phrases(bigram[sentences], min_count=10, threshold=5.0)
    model=gensim.models.Word2Vec(sentences = trigram[sentences], size=size, alpha=0.005, window=window, min_count=min_count,max_vocab_size=None,sample=sample, seed=1, workers=workers, min_alpha=0.0001, sg=sg, hs=1, negative=0, cbow_mean=1,iter=iter)

model.wv.most_similar('correlation')
Out[20]: 
[(u'rankings', 0.5009744167327881),
 (u'salesmen', 0.4948525130748749),
 (u'hackathon', 0.47931140661239624),
 (u'sachin', 0.46358123421669006),
 (u'surveys', 0.4472047984600067),
 (u'anova', 0.44710394740104675),
 (u'bass', 0.4449636936187744),
 (u'goethe', 0.4413239061832428),
 (u'sold', 0.43735259771347046),
 (u'exceptional', 0.4313117265701294)]

I am lost as to why the results are so random ? Is there anyway to check the accuracy for word2vec ?

Also is there an alternative of word2vec for most_similar() function ? I read about gloVE but was not able to install the package.

Any information in this regard would be helpful

1

There are 1 answers

0
gojomo On BEST ANSWER

Enable INFO-level logging and make sure that it indicates real training is happening. (That is, you see incremental progress taking time over the expected number of texts, over the expected number of iterations.)

You may be hitting this open bug issue in Phrases, where requesting the Phrase-promotion (as with trigram[sentences]) only offers a single-iteration, rather than the multiply-iterable collection object that Word2Vec needs.

Word2Vec needs to pass over the corpus once for vocabulary-discovery, then iter times again for training. If sentences or the phrasing-wrappers only support single-iteration, only the vocabulary will be discovered – training will end instantly, and the model will appear untrained.

As you'll see in that issue, a workaround is to perform the Phrases-transformation and save the results into an in-memory list (if it fits) or to a separate text corpus on disk (that's already been phrase-combined). Then, use a truly restartable iterable on that – which will also save some redundant processing.