How to use WmdSimilarity function provided in gensim along with word embeddings which are in numpy.ndarray data type

568 views Asked by At

Using Word2vec (skip-gram) model in tensorflow , I wrote the code to obtain word embeddings from document-set. The final embeddings are in numpy.ndarray format

Now to obtain similar documents , I need to use the WMD(Word Movers Distance) algorithm.

(I don't have much knowledge of gensim) The gensim.similarities.WmdSimilarity() requires the embeddings to be in KeyedVectors data type (seems like) -- What can I do to implement WMD in my code.I have a tight deadline and can't give much time to writing the code of WMD from scratch .

1

There are 1 answers

0
aneesh joshi On

If you're looking for similarity between 2 words, use

my_gensim_word2vec_model.most_similar('king')

my_gensim_word2vec_model is the gensim model, of course, not your own tensorflow model.

If you want the most similar to a bunch of words:

my_gensim_word2vec_model.most_similar(positive=['king', 'queen', 'rabbit'])

Check the gensim docs

If your're looking for similarity between sentences or documents, you're better off using doc2vec which gives a vector for all the vocabulary words and documents.

Or take the average of all words in the sentence/document to get the vector for that document. Then get the cosine similarity between the averages of the two sentences to be compared.

For example:

Similarity("Hello World", "Hi there") = CosineSimilarity(vec1, vec2)
"Hello World" -> (Vec("Hello") + Vec("World"))/2 -> vec1
"Hi there" -> (Vec("Hi") + Vec("there"))/2 -> vec2

(Your question is unclear. What is document set? What is your task?) Hope this helps.