According to WMD paper, it's inspired by word2vec model and use word2vec vector space for moving document 1 towards document 2 (in the context of Earth Mover Distance metric). From the paper:
Assume we are provided with a word2vec embedding matrix
X ∈ Rd×n for a finite size vocabulary of n words. The
ith column, xi ∈ Rd, represents the embedding of the ith
word in d-dimensional space. We assume text documents
are represented as normalized bag-of-words (nBOW) vectors,
d ∈ Rn. To be precise, if word i appears ci times in
the document, we denote di = ci/cj (for j=1 to n). An nBOW vector
d is naturally very sparse as most words will not appear in
any given document. (We remove stop words, which are
generally category independent.)
I understand the concept from the paper, however, I couldn't understand how wmd uses word2vec embedding space from the code in Gensim.
Can someone explain it in a simple way? Does it calculate the word vectors in a different way because I couldn't understand where in this code word2vec embedding matrix is used?
def wmdistance(self, document1, document2):
# Remove out-of-vocabulary words.
len_pre_oov1 = len(document1)
len_pre_oov2 = len(document2)
document1 = [token for token in document1 if token in self]
document2 = [token for token in document2 if token in self]
dictionary = Dictionary(documents=[document1, document2])
vocab_len = len(dictionary)
# Sets for faster look-up.
docset1 = set(document1)
docset2 = set(document2)
# Compute distance matrix.
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
for j, t2 in dictionary.items():
if t1 not in docset1 or t2 not in docset2:
continue
# Compute Euclidean distance between word vectors.
distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))
def nbow(document):
d = zeros(vocab_len, dtype=double)
nbow = dictionary.doc2bow(document) # Word frequencies.
doc_len = len(document)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents.
d1 = nbow(document1)
d2 = nbow(document2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
For the purposes of WMD, a text is considered a bunch of 'piles' of meaning. Those piles are placed at the coordinates of the text's words – and that's why WMD calculation is dependent on a set of word-vectors from another source. Those vectors position the text's piles.
The WMD is then the minimal amount of work needed to shift one text's piles to match another text's piles. And the measure of the work needed to shift from one pile to another is the euclidean distance between those pile's coordinates.
You could just try a naive shifting of the piles: look at the first word from text A, shift it to the first word from text B, and so forth. But that's unlikely to be the cheapest shifting – which would likely try to match nearer words, to send the 'meaning' on the shortest possible paths. So actually calculating the WMD is an iterative optimization problem – significantly more expensive than just a simple euclidean-distance or cosine-distance between two points.
That optimization is done inside the
emd()
call in the code you excerpt. But what that optimization requires is the pairwise distances between all words in text A, and all words in text B – because those are all the candidate paths across which meaning-weight might be shifted. You can see those pairwise distances calculated in the code to populate thedistance_matrix
, using the word-vectors already loaded in the model and accessible viaself[t1]
,self[t2]
, etc.