I have a Word2Vec model with a lot of word vectors. I can access a word vector as so.
word_vectors = gensim.models.Word2Vec.load(wordspace_path)
print(word_vectors['boy'])
Output
[ -5.48055351e-01 1.08748421e-01 -3.50534245e-02 -9.02988110e-03...]
Now I have a proper vector representation that I want to replace the word_vectors['boy'] with.
word_vectors['boy'] = [ -7.48055351e-01 3.08748421e-01 -2.50534245e-02 -10.02988110e-03...]
But the following error is thrown
TypeError: 'Word2Vec' object does not support item assignment
Is there any fashion or workaround to do this? That is manipulate word vectors manually once the model is trained? Is it possible in other platforms except Gensim?
Since word2vec vectors are typically only created by the iterative training process, then accessed, the gensim
Word2Vec
object does not support direct assignment of new values by its word indexes.However, as it is in Python, all its internal structures are fully viewable/tamperable by you, and as it is open-source, you can view exactly how it does all of its existing functionality, and use that as a model for how to do new things.
Specifically, the raw word-vectors are (in recent versions of gensim) stored in a property of the
Word2Vec
object calledwv
, and thiswv
property is an instance ofKeyedVectors
. If you examine its source code, you can see accesses of word-vectors by string key (eg'boy'
), including those by[]
-indexing implemented by the__getitem__()
method, go through its methodword_vec()
. You can view the source of that method either in your local installation, or at Github:https://github.com/RaRe-Technologies/gensim/blob/c2201664d5ae03af8d90fb5ff514ffa48a6f305a/gensim/models/keyedvectors.py#L265
There you'll see the word is actually converted to an integer-index (via
self.vocab[word].index
) then used to access an internalsyn0
orsyn0norm
array (depending on whether the user is accessing the raw or unit-normalized vector). If you look elsewhere where these are set up, or simply examine them in your own console/code (as if byword_vectors.wv.syn0
), you'll see these arenumpy
arrays which do support direct assignment by index.So, you can directly tamper with their values by integer index, as if by:
And then, future accesses of
word_vectors.wv['boy']
will return your updated values.Notes:
• If you want
syn0norm
to be updated, to have the proper unit-normed vectors (as are used inmost_similar()
and other operations), it'd likely be best to modifysyn0
first, then discard and recalculatesyn0norm
, via:• Adding new words would require more involved object-tampering, because it will require growing the
syn0
(replacing it with a larger array), and updating thevocab
dict