I want to adapt already existent Word vectors (e.g. from Spacy) and retrain them on a rather limited set of domain specific data. The problem is that I can't find a way to take the already trained vectors and adapt them to my new data. I have used gensim so far but it doesn't seem to work as I expected.
Below is the code I used with gensim but I would also be grateful for any hints using something else than gensim.
# illustrative example, I am using data from a textbook for the real application
training_data = [['This', 'is', 'an', 'example'],['for', 'new', 'training', 'data']]
# build a word2vec model on your dataset
base_model = Word2Vec(size=300, min_count=1)
base_model.build_vocab(training_data)
total_examples = base_model.corpus_count
# add GloVe's vocabulary & weights
base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)
#base_model.build_vocab([list(glove_vectors.index_to_key)], update=True)
# already trained spacy vectors of dim=300
base_model.intersect_word2vec_format('spacy_vecs.txt', binary=False, lockf=1.0)
# train on your data
print("Running ", base_model.epochs, "iterations")
base_model.train(training_data, total_examples=total_examples, epochs=100, compute_loss=True, callbacks=[callback()])
base_model_wv = base_model.wv
base_model.wv.save_word2vec_format('retrained_vectors.txt', binary=False)
checking out the word vectors afterwards doesn't yield a result which makes sense, so something must be going wrong here.
This is done in Python 2.7 since the newer gensim version doesn't seem to support this feature anymore.
On the one hand, you should be able to get
.intersect_word2vec_format()
method working in latest Gensim with some workaround mentioned in the open issue 3094 about its bug.But on the other, that method is an experimental, advanced feature with no good guides to its use. And more generally, fine-tuning existing vectors using Gensim is not a well-supported operation. There's no standard approach or best practices. I've never seen a good write-up demonstrating a reliable way to do it. (I've seen a bunch of bad write-ups on toy data that ignore or hand-wave away potential problems, and rarely check deeply if what they're trying is even helping.)
It'd require significant fresh R&D, and possibly some additions & refactoring of the related Gensim classes, to make it a reliable process.
Thus I recommend against this approach, unless absolutely required, or an area where you want to do some deep original experimentation & development, and can already research, or intuit, all the difficult tradeoffs involved. (That is: beyond what can be provided in a SO answer.)
I suspect for almost everyone who wants to do this, it'd be better to expand their training corpus instead, with other text content that uses similar word senses plus the extra words needed. For example, instead of trying to improvise a procedure for grafting your domain words into someone else's word-model trained on Wikipedia, mix your text with Wikipedia texts, & train a new model. (You could potentially overweight your limited texts by repeating them, randomly strewn through the corpus.)
This should lead to a more straightforward process, with less propensity for error or need for novel experiments, albeit with more training time.