I'm new to DL and NLP, and recently started using a pre-trained fastText embedding model (cc.en.300.bin) through gensim.
I would like to be able to calculate vectors for out-of-vocabulary words myself, by splitting the word to n-grams and looking up the vector for every n-gram.
I could not find a way to export the n-gram vectors that are part of the model. I realize they are hashed, but perhaps there's a way (not necessarily using gensim) to get them?
Any insight will be appreciated!
You can look exactly at how the
gensimcode creates FastText word-vectors for out-of-vocabulary words by examining its source code for itsFastTextKeyedVectorsclassword_vec()method directly:https://github.com/RaRe-Technologies/gensim/blob/3aeee4dc460be84ee4831bf55ca4320757c72e7b/gensim/models/keyedvectors.py#L2069
(Note that this source code in
gensim'sdevelopbranch may reflect recent FastText fixes that wouldn't match what your installed package up throughgensimversion 3.7.1 does; you may want to consult your installed package's local source code, or wait for these fixes to be in an official release.)Because Python doesn't protect any part of the relevant objects from external access (with things like enforced 'private' designations), you can perform the exact same operations from outside the class.
Note particularly that, in the current code (which matches the behavior of Facebook's original implementation), n-gram vectors will be pulled from the buckets in the hashtable
ngram_weightsstructure whether or not your current n-grams were truly known in the training data or not. In the cases where those n-grams were known and meaningful in the training data, that should help the OOV vector a bit. In the cases where it's getting an arbitrary other vector instead, such randomness shouldn't hurt much.