I'm new to DL and NLP, and recently started using a pre-trained fastText embedding model (cc.en.300.bin) through gensim.
I would like to be able to calculate vectors for out-of-vocabulary words myself, by splitting the word to n-grams and looking up the vector for every n-gram.
I could not find a way to export the n-gram vectors that are part of the model. I realize they are hashed, but perhaps there's a way (not necessarily using gensim) to get them?
Any insight will be appreciated!
You can look exactly at how the
gensim
code creates FastText word-vectors for out-of-vocabulary words by examining its source code for itsFastTextKeyedVectors
classword_vec()
method directly:https://github.com/RaRe-Technologies/gensim/blob/3aeee4dc460be84ee4831bf55ca4320757c72e7b/gensim/models/keyedvectors.py#L2069
(Note that this source code in
gensim
'sdevelop
branch may reflect recent FastText fixes that wouldn't match what your installed package up throughgensim
version 3.7.1 does; you may want to consult your installed package's local source code, or wait for these fixes to be in an official release.)Because Python doesn't protect any part of the relevant objects from external access (with things like enforced 'private' designations), you can perform the exact same operations from outside the class.
Note particularly that, in the current code (which matches the behavior of Facebook's original implementation), n-gram vectors will be pulled from the buckets in the hashtable
ngram_weights
structure whether or not your current n-grams were truly known in the training data or not. In the cases where those n-grams were known and meaningful in the training data, that should help the OOV vector a bit. In the cases where it's getting an arbitrary other vector instead, such randomness shouldn't hurt much.