I'm quite new to nlp. I'm trying to use a model trained with gensim in dl4j. I'm saving the model with
w2v_model.wv.save_word2vec_format("path/to/w2v_model.bin", binary=True)
and afterwards I'm loading it with
Word2Vec w2vModel = WordVectorSerializer.readWord2VecModel("path/to/w2v_model.bin");
The model works well except for the handling of out-of-vocabulary (OOV) words. In Gensim, it seems to calculate vectors for OOV words based on the word's n-grams, but in DL4J, it provides an empty vector for them.
My questions are:
- Is there a way to export the n-gram weights along with the model from Gensim so that DL4J can use them?
- If exporting the n-gram weights is not possible, is there a method to reconstruct them on the DL4J side to achieve similar results for OOV words as in Gensim?
Any guidance or suggestions would be greatly appreciated
The core original word2vec algorithm – and the
Word2Vecmodel class in Gensim – has no ability to synthesize vectors for OOV words using character n-grams.That's only a feature of FastText models (and the
FastTextmodel class in Gensim) – so if you're seeing that working in Gensim, yourw2v_modelvariable may actually hold a GensimFastTextobject.Further, the plain {word, vector}-per-line format saved by Gensim's
.save_word2vec_format()(whetherbinary=Falseorbinary=True) doesn't save any subword n-grams, even if used on aFastTextobject. (It just saves the full-word vectors for in-vocabulary words.)Gensim's
FastTextcan save models in the full raw model format also used by Facebook's original FastText implementation – seeFastText.save_facebook_model().But to bring that to a Java environment, you'd need to find a true FastText implementation that also reads that format. I don't see any evidence that the
Word2Vecclass in DL4J supports FastText features or load FastText models.There is an
org.deeplearning4j.models.fasttext.FastTextclass – which seems to wrap the Facebook native C++ FastText implementation via anothercom.github.jfasttext.JFastText. That is, it's not a true Java implementation, but it makes the model accessible to Java code.I have no idea of the completeness/reliability of this approach; it's a little fishy to me that a class (
JFastText) not from a Github engineer is named via acom.githubpath, but presumably thedeeplearning4jmaintainers know what they're doing, and this may be your best option for loading a fully-capable (character-n-gram features) FastText model for use in DL4J.