the key did not present in Word2vec

16 views Asked by At

I am confronted some problem when I use the pretrained_model: w2v_512.model.

the error is "Key 'xxx' is not present"

I think this may the word of 'xxx' can not be convert to embedding from the w2v_512.model owing to the model did not see this word in the pre-training process.

I want to kown how to solve it. Will it help if I use the BERT embedding. If so, how to use the BERT to get the embedding.

I would be appreciated if anybody answer me!!!

1

There are 1 answers

0
gojomo On

A set of word2vec vectors can only provide vectors for words that were included at training-time.

You could:

  • Ignore the absent words in your analysis. If your pretrained vectors have generally-good coverage of common words for your problem domain, the missing words are likely to be rare, and often ignoring them entirely doesn't hurt much.
  • Switch to a different set of pretrained word2vec vectors that includes the words you need, if you can find one.
  • Train your own word2vec model, on texts with enough varied in-context uses of all words you need, so that you control exactly what words have embeddings.
  • Use an alternate word2vec variant like FastText, which will synthesize guess-vectors for unknown words based on fragments of the word. You might be able to find a suitable pretrained FastText model, or train one yourself. This tactic works best in languages where words of similar meanings share the same roots/prefixes/suffixes, and the synthesized vectors may not be great, but they're often better than nothing.

I believe that BERT models also understand words built-up from subword tokens, a bit like FastText, so could offer an embedding for arbitrary words. So, you could try that and see if it works for you. But, the quality of any such embedding will remain dependent on how well the model was trained around that word & similar words. So, you should always be checking how well the results are working, for your goals – the mere fact a model can return an embedding you can use isn't enough to be sure that embedding is worth using.