KeyError: word not in vocabulary" in word2vec

923 views Asked by At

I wanted to convert some Japanese word to vector so that I can train the model for prediction. For that I downloaded pretrained models from Here.

import gensim
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from gensim import models
from janome.tokenizer import Tokenizer

w2v_model = models.KeyedVectors.load_word2vec_format(w2v_models_path)
t = Tokenizer()
# I am testing for some random string
sentence = "社名公開求人住宅手当・家賃補助制度がある企業在宅勤務・リモートワーク可能な求人テレワークコロナに負けるな!積極採用中の企業特集リモートワーク可能なWebデザイナー求人"

tokens = [x.surface for x in t.tokenize(sentence)]
vectors = [w2v_model[v] for v in tokens]

In the last line, I am getting KeyError: "word 'テレワークコロナ' not in vocabulary" Is there anything wrong here?

1

There are 1 answers

0
gojomo On

If you get a "not in vocabulary" error, you can trust that the token (word/key) that you've requested isn't in that KeyedVectors model.

You can see the full list of words known to your model (in the order they are stored) with w2v_model.key_to_index. (Or, just quick peek at some range of 20 in the middle as sanity-check with Python ranged-access like w2v_model.key_to_index[500:520].)

Are you sure 'テレワークコロナ' (and any other string giving the same error) is a legitimate, common Japanese word? Might the tokenizer be failing in some way? Are most of the words the tokenizer returns in the model?

It looks like the site you've linked has just copied those sets-of-word-vectors from Facebook's FastText work (https://fasttext.cc/docs/en/crawl-vectors.html). And, you're just using the plain text "word2vec_format" lists of vectors, so you only have the exact words in that file, and not the full FastText model - which also models word-fragments, and can thus 'guess' vectors for unknown words. (These guesses aren't very good – like working out a word's possible meaning from word-roots – but are usually better than nothing.)

I don't know if that approach works well for Japanese, but you could try it. If you instead grab the .bin (rather than text) file, and load it using Gensim's FastText support – specifically the load_facebook_vectors() method. You'll then get a special kind of KeyedVectors (FastTextKeyedVectors) that will give you such guesses for unknown words, which might help for your purposes (or not).