Find most similar words for OOV word

256 views Asked by At

I am looking for the most similar words for out-of-vocab OOV words using gensim. Something like this:

    def get_word_vec(self, model, word):
    try:
        if word not in model.wv.vocab:
            mostSimWord = model.wv.similar_by_word(word)
            print(mostSimWord)
        else:
            print( word )
    except Exception as ex:
        print(ex)

Is there are way to achieve this task? Options other than gensim also welcomed.

1

There are 1 answers

0
gojomo On

If you train a FastText model instead of a Word2Vec model, it inherently learns vectors for word-fragments (of configurable size ranges) in addition to full words.

In languages like English & many others (but not all), unknown words are often typos, alternate forms, or related in terms of roots and suffixes to knwon words. Thus, having vectors for subwords, then using those to tally up a good guess vector for an unknown word, can work well enough to be worth trying – better than ignoring such words, or using a totally random or origin-point vector.

There's no built-in way to try to extract such relationships from an existing set of word-vectors that isn't FastText/subword-based – but it'd be theoretically possible. You could compute edit distances to, or counts-of-shared-subwords with, all known words, & create a guess-vector by weighted combination of the N-closest words. (This might work really well with typos & rarer alternate spellings, but not as much for truly-absent novel words.)