My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".
I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.
Here is a sample of my document data:
[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
['If',
'you',
'feel',
'a',
'presence',
'standing',
'over',
'you',
'while',
'you',
'sleep',
'do'],
['NOT', 'open', 'your', 'eyes'],
['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
['This',
'may',
'sound',
'a',
'bit',
'like',
'the',
'show',
'Bird',
'Box',
'from',
'Netflix']]
I simply train data like this:
from gensim.models.fasttext import FastText
model = FastText(sentences_cleaned)
Consequently, I want to find the similarity between say, "rule" and this document.
model.wv.most_similar("rule")
However, fastText gives me this:
[('the', 0.1334390938282013),
('they', 0.12790171802043915),
('in', 0.12731242179870605),
('not', 0.12656228244304657),
('and', 0.11071767657995224),
('of', 0.08563747256994247),
('I', 0.06609072536230087),
('that', 0.05195673555135727),
('The', 0.002402491867542267),
('my', -0.009009800851345062)]
Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.
Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.
Thanks for any reply!
model.wv.most_similar('rule')
asks for that's model's set-of-word-vectors (.wv
) to return the words most-similar to'rule'
. That is, you've provided neither any document (multiple words) as a query, nor is there any way for theFastText
model to return either a document itself, or a name of any documents. Only words, as it has done.While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.
Those don't look like words very-much like
'rule'
, but you'll only get good results fromFastText
(and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.How many texts, with how many words, are in your
sentences_cleaned
data? (How many uses of'rule'
and related words?)In any real
FastText
/Word2Vec
/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.