What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

820 views Asked by At

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".

I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.

Here is a sample of my document data:

[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
 ['If',
  'you',
  'feel',
  'a',
  'presence',
  'standing',
  'over',
  'you',
  'while',
  'you',
  'sleep',
  'do'],
 ['NOT', 'open', 'your', 'eyes'],
 ['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
 ['This',
  'may',
  'sound',
  'a',
  'bit',
  'like',
  'the',
  'show',
  'Bird',
  'Box',
  'from',
  'Netflix']]

I simply train data like this:

from gensim.models.fasttext import FastText

model = FastText(sentences_cleaned)

Consequently, I want to find the similarity between say, "rule" and this document.

model.wv.most_similar("rule")

However, fastText gives me this:

[('the', 0.1334390938282013),
 ('they', 0.12790171802043915),
 ('in', 0.12731242179870605),
 ('not', 0.12656228244304657),
 ('and', 0.11071767657995224),
 ('of', 0.08563747256994247),
 ('I', 0.06609072536230087),
 ('that', 0.05195673555135727),
 ('The', 0.002402491867542267),
 ('my', -0.009009800851345062)]

Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.

Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.

Thanks for any reply!

1

There are 1 answers

3
gojomo On

model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.

While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.

Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.

How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)

In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.