What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

Question

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

796 views Asked by Yuchen Zhang At 04 October 2020 at 04:57

My goal is to find similarities between a word and a document. For example, I want to find the similarity between "new" and a document, for simplicity, say "Hello World!".

I used word2vec from gensim, but the problem is it does not find the similarity for an unseen word. Thus, I tried to use fastText from gensim as it can find similarity for words that are out of vocabulary.

Here is a sample of my document data:

[['This', 'is', 'the', 'only', 'rule', 'of', 'our', 'household'],
 ['If',
  'you',
  'feel',
  'a',
  'presence',
  'standing',
  'over',
  'you',
  'while',
  'you',
  'sleep',
  'do'],
 ['NOT', 'open', 'your', 'eyes'],
 ['Ignore', 'it', 'and', 'try', 'to', 'fall', 'asleep'],
 ['This',
  'may',
  'sound',
  'a',
  'bit',
  'like',
  'the',
  'show',
  'Bird',
  'Box',
  'from',
  'Netflix']]

I simply train data like this:

from gensim.models.fasttext import FastText

model = FastText(sentences_cleaned)

Consequently, I want to find the similarity between say, "rule" and this document.

model.wv.most_similar("rule")

However, fastText gives me this:

[('the', 0.1334390938282013),
 ('they', 0.12790171802043915),
 ('in', 0.12731242179870605),
 ('not', 0.12656228244304657),
 ('and', 0.11071767657995224),
 ('of', 0.08563747256994247),
 ('I', 0.06609072536230087),
 ('that', 0.05195673555135727),
 ('The', 0.002402491867542267),
 ('my', -0.009009800851345062)]

Obviously, it must have "rule" as the top similarity since the word "rule" appears in the first sentence of the document. I also tried stemming/lemmatization, but it doesn't work either.

Was my input format correct? I've seen lots of documents are using .cor or .bin format and I don't know what are those.

Thanks for any reply!

Original Q&A

There are 1 answers

**gojomo** · Answer 1 · 2020-10-04T17:31:23+00:00

model.wv.most_similar('rule') asks for that's model's set-of-word-vectors (.wv) to return the words most-similar to 'rule'. That is, you've provided neither any document (multiple words) as a query, nor is there any way for the FastText model to return either a document itself, or a name of any documents. Only words, as it has done.

While FastText trains on texts – lists of word-tokens – it only models words/subwords. So it's unclear what you expected instead: the answer is of the proper form.

Those don't look like words very-much like 'rule', but you'll only get good results from FastText (and similar word2vec-algorithms) if you train them with lots of varied data showing many subtly-contrasting realistic uses of the relevant words.

How many texts, with how many words, are in your sentences_cleaned data? (How many uses of 'rule' and related words?)

In any real FastText/Word2Vec/etc model, trained with asequate data/parameters, no single sentence (like your 1st sentence) can tell you much about what the results "should" be. That only emerged from the full rich dataset.

TechQA.

What is the input format of fastText and why does my model doesn't give me a meaningful similar output?

There are 1 answers

Related Questions in NLP

Related Questions in WORD2VEC

Related Questions in SENTIMENT-ANALYSIS

Related Questions in FASTTEXT

Related Questions in SENTENCE-SIMILARITY

Popular Questions

Popular Tags

Trending Questions