Cosine Similarity Involving Embeddings, Do we have to embed the whole sentence/text?

153 views Asked by At

Imagine I have some code as such. I am using the encode function to create embeddings. Then from these I would look to calculate a cosine similarity score, after all the model I have selected is geared towards cosine similarity (as opposed to dot-product similarity).

My question is do you always embed the entire string as it is or would you/could you produce cleaning on the two strings before you encode them? Stopwords out. Maybe only keep nouns or entities. Is this a thing, or would the discontinuity/non-grammatical possibility in the resulting strings hurt us?

from sentence_transformers import SentenceTransformer, util
model_name = 'sentence-transformers/multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)
phrase1 = 'Some arbitrarily long string from a book'
phrase2 = "This is a another arbitrarily long string from the same book'    
emb1 = model.encode(phrase1)
emb2 = model.encode(phrase2)

I get a cosine similarity which is not spread out that well. There isn't enough separation between good matches and bad matches.

2

There are 2 answers

0
cronoik On

My question is do you always embed the entire string as it is or would you/could you produce cleaning on the two strings before you encode them? Stopwords out. Maybe only keep nouns or entities. Is this a thing, or would the discontinuity/non-grammatical possibility in the resulting strings hurt us?

Intuitively you could think that. The TSDAE paper investigated the influence of different POS tags to determine the similarity of two sentences. It was shown that nouns are the most relevant across different approaches (see figure below).

most relevant POS

But that does not mean you could remove other less influential POS types to improve your result. The model you are using was trained with complete and grammatically correct sentences. Passing incomplete sentences to the model might confuse your model and decrease the performance.

The only real things you can do are:

  • Experiment with different pre-trained models (a different training dataset or objective can have a huge impact on your data).
  • Finetune your own sentence-transformer (check the training section of their homepage)
0
Kinjal On

Since you are using sentence embeddings, encoding the whole sentence makes more sense.

An alternative approach in order to increase separation could be, if you have an idea of the categories where most of texts fall in, then you can use a zero-shot classifier to give scores to each text with every category. You can keep refining the categories like a semi-supervised approach.