Imagine I have some code as such. I am using the encode function to create embeddings. Then from these I would look to calculate a cosine similarity score, after all the model I have selected is geared towards cosine similarity (as opposed to dot-product similarity).
My question is do you always embed the entire string as it is or would you/could you produce cleaning on the two strings before you encode them? Stopwords out. Maybe only keep nouns or entities. Is this a thing, or would the discontinuity/non-grammatical possibility in the resulting strings hurt us?
from sentence_transformers import SentenceTransformer, util
model_name = 'sentence-transformers/multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)
phrase1 = 'Some arbitrarily long string from a book'
phrase2 = "This is a another arbitrarily long string from the same book'
emb1 = model.encode(phrase1)
emb2 = model.encode(phrase2)
I get a cosine similarity which is not spread out that well. There isn't enough separation between good matches and bad matches.
Intuitively you could think that. The TSDAE paper investigated the influence of different POS tags to determine the similarity of two sentences. It was shown that nouns are the most relevant across different approaches (see figure below).
But that does not mean you could remove other less influential POS types to improve your result. The model you are using was trained with complete and grammatically correct sentences. Passing incomplete sentences to the model might confuse your model and decrease the performance.
The only real things you can do are: