Sentence Similarity between a phrase with 2-3 words and documents with multiple sentences

Question

Sentence Similarity between a phrase with 2-3 words and documents with multiple sentences

171 views Asked by Naveen Reddy Marthala At 14 December 2023 at 15:09

What I want to achieve: I have thousands of documents (descriptions of incidents) and I would like to find the documents which match a phrase or are similar to the words in the phrase. An example, for an input phrase, "electric vehicle", I would like to find all the documents that has any discussion related to anything happening with any type of electric vehicle or conveyance, the documents in the corpus might not have the word "vehicle", but may have the specific vehicle type mentioned, like "scooter", "bicycle", "hoverboard" etc,. and document may have the word "electrical" or even something like "lithium battery of a ". So, from an input phrase like "an electric vehicle" or "an electric automobile" or "vehicle powered by a lithium-ion battery", I need to find out all the documents that has related mentions to that term. But, I don't want to capture the documents with "automobile", "scooter" that doesn't have any mention of "electric" or "lithium-ion". So, from a phrase with 1 to 4 words, I must find matching documents containing anywhere from 2 to 100 words used for 1 to 7 sentences in each document.

And the list of input phrases (that are used to find matching documents) will vary, hence something like Siamese-networks or even training a classification model can't be done I suppose. And the count of documents will also keep increasing by day and each of the document is independent of each other.

Here's what I have done till now: I have used sentence-transformers (tried the pre-trained models, multi-qa-mpnet-base-dot-v1, all-MiniLM-L12-v2, all-MiniLM-L16-v2 and all-mpnet-base-v2), to get normalized embeddings for all the documents, then my input phrase. and then computed cosine-similarity between my input phrase's embeddings with all the documents, then get the top 20 sentences with highest values.

The matched documents were barely relevant. For ex, for input phrase "an electrical vehicle" matches documents, with highest cosine-similarity, containing nothing but the word "electrical", followed by documents with only "vehicle", then documents with only "electrical vehicle" or a bit more words or the same 2 words in different forms, followed by documents just a bit more words but having mentions only of "vehicle" without "electrical" and vice-versa. I presume, because of the less count of words in the input phrase.

How do I counter this and find documents that actually mention all the words in my input phrase instead of just using one word to find the matching documents?

Original Q&A

There are 1 answers

**petezurich** · Accepted Answer · 2023-12-15T14:20:03+00:00

In general your approach so far seem sensible and you should see more relevant search results. I suggest these improvements:

Use models for asymmetric semantic search. Have a look at this part of Sentence Transformer's documentation. Here you'll find a selection of MS Marco models specifically trained for short queries (e.g. few words) and search for larger text passages (e.g. your documents). At the moment you seem to use suboptimal pretrained models.
Have a look at hybrid search. Combining lexical with semantic search might yield better results for your use case. Weaviate has a nice implementation that you quickly can whip up and try out.

You could also provide a minimal reproducible example. This might help to give more detailed recommendations.

TechQA.

Sentence Similarity between a phrase with 2-3 words and documents with multiple sentences

There are 1 answers

Related Questions in NLP

Related Questions in WORD-EMBEDDING

Related Questions in SENTENCE-SIMILARITY

Related Questions in SENTENCE-TRANSFORMERS

Popular Questions

Trending Questions