Semantic search with pretrained BERT models giving irrelevant results with high similarity

310 views Asked by At

I'm trying to create a Semantic search system and have experimented with multiple pretrained models from the SentenceTransformers library: LaBSE, MS-MARCO etc. The system is working well in returning relevant documents first with high probability but issue is documents which are not relevant are also coming with relatively high probabilities. Hence it has become difficult to determine a cutoff threshold for what is relevant and what isnt.

For computing vector similarity i have experimented with Elasticsearch approximate KNN and FAISS with similar results in both. Have checked exact cosine similarities with Scikit-learn also.

My corpus generally has sentences of 15-30 words and the input sentence is < 10 words long. Example is given below

Corpus text 1: <brand_name> is a Fashion House. We design, manufacture and retail men's and women's apparel Input sentence 1: men's fashion Cosine similarity 1: 0.21

Corpus text 2: is an app for pizza delivery Input sentence 2: personal loan Cosine similarity 2: 0.16

Please suggest pretrained models that might be good for this purpose.

I have experimented with many pretrained models like LaBSE, ms-marco-roberta-base-v3 from the sentence transformers but seeing the same behaviour in all of them. Expecting embeddings of dissimilar sentences to have less cosine similarity

1

There are 1 answers

1
petezurich On BEST ANSWER

If you haven't already done so have a look at the distinction between symmetric and asymmetric semantic search and respective models trained specifically for this:

https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search

From what I understand from your use case you might get better results with asymmetric search.

Reranking can help a lot too. See this:

https://www.sbert.net/examples/applications/retrieve_rerank/README.html

Also you might want to have a look at Weaviate. For their vector search they have implemented an AutoCut function:

https://weaviate.io/developers/weaviate/search/similarity#autocut

Autocut takes a positive integer parameter N, looks at the distance between each result and the query, and stops returning results after the Nth "jump" in distance. For example, if the distances for six objects returned by nearText were [0.1899, 0.1901, 0.191, 0.21, 0.215, 0.23] then autocut: 1 would return the first three objects, autocut: 2 would return all but the last object, and autocut: 3 would return all objects.

Weaviate also has a nice hybrid search implementation (combining vector and lexical search) that might help you as well.