I'm trying to create a Semantic search system and have experimented with multiple pretrained models from the SentenceTransformers library: LaBSE, MS-MARCO etc. The system is working well in returning relevant documents first with high probability but issue is documents which are not relevant are also coming with relatively high probabilities. Hence it has become difficult to determine a cutoff threshold for what is relevant and what isnt.
For computing vector similarity i have experimented with Elasticsearch approximate KNN and FAISS with similar results in both. Have checked exact cosine similarities with Scikit-learn also.
My corpus generally has sentences of 15-30 words and the input sentence is < 10 words long. Example is given below
Corpus text 1: <brand_name> is a Fashion House. We design, manufacture and retail men's and women's apparel Input sentence 1: men's fashion Cosine similarity 1: 0.21
Corpus text 2: is an app for pizza delivery Input sentence 2: personal loan Cosine similarity 2: 0.16
Please suggest pretrained models that might be good for this purpose.
I have experimented with many pretrained models like LaBSE, ms-marco-roberta-base-v3 from the sentence transformers but seeing the same behaviour in all of them. Expecting embeddings of dissimilar sentences to have less cosine similarity
If you haven't already done so have a look at the distinction between symmetric and asymmetric semantic search and respective models trained specifically for this:
https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search
From what I understand from your use case you might get better results with asymmetric search.
Reranking can help a lot too. See this:
https://www.sbert.net/examples/applications/retrieve_rerank/README.html
Also you might want to have a look at Weaviate. For their vector search they have implemented an AutoCut function:
https://weaviate.io/developers/weaviate/search/similarity#autocut
Weaviate also has a nice hybrid search implementation (combining vector and lexical search) that might help you as well.