Langchain FAISS | Any solutions or alternatives for similarity search on vector DBs for slightly repetitive short words with numerics?

55 views Asked by At

So basically I am trying to search a cell line vector data base that has entries that look like this using langchain:

ID: 253F1

AC: CVCL_B513

SY: NA

OX: NCBI_TaxID=9606; ! Homo sapiens (Human)

CA: Induced pluripotent stem cell

There are easily tens of thousands of these entries in a text file that I store as a vector DB.

I find that if I do a similarity search on say the "Induced pluripotent stem cell", the similarity search always returns relevant documents. However, If i search 253F1 or CVCL_B513 its about a coin flip on whether the similarity search will return relevant documents.

The reason I need to do this form of search as opposed to a direct word match is because sometimes the input query will have varying forms of syntax eg: 253-F1 or 253.F1 or 253f1 scaled over thousands of entries.

Is there an alternative to approaching these short queries? Something that I might find getting better results? I have tried using FAISS to create a vector DB and similarity search on it, but I fear that due to the nature of data too many elements appear similar.

0

There are 0 answers