I'm trying to create a vector database in python using LangChain for retrieval augmentation with a large language model. Currently, I'm using NCBI Statpearls (a corpus of medical data) and for testing purposes have only initialized the vector database with a single article on artery occlusion. Instead of chunking by tokens, I've chunked by paragraph and also added information about the title and section name to each chunk for context.
However, the database is often retrieving results irrelevant to my queries. For example, when I search using just the term 'end stage kidney disease' (of which there is 1 mention in the text), the database returns as the first result:
ARTICLE TITLE: Chronic Total Occlusion of the Coronary Artery
SECTION NAME: History and Physical
The history should also include risk factors for cardiovascular disease (diabetes, tobacco abuse, hypertension, hyperlipidemia) and non-cardiac causes of the patient's symptoms, including pulmonary embolism, aortic dissection, pneumothorax, esophageal rupture or perforating peptic ulcer. Physical examination in these patients should include complete auscultation of the heart and lung sounds together with assessment for heart failure signs including jugular venous distention, Kussmaul sign, hepatojugular reflex, ascites, and peripheral edema.
Note how there's no mention of kidney disease. And the second, which also completely doesn't mention it:
ARTICLE TITLE: Chronic Total Occlusion of the Coronary Artery
SECTION NAME: Prognosis
In addition to causing symptoms, CTOs have correlations with a worse overall prognosis, with higher rates of death and non-fatal adverse cardiovascular events in several populations. Patients with CTOs tend to be older and have more comorbidities and more significant impairment of left ventricular function. Furthermore, patients with non-revascularized CTOs have higher mortality and a higher risk of major adverse cardiovascular events in comparison to patients with multivessel coronary artery disease who are completely revascularized.
Only in the third result does it return a passage mentioning kidney disease:
ARTICLE TITLE: Chronic Total Occlusion of the Coronary Artery
SECTION NAME: Etiology
Risk factors for CTO lesion in patients are as below
Known coronary artery disease or history of myocardial infarction
Excessive tobacco use
High LDL cholesterol, low HDL cholesterol
Diabetes
Sedentary lifestyle
Hypertension
Family history of premature disease
End-stage kidney disease <-----
Obesity
Postmenopausal women
I've tried FAISS with similar results, but my current implementation uses LanceDB, and is essentially the same as LangChain's example on their website. The model is text-embedding-ada-002:
<!-- language:python-->
embeddings = langchain.embeddings.OpenAIEmbeddings(deployment_id='Embedding', chunk_size=1)
db = langchain_lancedb.from_documents(list_derived_chunks + p_derived_chunks, embeddings, connection=table)
docs = db.similarity_search('end stage kidney disease', k=3)
for doc in docs:
print(doc.page_content)
print('================')
Where list_derived_chunks and p_derived_chunks are chunks extracted from lists in the article and paragraphs in the article, respectively. This was done via some XML parsing code which seems to work well.
Could anyone provide any insights as to what I can do to improve performance? Maybe I'm conceptualizing vector databases wrongly, or it just needs a fine-tuned embedding model to work well? Thanks :)
I tried several embedding models and tried LanceDB and FAISS as the vector databases. I expected the list containing end-stage kidney disease to be returned first as it seemed the most relevant but it was the third result in the vector database search.
There are two common approaches for this sort of textual information retrieval, semantic search and full text search (overview).
It sounds like you are expecting or intuitively thinking in terms of full text search. Full text search indices are trained only on the existing corpus of data. They will learn which words are common in that corpus and give them lower weight. You have fairly unique (e.g. infrequent in the full corpus) keywords (kidney disease) and expect your top results will contain those keywords. Since you're using LanceDb you might get results matching your intuition by using its full text search capabilities.
With semantic search you are more at the mercy of the model. You don't mention exactly which embedding you are using here but, for the sake of example, I'll pretend you used
text-embedding-ada-002. This model is designed to capture the semantic intent of English sentences and is trained on a very broad corpus. My guess is that, from the perspective of thetext-embedding-ada-002model, the prompt "end stage kidney disease" is getting embedded into a semantic concept like (this is a made up example) "text related to diseases using a clinical tone".You might want to investigate a concept known as "fine-tuning". You should be able to take the initial model
text-embedding-ada-002and fine-tune it on theNCBI Statpearlscorpus. If done correctly this should (in very hand waving terms) teach the model all about the nuances behind the broad concept "text related to diseases using a clinical tone".Obligatory Disclaimer: I currently work for LanceDB and this answer may be biased as a result