I'm working on a vector store qa bot to store docs from a csv file using langchain+chroma to create a vector store. I am using PALM model for my project to answer from the vector store. here's my code
import csv
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import pandas as pd
hf_embed = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
with open("COHORT.csv", newline="") as csvfile:
reader = csv.reader(csvfile)
i = 0
# Iterate through the rows in the CSV file and print each row
for row in reader:
s1 = f"The person is having Condition {row[0]}, Condition start date is {row[13]} with year of birth {row[14]}, the person's ethnicity is {row[18]} and race is {row[19]}, the person is a {row[20]} and uses the drug {row[21]} for the treatment"
print(s1)
collection_name = f"chatbot2_batch{i}"
print(collection_name)
# Create Chroma vector store from the batch
Vector_db = Chroma.from_texts(
collection_name=collection_name, texts=s1, embedding=hf_embed, persist_directory="kai3"
)
Vector_db.persist()
pdf_vector_db_path = "kai3"
db = Chroma(
collection_name="chatbot2",
embedding_function=hf_embed,
persist_directory=pdf_vector_db_path,
)
Vector_db.persist()
i += 1
METHOD 1
llm = GooglePalm(temperature=0.1, key="XXXXXX")
# Get the retriever from the Chroma vector store
retriever = db.as_retriever()
# Create a RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, input_key="question")
# Retrieve the answer from the vector store
answer = qa_chain("WHAT's the most used drug?")
# Print the answer
print(answer)
When I tried the above method I'm getting answer from the pretrained memory of PALM model and not from the vector store.
METHOD 2
# Create a RetrievalQA chain directly with the Chroma vector store
qa_chain1 = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=db, input_key="question"
)
# Retrieve the answer from the vector store
answer = qa_chain1("WHAT's the most used drug?")
# Print the answer
print(answer)
When I try this I'm getting an error stating
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
<ipython-input-33-6d935d969703> in <cell line: 1>()
----> 1 qa_chain1 = RetrievalQA.from_chain_type(llm=llm,
2 chain_type="stuff",
3 retriever=db,
4 input_key="question",
5 )
2 frames
/usr/local/lib/python3.10/dist-packages/pydantic/main.cpython-310-x86_64-linux-gnu.so in pydantic.main.BaseModel.__init__()
ValidationError: 1 validation error for RetrievalQA
retriever
value is not a valid dict (type=type_error.dict)
You have a few problems in your code.
This is not a question that can be answered through retrieval. This is a question that requires aggregation of the entire dataset.
Ignoring that for a second, and using another question, like
What drug is used to treat face cancer
(or some other condition in your dataset), your code is (inside of a loop!) creating a DB from a single text, inVector_db
, but then you are usingdb
for retrieval, which is empty, so the LLM is generating an answer for you from its internal knowledge.Consider this: