I am using Langchain + Chroma + OpenAI to do a Q&A program with a csv document as its knowledge base.
The CSV file looks like below:
Here is the CSV file: https://1drv.ms/u/s!Asflam6BEzhjgbkdegCGfZ7FI4O1Og?e=2X6ior
And code for creating embedding:
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores.chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter as RCTS
file_path = "Test.csv"
doc_pages = []
csv_loader = CSVLoader(file_path)
doc_pages = csv_loader.load()
print(f"Extracted {file_path} with {len(doc_pages)} pages...")
splitter = RCTS(chunk_size = 3000, chunk_overlap = 300)
splitted_docs = splitter.split_documents(doc_pages)
embedding = OpenAIEmbeddings()
persist_directory = "docs_t/chroma/"
vectordb = Chroma.from_documents(
documents=splitted_docs,
embedding=embedding,
persist_directory=persist_directory
)
vectordb.persist()
print(vectordb._collection.count())
Here is the Testing code:
result = vectordb.similarity_search("what is the Support Item Name for 01_003_0107_1_1", k=3)
for r in result:
print(r.page_content, end="\n\n")
And I see this testing code returns all other non-relevant information.
Which part leads to this issue?