I am using ParentDocumentRetriever of langchain. Using mostly the code from their webpage I managed to create an instance of ParentDocumentRetriever using bge_large embeddings, NLTK text splitter and chromadb. I added documents to it, so that I c
embedding_function = HuggingFaceEmbeddings(model_name='BAAI/bge-large-en-v1.5', cache_folder=hf_embed_path)
# This text splitter is used to create the child documents
child_splitter = NLTKTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="full_documents",
embedding_function=embedding_function,
persist_directory="./chroma_db_child"
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
)
retriever.add_documents(docs, ids=None)
I added documents to it, so that I can query using the small chunks to match but to return the full document: matching_docs = retriever.get_relevant_documents(query_text)
Chromadb collection 'full_documents' was stored in /chroma_db_child. I can read the collection and query it. I get back the chunks, which is what is expected:
vector_db = Chroma(
collection_name="full_documents",
embedding_function=embedding_function,
persist_directory="./chroma_db_child"
)
matching_doc = vector_db.max_marginal_relevance_search('whatever', 3)
len(matching_doc)
>>3
One thing I can't figure out is how to persist the whole structure. This code uses store = InMemoryStore()
, which means that once I stopped execution, it goes away.
Is there a way, perhaps using something else instead of InMemoryStore()
, to create ParentDocumentRetriever
and persist both full documents and the chunks, so that I can restore them later without having to go through retriever.add_documents(docs, ids=None)
step?
I had the same problem and found the solution here: https://github.com/langchain-ai/langchain/issues/9345
You need to use the create_kv_docstore() function like this:
You will end up with 2 folders: the chroma db "db" with the child chunks and the "data" folder with the parents documents.
I think there is also a possibility of saving the documents in a Redis db or Azure blobstorage (https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container) but I am not sure.