ChromaDB and HuggingFace cannot process large files

489 views Asked by At

I am trying to process 1000+ page PDFs using huggingface embeddings and chroma db. Whenever I try to upload a large file, however, I get the error below. I don't know if chromadb can handle that big of files but I thought I'd ask so I can see my options with chromadb or if I need to change the database. Any help would be appreciated to resolve this issue!

 File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 613, in from_documents
    return cls.from_texts(
           ^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 577, in from_texts
    chroma_collection.add_texts(texts=texts, metadatas=metadatas, ids=ids)
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 205, in add_texts
    [embeddings[idx] for idx in non_empty_ids] if embeddings else None
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/langchain/vectorstores/chroma.py", line 205, in <listcomp>
    [embeddings[idx] for idx in non_empty_ids] if embeddings else None
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 0

Python Code

embeddings = HuggingFaceHubEmbeddings(huggingfacehub_api_token=access_token)
loader = OnlinePDFLoader(document)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
db = Chroma.from_documents(texts, embeddings, persist_directory="./chroma_db")
1

There are 1 answers

0
Jon M On

I don't know if the file is too big for Chroma. My files are always smaller. However, a chunking size of 300 is not very large and likely to compromise your ability to search with enough document context later.

You might want to increase that to at least 512. If number of chunks is a problem for Chroma, then you might just avoid it, and get better search results.