Dumping embeddings in FAISS DB in langchain causing RAM to explode

678 views Asked by At

I have a list of texts stored and am loading them with pickle, the texts are around 400K paragraphs and when I load them all in RAM to call FAISS DB, the memory explodes. Is there a way to do it incrementally? Given below is my current minimal working code:

import os
import pickle
from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_KEY")
    
    def get_all_docs_from_chunks():
        "creates 1 list of all the chunks (RAM exploding here)" 
        all_docs = []
        chunk_num = 0
        chunks_files = sorted([f for f in os.listdir("chunks") if f.endswith(".pkl")], key=lambda x: int(x.split(".")[0]))
        
        for chunk_file in tqdm(chunks_files, desc="Loading Chunks"):
            with open(f'chunks/{chunk_file}', 'rb') as file:
                all_docs.extend(pickle.load(file))
        return all_docs


if __name__ == '__main__':
    all_split_docs = get_all_docs_from_chunks() # code stuck here and eventually stops at 3000/400K iteration
    print("--got all docs--")
    embeddings = OpenAIEmbeddings()

    filename = "faiss_openai_embeddings"
    print("-- making embeddings --")
    db = FAISS.from_documents(all_split_docs, embeddings)
    db.save_local(filename)
    print("-- embeddings saved --")
    

Any help will be highly appreciated, I am ok with getting 100 chunks, making their embeddings and then further updating the index but can't find langchain FAISS documentation that does this.

1

There are 1 answers

2
Rishab Jain On

theres a logger command in embeddings function, if you go and comment that off. execution will run in seconds