I have a list of texts stored and am loading them with pickle, the texts are around 400K paragraphs and when I load them all in RAM to call FAISS DB, the memory explodes. Is there a way to do it incrementally? Given below is my current minimal working code:
import os
import pickle
from dotenv import load_dotenv
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_KEY")
def get_all_docs_from_chunks():
"creates 1 list of all the chunks (RAM exploding here)"
all_docs = []
chunk_num = 0
chunks_files = sorted([f for f in os.listdir("chunks") if f.endswith(".pkl")], key=lambda x: int(x.split(".")[0]))
for chunk_file in tqdm(chunks_files, desc="Loading Chunks"):
with open(f'chunks/{chunk_file}', 'rb') as file:
all_docs.extend(pickle.load(file))
return all_docs
if __name__ == '__main__':
all_split_docs = get_all_docs_from_chunks() # code stuck here and eventually stops at 3000/400K iteration
print("--got all docs--")
embeddings = OpenAIEmbeddings()
filename = "faiss_openai_embeddings"
print("-- making embeddings --")
db = FAISS.from_documents(all_split_docs, embeddings)
db.save_local(filename)
print("-- embeddings saved --")
Any help will be highly appreciated, I am ok with getting 100 chunks, making their embeddings and then further updating the index but can't find langchain FAISS documentation that does this.
theres a logger command in embeddings function, if you go and comment that off. execution will run in seconds