I have the following code which loads my pdf file generates embeddings and stores them in a vector db. I can then use it to preform searches on it.
The issue is that every time i run it the embeddings are regrated and stored in the db along with the ones already created.
Im trying to figurer out How to load an existing vector db into Langchain. rather then recreating them every time the app runs.
load it
def load_embeddings(store, file):
# delete the dir
# shutil.rmtree(store) # I have to delete it or it just loads double data
loader = PyPDFLoader(file)
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
pages = loader.load_and_split(text_splitter)
return DocArrayHnswSearch.from_documents(
pages, GooglePalmEmbeddings(), work_dir=store + "/", n_dim=768
)
use it
db = load_embeddings("linda_store", "linda.pdf")
embeddings = GooglePalmEmbeddings()
query = "Have I worked with Oauth?"
embedding_vector = embeddings.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
for i in range(len(docs)):
print(i, docs[i])
issue
This works fine but if I run it again it just loads the file again into the vector db. I want it to just use the db after I have created it and not create it again.
I cant seem to find a method for loading it I tried
db = DocArrayHnswSearch.load("hnswlib_store/", embeddings)
But thats a no go.
Your
load_embeddings
function is recreating the database every time you call it. Here's why:1. You're loading from PyPDFLoader every time
2. from_documents(documents, embedding, **kwargs)
Instead, you can try this:
I am using
OpenAIEmbeddings()
here but the same code should apply toGooglePalmEmbeddings()
just make sure you update the value of the dimension.1. DocArrayHnswSearch.from_params
We're using
DocArrayHnswSearch.from_params
instead to load embeddings from the store (see here). This method does not expect the documents.2. We're using our
vector_store
to perform similarity searchAs you can see from the
query_vector_store(query: str)
function above, we're not re-loading the documents from the PDF loader every time. Instead, we're just passing in our embeddings, work directory, and dimensions.3. Usage
You can use the method as such:
query_vector_store('YOUR_QUERY')
.Based on your for loop here:
You'll see the documents sorted by most similar.
I hope this helps!