I have a set of 1000 documents (plain texts) and one user query. I want to retrieve the top k documents that are the most relevant to a user query using the Python library Langchain. Specially, I want the system to identify the top k sentences that are the closest match to the user query, and then returns the documents that contain these sentences. How can I do so?
The following code identifies the top k documents that are the closest match to the user query with Haystack. How can I change it so that instead, the code identifies the top k sentences that are the closest match to the user query with Langchain, and returns the documents that contain these sentences.
# Note: Most of the code is from https://haystack.deepset.ai/tutorials/07_rag_generator
import logging
logging.basicConfig(format="%(levelname)s - %(name)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
import pandas as pd
from haystack.utils import fetch_archive_from_http
# Download sample
doc_dir = "data/tutorial7/"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/small_generator_dataset.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)
# Create dataframe with columns "title" and "text"
#df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",")
df = pd.read_csv(f"{doc_dir}/small_generator_dataset.csv", sep=",",nrows=10)
# Minimal cleaning
df.fillna(value="", inplace=True)
print(df.head())
from haystack import Document
# Use data to initialize Document objects
titles = list(df["title"].values)
texts = list(df["text"].values)
documents = []
for title, text in zip(titles, texts):
documents.append(Document(content=text, meta={"name": title or ""}))
from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", return_embedding=True)
from haystack.nodes import RAGenerator, DensePassageRetriever
retriever = DensePassageRetriever(
document_store=document_store,
query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
use_gpu=True,
embed_title=True,
)
# Delete existing documents in documents store
document_store.delete_documents()
# Write documents to document store
document_store.write_documents(documents)
# Add documents embeddings to index
document_store.update_embeddings(retriever=retriever)
from haystack.pipelines import GenerativeQAPipeline
from haystack import Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name='Retriever', inputs=['Query'])
from haystack.utils import print_answers
QUESTIONS = [
"who got the first nobel prize in physics",
"when is the next deadpool movie being released",
]
for question in QUESTIONS:
res = pipeline.run(query=question, params={"Retriever": {"top_k": 5}})
print(res)
#print_answers(res, details="all")
To run the code:
conda create -y --name haystacktest python==3.9
conda activate haystacktest
pip install --upgrade pip
pip install farm-haystack
conda install pytorch -c pytorch
pip install sentence_transformers
pip install farm-haystack[colab,faiss]==1.17.2
E.g., I wonder if there is a way to amend the Faiss indexing strategy.