I have developed a small app based on langchain and streamlit, where user can ask queries using pdf files. The code is mentioned as below:
from dotenv import load_dotenv
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
def main():
load_dotenv()
st.set_page_config(page_title="Ask your PDF")
st.header("Ask your PDF ")
# upload file
pdf = st.file_uploader("Upload your PDF", type="pdf")
# extract the text
if pdf is not None:
pdf_reader = PdfReader(pdf)
text = ""
for page in pdf_reader.pages:
text += page.extract_text()
# split into chunks
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=500,
chunk_overlap=100,
length_function=len
)
chunks = text_splitter.split_text(text)
# create embeddings
embeddings = OpenAIEmbeddings()
knowledge_base = FAISS.from_texts(chunks, embeddings)
# show user input
user_question = st.text_input("Ask a question about your PDF:")
if user_question:
docs = knowledge_base.similarity_search(user_question)
llm = OpenAI()
chain = load_qa_chain(llm)
with get_openai_callback() as cb:
response = chain.run(input_documents=docs, question=user_question)
print(cb)
st.write(response)
if __name__ == '__main__':
main()
Can someone suggest that how I can retrieve or render the page of the pdf from where answer or information has been extracted? I have came across this but won't able to implement it properly.
Here is a simple approach.
Once we get the response, we will compare it with the content of each page that we have saved before. The idea is to get which page gets the highest similarity to the response. It can be page 1, page 2, etc.
Sort the data and get the page with highest similarity.
Now generate all the images per page, using the library
pdf2image
. We are going to show the content of the page as an image. You can do other methods as we already have the content of the page. But in this approach I will show the image via streamlit image widget.Now that we have a list of images, get the index that corresponds to the page that we want to show.
Here is the code to get the similarity score between page content and response.
Sample output
You can download a sample pdf from my google drive.
Full code
Use similarity from openai api.
Use sentence-transfomer for similarity.
Solution 2
Uses pymupdf to save the text and save the images per page. Uploaded file can be from anywhere not necessarily from the location of the streamlit script because while we are saving the text on each pdf page, we also save the images as data bytes.
This also uses the sentence-transformer to measure similarity of two text strings useful for page content and response comparison.
Full code