how access the content of pdf file using page number in langchain javascript

386 views Asked by At

I'm making pdf assistant chatbot using next js , langchain and pincone DB. I'm able to get the content perfectly but when I'm trying to access the page content pdf says it can't access the page content. I need to access the page content using page number

To do this I have extracted all pages of the pdf using pdfjs and put metadata in each page as page number and use another llmchain to results and still facing the same problem

this is vector-store.ts

 try {
    const embeddings = new OpenAIEmbeddings();
    const index = client.Index(env.PINECONE_INDEX_NAME);

    // Add metadata to each document
    const docsWithMetadata = docs.map((doc, index) => {
      return {
        ...doc,
        metadata: {
          page_number: index, // Assuming page numbers start from 1
        },
      };
    });

    // Embed the PDF documents and store them in the vectorStore
    await PineconeStore.fromDocuments(docsWithMetadata, embeddings, {
      pineconeIndex: index,
      textKey: "text",
    });
  } catch (error) {
    console.log("error ", error);
    throw new Error("Failed to load your docs !");
  }
}
export async function getVectorStore(client: PineconeClient, pageFilter: string) {
  try {
    const embeddings = new OpenAIEmbeddings();
    const index = client.Index(env.PINECONE_INDEX_NAME);
    const pageFilterArray = JSON.parse(pageFilter);
    const filter = pageFilterArray.length
    ? { page_number: { $in: pageFilterArray } }
    : undefined;

    console.log("The metadata filter sent ", filter);

    const vectorStore = await PineconeStore.fromExistingIndex(embeddings, {
      pineconeIndex: index,
      textKey: "text",
      filter,
    });

    return vectorStore;
  } catch (error) {
    console.log("error ", error);
    throw new Error("Something went wrong while getting vector store !");
  }
}

this is my prompt-template

// Creates a standalone question from the chat-history and the current question
//@ts-nocheck
import {allPages, data} from "../scripts/output"

export const STANDALONE_QUESTION_TEMPLATE = `Below is a summary of the conversation so far, and a new question asked by the user that needs to be answered by searching in a knowledge base.
Generate a search query based on the conversation and the new question.

Chat History:
{chat_history}

Question:
{question}

Search query:`;

// Actual question you ask the chat and send the response to client
export const QA_TEMPLATE = `You are an enthusiastic AI coding assistant.

Answer with the facts listed in the context below. If there isn't enough information below, say you don't know.

{context}

Question: {question}
Helpful answer in markdown`;


export const METADATA_GENERATOR_TEMPLATE =`
You are an expert text classifier. Your job is to generate an array of strings that are within the "context" that best match with the Input question.
First check for text-similarity match, if not fallback to semantic matching. 


Context:
"""
${data}
"""

Question: {question}
Helpful answer in markdown:
Examples for QA on page page number:
Input: Can you summarize the document on page number?
Output: ["page number"]
`

this is data.ts

export const data = [
  {
    metadata: { page_number: 1 },
    pageContent:
      "Installation Guide  Atlas Bio Series Access Control Panels  www.zktecousa.com  Ver 2.0 ",
  },
  {
    metadata: { page_number: 2 },
    pageContent:
      "2 Atlas Bio Series Access Control Panels Installation Guide  CONTENTS  What’s in the Box   .......................................................................   3 Optional Accessories   
  },
  {
    metadata: { page_number: 3 },
    pageContent:
      "3 Atlas Bio Series Access Control Panels Installation Guide  What’s in the Box  4 Diodes 4 Screws & Anchors   1 Screwdriver  COMM RUN GND PWR NC COM NO SEN GND BUT NC COM NO SEN GND BUT NC 
  },
  {
    metadata: { page_number: 4 },
    pageContent:
      "4 Atlas Bio Series Access Control Panels Installation Guide  Optional Accessories  Wiegand 
  },
  {
    metadata: { page_number: 5 },
    pageContent:
      "5 Atlas Bio Series Access Control Panels Installation Guide  Safety Precautions The following precautions are to keep user’s safe and prevent any damage. Please read carefully 
  },

this is my langchain.ts

export async function callChain({ question, chatHistory }: callChainArgs) {
  try {
    const sanitizedQuestion = question.trim().replaceAll("\n", " ");
    console.log(sanitizedQuestion, "Question");

    const pineconeClient = await getPineconeClient();



    const { stream, handlers } = LangChainStream({
      experimental_streamData: true,
    });
    const data = new experimental_StreamData();

    const prompt = PromptTemplate.fromTemplate(METADATA_GENERATOR_TEMPLATE);
    const chainA = new LLMChain({ llm: nonStreamingModel, prompt });

    // The result is an object with a `text` property.
    const metadata = await chainA.call({ question: sanitizedQuestion });

    console.log("The resA ", metadata);

    const vectorStore = await getVectorStore(pineconeClient, metadata.text);


    // Create the ConversationalRetrievalQAChain with the specified models and options
    const chain = ConversationalRetrievalQAChain.fromLLM(
      streamingModel,
      vectorStore.asRetriever(),
      {
        qaTemplate: QA_TEMPLATE,
        questionGeneratorTemplate: STANDALONE_QUESTION_TEMPLATE,
        returnSourceDocuments: true,
        memory: new BufferMemory({
          memoryKey: "chat_history",
          inputKey: "question", // The key for the input to the chain
          outputKey: "text", // The key for the final conversational output of the chain
          returnMessages: true, // If using with a chat model (e.g. gpt-3.5 or gpt-4)
        }),
        questionGeneratorChainOptions: {
          llm: nonStreamingModel,
        },
      } 
    );

    // Call the chain with the sanitized question and chat history
    chain
      .call(
        {
          question: sanitizedQuestion,
          chat_history: chatHistory,
        },
        [handlers]
      )
      .then(async (res) => {
        // Extract the source documents from the response
        const sourceDocuments = res?.sourceDocuments;

        // Get the page content of the first two documents
        const firstTwoDocuments = sourceDocuments.slice(0, 2);
        const pageContents = firstTwoDocuments.map(
          ({ pageContent }: { pageContent: string }) => pageContent
        );

        // Append the page contents to the experimental_StreamData object
        data.append({
          sources: pageContents,
        });

        // Close the experimental_StreamData object
        data.close();
      });

    // Return the streaming text response
    return new StreamingTextResponse(stream, {}, data);
  } catch (e) {
    console.error(e);
    throw new Error("Call chain method failed to execute successfully!!");
  }
}

I feel like most probably I'm going wrong with my prompt template or lanchain

0

There are 0 answers