Query with my own data using langchain and pinecone

766 views Asked by At

I want to use langchain to give my own context to an openai gpt llm model and query my data using the llm model. Firstly, I'm using langchainjs to load the documents based on the file path provided and split them into chunks. Then that splitted documents is fed into the pinecone database and get an store using the pinecone library. That store is used to create a llm QA chain and use that to query about my data.

This is my current implementation:

main.js

import { Document } from "langchain/document";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { PDFLoader } from "langchain/document_loaders/fs/pdf";
import { CharacterTextSplitter } from "langchain/text_splitter";

import { PineconeClient } from "@pinecone-database/pinecone";

import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { PineconeStore } from "langchain/vectorstores/pinecone";

import { OpenAI } from "langchain/llms/openai";
import { VectorDBQAChain } from "langchain/chains";

const openAIApiKey = process.env.OPEN_AI_API_KEY;

async function main(filePath) {
  // create document array
  const docs = [
    new Document({
      metadata: { name: `Filepath: ${filePath}` },
    }),
  ];

  // initialize loader
  const Loader = path.extname(file) === `.pdf` ? PDFLoader : TextLoader;

  const loader = new Loader(file);

  // load and split the docs
  const loadedAndSplitted = await loader.loadAndSplit();

  // push the splitted docs to the array
  docs.push(...loadedAndSplitted);

  // create splitter
  const textSplitter = new CharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 0,
  });

  // use the splitter to split the docs to different chunks
  const splittedDocs = await textSplitter.splitDocuments(docs);

  // create pinecone index
  const client = new PineconeClient();
  await client.init({
    apiKey: process.env.PINECONE_API_KEY,
    environment: process.env.PINECONE_ENVIRONMENT,
  });
  const pineconeIndex = client.Index(process.env.PINECONE_INDEX);

  // create openai embedding
  const embeddings = new OpenAIEmbeddings({ openAIApiKey });

  // create a pinecone store using the splitted docs and the pinecone index
  const pineconeStore = await PineconeStore.fromDocuments(
    splittedDocs,
    embeddings,
    {
      pineconeIndex,
      namespace: "my-pinecode-index",
    }
  );

  // initialize openai model
  const model = new OpenAI({
    openAIApiKey,
    modelName: "gpt-3.5-turbo",
  });

  // create a vector chain using the llm model and the pinecone store
  const chain = VectorDBQAChain.fromLLM(model, pineconeStore, {
    k: 1,
    returnSourceDocuments: true,
  });

  // use the chain to query my data
  const response = await chain.call({
    query: "Explain about the contents of the pdf file I provided.", // question is based on the file i provided
  });

  console.log(`\nResponse: ${response.text}`); 
}

Note: My pinecone index has dimension of 1536 because I got error saying Vector dimension 1536 does not match the dimension of the index 1000 whenever I used a different dimension size.

The responses I get are totally unexpected. Sometimes it answers me if asked a normal and non-trivial question but often times, its like the model didn't get the context about my data at all. It just denies about knowing even the simplest of things. I got the basic idea of implementation from the langchainjs documantation.

I tried changing the gpt model with text davinci models and changing chunk size and recreate the pinecone store. But that also doesn't do anything.

Can anyone please help me what I'm doing wrong here? Or suggest me what should I be doing.

Any help is appreciated. Thank you.

0

There are 0 answers