cost OpenAI usage in RAG retrievel augmented generation Pipeline (LangChain, FAISS, OpenAI)

268 views Asked by At

I am not sure how to estimate the complete cost of OpenAI usage in the RAG pipeline I am building. I want to track beforehand the token usage and the associated cost. Here are some code snippets:

model_name = 'text-embedding-ada-002'

embeddings = OpenAIEmbeddings( model=model_name, openai_api_key=openai_api_key )

def create_and_load_faiss_index(chunks, embeddings, index_path):
    try:
        # Create a FAISS index from documents
        db = FAISS.from_documents(chunks, embeddings)

        # Save the FAISS index locally
        db.save_local(index_path)

        # Load the FAISS index from the saved location
        db = FAISS.load_local(index_path, embeddings)

        return db

    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None

db = create_and_load_faiss_index(chunks, embeddings, index_path)

retriever = db.as_retriever()

template = """…"""

prompt_template = ChatPromptTemplate.from_template(template=template) print(prompt_template)

llm = ChatOpenAI(model_name="gpt-4", temperature=0)

rag_chain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt_template | llm | StrOutputParser() )

query = f"…"

openai_output = rag_chain.invoke(query)

Track OpenAI usage (embeddings AND RAG): To track the usage beforehand, I need to calculate the cost of calling "text-embedding-ada-002" in the embeddings phase and "gpt-4" in the rag phase. For calling gpt-4 in RAG, I came across "get_openai_callback()". What about the embedding phase? Thinking about that gpt-4 will retrieve its information from the faiss index, this does not include any tokens in that sense, correct?

with get_openai_callback() as cb:
    openai_output = rag_chain.invoke(query)
    print(cb)

Which returns something like:

Tokens Used: 37 Prompt Tokens: 4 Completion Tokens: 33 Successful Requests: 1 Total Cost (USD): $7.2e-05

1

There are 1 answers

0
harufumi.abe On

in my case, the result was not same.
I couldn't reduce Prompt Tokens(On the contrary, little bit increased prompt tokens and responsetime).
but embedded prompt returned better answer.

Case1

Non embedded prompt

"prompt_tokens": 3295,
"completion_tokens": 347,
"openai_process_time": 4.253575,

Embedded prompt
(answered better)

"prompt_tokens": 3602,
"completion_tokens": 686,
"openai_process_time": 8.553565,

Case2

Non embedded prompt

"prompt_tokens": 3355,
"completion_tokens": 347,
"openai_process_time": 4.67733,

Embedded prompt
(answered better)

"prompt_tokens": 3669,
"completion_tokens": 583,
"openai_process_time": 7.52354,

those were resulted with GPT-3.5.
but also GPT-4 didn't reduce prompt tokens.