I tried to use my local llm model for doing some inference.
I have to use multiple gpu (Quadro RTX 8000 * 8), so I tried to use langchain with vLLM. Because when I used langchain with huggingface pipeline + multi gpu, many error occurred(I didn't have enough time for fix these errors).
There is no problem with using huggingface repo model with vLLM, but when I changed huggingface model_id to local model path, vLLM checked the model at huggingface repo, "does not appear to have a file named config.json. Checkout huggingface repo/None for available files" error occurred. It seems that vLLM tried to find my local path model at huggingface repository, but that does not exists, so the error occurred.
Here is my part of source code.
from fastapi import FastAPI, Request, Form
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
import os
from time import time
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains import RetrievalQA
import torch
from langchain.llms import VLLM
# load local vector storage
embedding_id = "intfloat/multilingual-e5-large"
docsearch = FAISS.load_local("./faiss_db_{}".format(embedding_id), embeddings)
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.80)
compression_retriever = ContextualCompressionRetriever(base_compressor=embeddings_filter,
base_retriever=docsearch.as_retriever())
llm = VLLM(model=local model path, ## path like /home/account/somewhere/models/model
tensor_parallel_size=2,
trust_remote_code=True,
max_new_tokens=2048,
top_k=50,
top_p=0.01,
temperature=0.01,
repetition_penalty=1.5,
stop=stop_word
)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=compression_retriever)
st = time()
prompt = "questions"
response = qa.run(query=prompt)
et = time()
print(prompt)
print('>', response)
print('>', et-st, 'sec consumed. ')
Is there any method for using local model with langchain + vLLM? Or any method for inference by multiple gpu with langchain?