Performing LLM inference locally with Python (LangChain / AutoGen / AutoMemGPT) using LLM model hosted on RunPod webui (TheBloke LLMs)

432 views Asked by At

I am running ehartford_dolphin-2.1-mistral-7b on an RTX A6000 machine on RunPod with the template TheBloke LLMs Text Generation WebUI.

I have 2 options: running webui on runpod or running HuggingFace Text Generation Inference template on runpod

Option 1. RunPod WebUI

I can successfully loaded the model on textgen webui on RunPod on the Chat tab. I now want to access it ob my Python code and run inference. Ideal case would be if I integrate it on LangChain and create a LangChain LLM object.

  • I enabled openai and api on RunPod webui on the Settings tab
  • I currently have 7860, 5001 and 5000 ports enabled

Using AutoMemGPT

I found this Python code using AutoMemGPT to access webui endpoint:

import os
import autogen
import memgpt.autogen.memgpt_agent as memgpt_autogen
import memgpt.autogen.interface as autogen_interface
import memgpt.agent as agent       
import memgpt.system as system
import memgpt.utils as utils 
import memgpt.presets as presets
import memgpt.constants as constants 
import memgpt.personas.personas as personas
import memgpt.humans.humans as humans
from memgpt.persistence_manager import InMemoryStateManager, InMemoryStateManagerWithPreloadedArchivalMemory, InMemoryStateManagerWithEmbeddings, InMemoryStateManagerWithFaiss
import openai

config_list = [
    {
        "api_type": "open_ai",
        "api_base": "https://0ciol64iqvewdn-5001.proxy.runpod.net/v1",
        "api_key": "NULL",
    },
]

llm_config = {"config_list": config_list, "seed": 42}

# If USE_MEMGPT is False, then this example will be the same as the official AutoGen repo
# (https://github.com/microsoft/autogen/blob/main/notebook/agentchat_groupchat.ipynb)
# If USE_MEMGPT is True, then we swap out the "coder" agent with a MemGPT agent

USE_MEMGPT = True

## api keys for the memGPT
openai.api_base="https://0ciol64iqvewdn-5001.proxy.runpod.net/v1"
openai.api_key="NULL"


# The user agent
user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={"last_n_messages": 2, "work_dir": "groupchat"},
    human_input_mode="TERMINATE",  # needed?
    default_auto_reply="You are going to figure all out by your own. "
    "Work by yourself, the user won't reply until you output `TERMINATE` to end the conversation.",
)


interface = autogen_interface.AutoGenInterface()
persistence_manager=InMemoryStateManager()
persona = "I am a 10x engineer, trained in Python. I was the first engineer at Uber."
human = "Im a team manager at this company"
memgpt_agent=presets.use_preset(presets.DEFAULT_PRESET, model='gpt-4', persona=persona, human=human, interface=interface, persistence_manager=persistence_manager, agent_config=llm_config)


if not USE_MEMGPT:
    # In the AutoGen example, we create an AssistantAgent to play the role of the coder
    coder = autogen.AssistantAgent(
        name="Coder",
        llm_config=llm_config,
        system_message=f"I am a 10x engineer, trained in Python. I was the first engineer at Uber",
        human_input_mode="TERMINATE",
    )

else:
    # In our example, we swap this AutoGen agent with a MemGPT agent
    # This MemGPT agent will have all the benefits of MemGPT, ie persistent memory, etc.
    print("\nMemGPT Agent at work\n")
    coder = memgpt_autogen.MemGPTAgent(
        name="MemGPT_coder",
        agent=memgpt_agent,
    )


# Begin the group chat with a message from the user
user_proxy.initiate_chat(
    coder,
    message="Write a Function to print Numbers 1 to 10"
    )

Error


ModuleNotFoundError Traceback (most recent call last) Cell In[2], line 10 8 import memgpt.presets as presets 9 import memgpt.constants as constants ---> 10 import memgpt.personas.personas as personas 11 import memgpt.humans.humans as humans 12 from memgpt.persistence_manager import InMemoryStateManager, InMemoryStateManagerWithPreloadedArchivalMemory, InMemoryStateManagerWithEmbeddings, InMemoryStateManagerWithFaiss

ModuleNotFoundError: No module named 'memgpt.personas.personas'

What I tried to solve this error

  • pip install --upgrade pymemgpt -- does not change error
  • pip install pymemgpt==0.1.3 -- I get openai version conflicts
  • pip install -e . after cloning MemGPT repository -- another error

What I need

  • I always get version conflicts between openai, llama-index, pymemgpt, pyautogpt, numpy, so maybe the proper version to make this code run would be nice otherwise any advice?

Option 2. Using HuggingFace Text Generation Interface

So instead of loading TheBloke LLMs template that runs webui on RunPod I found a guide to instead use a TextGenerationInference template

Current code

gpu_count = 1

pod = runpod.create_pod(
    name="Llama-7b-chat",
    image_name="ghcr.io/huggingface/text-generation-inference:0.9.4",
    gpu_type_id="NVIDIA RTX A4500",
    data_center_id="EU-RO-1",
    cloud_type="SECURE",
    docker_args="--model-id TheBloke/Llama-2-7b-chat-fp16",
    gpu_count=gpu_count,
    volume_in_gb=50,
    container_disk_in_gb=5,
    ports="80/http,29500/http",
    volume_mount_path="/data",
)
pod

from langchain.llms import HuggingFaceTextGenInference

inference_server_url = f'https://{pod["id"]}-80.proxy.runpod.net'
llm = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url,
    max_new_tokens=1000,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.1,
    repetition_penalty=1.03,
)

It works well on Llama 2 but I cannot make it work on other LLMs that needs a ton of configuring on the webui before running. So for example Falcon or Mixtral where I need to change several parameters on webui manually.

What I need

  • A way to run this code to any LLM by programmatically setting model parameters, settings, etc instead on RunPod webui
0

There are 0 answers