Hugging Face model deployment

556 views Asked by At

My question is related to how one deploys the Hugging Face model. I recently downloaded the Falcon 7B Instruct model and ran it in my Colab. However, when I am trying to load the model and want it to generate text, it takes about 40 seconds to give me an output. I was just wondering how we deploy these models in production then so that it gives us output with low latency. I am new to MLOps so I just want to explore. Also, what will be the charges of deploying that model? What if many users are simultaneously using this model? How will I handle that? Will greatly appreciate the response.

The code I am using is from the https://huggingface.co/tiiuae/falcon-7b-instruct.

Also, I am saving the model weights locally in a Google Drive.

2

There are 2 answers

1
SilentCloud On
  • I was just wondering how we deploy these models in production then so that it gives us output with low latency.

You can download the model and use it locally, to avoid any kind of latency related to Internet connection. Notice that you input has to be processed, so it is normal to take some time to give you a response. To make it as quick as possible, you have to run it on GPUs (typically dozens of times faster that CPUs).

  • Also, what will be the charges of deploying that model?

You can use the models for free.

  • What if many users are simultaneously using this model?

I have never experienced issues of this kind, but if you want to avoid any kind of problem, once again, you can download the model and load it from local.

0
Khaliladib11 On

It depends on where you want to deploy your model. It is really easy to do on AWS Sagemaker

import sagemaker
from sagemaker.session import Session
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

PROFILE_NAME = ""
ENDPOINT_NAME = ""
ROLE = ""

boto_session = boto3.Session(profile_name=PROFILE_NAME, region_name="us-east-1")
sagemaker_session = Session(boto_session=boto_session)

# get the huggingface llm image
llm_img = get_huggingface_llm_image_uri(
    backend="huggingface",
    session=sagemaker_session,
    version="0.9.3"
)

# define the model deployment configuration
deploy_config = {
    'HF_MODEL_ID': "tiiuae/falcon-7b", # model_id from hf.co/models
    'SM_NUM_GPUS': json.dumps(1), # Number of GPU used per replica
    'MAX_INPUT_LENGTH': json.dumps(3072),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(4096),  # Max length of the generation (including input text)
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(8192),  # Limits the number of tokens that can be processed in parallel during the generation
    # ,'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
    role=ROLE,
    image_uri=llm_img,
    env=deploy_config,
    sagemaker_session=sagemaker_session
)

# Deploy model to an endpoint
instance_type = "ml.g5.2xlarge"
health_check_timeout = 300

llm_endpoint = llm_model.deploy(
    endpoint_name=ENDPOINT_NAME,
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)