How to handle multiple parallel requests in FastAPI for an ML model API

52 views Asked by At

I have built an ML model API using FastAPI, and this API mostly requires GPU usage. I want to make this API serve at least some amount of parallel requests. To achieve this, I tried to make all the functions def instead of async def so that it can handle requests concurrently (as mentioned here, here and here). Currently, what I am experiencing is that if one request is made, it takes 3 seconds to get output, and if three parallel requests are made, then all users are getting the output in 9 seconds. Here, all users are getting the output at the same time, but as you can see, the time increases as the number of requests increases. However, what I actually want is for all users to get the output in 3 seconds.

I have tried some approaches like ThreadPoolExecutor (here), ProcessPoolExecutor (here), Asyncio (here), run_in_threadpool (here), but none of these methods worked for me.

This is the how my api code looks like with simple def:

from fastapi import Depends, FastAPI, File, UploadFile, Response
import uvicorn

class Model_loading():
    def __init__():
        self.model = torch.load('model.pth')


app = Fastapi()
model_instance = Model_loading()

def gpu_based_processing(x):

   ---- doing some gpu based computation ----

   return result


@app.post('/model-testing')
def my_function(file: UploadFile = File(...)):
    
   ---- doing some initial preprocessing ----

   output = gpu_based_processing(x)

   return Response(content=output , media_type="image/jpg")

Additionally, I have observed a behavior where making 20 parallel requests to the above API leads to a CUDA out-of-memory error. Even with only 20 requests, it is unable to handle them. How can I address the CUDA memory issue and manage handling multiple parallel requests simultaneously?

0

There are 0 answers