Deploy in production a LLM model with FastAPI

491 views Asked by At

I'm trying to deploy in production a LLM model with memory in FastApi. The problem is when two or more people make a request, the answers come cross over and overlap, delivering to one requester the answer from another. Any idea of how I can deploy correctly a LLM model to be consumed for multiple users? I cannot find any good tutorial or documentation about that. The idea also is allowing users to control the parameters, like temperatures, etc

here is part of the code:

def get_answersf7bsft(user_id, session_id, question, prompt, 
                    clean_memory=False,
                    max_new_tokens_=256, 
                    temperature_=0.1,
                    top_k_=50,
                    top_p_=.95, 
                    typical_p_=1.00, 
                    repetition_penalty_=1.2):
    #Da las respuestas de modelo token a token, la memoria guarda las ultimas 4 interacciones del usuario.
    # Si fijamos clean_memory=True borrará la memoria se quedará únicamente con la última interrracción
    print("Comienza get_answersf7bsft")
    start = time.time()
    global memory

    if clean_memory:
       memory = {}
    
    #agregamos la nueva pregunta en la memoria y generamos la conversación con el prompt
    update_memory(question,memory)
    conversation = conv_gen(prompt,memory)
    
    #print(conversation)
    inputs = tokenizer(conversation, return_tensors="pt").to("cuda")["input_ids"]
    
    #generamos los parámetros que se usaran en el modelo y los insertamos en un streamer
    #do_sample=True permite controlar la temperatura, top_p, top_k, typical_p, repetition_penalty
    generation_kwargs = dict(input_ids=inputs, 
                             pad_token_id=tokenizer.eos_token_id,
                             streamer=streamer,
                             do_sample=True,
                             max_new_tokens=max_new_tokens_,  
                             temperature=temperature_, 
                             top_k=top_k_, 
                             top_p=top_p_, 
                             typical_p=typical_p_,
                             repetition_penalty=repetition_penalty_,
                             bad_words_ids=[[5150], [12453]]) 
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    answer = ""
    
    #comenzamos el proceso de streaming
    thread.start()
    for _ in streamer:
        _  = _.replace("<|endoftext|>", "")
        answer = answer + _
        yield _
    print("Finaliza respuesta")

    #guardamos la respuesta en la memoria
    memory[len(memory)-1][1] = answer
0

There are 0 answers