I'm trying to deploy in production a LLM model with memory in FastApi. The problem is when two or more people make a request, the answers come cross over and overlap, delivering to one requester the answer from another. Any idea of how I can deploy correctly a LLM model to be consumed for multiple users? I cannot find any good tutorial or documentation about that. The idea also is allowing users to control the parameters, like temperatures, etc
here is part of the code:
def get_answersf7bsft(user_id, session_id, question, prompt,
clean_memory=False,
max_new_tokens_=256,
temperature_=0.1,
top_k_=50,
top_p_=.95,
typical_p_=1.00,
repetition_penalty_=1.2):
#Da las respuestas de modelo token a token, la memoria guarda las ultimas 4 interacciones del usuario.
# Si fijamos clean_memory=True borrará la memoria se quedará únicamente con la última interrracción
print("Comienza get_answersf7bsft")
start = time.time()
global memory
if clean_memory:
memory = {}
#agregamos la nueva pregunta en la memoria y generamos la conversación con el prompt
update_memory(question,memory)
conversation = conv_gen(prompt,memory)
#print(conversation)
inputs = tokenizer(conversation, return_tensors="pt").to("cuda")["input_ids"]
#generamos los parámetros que se usaran en el modelo y los insertamos en un streamer
#do_sample=True permite controlar la temperatura, top_p, top_k, typical_p, repetition_penalty
generation_kwargs = dict(input_ids=inputs,
pad_token_id=tokenizer.eos_token_id,
streamer=streamer,
do_sample=True,
max_new_tokens=max_new_tokens_,
temperature=temperature_,
top_k=top_k_,
top_p=top_p_,
typical_p=typical_p_,
repetition_penalty=repetition_penalty_,
bad_words_ids=[[5150], [12453]])
thread = Thread(target=model.generate, kwargs=generation_kwargs)
answer = ""
#comenzamos el proceso de streaming
thread.start()
for _ in streamer:
_ = _.replace("<|endoftext|>", "")
answer = answer + _
yield _
print("Finaliza respuesta")
#guardamos la respuesta en la memoria
memory[len(memory)-1][1] = answer