I am building a RAG system on Azure Databricks and having trouble evaluating the pyfunc models we are saving to MLflow. The predict method of the model class outputs a pandas dataframe with three columns: answers
, sources
and prompts
for auditability:
return pd.DataFrame({'answers': answers, 'sources': sources, 'prompts': prompts})
However, I am having some issues with using mlflow.evaluate() on these model versions.
Issue: this model will be used as a chatbot so latency and response size are key metrics to evaluate. As such, we specify latency and token_count as extra metrics. This results in the following error:
ValueError: cannot reindex on an axis with duplicate labels
evaluation code:
evaluation_results = mlflow.evaluate(
model=f'models:/{model_name}/{model_version}',
data=data,
predictions="answers",
extra_metrics=[
mlflow.metrics.latency(),
mlflow.metrics.token_count()
]
)
I am using mlflow==2.8.0. The key goal I would like is to be able to see in the mlflow evaluation UI a comparison of answers, sources, prompts, latency and token count for different experiment runs.