Evaluation metrics for multiple correct answers in QA problem system

558 views Asked by At

I'm building a QA machine and I have my own data for this task. I have a problem that 1 question can have 2 or more answers. For example:

Questions: "What does A have to do?"

Correct answers:

  • "A have to clean the floor"
  • "A have to hang up the laundry"

In my QA model, I can get k best answers. However, in some cases, not only k is unequal the number of correct answers but also some of the k answers are not correct.

Most of public dataset like SQuAD, triviaQA have a pair with one question and one answer. In my case, my question can have multiple answers. So, what kind of evaluation metrics I should use? Can I use F1 score?

1

There are 1 answers

0
Jindřich On

The evaluation metric should always depend on how the system you are developing will be used. F1 score is certainly a reasonable statistics that tells you a lot about how the distribution of the correct and wrong answers is.

If you are going to present a single best answer from your system, you should also measure the 1-best accuracy. If you are going present multiple answers, you should measure the precision at n (i.e., proportion of correct answers among n best answers, it is in fact recall, but folks in information retrieval call it precision).

If you are not sure what is a suitable number of answers to present, you might want to plot the ROC curve and compute the AUC score.