I'm building a QA machine and I have my own data for this task. I have a problem that 1 question can have 2 or more answers. For example:
Questions: "What does A have to do?"
Correct answers:
- "A have to clean the floor"
- "A have to hang up the laundry"
In my QA model, I can get k best answers. However, in some cases, not only k is unequal the number of correct answers but also some of the k answers are not correct.
Most of public dataset like SQuAD, triviaQA have a pair with one question and one answer. In my case, my question can have multiple answers. So, what kind of evaluation metrics I should use? Can I use F1 score?
The evaluation metric should always depend on how the system you are developing will be used. F1 score is certainly a reasonable statistics that tells you a lot about how the distribution of the correct and wrong answers is.
If you are going to present a single best answer from your system, you should also measure the 1-best accuracy. If you are going present multiple answers, you should measure the precision at n (i.e., proportion of correct answers among n best answers, it is in fact recall, but folks in information retrieval call it precision).
If you are not sure what is a suitable number of answers to present, you might want to plot the ROC curve and compute the AUC score.