Evaluation metrics for multiple correct answers in QA problem system

Question

Evaluation metrics for multiple correct answers in QA problem system

545 views Asked by Thanh Kiet At 29 September 2020 at 03:43

I'm building a QA machine and I have my own data for this task. I have a problem that 1 question can have 2 or more answers. For example:

Questions: "What does A have to do?"

Correct answers:

"A have to clean the floor"
"A have to hang up the laundry"

In my QA model, I can get k best answers. However, in some cases, not only k is unequal the number of correct answers but also some of the k answers are not correct.

Most of public dataset like SQuAD, triviaQA have a pair with one question and one answer. In my case, my question can have multiple answers. So, what kind of evaluation metrics I should use? Can I use F1 score?

Original Q&A

There are 1 answers

**Jindřich** · Answer 1 · 2020-09-29T06:59:24+00:00

The evaluation metric should always depend on how the system you are developing will be used. F1 score is certainly a reasonable statistics that tells you a lot about how the distribution of the correct and wrong answers is.

If you are going to present a single best answer from your system, you should also measure the 1-best accuracy. If you are going present multiple answers, you should measure the precision at n (i.e., proportion of correct answers among n best answers, it is in fact recall, but folks in information retrieval call it precision).

If you are not sure what is a suitable number of answers to present, you might want to plot the ROC curve and compute the AUC score.

TechQA.

Evaluation metrics for multiple correct answers in QA problem system

There are 1 answers

Related Questions in NLP

Related Questions in METRICS

Related Questions in NLP-QUESTION-ANSWERING

Popular Questions

Popular Tags

Trending Questions