I'm building a QA machine. I have a problem that one question maybe have multiple answers, and the answers are located in different position in context. For example:
Question: What does Chris have to do?
Context: ....Chris have to wash dishes....(more text)....Chris have to do his homework....
Correct answers:
- wash dishes
- do homework
When I got the answers out for a question, I use a clustering algorithm to deduplicate and get "separate" answers. Therefore, I need a dataset having some pair of 1 question - many answers like above to evaluate my clustering algorithm and sentence embedding model.
Is there any public dataset that support a pair of one question - multiple correct answers (not duplicated)? I tried MS MARCO but most of multiple answers in this dataset are duplicated.
I was looking for something similar, question answering techniques or datasets with multiple non-redundant answers.
This is the dataset:https://github.com/mingzhu0527/MASHQA
and the paper : https://www.aclweb.org/anthology/2020.findings-emnlp.342.pdf[enter link description here]1
However, this paper poses the problem of QA as a sentence classification task, where the task is really to tell whether each sentence in the context answers the query or not.
Now, if your multiple answers don't span a sentence and are just phrases, I wouldn't recommend you to go for this.