I loaded the deepset/roberta-base-squad2
on squad2
, and got really poor performance on no-answer questions:
{
"exact": 42.103933294028465,
"f1": 45.67169337842289,
"total": 11873,
"HasAns_exact": 84.2948717948718,
"HasAns_f1": 91.44062339440198,
"HasAns_total": 5928,
"NoAns_exact": 0.0336417157275021,
"NoAns_f1": 0.0336417157275021,
"NoAns_total": 5945
}
Here's what I did:
- Load the model from tranformer
model_name = "deepset/roberta-base-squad2"
model = pipeline('question-answering', model=model_name, tokenizer=model_name)
- Save predictions to json file
prediction = model({'context': context, 'question': question})
predictions_dict\[id\] = prediction\['answer'\]
- Evaluate using squad2's official evaluation script
I'm quite a beginner and am unsure where I might have gone wrong.
I'm thinking whether I should add an extra binary classifier. However, the model card on Hugging Face indicates that deepset/roberta-base-squad2
is already a fine-tuned model, so I assume that simply loading and using it should suffice.
Alternatively, should I also save the prediction scores in step2, pass them to the evaluation script using the --na-prob-file
param, and then establish a --na-prob-thresh
? If so, what would be the appropriate threshold? I'm trying to determine the threshold that replicates the performance metrics reported on Hugging Face.
I've tried searching for papers, documentation, and issues related to this but haven't found anything conclusive. I feel like I might be missing some basic understanding here. Could anyone offer some guidance?