Cannot reproduce the performance of deepset/roberta-base-squad2 on squad2 due to no-answer questions

100 views Asked by AK Chelsea At 10 November 2023 at 08:52

I loaded the deepset/roberta-base-squad2 on squad2, and got really poor performance on no-answer questions:

{
"exact": 42.103933294028465,
"f1": 45.67169337842289,
"total": 11873,
"HasAns_exact": 84.2948717948718,
"HasAns_f1": 91.44062339440198,
"HasAns_total": 5928,
"NoAns_exact": 0.0336417157275021,
"NoAns_f1": 0.0336417157275021,
"NoAns_total": 5945
}

Here's what I did:

Load the model from tranformer

model_name = "deepset/roberta-base-squad2"
model = pipeline('question-answering', model=model_name, tokenizer=model_name)

Save predictions to json file

prediction = model({'context': context, 'question': question})
predictions_dict\[id\] = prediction\['answer'\]

Evaluate using squad2's official evaluation script

I'm quite a beginner and am unsure where I might have gone wrong.

I'm thinking whether I should add an extra binary classifier. However, the model card on Hugging Face indicates that deepset/roberta-base-squad2 is already a fine-tuned model, so I assume that simply loading and using it should suffice.

Alternatively, should I also save the prediction scores in step2, pass them to the evaluation script using the --na-prob-file param, and then establish a --na-prob-thresh? If so, what would be the appropriate threshold? I'm trying to determine the threshold that replicates the performance metrics reported on Hugging Face.

I've tried searching for papers, documentation, and issues related to this but haven't found anything conclusive. I feel like I might be missing some basic understanding here. Could anyone offer some guidance?

Original Q&A

TechQA.

Cannot reproduce the performance of deepset/roberta-base-squad2 on squad2 due to no-answer questions

There are 0 answers

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in HUGGINGFACE

Related Questions in ROBERTA-LANGUAGE-MODEL

Related Questions in SQUAD

Popular Questions

Popular Tags

Trending Questions