KeyError: 'answers' error when using BioASQ dataset using Huggingface Transformers

462 views Asked by At

I am using run_squad.py https://github.com/huggingface/transformers/blob/master/examples/run_squad.py from Huggingface Transformers for fine-tuning on BioASQ Question Answering dataset.

I have converted the tensorflow weights provided by the authors of BioBERT https://github.com/dmis-lab/bioasq-biobert to Pytorch as discussed here https://github.com/huggingface/transformers/issues/312.

Further, I am using the preprocessed data of BioASQ https://github.com/dmis-lab/bioasq-biobert which is converted to the SQuAD form. However, when I am running the run_squad.py script with the below parameters

 --model_type bert \
  --model_name_or_path /scratch/oe7/uk1594/BioBERT/BioBERT-PyTorch/BioBERTv1.1-SQuADv1.1-Factoid-PyTorch/ \
  --do_train \
  --do_eval \
  --save_steps 1000 \
  --train_file $data/BioASQ-train-factoid-6b.json \
  --predict_file $data/BioASQ-test-factoid-6b-1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /scratch/oe7/uk1594/BioBERT/BioBERT-PyTorch/QA_output_squad/BioASQ-factoid-6b/BioASQ-factoid-6b-1-issue-23mar/


I get the below error:

03/23/2020 12:53:12 - INFO - transformers.modeling_utils -   loading weights file /scratch/oe7/uk1594/BioBERT/BioBERT-PyTorch/QA_output_squad/BioASQ-factoid-6b/BioASQ-factoid-6b-1-issue-23mar/pytorch_model.bin
03/23/2020 12:53:15 - INFO - __main__ -   Creating features from dataset file at .

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_squad.py", line 856, in <module>
    main()
  File "run_squad.py", line 845, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 299, in evaluate
    dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
  File "run_squad.py", line 475, in load_and_cache_examples
    examples = processor.get_dev_examples(args.data_dir, filename=args.predict_file)
  File "/scratch/oe7/uk1594/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 522, in get_dev_examples
    return self._create_examples(input_data, "dev")
  File "/scratch/oe7/uk1594/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 549, in _create_examples
    answers = qa["answers"]
KeyError: 'answers'


Really appreciate your help.

Thanks a lot for your guidance.

The evaluaton dataset is looks like this:

{
  "version": "BioASQ6b", 
  "data": [
    {
      "title": "BioASQ6b", 
      "paragraphs": [
        {
          "context": "emMAW: computing minimal absent words in external memory. Motivation: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes",
          "qas": [
            {
              "question": "Which algorithm is available for computing minimal absent words using external memory?", 
              "id": "5a6a3335b750ff4455000025_000"
            }
          ]
        }
    ]
}
]
}



1

There are 1 answers

0
Abdullah Bashir On

The BioASQ evaluation files are test files that don't contain answers, only used for predictions. for evaluation during training you can use a portion of the training files