Trying to train AllenNLP coreference resolution model on ontonotes: gets CUDA out of memory

625 views Asked by At

I'm trying to train AllenNLPs coreference model on a 16GB GPU, using this config file: https://github.com/allenai/allennlp-models/blob/main/training_config/coref/coref_spanbert_large.jsonnet

I created train, test, and dev files using this script: https://github.com/allenai/allennlp/blob/master/scripts/compile_coref_data.sh

I got CUDA out of memory almost instantly, so I tried changing "spans_per_word" and "max_antecedents" to lower values. With spans_per_words set to 0.1 instead of 0.4, I could run a bit longer but not nearly a full epoch. Is a 16GB GPU not enough? Or are there other parameters I could try changing?

Traceback (most recent call last): File "/home/ubuntu/anaconda3/envs/allennlp/bin/allennlp", line 8, in sys.exit(run()) File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/main.py", line 34, in run main(prog="allennlp") File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/init.py", line 119, in main args.func(args) File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 119, in train_model_from_args file_friendly_logging=args.file_friendly_logging, File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 178, in train_model_from_file file_friendly_logging=file_friendly_logging, File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 242, in train_model file_friendly_logging=file_friendly_logging, File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 466, in _train_worker metrics = train_loop.run() File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/commands/train.py", line 528, in run return self.trainer.train() File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 740, in train metrics, epoch = self._try_train() File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 772, in _try_train train_metrics = self._train_epoch(epoch) File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/allennlp/training/trainer.py", line 523, in _train_epoch loss.backward() File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/home/ubuntu/anaconda3/envs/allennlp/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 1.33 GiB (GPU 0; 14.76 GiB total capacity; 11.69 GiB already allocated; 639.75 MiB free; 13.09 GiB reserved in total by PyTorch)

1

There are 1 answers

0
Dirk Groeneveld On

16GB is on the low end for that model.

When this model receives a lot of text, it will split the text into multiple shorter sequences of 512 word pieces each, and run them all at the same time. That way you end up with a lot of sequences in memory at the same time even when the batch size is 1.

Try setting max_sentence to a lower value (default is 110), and see if that works.