I'm training Yolov7 model on a custom dataset. I'm able to train it on a small dataset of around 300 images using these parameters.
python train.py --workers 4 --device 0 --batch-size 2 --data academy2/cfg.yaml --img 640 640 --cfg academy2/yolov7.yaml --weights yolov7.pt --name yolov7-academy2 --hyp academy2/hyp.scratch.p5.yaml
However, once I start training on a larger dataset, I get the cuda out of memory error.
     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     0/299     10.7G   0.08955   0.09489   0.07693    0.2614       172       640:   0%|▎                                                                                                                 | 5/1887 [00:17<1:52:11,  3.58s/it]
Traceback (most recent call last):
  File "train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 363, in train
    loss, loss_items = compute_loss_ota(pred, targets.to(device), imgs)  # loss scaled by batch_size
  File "C:\Users\aaa\Documents\Python\yolov7\utils\loss.py", line 585, in __call__
    bs, as_, gjs, gis, targets, anchors = self.build_targets(p, targets, imgs)
  File "C:\Users\aaa\Documents\Python\yolov7\utils\loss.py", line 733, in build_targets
    torch.log(y/(1-y)) , gt_cls_per_image, reduction="none"
  File "C:\Users\aaa\anaconda3\envs\yolov7_04072023\lib\site-packages\torch\nn\functional.py", line 3132, in binary_cross_entropy_with_logits
    return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 6.07 GiB (GPU 0; 11.00 GiB total capacity; 37.90 GiB already allocated; 0 bytes free; 39.60 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I'm a bit confused about how a larger dataset can have such an impact. I was only able to run it with batch_size = 1 and img_size = 320. I have RTX 2080 TI.
Another observation is the number of classes. The model seems to run no problem when n_classes = 1, (even with batch_size = 4, img_size = 1024), but when I increase the number of classes I get the same error. I understand that the number of classes increases the size of the network, but it's just in the output layer so the overall number of parameters is almost the same. Why does it have such a strong impact on memory?
Thanks