Error while Fine tuning yolov7 tiny model on google colab pro

125 views Asked by At

I am trying to fine tune my yolov7-tiny pytorch model on my custom dataset which contains 40k images on google colab it gives me the error on first epoch

Command

!python train.py --batch 16 --cfg cfg/training/yolov7-tiny.yaml --epochs 30 --data        data/custom_data.yaml --weights 'yolov7-tiny.pt' --device 'cpu' --hyp /content/yolov7/data/hyp.scratch.tiny.yaml

Error

/content/yolov7
2023-11-23 09:41:50.546886: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-23 09:41:50.546968: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-23 09:41:50.547001: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-23 09:41:50.556142: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-23 09:41:51.761396: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
YOLOR  v0.1-128-ga207844 torch 2.1.0+cu118 CPU

Namespace(weights='yolov7-tiny.pt', cfg='cfg/training/yolov7-tiny.yaml', data='data/custom_data.yaml', hyp='/content/yolov7/data/hyp.scratch.tiny.yaml', epochs=30, batch_size=16, img_size=[640, 640], rect=False, resume=False, nosave=False, notest=False, noautoanchor=False, evolve=False, bucket='', cache_images=False, image_weights=False, device='cpu', multi_scale=False, single_cls=False, adam=False, sync_bn=False, local_rank=-1, workers=8, project='runs/train', entity=None, name='exp', exist_ok=False, quad=False, linear_lr=False, label_smoothing=0.0, upload_dataset=False, bbox_interval=-1, save_period=-1, artifact_alias='latest', freeze=[0], v5_metric=False, world_size=1, global_rank=-1, save_dir='runs/train/exp2', total_batch_size=16)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.05, copy_paste=0.0, paste_in=0.05, loss_ota=1
wandb: Install Weights & Biases for YOLOR logging with 'pip install wandb' (recommended)

                 from  n    params  module                                  arguments                     
  0                -1  1       928  models.common.Conv                      [3, 32, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
  2                -1  1      2112  models.common.Conv                      [64, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  3                -2  1      2112  models.common.Conv                      [64, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  4                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  5                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  6  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
  7                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
  8                -1  1         0  models.common.MP                        []                            
  9                -1  1      4224  models.common.Conv                      [64, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 10                -2  1      4224  models.common.Conv                      [64, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 11                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 12                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 13  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 15                -1  1         0  models.common.MP                        []                            
 16                -1  1     16640  models.common.Conv                      [128, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 17                -2  1     16640  models.common.Conv                      [128, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 19                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 20  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 21                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 22                -1  1         0  models.common.MP                        []                            
 23                -1  1     66048  models.common.Conv                      [256, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 24                -2  1     66048  models.common.Conv                      [256, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 25                -1  1    590336  models.common.Conv                      [256, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 26                -1  1    590336  models.common.Conv                      [256, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 27  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 28                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 29                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 30                -2  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 31                -1  1         0  models.common.SP                        [5]                           
 32                -2  1         0  models.common.SP                        [9]                           
 33                -3  1         0  models.common.SP                        [13]                          
 34  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 35                -1  1    262656  models.common.Conv                      [1024, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 36          [-1, -7]  1         0  models.common.Concat                    [1]                           
 37                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 38                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 39                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 40                21  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 41          [-1, -2]  1         0  models.common.Concat                    [1]                           
 42                -1  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 43                -2  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 44                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 45                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 46  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 47                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 48                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 49                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 50                14  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 51          [-1, -2]  1         0  models.common.Concat                    [1]                           
 52                -1  1      4160  models.common.Conv                      [128, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 53                -2  1      4160  models.common.Conv                      [128, 32, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 54                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 55                -1  1      9280  models.common.Conv                      [32, 32, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 56  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 57                -1  1      8320  models.common.Conv                      [128, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 58                -1  1     73984  models.common.Conv                      [64, 128, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
 59          [-1, 47]  1         0  models.common.Concat                    [1]                           
 60                -1  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 61                -2  1     16512  models.common.Conv                      [256, 64, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 62                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 63                -1  1     36992  models.common.Conv                      [64, 64, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 64  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 65                -1  1     33024  models.common.Conv                      [256, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 66                -1  1    295424  models.common.Conv                      [128, 256, 3, 2, None, 1, LeakyReLU(negative_slope=0.1)]
 67          [-1, 37]  1         0  models.common.Concat                    [1]                           
 68                -1  1     65792  models.common.Conv                      [512, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 69                -2  1     65792  models.common.Conv                      [512, 128, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 70                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 71                -1  1    147712  models.common.Conv                      [128, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 72  [-1, -2, -3, -4]  1         0  models.common.Concat                    [1]                           
 73                -1  1    131584  models.common.Conv                      [512, 256, 1, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 74                57  1     73984  models.common.Conv                      [64, 128, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 75                65  1    295424  models.common.Conv                      [128, 256, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 76                73  1   1180672  models.common.Conv                      [256, 512, 3, 1, None, 1, LeakyReLU(negative_slope=0.1)]
 77      [74, 75, 76]  1     82076  models.yolo.IDetect                     [25, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]]
/usr/local/lib/python3.10/dist-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Model Summary: 263 layers, 6079932 parameters, 6079932 gradients, 13.4 GFLOPS

Transferred 330/344 items from yolov7-tiny.pt
Scaled weight_decay = 0.0005
Optimizer groups: 58 .bias, 58 conv.weight, 61 other
train: Scanning '/content/drive/MyDrive/combined_data_14sep/combined_data_14sep/train/labels.cache' images and labels... 28588 found, 1 missing, 6 empty, 2 corrupted: 100% 28590/28590 [00:00<?, ?it/s]
val: Scanning '/content/drive/MyDrive/combined_data_14sep/combined_data_14sep/val/labels.cache' images and labels... 6163 found, 2 missing, 0 empty, 1 corrupted: 100% 6165/6165 [00:00<?, ?it/s]

autoanchor: Analyzing anchors... anchors/target = 4.01, Best Possible Recall (BPR) = 0.9997
Image sizes 640 train, 640 test
Using 8 dataloader workers
Logging results to runs/train/exp2
Starting training for 30 epochs...

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
      0/29        0G   0.07406   0.02084   0.09498    0.1899        44       640:   2% 42/1787 [14:15<3:01:29,  6.24s/it]libpng warning: iCCP: known incorrect sRGB profile
      0/29        0G   0.06949   0.01645   0.08452    0.1705        35       640:   8% 150/1787 [25:10<2:49:18,  6.21s/it]libpng warning: iCCP: known incorrect sRGB profile
      0/29        0G   0.06428   0.01585   0.07459    0.1547        34       640:  17% 295/1787 [39:58<2:37:10,  6.32s/it]libpng warning: iCCP: known incorrect sRGB profile
      0/29        0G    0.0612   0.01587   0.06808    0.1452        53       640:  24% 425/1787 [53:20<2:15:37,  5.97s/it]libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: cHRM chunk does not match sRGB
      0/29        0G   0.06106    0.0159   0.06779    0.1447        55       640:  24% 433/1787 [54:10<2:20:43,  6.24s/it]libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: cHRM chunk does not match sRGB
      0/29        0G    0.0596   0.01599   0.06443      0.14        66       640:  28% 500/1787 [1:00:59<2:16:03,  6.34s/it]libpng warning: iCCP: known incorrect sRGB profile
      0/29        0G   0.05856   0.01605   0.06269    0.1373        65       640:  31% 555/1787 [1:06:34<2:12:26,  6.45s/it]libpng warning: iCCP: profile 'ICC Profile': 0h: PCS illuminant is not D50
      0/29        0G    0.0559   0.01591   0.05811    0.1299        57       640:  41% 735/1787 [1:24:59<1:46:40,  6.08s/it]libpng warning: iCCP: profile 'ICC Profile': 0h: PCS illuminant is not D50
      0/29        0G   0.05285   0.01518   0.04852    0.1166        62       640:  66% 1184/1787 [2:12:21<1:06:22,  6.60s/it]libpng warning: iCCP: profile 'ICC Profile': 0h: PCS illuminant is not D50
      0/29        0G   0.05234   0.01525   0.04795    0.1155        44       640:  71% 1268/1787 [2:22:00<1:00:09,  6.95s/it]libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: cHRM chunk does not match sRGB
      0/29        0G    0.0511   0.01518     0.045    0.1113        78       640:  88% 1574/1787 [2:55:58<24:06,  6.79s/it]libpng warning: iCCP: known incorrect sRGB profile
      0/29        0G   0.05095   0.01517   0.04469    0.1108        52       640:  90% 1611/1787 [3:00:03<21:35,  7.36s/it]libpng warning: iCCP: known incorrect sRGB profile
libpng warning: iCCP: cHRM chunk does not match sRGB
      0/29        0G   0.05036   0.01521   0.04345     0.109        41       640: 100% 1787/1787 [3:19:39<00:00,  6.70s/it]
               Class      Images      Labels           P           R      [email protected]  [email protected]:.95:  73% 141/193 [11:14<04:08,  4.78s/it]
Traceback (most recent call last):
  File "/content/yolov7/train.py", line 616, in <module>
    train(hyp, opt, device, tb_writer)
  File "/content/yolov7/train.py", line 415, in train
    results, maps, times = test.test(data_dict,
  File "/content/yolov7/test.py", line 119, in test
    loss += compute_loss([x.float() for x in train_out], targets)[1][:3]  # box, obj, cls
  File "/content/yolov7/utils/loss.py", line 477, in __call__
    t[range(n), tcls[i]] = self.cp
IndexError: index 5283828224 is out of bounds for dimension 1 with size 25

I'm seeking assistance in identifying and resolving this issue. I'm not entirely certain where the root cause lies, but I'm open to suggestions and guidance from experienced individuals.

0

There are 0 answers