I get the above error with Pytorch, with the following assertion:
/opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0]
assertion `index >= -size[i] && index < size at] && "index out of bounds"` failed
I have seen the other solutions to this problem that describe how it might due to the labels being not from (0, num_classes-1), etc. However, I have ensured that in my case and the error comes while computing Hinge Loss as follows:
diff_hinge_loss+= F.hinge_embedding_loss( neg_dist - pos_dist, torch.tensor(-1).to(cuda), args.diff_margin, reduction='sum').to(cuda)
Everything is normal while training initially, however after training for certain epochs, I get CUDA runtime error with the computation of Hinge Loss.
Full Error Trace:
/opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0]
Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
File "ours-vision.py", line 1106, in <module>
penalty_erm, penalty_irm, penalty_ws, penalty_same_ctr, penalty_diff_ctr = train( train_dataset, data_match_tensor, label_match_tensor, phi, opt, opt_ws, scheduler, epoch, base_domain_idx, bool_erm, bool_ws, bool_ctr )
File "ours-vision.py", line 688, in train
diff_hinge_loss+= F.hinge_embedding_loss( neg_dist - pos_dist, torch.tensor(-1).to(cuda), args.diff_margin, reduction='sum').to(cuda)
RuntimeError: CUDA error: device-side assert triggered