CUDA error 59: Device-side assert triggered

1.9k views Asked by At

I get the above error with Pytorch, with the following assertion:

/opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0]

assertion `index >= -size[i] && index < size at] && "index out of bounds"` failed

I have seen the other solutions to this problem that describe how it might due to the labels being not from (0, num_classes-1), etc. However, I have ensured that in my case and the error comes while computing Hinge Loss as follows:

diff_hinge_loss+=  F.hinge_embedding_loss( neg_dist - pos_dist, torch.tensor(-1).to(cuda), args.diff_margin, reduction='sum').to(cuda)

Everything is normal while training initially, however after training for certain epochs, I get CUDA runtime error with the computation of Hinge Loss.

Full Error Trace:

/opt/conda/conda-bld/pytorch_1565272269120/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [1,0,0], thread: [127,0,0] 
Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Traceback (most recent call last):

  File "ours-vision.py", line 1106, in <module>
    penalty_erm, penalty_irm, penalty_ws, penalty_same_ctr, penalty_diff_ctr = train( train_dataset, data_match_tensor, label_match_tensor, phi, opt, opt_ws, scheduler, epoch, base_domain_idx, bool_erm, bool_ws, bool_ctr )

  File "ours-vision.py", line 688, in train
diff_hinge_loss+=  F.hinge_embedding_loss( neg_dist - pos_dist, torch.tensor(-1).to(cuda), args.diff_margin, reduction='sum').to(cuda)

RuntimeError: CUDA error: device-side assert triggered
0

There are 0 answers