Is there a reason why a nan value appears when there is no nan value in the model parameter?

470 views Asked by At

I want to train the model with FP32 and perform inference with FP16.

For other networks (ResNet) with FP16, it worked.

But EDSR (super resolution) with FP16 did not work.

The differences I found are that

  1. ReLU with inplace=True in EDSR
  2. PixelShuffle in EDSR
  3. No batchnorm in EDSR

I am using CUDA 11.3, python 3.8.12, pytorch 1.12.1 and cudnn 8.7.0. Is there any functions that does not support FP16 in convolutional neural network?

GPU : RTX A6000

My process is like :

net_half = net.half()
net_half.eval()
input_half = input.half()

with torch.no_grad():
     output_half = net_half(input_half)

I checked that there is no Nan in model parameters and input by checking

torch.stack([torch.isnan(p).any() for p in net_half.parameters()]).any()
torch.isnan(input_half).any()

gives False.

And by checking simple operations in EDSR :

x = torch.randn(1,4,Ny//2,Nx//2)

test_block1 = nn.Sequential(
    nn.Conv2d(4,64,kernel_size=3,padding=1),
    nn.Conv2d(64,64,kernel_size=3,padding=1,bias=True),
    nn.ReLU(True),
    nn.Conv2d(64,64,kernel_size=3,padding=1,bias=True),
    nn.Conv2d(64,64*4,kernel_size=3,padding=1,bias=True),
    nn.PixelShuffle(2),
    nn.ReLU(True),
    nn.Conv2d(64,4,kernel_size=3,padding=1)
)
x = x.half().to(device)
test_block1 = test_block1.half().to(device)
with torch.no_grad():
    y = test_block1(x)

print(y)

It does not give any Nan values.

I don't know why but I could have got the results at epoch 1 and Nan values at epoch 4.

0

There are 0 answers