Is there a reason why a nan value appears when there is no nan value in the model parameter?

474 views Asked by SIwoo Lee At 29 December 2022 at 05:40

I want to train the model with FP32 and perform inference with FP16.

For other networks (ResNet) with FP16, it worked.

But EDSR (super resolution) with FP16 did not work.

The differences I found are that

ReLU with inplace=True in EDSR
PixelShuffle in EDSR
No batchnorm in EDSR

I am using CUDA 11.3, python 3.8.12, pytorch 1.12.1 and cudnn 8.7.0. Is there any functions that does not support FP16 in convolutional neural network?

GPU : RTX A6000

My process is like :

net_half = net.half()
net_half.eval()
input_half = input.half()

with torch.no_grad():
     output_half = net_half(input_half)

I checked that there is no Nan in model parameters and input by checking

torch.stack([torch.isnan(p).any() for p in net_half.parameters()]).any()
torch.isnan(input_half).any()

gives False.

And by checking simple operations in EDSR :

x = torch.randn(1,4,Ny//2,Nx//2)

test_block1 = nn.Sequential(
    nn.Conv2d(4,64,kernel_size=3,padding=1),
    nn.Conv2d(64,64,kernel_size=3,padding=1,bias=True),
    nn.ReLU(True),
    nn.Conv2d(64,64,kernel_size=3,padding=1,bias=True),
    nn.Conv2d(64,64*4,kernel_size=3,padding=1,bias=True),
    nn.PixelShuffle(2),
    nn.ReLU(True),
    nn.Conv2d(64,4,kernel_size=3,padding=1)
)
x = x.half().to(device)
test_block1 = test_block1.half().to(device)
with torch.no_grad():
    y = test_block1(x)

print(y)

It does not give any Nan values.

I don't know why but I could have got the results at epoch 1 and Nan values at epoch 4.

Original Q&A

TechQA.

Is there a reason why a nan value appears when there is no nan value in the model parameter?

There are 0 answers

Related Questions in PYTHON

Related Questions in PYTORCH

Related Questions in HALF-PRECISION-FLOAT

Popular Questions

Popular Tags

Trending Questions