I use the following snippet of code to show the scale when using Pytorch's Automatic Mixed Precision Package(amp):
scaler = torch.cuda.amp.GradScaler(init_scale = 65536.0,growth_interval=1)
print(scaler.get_scale())
and This is the output that I get:
...
65536.0
32768.0
16384.0
8192.0
4096.0
...
1e-xxx
...
0
0
0
And all the loss after this step became Nan
(the scale still is 0 in the meanwhile).
Whats wrong with my loss function or training data?