Validation loss become nan while training on TPU but perfectly ok on GPU

1.1k views Asked by At

I encountered a rather strange problem on google colab training with GPU and TPU, I used a custom loss and it is fine with the tfrecord dataset on GPU, but gave nan as validation loss if i switch to TPU. There's no other specific error. Also, an older validation tfrecord dataset worked fine on TPU. This made me think there may be something specific to the data. This showed up on model.evaluate(...) as well since it is on the validation set.

Any idea how best to debug this with the TPU? More details can be provided upon request.

2

There are 2 answers

0
kawingkelvin On

My issue could be related to https://github.com/tensorflow/tensorflow/issues/41635 (although it was seen even for a non custom loss function). For my case, i don't see it using out of box loss function, but hit it when i use custom loss. The custom loss doesnt seem to be the main cause, as it works under both CPU and GPU for any dataset.

Anyway, i followed the issue's tip and drop the last batch (it has a size less than batch_size) and the NaN is no longer seen. While this fixed the problem, I still don't have a clear answer to the root cause.

0
Dhruv kejriwal On

This is the solution for fixing the problem:

Add these parameters in model.fit() and it will work fine

steps_per_epoch = train.shape[0]//batch_Size
validation_steps = validate.shape[0]//batch_Size

This will remove the last batch to be processed and then the problem will get fixed and no more Nan will be there