I am building a BERT binary classification on SageMaker using Pytorch.
Previously when I ran the model, I set the Batch size to 16 and the model were able to run successfully. However, yesterday after I stopped SageMaker and restarted the this morning, I can't run the model with Batch size as 16 any more. I am able to run the model with batch size 8.
However, the model is not producing the same result (of course). I didn't change anything else in between. All other settings are the same. (Except I change the SageMaker volume from 30GB to 200GB.)
Does anyone know what may cause this problem? I really want to reproduce the result with batch size 16.
Any answers will help and thank you in advance!