I'm trying to train Arcface with reference to.
As far as I know, Arcface requires more than 200 training epochs on CASIA-webface with a large batch size.
Within 100 epochs of training, I stopped the training for a while because I was needed to use GPU for other tasks. And the checkpoints of the model(Resnet) and margin are saved. Before it was stopped, its loss recorded a value between 0.3~1.0, and training accuracy was mount to 80~95%.
When I resume the Arcface training by loading the checkpoint files using load_sate, it seems normal when the first batch is processed. But suddenly the loss increased sharply and the accuracy became very low.
Its loss suddenly became increased. How did this happen? I had no other way so anyway continued the training, but I don't think the loss is decreasing well even though it is a trained model over 100 epochs...
When I searched for similar issues, they told the problem was that the optimizer was not saved (Because the reference github page didn't save the optimizer, so did I. Is it true?
if you see this line! you are Decaying the learning rate of each parameter group by gamma. This has altered your learning rate as you had reached 100th epoch. and moreover you had not saved your optimizer state while saving your model.
This made your code to start with the starting lr i.e 0.1 after resuming your training. And this spiked your loss again.
Vote if you find this useful