Keras: A loaded checkpoint model to resume a training could decrease the accuracy?

735 views Asked by At

My keras template is generating a checkpoint for every best time of my training.

However my internet dropped and when loading my last checkpoint and restarting training from last season (using initial_epoch), the accuracy dropped from 89.1 (loaded model value) to 83.6 in the first season of new training. Is this normal behavior when resuming(restarting) a training? Because when my network fell it was already in the 30th season and there was no drop in accuracy, there was also no significant improvement and so did not generate any new checkpoint, forcing me to come back a few epochs.

Thanks in advance for the help.

1

There are 1 answers

1
Timbus Calin On BEST ANSWER

The problem with saving and retraining is that, when you start retraining from a trained model up to epoch N, at epoch N+1 it does not have the history retained.

Scenario:

You are training a model for 30 epochs. At epoch 15, you have an accuracy of 88% (say you save your model according to the best validation accuracy). Unfortunately, something happens and your training crashes. However, since you trained with checkpoints, you have the resulting model obtained at epoch 15, before your program crashed.

If you start retraining from epoch 15, the previous validation_accuracies(since you now train again "from scratch"), will not be 'remembered anywhere'. If you get at epoch 16 a validation accuracy of 84%, your 'best_model' (with 88% acc) will be overwritten with the epoch 16 model, because there is no saved/internal history data of the prior training/validation accuracies. Under the hood, at a new retraining, 84% will be compared to -inf, therefore it will save the epoch 16 model.

The solution is to either retrain from scratch, or to initialise the second training validation accuracies with a list (manually or obtained from Callback) from the previous training. In this way, the maximum accuracy compared by Keras under the hood at the end of your epoch, would be 88% (in the scenario) not -inf.