Why is there a problem when loading saved weights on a model

1.4k views Asked by At

I'm trying to modify a classifier model with many tools (dropout, autoencoder, etc...) to analyse what gets the best results. Thus, I am using the save_weights and load_weights methods.

The first time I am launching my model, it works fine. However when loading the weights, the fit isn't doing anything. The loss stagnates during the entire training.

I know I must be doing something wrong but I don't know what. I thought first that it was an issue of vanishing gradient since I encountered the problem first with the autoencoded dataset. But after many tweaks and tries, I feel the issue resides in the weights loading. See for yourself (this obviously is after a Runtime Restart) :

# Classifier

model = Sequential()

model.add(Dense(50, activation= 'relu', input_dim= x.shape[1]))
model.add(Dense(50, activation= 'relu'))
model.add(Dense(50, activation= 'relu'))
model.add(Dense(50, activation= 'relu'))
model.add(Dense(10, activation= 'softmax'))

model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics = ['acc'])

model.save_weights('/content/drive/My Drive/Colab Notebooks/Weights/KagTPOneStart')

First time fitting (loading initial weight ten fit. yes I know the initial weights are already there at this time but I left the line here to proove this time it doesn't cause a problem):

model.load_weights('/content/drive/My Drive/Colab Notebooks/Weights/KagTPOneStart')

model.fit(x,y_train,epochs=10,batch_size=20, validation_split=0.15)

model.save_weights('/content/drive/My Drive/Colab Notebooks/Weights/KagTPOneNormal')

Results :

Train on 35700 samples, validate on 6300 samples
Epoch 1/10
35700/35700 [==============================] - 5s 128us/step - loss: 1.0875 - acc: 0.8036 - val_loss: 0.3275 - val_acc: 0.9067
Epoch 2/10
35700/35700 [==============================] - 4s 120us/step - loss: 0.2792 - acc: 0.9201 - val_loss: 0.3186 - val_acc: 0.9079
Epoch 3/10
35700/35700 [==============================] - 4s 122us/step - loss: 0.2255 - acc: 0.9357 - val_loss: 0.1918 - val_acc: 0.9444
Epoch 4/10
35700/35700 [==============================] - 4s 121us/step - loss: 0.1777 - acc: 0.9499 - val_loss: 0.1977 - val_acc: 0.9465
Epoch 5/10
35700/35700 [==============================] - 4s 121us/step - loss: 0.1530 - acc: 0.9549 - val_loss: 0.1718 - val_acc: 0.9478
Epoch 6/10
35700/35700 [==============================] - 4s 121us/step - loss: 0.1402 - acc: 0.9595 - val_loss: 0.1847 - val_acc: 0.9510
Epoch 7/10
35700/35700 [==============================] - 4s 122us/step - loss: 0.1236 - acc: 0.9637 - val_loss: 0.1675 - val_acc: 0.9546
Epoch 8/10
35700/35700 [==============================] - 4s 121us/step - loss: 0.1160 - acc: 0.9660 - val_loss: 0.1776 - val_acc: 0.9586
Epoch 9/10
35700/35700 [==============================] - 4s 120us/step - loss: 0.1109 - acc: 0.9683 - val_loss: 0.1928 - val_acc: 0.9492
Epoch 10/10
35700/35700 [==============================] - 4s 120us/step - loss: 0.1040 - acc: 0.9701 - val_loss: 0.1749 - val_acc: 0.9570
WARNING:tensorflow:This model was compiled with a Keras optimizer (<tensorflow.python.keras.optimizers.Adam object at 0x7fb76ca35080>) but is being saved in TensorFlow format with `save_weights`. The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved.

Consider using a TensorFlow optimizer from `tf.train`.

Second time training (loading initial weights then fit) :

model.load_weights('/content/drive/My Drive/Colab Notebooks/Weights/KagTPOneStart')

model.fit(x,y_train,epochs=10,batch_size=20, validation_split=0.15)

model.save_weights('/content/drive/My Drive/Colab Notebooks/Weights/KagTPOneNormal')

Results :

Train on 35700 samples, validate on 6300 samples
Epoch 1/10
35700/35700 [==============================] - 4s 121us/step - loss: 14.4847 - acc: 0.1011 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 2/10
35700/35700 [==============================] - 4s 122us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 3/10
35700/35700 [==============================] - 4s 120us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 4/10
35700/35700 [==============================] - 4s 121us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 5/10
35700/35700 [==============================] - 4s 121us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 6/10
35700/35700 [==============================] - 4s 121us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 7/10
35700/35700 [==============================] - 4s 122us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 8/10
35700/35700 [==============================] - 4s 121us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 9/10
35700/35700 [==============================] - 4s 122us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
Epoch 10/10
35700/35700 [==============================] - 5s 130us/step - loss: 14.5018 - acc: 0.1003 - val_loss: 14.5907 - val_acc: 0.0948
WARNING:tensorflow:This model was compiled with a Keras optimizer (<tensorflow.python.keras.optimizers.Adam object at 0x7fb76ca35080>) but is being saved in TensorFlow format with `save_weights`. The model's weights will be saved, but unlike with TensorFlow optimizers in the TensorFlow format the optimizer's state will not be saved.

Consider using a TensorFlow optimizer from `tf.train`.

Thanks in advance for your help :)

PS : Here's the data for reference, but I really don't think this is the problem. This is a MNIST-like dataset provided by google on kaggle. (I believe it is exactly MNIST but not all samples) :

import pandas as pd 
df=pd.read_csv('/content/drive/My Drive/Colab Notebooks/IA/Kaggle TP1/train.csv')
data = df.values
data.shape        #(42000, 785)
y = data[:,0]
y_train =  np_utils.to_categorical(y, 10)
x = data[:,1:]
1

There are 1 answers

2
ixeption On BEST ANSWER

To restart a training for a model, which was already used by the fit() function, you have to recompile it.

model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics = ['acc'])

The reason why is that the model has an optimizer assigned, which is in already in some state. This state indicates the training progress, so if you do not recompile the model, the training will be continued at this state. If your model did get stuck in the first training, it will almost certain continue being stuck (learning rate too low etc.).

Compile defines the loss function, the optimizer and the metrics and has nothing to do with the weights assigned to the layers.