I was wondering if it is safe to ignore the warning in a model in which custom tape is used to calculate some gradients with respect to the features in the input (but not to the difference with the labels)
I was trying to implement TANGOS regularization (https://openreview.net/pdf?id=n6H86gW8u0d) in TensorFlow and Keras, that needs the calculation of the gradient of the output of each layer with respect to the features input to the model. I want the regularization to be applied only to certain layers (for example batchnorm layers are excluded) but as I understand the complete calculation must be done within the tape block. This is the code inside the call() method (I use subclassing), where "features_in" are the input of the model and the layers have previously been created and stored in "self.Dense_layers" list:
features_tensor=tf.convert_to_tensor(features_in, dtype=tf.float32)
layers_results=[features]
with tf.GradientTape(persistent=True) as my_tape:
my_tape.watch(features_tensor)
for i in range(len(self.Dense_layers)):
layers_results.append( self.Dense_layers[i](layers_results[-1]) )
output=layers_results[-1]
Next, I use a dictionary to store the gradients, using the name of the layers as keys:
gradients={}
for i in range(0,len(self.Dense_layers)):
if isinstance(self.Dense_layers[i],keras.layers.Dense): #filter out activation, batch norm layers, etc)
jacobian=my_tape.jacobian(layers_results[i],features_tensor)
#jacobian is a 4th rank tensor of dims: n_batch, features_in, n_batch, layers_out
#so it is transposed to: features_in,hidden state,batch,batch and
#only the diagonal part (same values of last two n_batch dimensions) are kept
jacobian=tf.linalg.diag_part ( tf.transpose(jacobian,[1,3,2,0]) )
jacobian=tf.transpose(jacobian,[2,0,1]) #dimensions: batch, hidden state, features in
gradients[self.Dense_layers[i].name]=(jacobian)
del my_tape
And finally, this dictionary is used to calculate the penalties described in the paper for each of the affected layers.
When I run the code (and before attemtting to add the calculated penalties to the loss... right now I was just debugging and these are wasted) I obtain a lot of complains referring to the lack of lots of gradients:
WARNING:tensorflow:Gradients do not exist for variables ['model_12/e_mlp-52/kernel:0', (...)] when minimizing the loss. If you're using model.compile(), did you forget to provide a loss argument?
I think that the layers in the list correspond to all those layers of which I am not extracting the gradients for calculating the loss (obviously I do have provided the "loss" argument)
Is it safe to ignore the warnings? Apparently, the model optimizes weights and the loss is reduced on several epochs, and I would seay that at a comparable rate than before using the tape... ...And... is there any way of getting rid off the warning (in case the answer to the previous question is "yes", of course).
Thanks:
Luis