I'm trying to understand how to recover a saved/checkpointed net using tensorflow.train.Checkpoint.restore
.
I'm using code that's strongly based on Google's Colab tutorial for creating a pix2pix GAN. Below, I've excerpted the key portion, which just attempts to instantiate a new net, then to fill it with weights from a previous net that was saved and checkpointed.
I'm assigning a unique(ish) id number to a particular instantiation of a net by summing all the weights of the net. I compare these id numbers both at the creation of the net, and after I've attempted to recover the checkpointed net
def main(opt):
# Initialize pix2pix GAN using arguments input from command line
p2p = Pix2Pix(vars(opt))
print(opt)
# print sum of initial weights for net
print("Init Model Weights:",
sum([x.numpy().sum() for x in p2p.generator.weights]))
# Create or read from model checkpoints
checkpoint = tf.train.Checkpoint(generator_optimizer=p2p.generator_optimizer,
discriminator_optimizer=p2p.discriminator_optimizer,
generator=p2p.generator,
discriminator=p2p.discriminator)
# print sum of weights from checkpoint, to ensure it has access
# to relevant regions of p2p
print("Checkpoint Weights:",
sum([x.numpy().sum() for x in checkpoint.generator.weights]))
# Recover Checkpointed net
checkpoint.restore(tf.train.latest_checkpoint(opt.weights)).expect_partial()
# print sum of weights for p2p & checkpoint after attempting to restore saved net
print("Restore Model Weights:",
sum([x.numpy().sum() for x in p2p.generator.weights]))
print("Restored Checkpoint Weights:",
sum([x.numpy().sum() for x in checkpoint.generator.weights]))
print("Done.")
if __name__ == '__main__':
opt = parse_opt()
main(opt)
The output I got when I ran this code was as follows:
Namespace(channels='1', data='data', img_size=256, output='output', weights='weights/ckpt-40.data-00000-of-00001')
## These are the input arguments, the images have only 1 channel (they're gray scale)
## The directory with data is ./data, the images are 265x256
## The output directory is ./output
## The checkpointed net is stored in ./weights/ckpt-40.data-00000-of-00001
## Sums of nets' weights
Init Model Weights: 11047.206374436617
Checkpoint Weights: 11047.206374436617
Restore Model Weights: 11047.206374436617
Restored Checkpoint Weights: 11047.206374436617
Done.
There is no change in the sum of the net's weights before and after recovering the checkpointed version, although p2p
and checkpoint
do seem to have access to the same locations in memory.
Why am I not recovering the saved net?
The problem arose because tf.Checkpoint.restore needs the directory in which the checkpointed net is stored, not the specific file (or, what I took to be the specific file - ./weights/ckpt-40.data-00000-of-00001)
When it is not given a valid directory, it silently proceeds to the next line of code, without updating the net or throwing an error. The fix was to give it the directory with the relevant checkpoint files, rather than just the file I believed to be relevant.