I am following this variational autoencoder tutorial: https://keras.io/examples/generative/vae/.
I know VAE's loss function consists of the reconstruction loss that compares the original image and reconstruction, as well as the KL loss. However, I'm a bit confused about the reconstruction loss and whether it is over the entire image (sum of squared differences) or per pixel (average sum of squared differences). My understanding is that the reconstruction loss should be per pixel (MSE), but the example code I am following multiplies MSE by 28 x 28
, the MNIST image dimensions. Is that correct? Furthermore, my assumption is this would make the reconstruction loss term significantly larger than the KL loss and I'm not sure we want that.
I tried removing the multiplication by (28x28), but this resulted in extremely poor reconstructions. Essentially all the reconstructions looked the same regardless of the input. Can I use a lambda parameter to capture the tradeoff between kl divergence and reconstruction, or it that incorrect because the loss has a precise derivation (as opposed to just adding a regularization penalty).
reconstruction_loss = tf.reduce_mean(
keras.losses.binary_crossentropy(data, reconstruction)
)
reconstruction_loss *= 28 * 28
kl_loss = 1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var)
kl_loss = tf.reduce_mean(kl_loss)
kl_loss *= -0.5
total_loss = reconstruction_loss + kl_loss
It isn't really necessary to multiply by the number of pixels. However, whether you do so or not will affect the way your fitting algorithm behaves with respect to the other hyper parameters: your lambda parameter and the learning rate. In essence, if you want to remove the multiplication by 28 x 28 but retain the same fitting behavior, you should divide lambda by 28 x 28 and then multiply your learning rate by 28 x 28. I think you were already approaching this idea in your question, and the piece you were missing is the adjustment to the learning rate.