I am using the adam_sgd optimiser to train a neural network and I am having trouble associating the arguments in the function with the parameters reported in the paper for Adam. More specifically how do the parameters alpha, beta1, beta2 and epsilon relate to learning rate and momentum in the CNTK implementation of Adam?
In CNTK implementation of ADAM optimizer, how the parameters alpha, beta1, beta2 and epsilon relate to learning rate and momentum
606 views Asked by Sayan Pathak At
1