I am using the adam_sgd optimiser to train a neural network and I am having trouble associating the arguments in the function with the parameters reported in the paper for Adam. More specifically how do the parameters alpha, beta1, beta2 and epsilon relate to learning rate and momentum in the CNTK implementation of Adam?

1

There are 1 answers

0
Sayan Pathak On
  • Alpha is the learning_rate
  • Beta1 is momentum parameter
  • Beta2 is variance_momentum parameter