Does Stochastic Gradient Descent even work with TensorFlow?

1.8k views Asked by At

I designed a MLP, fully connected, with 2 hidden and one output layer. I get a nice learning curve if I use batch or mini-batch gradient descent.

But a straight line while performing Stochastic Gradient Descent (violet) enter image description here

What did I get wrong?

In my understanding, I do stochastic gradient descent with Tensorflow, if I provide just one train/learn example each train step, like:

X = tf.placeholder("float", [None, amountInput],name="Input")
Y = tf.placeholder("float", [None, amountOutput],name="TeachingInput")
...
m, i = sess.run([merged, train_op], feed_dict={X:[input],Y:[label]})

Whereby input is a 10-component vector and label is a 20-component vector.

For testings I run 1000 iterations, each iterations contains one of 50 prepared train/learn example. I expected an overfittet nn. But as you see, it doesn't learn :(

Because the nn will perform in an online-learning environment, a mini-batch oder batch gradient descent isn't an option.

thanks for any hints.

1

There are 1 answers

1
nessuno On BEST ANSWER

The batch size influences the effective learning rate.

If you think to the update formula of a single parameter, you'll see that it's updated averaging the various values computed for this parameter, for every element in the input batch.

This means that if you're working with a batch size with size n, your "real" learning rate per single parameter is about learning_rate/n.

Thus, if the model you've trained with batches of size n have trained without issues, this is because the learning rate was ok for that batch size.

If you use pure stochastic gradient descent, you have to lower the learning rate (usually by a factor of some power of 10).

So, for example, if your learning rate was 1e-4 with a batch size of 128, try with a learning rate of 1e-4 / 128.0 as see if the network learn (it should).