I am using Tensorflow in an online learning environment. As cost function is implemented:
cost = tf.sqrt(tf.reduce_mean(tf.square(tf.sub(Y, output))))
Optimization is done like:
train_op = tf.train
.GradientDescentOptimizer(0.0001)
.minimize(cost,name="GradientDescent")
And I run Stochastic Gradient Descent like:
m, i = sess.run([merged, train_op], feed_dict={X: input_batch,Y:label_batch})
Whereby input_batch and label_batch contain only one vector each.
So how can I interpret a cost function like:
Is this a good progress for a stochastic approach? Why the gap gets bigger?
I train the network 50'000 times with the same 50 training examples. So each example is used for training 10'000 times every 51th step.
I tried already to change the learning rate by factor 10 in both ways. This question is related to my other question from: Does Stochastic Gradient Descent even work with TensorFlow?
Thanks for any hints.