Minibatching in Stochastic Gradient Descent and in Q-Learning

610 views Asked by At

Background (may be skipped):

In training neural networks, usually stochastic gradient descent (SGD) is used: instead of computing the network's error on all members of the training set and updating the weights by gradient descent (which means waiting a long time before each weight update), use each time a minbatch of members, and treat the resulting error as an unbiased estimation of the true error.

In reinforcement learning, sometimes Q-learning is implemented with a neural network (as in deep Q-learning), and experience replay is used: Instead of updating the weights by the previous (state,action,reward) of the agent, update using a minibatch of random samples of old (states,actions,rewards), so that there is no correlation between subsequent updates.

The Question:

Is the following assertion correct?: When minibatching in SGD, one weights update is performed per the whole minibatch, while when minibatching in Q-learning, one weights update is performed per each member in the minibatch?

One more thing:

I think this question is more suitable for Cross Validated, as it is a conceptual question about machine learning and has nothing to do with programming, but by looking at questions tagged reinforcement-learning on Stackoverflow, I conclude that it is normative to ask this question here, and the number of responses I can get is larger.

1

There are 1 answers

0
Lior On BEST ANSWER

The answer is no. The Q-network's parameters can be updated at once using all examples in a minibatch. Denote the members of the minibatch by (s1,a1,r1,s'1),(s2,a2,r2,s'2),... Then the loss is estimated relative to the current Q-network:

L=(Q(s1,a1)-(r1+max{Q(s'1, _ )}))^2+(Q(s2,a2)-(r2+max{Q(s'2, _ )}))^2+...

This is an estimation of the true loss, which is an expectation over all (s,a,r). In this way, the updating of the parameters of Q is similar to SGD.

Notes:

  • the expression above could also contain a discount factor.
  • the estimation is biassed since it does not contain a term representing the variance due to s', but this does not change the direction of the gradient.
  • sometimes, the second Q-network in each squared term is not the current Q but a past Q (double Q-learning).