Background (may be skipped):
In training neural networks, usually stochastic gradient descent (SGD) is used: instead of computing the network's error on all members of the training set and updating the weights by gradient descent (which means waiting a long time before each weight update), use each time a minbatch of members, and treat the resulting error as an unbiased estimation of the true error.
In reinforcement learning, sometimes Q-learning is implemented with a neural network (as in deep Q-learning), and experience replay is used: Instead of updating the weights by the previous (state,action,reward) of the agent, update using a minibatch of random samples of old (states,actions,rewards), so that there is no correlation between subsequent updates.
The Question:
Is the following assertion correct?: When minibatching in SGD, one weights update is performed per the whole minibatch, while when minibatching in Q-learning, one weights update is performed per each member in the minibatch?
One more thing:
I think this question is more suitable for Cross Validated, as it is a conceptual question about machine learning and has nothing to do with programming, but by looking at questions tagged reinforcement-learning on Stackoverflow, I conclude that it is normative to ask this question here, and the number of responses I can get is larger.
The answer is no. The Q-network's parameters can be updated at once using all examples in a minibatch. Denote the members of the minibatch by (s1,a1,r1,s'1),(s2,a2,r2,s'2),... Then the loss is estimated relative to the current Q-network:
L=(Q(s1,a1)-(r1+max{Q(s'1, _ )}))^2+(Q(s2,a2)-(r2+max{Q(s'2, _ )}))^2+...
This is an estimation of the true loss, which is an expectation over all (s,a,r). In this way, the updating of the parameters of Q is similar to SGD.
Notes: