Is there some trick to getting 1-step TD (temporal difference) prediction to converge with a neural net? The network is a simple feed forward network using ReLU. I've got the network working for Q-learning in the following way:
gamma = 0.9
q0 = model.predict(X0[times+1])
q1 = model.predict(X1[times+1])
q2 = model.predict(X2[times+1])
q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
# Use negative rewards because rewards are negative
target = -np.array(rewards)[times] + gamma * q_Opt
Where X0, X1, and X2 are MNIST image features with actions 0, 1, and 2 concatenated onto them respectively. This method converges. What I'm trying that doesn't work:
# What I'm trying that doesn't work
v_hat_next = model.predict(X[time_steps+1])
target = -np.array(rewards)[times] + gamma * v_hat_next
history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)
This method doesn't converge at all and in fact gives identical state values for every state. Any idea what I'm doing wrong? Is there some trick to setting up the target? The target is supposed to be +1+̂ (+1,) and I thought that's what I've done here.