I have implemented Q-Learning as described in,
http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf
In order to approx. Q(S,A) I use a neural network structure like the following,
- Activation sigmoid
- Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
- Outputs, single output. Q-Value
- N number of M Hidden Layers.
- Exploration method random 0 < rand() < propExplore
At each learning iteration using the following formula,
I calculate a Q-Target value then calculate an error using,
error = QTarget - LastQValueReturnedFromNN
and back propagate the error through the neural network.
Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.
Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)
Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.
Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.
Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.
Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.