Questions about Q-Learning using Neural Networks

Question

Questions about Q-Learning using Neural Networks

1.3k views Asked by Hamza Yerlikaya At 07 December 2014 at 08:27

I have implemented Q-Learning as described in,

http://web.cs.swarthmore.edu/~meeden/cs81/s12/papers/MarkStevePaper.pdf

In order to approx. Q(S,A) I use a neural network structure like the following,

Activation sigmoid
Inputs, number of inputs + 1 for Action neurons (All Inputs Scaled 0-1)
Outputs, single output. Q-Value
N number of M Hidden Layers.
Exploration method random 0 < rand() < propExplore

At each learning iteration using the following formula,

enter image description here

I calculate a Q-Target value then calculate an error using,

error = QTarget - LastQValueReturnedFromNN

and back propagate the error through the neural network.

Q1, Am I on the right track? I have seen some papers that implement a NN with one output neuron for each action.

Q2, My reward function returns a number between -1 and 1. Is it ok to return a number between -1 and 1 when the activation function is sigmoid (0 1)

Q3, From my understanding of this method given enough training instances it should be quarantined to find an optimal policy wight? When training for XOR sometimes it learns it after 2k iterations sometimes it won't learn even after 40k 50k iterations.

Original Q&A

There are 1 answers

**Juan Leni** · Accepted Answer · 2016-02-27T08:22:39+00:00

Q1. It is more efficient if you put all action neurons in the output. A single forward pass will give you all the q-values for that state. In addition, the neural network will be able to generalize in a much better way.

Q2. Sigmoid is typically used for classification. While you can use sigmoid in other layers, I would not use it in the last one.

Q3. Well.. Q-learning with neural networks is famous for not always converging. Have a look at DQN (deepmind). What they do is solving two important issues. They decorrelate the training data by using memory replay. Stochastic gradient descent doesn't like when training data is given in order. Second, they bootstrap using old weights. That way they reduce non-stationary.

TechQA.

Questions about Q-Learning using Neural Networks

There are 1 answers

Related Questions in MACHINE-LEARNING

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in NEURAL-NETWORK

Related Questions in REINFORCEMENT-LEARNING

Related Questions in Q-LEARNING

Popular Questions

Trending Questions