I'm trying to find optimal policy in environment with continuous states (dim. = 20) and discrete actions (3 possible actions). And there is a specific moment: for optimal policy one action (call it "action 0") should be chosen much more frequently than other two (~100 times more often; this two action more risky).
I've tried Q-learning with NN value-function approximation. Results were rather bad: NN learns always choose "action 0". I think that policy gradient methods (on NN weights) may help, but don't understand how to use them on discrete actions.
Could you give some advise what to try? (maybe algorithms, papers to read). What are the state-of-the-art RL algorithms when state space is continuous and action space is discrete?
Thanks.
Applying Q-learning in continuous (states and/or actions) spaces is not a trivial task. This is especially true when trying to combine Q-learning with a global function approximator such as a NN (I understand that you refer to the common multilayer perceptron and the backpropagation algorithm). You can read more in the Rich Sutton's page. A better (or at least more easy) solution is to use local approximators such as for example Radial Basis Function networks (there is a good explanation of why in Section 4.1 of this paper).
On the other hand, the dimensionality of your state space maybe is too high to use local approximators. Thus, my recommendation is to use other algorithms instead of Q-learning. A very competitive algorithm for continuous states and discrete actions is Fitted Q Iteration, which usually is combined with tree methods to approximate the Q-function.
Finally, a common practice when the number of actions is low, as in your case, it is to use an independent approximator for each action, i.e., instead of a unique approximator that takes as input the state-action pair and return a Q value, using three approximators, one per action, that take as input only the state. You can find an example of this in Example 3.1 of the book Reinforcement Learning and Dynamic Programming Using Function Approximators