I am trying to solve a control problem with DDPG. The problem is simple enough so that I can do value function iteration for its discretized version, and thus I have the "perfect" solution to compare my results with. But I want to solve the problem with DDPG, in hope to apply RL to harder versions of it.
Some details about the problem:
- The control space is [0,1], the state space has dimension 2
- This is a stochastic environment, transitions between states are not deterministic
- There is some non-constant reward at any period, so sparse rewards should not be a problem
- Value function iteration takes just 10 minutes or so, again, it's quite a simple control problem
What is the issue:
My agent always eventually converges to a degenerate policy with action being either always 1 or 0. At some point while training it can be a bit close to the right policy, but it's never really close.
Similarly, I usually fail to get right the shape of Q-function:
What I tried to do:
- I put a sigmoid layer at the end of the actor network (naturally, as the control space is [0,1])
- I have target networks both for policy and critic
- I clap my actions so they are between [0.01, 0.99] (the perfect solution is always within these boundaries)
- I tried adding some artificial penalty to reward for actions close 0 and 1. Then, the algorithm converged to something else, but again not to something good.
- I tried both random uniform exploration or adding small normal noise. I either decrease exploration rate with time or hold it constant but small
- To check my code, I ran the following experiment. I would first fix policy to be the "perfect" one and update only critic. In this case, I manage to learn Q-network pretty well (the shape too). Then, I freeze the critic and update only actor with the DDPG updating rule. I manage to get pretty close to the perfect policy. But when I start to update actor and critic simultaneously, they again diverge to something degenerate.
- I experimented a lot with my hyperparameters, currently they are the following:
> Optimizer: Adam. Learning rates: 0.001 for actor, 0.01 for critic. Batch size = 50,
> memory size = 10,000. Standard deviation of normal exploration noise = 0.02.
> Weight for soft updates of target networks: 0.01.
> Sizes of hidden layers for actor and critic: [8,16,16,8].
> Length of simulation: from 1,000 to 1,000,000.
I would be very grateful for any advice, thanks!
I have faced this problem and changing the model, the networks, number of episodes, and all the other parameters such as learning rate to fit your environment may solve the problem. However, it is not performing well in some environments. So, I changed to A2C because it shows better results.
This is the DDPG model and parameters that I got better results:
Parameters
To Select the Action use one of these mathods:
You can find different papers talk about this problem. For example in the paper
[1-5]
, the authors show some shortcomings of DDPG and shows why the ddpg algorithm fails to achieve convergence1- Cai, Q., Filos-Ratsikas, A., Tang, P., & Zhang, Y. (2018, April). Reinforcement Mechanism Design for e-commerce. In Proceedings of the 2018 World Wide Web Conference (pp. 1339-1348).
2- Iqbal, S., & Sha, F. (2019, May). Actor-attention-critic for multi-agent reinforcement learning. In International Conference on Machine Learning (pp. 2961-2970). PMLR.
3- Foerster, J., Nardelli, N., Farquhar, G., Afouras, T., Torr, P. H., Kohli, P., & Whiteson, S. (2017, August). Stabilising experience replay for deep multi-agent reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 1146-1155). JMLR. org.
4- Hou, Y., & Zhang, Y. (2019). Improving DDPG via Prioritized Experience Replay. no. May.
5- Wang, Y., & Zhang, Z. (2019, November). Experience Selection in Multi-agent Deep Reinforcement Learning. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 864-870). IEEE.