I'm new to reinforcement learning and q-learning and I'm trying to understand concepts and try to implement them. Most of material I have found use CNN layers to process image input. I think I would rather start with something simpler than than, so I use grid world.
This is what I have already implemented. I implemented an environment by following MDP and have 5x5 grid, with fixed agent position (A) and target position (T). Start state could look like this.
-----
---T-
-----
-----
A----
Currently I represent my state as a 1-dimensional vector of length 25 (5x5) where 1 is on position where Agent is, otherwise 0, so for example the state above will be repsented as vector
[1, 0, 0, ..., 0]
I have successfully implemented solutions with Q table and simple NN with no hidden layer.
Now, I want to move little further and make the task more complicated by making Target position random each episode. Because now there is no correlation between my current representation of state and actions, my agents act randomly. In order to solve my problem, first I need to adjust my state representation to contain some information like distance to the target, direction or both. The problem is, that I don't how to represent my state now. I have come up with some ideas:
- [x, y, distance_T]
- [distance_T]
two 5x5 vectors, one for Agent's position, one for Target's position
[1, 0, 0, ..., 0], [0, 0, ..., 1, 0, ..., 0]
I know that even if I will figure out the state representation, my implemented model will not be able to solve the problem and I will need to move toward hidden layers, experience replay, frozen target network and so on, but I only want to verify the model failure.
In conclusion, I want to ask how to represent such state as an input for neural network. If there are any sources of informations, articles, papers etc. which I have missed, feel free to post them.
Thank you in advance.
In Reinforcement Learning there is no right state representation. But there are wrong state representations. At least, that is to say that Q-learning and other RL techniques make a certain assumption about the state representation.
It is assumed that the states are states of a Markov Decision Process (MDP). An MDP is one where everything you need to know to 'predict' (even in a probabilistic way) is available in the current state. That is to say that the agent must not need memory of past states to make a decision.
It is very rarely the case in real life that you have a Markov decision process. But many times you have something close, which has been empirically shown to be enough for RL algorithms.
As a "state designer" you want to create a state that makes your task as close as possible to an MDP. In your specific case, if you have the distance as your state there is very little information to predict the next state, that is the next distance. Some thing like the current distance, the previous distance and the previous action is a better state, as it gives you a sense of direction. You could also make your state be the distance and the direction to which the target is at.
Your last suggestion of two matrices is the one I like most. Because it describes the whole state of the task without giving away the actual goal of the task. It also maps well to convolutional networks.
The distance approach will probably converge faster, but I consider it a bit like cheating because you practically tell the agent what it needs to look for. In more complicated cases this will rarely be possible.