I have a modeling question. I am sorry I am new to reinforcement learning.
Suppose we have a game in the style pacman. the agent has access to left-front, center-front, right-front circles and must eat dots it will encounter. (if it skips there is more penalty.) dots would appear randomly but have different weigths: either positive or negative. I want to find an optimal score (summed from weigths of the dots) and/or optimal length of dots it will encounter in chain where it would score positive.
I want to train a Q-learning model for this (though I doubt it is the correct way). I plan next using policy-based iteration because value-based model gave me a rather linear solution in a stochastic state space (only one decision per state where it can alter).
I don't know if theoretically this question is solvable.
The dots appear on the fly in random circle next to the agent. say, the "next states"
[+/-1,0,0],[0,+/-1,0],[0,0,+/-1]have equal probability distribution. I have trouble posing question the rigth way and to fix a terminal state.
Can you guide me?