I am trying to implement the Q-Learning. The general algorithm from here is as below
In the statement
I just don't get it that should i implement the above statement of the original pseudo-code recursively for all next states which current state/action can lead us to and max it every time
OR just choose the maximum value of the next state with current action from the Action-State Q-Value table?
Thanks in advance.
All the formula says is that on step
t+1
you update the state-action value by using the state-action value from stept
and the maximum of values over all the actions for the current state.