I was trying to understand the proof why policy improvement theorem can be applied on epsilon-greedy policy.
The proof starts with the mathematical definition -
I am confused on the very first line of the proof.
This equation is the Bellman expectation equation for Q(s,a), while V(s) and Q(s,a) follow the relation -
So how can we ever derive the first line of the proof?
The optimal control problem was first introduced in the 1950s. The problem was to design a controller to maximize or minimize an objective function. Richard Bellman approached this optimal control problem by introducing the Bellman Equation:
Where the value is equivalent to the discounted sum of the rewards. If we take the first time step out, we get the following:
Subsequently, classic reinforcement learning is based on the Markov Decision Process, and assumes all state transitions are known. Thus the equation becomes the following:
That is, the summation is equivalent to the summation of all possible transitions from the that state, multiplied by the reward for achieving the new state.
The above equations are written in the value form. Sometimes, we want the value to also be a function of the action, thus creating the action-value. The conversion of the above equation to the action value form is:
The biggest issue with this equation is that in real life, the transitional probabilities are in fact not known. It is impossible to know the transitional probabilities of every single state unless the problem is extremely simple. To solve this problem, we usually just take the max of the future discounted portion. That is, we assume we behave optimally in the future, rather than taking the average of all possible scenarios.
However, the environment can be heavily stochastic in a real scenario. Therefore, the best estimate of the action-value function in any state is simply an estimate. And the post probabilistic case is the expected value. Thus, giving you:
The reward notation is t+1 in your equation. This is mainly because of different interpretations. The proof above still holds for your notation. It is simply saying you won't know your reward until you get to your next sampling time.