Confusion in understanding Q(s,a) formula for Reinforcement Learning MDP?

210 views Asked by At

I was trying to understand the proof why policy improvement theorem can be applied on epsilon-greedy policy.

The proof starts with the mathematical definition -

Barto and Sutton - Reinforcement Learning: An Introduction

I am confused on the very first line of the proof.

enter image description here

This equation is the Bellman expectation equation for Q(s,a), while V(s) and Q(s,a) follow the relation -

enter image description here

So how can we ever derive the first line of the proof?

1

There are 1 answers

0
Rui Nian On

The optimal control problem was first introduced in the 1950s. The problem was to design a controller to maximize or minimize an objective function. Richard Bellman approached this optimal control problem by introducing the Bellman Equation:

enter image description here

Where the value is equivalent to the discounted sum of the rewards. If we take the first time step out, we get the following:

enter image description here

Subsequently, classic reinforcement learning is based on the Markov Decision Process, and assumes all state transitions are known. Thus the equation becomes the following:

enter image description here

That is, the summation is equivalent to the summation of all possible transitions from the that state, multiplied by the reward for achieving the new state.

The above equations are written in the value form. Sometimes, we want the value to also be a function of the action, thus creating the action-value. The conversion of the above equation to the action value form is:

enter image description here

The biggest issue with this equation is that in real life, the transitional probabilities are in fact not known. It is impossible to know the transitional probabilities of every single state unless the problem is extremely simple. To solve this problem, we usually just take the max of the future discounted portion. That is, we assume we behave optimally in the future, rather than taking the average of all possible scenarios.

enter image description here

However, the environment can be heavily stochastic in a real scenario. Therefore, the best estimate of the action-value function in any state is simply an estimate. And the post probabilistic case is the expected value. Thus, giving you:

enter image description here

The reward notation is t+1 in your equation. This is mainly because of different interpretations. The proof above still holds for your notation. It is simply saying you won't know your reward until you get to your next sampling time.