How is Q learning different from value iteration in reinforcement learning? I know Q learning is model-free and training samples are transitions (s,a,s',r). But since we know the transitions and the reward for every transition in Q learning, is it not the same as model-based learning where we know the reward for a state and action pair, and the transitions for every action from a state (be it stochastic or deterministic)? I do not understand the difference.

3 Answers

seaotternerd On Best Solutions

You are 100% right that if we knew the transition probabilities and reward for every transition in Q-learning, it would be pretty unclear why we would use it instead of model-based learning or how it would even be fundamentally different. After all, transition probabilities and rewards are the two components of the model used in value iteration - if you have them, you have a model.

The key is that, in Q-learning, the agent does not know state transition probabilities or rewards. The agent only discovers that there is a reward for going from one state to another via a given action when it does so and receives a reward. Similarly, it only figures out what transitions are available from a given state by ending up in that state and looking at its options. If state transitions are stochastic, it learns the probability of transitioning between states by observing how frequently different transitions occur.

A possible source of confusion here is that you, as the programmer, might know exactly how rewards and state transitions are set up. In fact, when you're first designing a system, odds are that you do as this is pretty important to debugging and verifying that your approach works. But you never tell the agent any of this - instead you force it to learn on its own through trial and error. This is important if you want to create an agent that is capable of entering a new situation that you don't have any prior knowledge about and figuring out what to do. Alternately, if you don't care about the agent's ability to learn on its own, Q-learning might also be necessary if the state-space is too large to repeatedly enumerate. Having the agent explore without any starting knowledge can be more computationally tractable.

Taru On

Value iteration is used when you have transition probabilities, that means when you know the probability of getting from state x into state x' with action a. In contrast, you might have a black box which allows you to simulate it, but you're not actually given the probability. So you are model-free. This is when you apply Q learning.

Also what is learned is different. With value iteration, you learn the expected cost when you are given a state x. With q-learning, you get the expected discounted cost when you are in state x and apply action a.

Here are the algorithms:

I'm currently writing down quite a bit about reinforcement learning for an exam. You might also be interested in my lecture notes. However, they are mostly in German.

ZhaoqunZhong On

I don't think the accepted answer catched the essence of the difference. To quote the newest version of Richard Sutton's book:

" Having q∗ makes choosing optimal actions even easier. With q∗, the agent does not even have to do a one-step-ahead search: for any state s, it can simply find any actionthat maximizes q∗(s; a). The action-value function effectively caches the results of all one-step-ahead searches. It provides the optimal expected long-term return as a value that is locally and immediately available for each state{action pair. Hence, at the cost of representing a function of state{action pairs, instead of just of states, the optimal action value function allows optimal actions to be selected without having to know anything about possible successor states and their values, that is, without having to know anything about the environment’s dynamics. "

Usually in real problems, the agent doesn't know the world(or the so called transformation) dynamics, but we definitely know the rewards, because those are what the environment gives back during the interaction, and the reward function is actually defined by us.

The real difference between q-learning and normal value iteration is that: After you have V*, you still need to do one step action lookahead to subsequent states to identify the optimal action for that state. And this lookahead requires the transition dynamic after the action. But if you have q*, the optimal plan is just choosing a from the max q(s,a) pair.