TD learning vs Q learning

Question

TD learning vs Q learning

3.6k views Asked by Ricky At 26 February 2016 at 11:29

In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?

As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?

Original Q&A

There are 4 answers

**Juan Leni** · Answer 1 · 2016-02-26T11:49:41+00:00

Q-Learning is a TD (temporal difference) learning method.

I think you are trying to refer to TD(0) vs Q-learning.

I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.

**Rakesh** · Answer 2 · 2020-07-31T13:34:31+00:00

Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo

**Nicolás Esteban Cofré Ramírez** · Answer 3 · 2020-02-04T09:53:31+00:00

Nicolás Esteban Cofré Ramírez On 04 February 2020 at 09:53

Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.

**Pablo EM** · Answer 4 · 2016-03-01T09:48:38+00:00

Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (i.e., compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).

However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.

In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.

You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.

TechQA.

TD learning vs Q learning

There are 4 answers

Related Questions in MACHINE-LEARNING

Related Questions in REINFORCEMENT-LEARNING

Related Questions in Q-LEARNING

Related Questions in TEMPORAL-DIFFERENCE

Popular Questions

Popular Tags

Trending Questions