Why does my implementation of TD(0) not work?

45 views Asked by At

I am trying to implement TD(0) among other RL Policy Evaluation techniques.

I have also implemented the dynamic programming approach for a given model of the world and FV Monte Carlo and EV Monte Carlo for the unknown case. It seems that my implementation for TD(0) does not converge to the right state-values for my value_fuction. The other approaches work, so I don't think it is a problem of data stuctures in my case. I let TD(0) also work on pre-sampled episodes, but no matter how many episodes I use and how simple the state and action spaces are, it just does not yield the same results as MC. Here is my implementation:

def td_0(
    episodes: list[Episode],
    states: set[State],
    reward_function: RewardFunction,
    gamma: float = 0.9,
    alpha: float = 0.1,
):
    value_function = ValueFunction({s: 0 for s in states})

    # Episodes are lists of [state, action, reward, state, action, reward, ... , state]
    for epi in episodes:
        state_index = 0
        while state_index < len(epi):
            state = epi[state_index]
            if state_index == len(epi) - 1: # for terminal state value is reward
                value_function.set_value(state, reward_function.get_reward(state))
                break

            next_state = epi[state_index + 3]
            # compute new value
            new_value = value_function.get_value(state) + alpha * (
                reward_function.get_reward(state)
                + gamma * value_function.get_value(next_state)
                - value_function.get_value(state)
            )

            value_function.set_value(state, new_value)  
 
            # set next state index
            state_index += 3

    return value_function

Or is it simply a question of how to set the learning rate? For example, I use the following model to sample the episodes:

# define some states
s1 = State(1)
s2 = State(2)
s3 = State(3)
states = {s1, s2, s3}

# define some actions
a1 = Action(1)
a2 = Action(2)
actions = {a1, a2}

# dynamics model syntax: (current_state, action, next_state, probability)
dynamics_model = Dynamics(
    [
        (s1, a1, s2, 1),
        (s1, a2, s3, 1),
    ]
)

# policy syntax: (state, action, probability)
pi = Policy(
    [
        (s1, a2, 0.5),
        (s1, a1, 0.5),
    ]
)

# reward function syntax: {state: reward}
reward_function = RewardFunction({s1: 1, s2: -1, s3: 1})

It is a very simple example with terminal states s2 and s3 and it is clear that the Value function should converge to: V(s1) = 1, V(s2) = -1, V(s3) = 1 which my implementations of MC and DP do, so whats wrong with my TD?

Here are some outputs of some runs:

Nr. Episodes: 10000
Dynamic Programming
{State_1: 1.0, State_2: -1.0, State_3: 1.0}
First visit Monte Carlo
{State_1: 1.0134, State_2: -1.0, State_3: 1.0}
Every visit Monte Carlo
{State_1: 1.0134, State_2: -1.0, State_3: 1.0}
Temporal Difference (0)
{State_1: 1.5552992453591237, State_2: -1, State_3: 1}
Nr. Episodes: 10000
Dynamic Programming
{State_1: 1.0, State_2: -1.0, State_3: 1.0}
First visit Monte Carlo
{State_1: 1.0114, State_2: -1.0, State_3: 1.0}
Every visit Monte Carlo
{State_1: 1.0114, State_2: -1.0, State_3: 1.0}
Temporal Difference (0)
{State_1: 0.07098345716518928, State_2: -1, State_3: 1}
0

There are 0 answers