How to solve a gridworld problem using dp with numpy

85 views Asked by At

I am trying to solve a GridWorld problem using numpy. This consists of a 3x3 maze with terminal states at (3,2) and (3,3) and rewards of -1 and +1 respectively. Through dynamic programming, I want to determine the value of each of the states. The robot can move in all directions with a probability of 0.8 of taking the desired action and a probability of 0.1 of moving to its right or left respectively.

I have some doubts because we are given three trials that are considered for the experiment:

Trace 1 <(1,1), 0> → E → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>

Trace 2: <(1,1), 0> → E → <(1,2), 0> → E → <(2,2), 0> → N → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>

Trace 3: <(1,1), 0> → E → <(2,1), 0> → N → <(1,1), 0> → E → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>

I'm not sure if these trials are relevant for the code implementation. However, they have been useful for me to make a manual approximation of the state values using the Bellman formula.

I have created the following code, but it gives me the final states below, which I think are not correct (at least the politics of movement seems right but is the value numbers what does not seem okay to me). I do not see what I am doing wrong.

import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

# Parameters defining the grid, final states, and possible actions in the environment:
dimension = 3  # Dimension of our 3x3 grid
finalStates = [[2, 1], [2, 2]]  # Final states
actions = [[-1, 0], [1, 0], [0, 1], [0, -1]]  # Actions: move up, down, right, left

# Parameters for calculating the value function:
gamma = 0.9  # Discount factor
rewardValue = 0  # Reward for normal movement
numIterations = 1000  # Number of algorithm iterations

# Function that returns the next state and reward when taking an action from a state
def calculateStateReward(initialState, action):
    # If our initial state is a final state, there is no change or reward
    if initialState in finalStates:
        return initialState, 0

    # If it moves, apply the reward
    reward = rewardValue
    # Calculate the new state
    finalState = np.array(initialState) + np.array(action)
    
    if -1 in finalState or 3 in finalState:  # If we go out of the maze, stay in the initial state
        finalState = initialState
        
    # Assign rewards for final states
    if (finalState == np.array([2, 1])).all():    # After moving, check if the final state is among the final states and return the associated reward.
        return finalState, -1
  
    if (finalState == np.array([2, 2])).all():    
        return finalState, +1

    return finalState, reward

# Initialize cell values
cellValues = np.zeros((dimension, dimension))
cellValues[2, 1] = -1  # Terminal state (3,2)
cellValues[2, 2] = 1  # Terminal state (3,3)

states = [[i, j] for i in range(dimension) for j in range(dimension)]  # Positions in the grid

deltas = []
# Execute the value iteration
for iteration in range(numIterations):
    
    copyCellValues = np.copy(cellValues)
    deltaState = []
    
    for state in states:
        
        if state in finalStates:  # Skip final states
            continue
        
        weightedRewards = 0
        
        # Calculate the expected value using the Bellman equation
        for action in actions:
            
            # Calculate the expected value for the desired action
            finalState, reward = calculateStateReward(state, action)
            
            if -1 in finalState or 3 in finalState:
                continue
            
            desiredValue = 0.8 * (reward + gamma * copyCellValues[finalState[0], finalState[1]])
            
            # Calculate the expected value for the action to the left of the agent
            leftAction = [-action[1], action[0]]
            leftState, leftReward = calculateStateReward(state, leftAction)
            
            if -1 in leftState or 3 in leftState:
                continue
            
            leftValue = 0.1 * (leftReward + gamma * copyCellValues[leftState[0], leftState[1]])
            
            # Calculate the expected value for the action to the right of the agent
            rightAction = [action[1], -action[0]]
            rightState, rightReward = calculateStateReward(state, rightAction)
            
            if -1 in rightState or 3 in rightState:
                continue
            
            rightValue = 0.1 * (rightReward + gamma * copyCellValues[rightState[0], rightState[1]])
            
            # Combine the values with probabilities
            weightedRewards += desiredValue + leftValue + rightValue
            
        deltaState.append(np.abs(copyCellValues[state[0], state[1]] - weightedRewards))
        copyCellValues[state[0], state[1]] = weightedRewards
        
    deltas.append(deltaState)
    cellValues = copyCellValues
    
    if iteration in [0, 1, 2, 9, 99, numIterations - 1]:
        print("Iteration {}".format(iteration + 1))
        print(cellValues)


Iteration 1
[[ 0.    0.    0.  ]
 [ 0.   -1.9   0.19]
 [-1.9  -1.    1.  ]]
Iteration 2
[[ 0.     -1.71   -1.368 ]
 [-3.42   -6.346  -4.8716]
 [-8.398  -1.      1.    ]]
Iteration 3
[[ -4.617    -12.6369   -18.22005 ]
 [-20.5029   -36.11026  -51.381719]
 [-35.46901   -1.         1.      ]]
Iteration 10
[[-3.48617826e+06 -1.03834636e+07 -2.09657485e+07]
 [-8.88495296e+06 -2.34290181e+07 -4.60427294e+07]
 [-1.09129728e+07 -1.00000000e+00  1.00000000e+00]]
Iteration 100
[[-2.53277516e+81 -7.59166758e+81 -1.53767710e+82]
 [-6.44869368e+81 -1.71037521e+82 -3.36998977e+82]
 [-7.89775648e+81 -1.00000000e+00  1.00000000e+00]]
Iteration 1000
[[-inf -inf -inf]
 [-inf -inf -inf]
 [-inf  -1.   1.]]

0

There are 0 answers