I am trying to solve a GridWorld problem using numpy. This consists of a 3x3 maze with terminal states at (3,2) and (3,3) and rewards of -1 and +1 respectively. Through dynamic programming, I want to determine the value of each of the states. The robot can move in all directions with a probability of 0.8 of taking the desired action and a probability of 0.1 of moving to its right or left respectively.
I have some doubts because we are given three trials that are considered for the experiment:
Trace 1 <(1,1), 0> → E → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>
Trace 2: <(1,1), 0> → E → <(1,2), 0> → E → <(2,2), 0> → N → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>
Trace 3: <(1,1), 0> → E → <(2,1), 0> → N → <(1,1), 0> → E → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>
I'm not sure if these trials are relevant for the code implementation. However, they have been useful for me to make a manual approximation of the state values using the Bellman formula.
I have created the following code, but it gives me the final states below, which I think are not correct (at least the politics of movement seems right but is the value numbers what does not seem okay to me). I do not see what I am doing wrong.
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
# Parameters defining the grid, final states, and possible actions in the environment:
dimension = 3 # Dimension of our 3x3 grid
finalStates = [[2, 1], [2, 2]] # Final states
actions = [[-1, 0], [1, 0], [0, 1], [0, -1]] # Actions: move up, down, right, left
# Parameters for calculating the value function:
gamma = 0.9 # Discount factor
rewardValue = 0 # Reward for normal movement
numIterations = 1000 # Number of algorithm iterations
# Function that returns the next state and reward when taking an action from a state
def calculateStateReward(initialState, action):
# If our initial state is a final state, there is no change or reward
if initialState in finalStates:
return initialState, 0
# If it moves, apply the reward
reward = rewardValue
# Calculate the new state
finalState = np.array(initialState) + np.array(action)
if -1 in finalState or 3 in finalState: # If we go out of the maze, stay in the initial state
finalState = initialState
# Assign rewards for final states
if (finalState == np.array([2, 1])).all(): # After moving, check if the final state is among the final states and return the associated reward.
return finalState, -1
if (finalState == np.array([2, 2])).all():
return finalState, +1
return finalState, reward
# Initialize cell values
cellValues = np.zeros((dimension, dimension))
cellValues[2, 1] = -1 # Terminal state (3,2)
cellValues[2, 2] = 1 # Terminal state (3,3)
states = [[i, j] for i in range(dimension) for j in range(dimension)] # Positions in the grid
deltas = []
# Execute the value iteration
for iteration in range(numIterations):
copyCellValues = np.copy(cellValues)
deltaState = []
for state in states:
if state in finalStates: # Skip final states
continue
weightedRewards = 0
# Calculate the expected value using the Bellman equation
for action in actions:
# Calculate the expected value for the desired action
finalState, reward = calculateStateReward(state, action)
if -1 in finalState or 3 in finalState:
continue
desiredValue = 0.8 * (reward + gamma * copyCellValues[finalState[0], finalState[1]])
# Calculate the expected value for the action to the left of the agent
leftAction = [-action[1], action[0]]
leftState, leftReward = calculateStateReward(state, leftAction)
if -1 in leftState or 3 in leftState:
continue
leftValue = 0.1 * (leftReward + gamma * copyCellValues[leftState[0], leftState[1]])
# Calculate the expected value for the action to the right of the agent
rightAction = [action[1], -action[0]]
rightState, rightReward = calculateStateReward(state, rightAction)
if -1 in rightState or 3 in rightState:
continue
rightValue = 0.1 * (rightReward + gamma * copyCellValues[rightState[0], rightState[1]])
# Combine the values with probabilities
weightedRewards += desiredValue + leftValue + rightValue
deltaState.append(np.abs(copyCellValues[state[0], state[1]] - weightedRewards))
copyCellValues[state[0], state[1]] = weightedRewards
deltas.append(deltaState)
cellValues = copyCellValues
if iteration in [0, 1, 2, 9, 99, numIterations - 1]:
print("Iteration {}".format(iteration + 1))
print(cellValues)
Iteration 1
[[ 0. 0. 0. ]
[ 0. -1.9 0.19]
[-1.9 -1. 1. ]]
Iteration 2
[[ 0. -1.71 -1.368 ]
[-3.42 -6.346 -4.8716]
[-8.398 -1. 1. ]]
Iteration 3
[[ -4.617 -12.6369 -18.22005 ]
[-20.5029 -36.11026 -51.381719]
[-35.46901 -1. 1. ]]
Iteration 10
[[-3.48617826e+06 -1.03834636e+07 -2.09657485e+07]
[-8.88495296e+06 -2.34290181e+07 -4.60427294e+07]
[-1.09129728e+07 -1.00000000e+00 1.00000000e+00]]
Iteration 100
[[-2.53277516e+81 -7.59166758e+81 -1.53767710e+82]
[-6.44869368e+81 -1.71037521e+82 -3.36998977e+82]
[-7.89775648e+81 -1.00000000e+00 1.00000000e+00]]
Iteration 1000
[[-inf -inf -inf]
[-inf -inf -inf]
[-inf -1. 1.]]