How can I use trials in GridWorld problem using DP?

58 views Asked by At

I am trying to solve a GridWorld problem using numpy. This consists of a 3x3 maze with terminal states at (3,2) and (3,3) and rewards of -1 and +1 respectively. Through dynamic programming, I want to determine the value of each of the states. The robot can move in all directions with a probability of 0.8 of taking the desired action and a probability of 0.1 of moving to its right or left respectively.

I have some doubts because we are given three trials that are considered for the experiment:

Trace 1:
<(1,1), 0> → E → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>

Trace 2:
<(1,1), 0> → E → <(1,2), 0> → E → <(2,2), 0> → N → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>

Trace 3:
<(1,1), 0> → E → <(2,1), 0> → N → <(1,1), 0> → E → <(1,2), 0> → E → <(1,3), 0> → S → <(2,3), 0> → S → <(3,3), 1>

I'm not sure if these trials are relevant for the code implementation. However, they have been useful for me to make a manual approximation of the state values using the Bellman formula.

I have created the function calculateStateReward(initialState, action) in which, by generating a probability = random.uniform(0,1), I define the probability for the robot to either perform the desired action and move to the adjacent state within the maze, or instead execute a movement to its right or left. Furthermore, within this function, rewards are defined for terminal states as well as the boundaries of the board. The function returns the final state and the reward. I would need to know how to implement the dynamic programming iteration algorithm using the provided trials.

import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

# Parameters that define the grid, final states, and possible actions in the environment:
dimension = 3  # dimension of the 3x3 grid
finalStates = [[2, 1], [2, 2]]  # Final states as described in the statement
actions = [[-1, 0], [1, 0], [0, 1], [0, -1]]  # Actions: move up, down, right, left

# Parameters to calculate the value function:
gamma = 0.9  # Discount factor
rewardValue = 0  # Reward for normal movement
numIterations = 1000 # Iterations of the algorithm

# Function that returns the next state and reward when executing an action from a state
def calculateStateReward(initialState, action):
    # If the initial state is a final state, there is no change or reward
    if initialState in finalStates:
        return initialState, 0

    reward = rewardValue
    # Calculate the new state
    finalState = np.array(initialState) + np.array(action)
    if -1 in finalState or 3 in finalState:  # If it exits the maze, it remains in the initial state
        finalState = initialState

    # Assign rewards for final states
    if np.array_equal(finalState, [2, 1]):    # After moving, we check if the final state is among the final states and return the associated reward.
        return list(finalState), -1
  
    if np.array_equal(finalState, [2, 2]):    
        return list(finalState), +1

    return list(finalState), reward

# Initialize cell values
cellValues = np.zeros((3, 3))
cellValues[2, 1] = -1  # Terminal state (3,2)
cellValues[2, 2] = 1  # Terminal state (3,3)

states = [[i, j] for i in range(dimension) for j in range(dimension)]  # positions in the grid

# Run the value iteration
for iteration in range(numIterations):
    copyCellValues = np.copy(cellValues)
    delta = 0
    for state in states:
        if state in finalStates:  # Skip terminal states
            continue
        
        # Calculate expected value using the Bellman equation
        actionValues = []
        for action in actions:
            # Calculate expected value for the desired action
            finalState, reward = calculateStateReward(state, action)
            desiredValue = reward + gamma * copyCellValues[finalState[0], finalState[1]]
            
            # Calculate expected value for the left action
            leftAction = [-action[1], action[0]]
            leftState, leftReward = calculateStateReward(state, leftAction)
            leftValue = leftReward + gamma * copyCellValues[leftState[0], leftState[1]]
            
            # Calculate expected value for the right action
            rightAction = [action[1], -action[0]]
            rightState, rightReward = calculateStateReward(state, rightAction)
            rightValue = rightReward + gamma * copyCellValues[rightState[0], rightState[1]]
            
            # Combine values with probabilities
            expectedValue = 0.8 * desiredValue + 0.1 * leftValue + 0.1 * rightValue
            actionValues.append(expectedValue)

        # Update state value to the maximum expected value
        maxValue = max(actionValues)
        delta = max(delta, np.abs(maxValue - copyCellValues[state[0], state[1]]))
        copyCellValues[state[0], state[1]] = maxValue
        
    if iteration in [0, 1, 99, 999]:
        print(f"Iteration {iteration + 1}:")
        print(cellValues)
        print("")
        
        # Check for convergence
    if delta < 0.01:
        break
    else:
        cellValues = copyCellValues

# Print the final state values
print("Final State Values:")
print(cellValues)

Final State Values:
[[ 1.1566795   1.34480169  1.54979606]
 [ 1.03620988  1.22269468  1.79122109]
 [ 0.61021615 -1.          1.        ]]
0

There are 0 answers