how can get SARSA code for gridworld model in R program?

566 views Asked by At

I have a problem in my study case. I interesting in reinforcement learning for gridworld model. Model is maze of 7x7 fields for movement. Consider a maze of fields. There are four directions: up, down, left and right (or N, E, S, W). So there are at most policies. Many can be excluded when using the immediate punishment given at bumping into a wall. Employing in addition an inhibition-of-return principle usually even less actions are admissible. Many policies differ only in the part after the goal or are equivalent.

▼ States: with obstacles ▼ Reward: if r=1 if s=G, else r=0 for any admissible move and else r=-100 ▼ Initialisation: Q0(a,s)~N(0,0.01)

To solve this model, I make a R code but it is not working properly.

model: 7x7, S:start state, G: terminal state, O: accessible state, X: wall

 [O,O,G,X,O,O,S]
 [O,X,O,X,O,X,X]
 [O,X,O,X,O,O,O]
 [O,X,O,X,O,X,O]
 [O,X,O,O,O,X,O]
 [O,X,O,X,O,X,O]
 [O,O,O,X,O,O,O]

So I want to know how can correct code for this gridworld model(not uppon code) and want to know that how can I solve this model through SARSA model.

actions <- c("N", "S", "E", "W")

x <- 1:7
y <- 1:7

rewards <- matrix(rep(0, 49), nrow=7)

 rewards[1, 1] <- 0
 rewards[1, 2] <- 0
 rewards[1, 3] <- 1
 rewards[1, 4] <- -100
 rewards[1, 5] <- 0
 rewards[1, 6] <- 0
 rewards[1, 7] <- 0
 rewards[2, 1] <- 0
 rewards[2, 2] <- -100
 rewards[2, 3] <- 0
 rewards[2, 4] <- -100
 rewards[2, 5] <- 0
 rewards[2, 6] <- -100
 rewards[2, 7] <- -100
 rewards[3, 1] <- 0
 rewards[3, 2] <- -100
 rewards[3, 3] <- 0
 rewards[3, 4] <- -100
 rewards[3, 5] <- 0
 rewards[3, 6] <- 0
 rewards[3, 7] <- 0
 rewards[4, 1] <- 0
 rewards[4, 2] <- -100
 rewards[4, 3] <- 0
 rewards[4, 4] <- -100
 rewards[4, 5] <- 0
 rewards[4, 6] <- -100
 rewards[4, 7] <- 0
 rewards[5, 1] <- 0
 rewards[5, 2] <- -100
 rewards[5, 3] <- 0
 rewards[5, 4] <- 0
 rewards[5, 5] <- 0
 rewards[5, 6] <- -100
 rewards[5, 7] <- 0
 rewards[6, 1] <- 0
 rewards[6, 2] <- -100
 rewards[6, 3] <- 0
 rewards[6, 4] <- -100
 rewards[6, 5] <- 0
 rewards[6, 6] <- -100
 rewards[6, 7] <- 0
 rewards[7, 1] <- 0
 rewards[7, 2] <- 0
 rewards[7, 3] <- 0
 rewards[7, 4] <- -100
 rewards[7, 5] <- 0
 rewards[7, 6] <- 0
 rewards[7, 7] <- 0

 values <- rewards # initial values

 states <- expand.grid(x=x, y=y)

 # Transition probability
 transition <- list("N" = c("N" = 0.8, "S" = 0, "E" = 0.1, "W" = 0.1), 
         "S"= c("S" = 0.8, "N" = 0, "E" = 0.1, "W" = 0.1),
         "E"= c("E" = 0.8, "W" = 0, "S" = 0.1, "N" = 0.1),
         "W"= c("W" = 0.8, "E" = 0, "S" = 0.1, "N" = 0.1))

 # The value of an action (e.g. move north means y + 1)
 action.values <- list("N" = c("x" = 0, "y" = 1), 
         "S" = c("x" = 0, "y" = -1),
         "E" = c("x" = 1, "y" = 0),
         "W" = c("x" = -1, "y" = 0))

 # act() function serves to move the robot through states based on an action
 act <- function(action, state) {
     action.value <- action.values[[action]]
     new.state <- state
         if(state["x"] == 1 && state["y"] == 7 || (state["x"] == 1 && state["y"] == 3))
         return(state)
     #
     new.x = state["x"] + action.value["x"]
     new.y = state["y"] + action.value["y"]
     # Constrained by edge of grid
     new.state["x"] <- min(x[length(x)], max(x[1], new.x))
     new.state["y"] <- min(y[length(y)], max(y[1], new.y))
     #
     if(is.na(rewards[new.state["y"], new.state["x"]]))
         new.state <- state
     #
     return(new.state)
 }


 rewards

 bellman.update <- function(action, state, values, gamma=1) {
     state.transition.prob <- transition[[action]]
     q <- rep(0, length(state.transition.prob))
     for(i in 1:length(state.transition.prob)) {        
         new.state <- act(names(state.transition.prob)[i], state) 
         q[i] <- (state.transition.prob[i] * (rewards[state["y"],        state["x"]] + (gamma * values[new.state["y"], new.state["x"]])))
     }
     sum(q)
 }

 value.iteration <- function(states, actions, rewards, values, gamma, niter,      n) {
     for (j in 1:niter) {
         for (i in 1:nrow(states)) {
             state <- unlist(states[i,])
             if(i %in% c(7, 15)) next # terminal states
             q.values <- as.numeric(lapply(actions, bellman.update,      state=state, values=values, gamma=gamma))
             values[state["y"], state["x"]] <- max(q.values)
         }
     }
     return(values)
 }

 final.values <- value.iteration(states=states, actions=actions,      rewards=rewards, values=values, gamma=0.99, niter=100, n=10)

 final.values
1

There are 1 answers

0
Lucas Borsatto On

the problem is that your punishment is much larger than the reward. The agent probably prefers to throw itself in the wall than try to get the reward. That happens because the state-action values converge to very low real number, even lower than -100, depending on the rewards of the actions.

Here is a model that I made simulating the Value Iteration (which presents the values that SARSA should converge to):

enter image description here

The table of values represents the value states of the model in the picture, but it is inverted (because I didn't fix it yet).

In this case, I put the values of reward and punishment to be very similar with your model. -15 is an impartial state (a wall), 1.0 is the ball and -100 are the blocks. The agent gets 0.0 for each action and the transition probabilities are the same too.

The agent must reach the ball, but as you see the states converged to very small values. Here, you can see that the neighboring states to which the ball is have lower values. So the agent prefers to never reach its goal.

To solve your problem, try to decrease the punishment.