Incorporating Transition Probabilities in SARSA

348 views Asked by At

I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.

We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?

Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.

PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.

2

There are 2 answers

2
Pablo EM On BEST ANSWER

The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).

Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):

A central challenge in the DP and RL fields is that, in their original form (i.e., tabular form), DP and RL algorithms cannot be implemented for general problems. They can only be implemented when the state and action spaces consist of a finite number of discrete elements, because (among other reasons) they require the exact representation of value functions or policies, which is generally impossible for state spaces with an infinite number of elements (or too costly when the number of states is very high).

Even when the states and actions take finitely many values, the cost of representing value functions and policies grows exponentially with the number of state variables (and action variables, for Q-functions). This problem is called the curse of dimensionality, and makes the classical DP and RL algorithms impractical when there are many state and action variables. To cope with these problems, versions of the classical algorithms that approximately represent value functions and/or policies must be used. Since most problems of practical interest have large or continuous state and action spaces, approximation is essential in DP and RL.

In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.

Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.

NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.

1
Juan Leni On

If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.

It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.