Why do we weight recent rewards higher in non-stationary reinforcement learning?

Question

Why do we weight recent rewards higher in non-stationary reinforcement learning?

263 views Asked by Sudhanshu Mittal At 08 May 2016 at 11:40

The book 'Introduction to Reinforcement Learning' by Barto and Sutton, mentions the following about non-stationary RL problems -

"we often encounter reinforcement learning problems that are effectively nonstationary. In such cases, it makes sense to weight recent rewards more heavily than long-past ones. " (see here -https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node20.html)
I am not absolutely convinced by this. For example, an explorer agent whose task is to find an exit for a maze might actually lose because it made a wrong choice in the distant past.
Could you please explain why it makes sense to weight more recent rewards higher in simple terms?

Original Q&A

There are 2 answers

**Don Reba** · Answer 1 · 2016-05-08T13:03:07+00:00

Don Reba On 08 May 2016 at 13:03

If the problem is non-stationary, then past experience is increasingly out of date and should be given lower weight. That way, if an explorer makes a mistake in distant past, the mistake is overwritten by more recent experience.

**Simon** · Answer 2 · 2016-05-08T13:06:14+00:00

The text explicitly refers to nonstationary problems. In such problems, the MDP characteristics change. For example, the environment can change and therefore the transition matrix or the reward function might be different. In this case, a reward collected in the past might not be significant anymore.

In your example, the MDP is stationary, because the maze never changes, so your statement is correct. If (for example) the exit of the maze would change according to some law (which you do not know), then it makes sense to weigh recent rewards more (for example, if the reward is the Manhattan distance from the agent position to the exit).

In general, dealing with nonstationary MDPs is very complex, because usually you don't know how the characteristics change (in the example above, you don't know how the exit location is changed). On the contrary, if you know the law determining how the environment changes, you should include it in the MDP model.

TechQA.

Why do we weight recent rewards higher in non-stationary reinforcement learning?

There are 2 answers

Related Questions in ARTIFICIAL-INTELLIGENCE

Related Questions in REINFORCEMENT-LEARNING

Related Questions in REWARD-SYSTEM

Popular Questions

Popular Tags

Trending Questions