I'm implementing a REINFORCE with baseline algorithm, but I have a doubt with the discount reward function.
I implemented the discount reward function like this:
def disc_r(rewards):
r = np.zeros_like(rewards)
tsteps = range(len(rewards)) #timesteps
sum_reward = 0
for i in reversed(tsteps):
sum_reward = rewards[i] + gamma*sum_reward
r[i] = sum_reward
print(r[i])
return r - np.mean(r)
Therefore, for example, for a discount factor gamma = 0.1
and a reward rewards = [1,2,3,4]
it gives:
r = [1.234, 2.34, 3.4, 4.0]
which is correct according to the expression of the return G:
The return is the sum of discounted rewards: G = discount_ factor * G + reward
However, and here my question, I found this article from Towards Data Science https://towardsdatascience.com/learning-reinforcement-learning-reinforce-with-pytorch-5e8ad7fc7da0 where they define this same function as follows:
def discount_rewards(rewards, gamma=0.99):
r = np.array([gamma**i * rewards[i] for i in range(len(rewards))])
# Reverse the array direction for cumsum and then revert back to the original order
r = r[::-1].cumsum()[::-1]
print(r)
return r — r.mean()
Computing for the same gamma = 0.1
and a reward rewards = [1,2,3,4]
it gives:
r = [1.234, 0.234, 0.034, 0.004]
But I don't see the process here, it seems it doesn't follow the rule of G...
Does someone know what is going on with this second function and why it could also be correct (or in which cases maybe...)?
I can confirm that the second function is incorrect. A corrected version of this, that uses numpy and is more efficient than your first function would be:
Also, it seems to me that your first function is also incorrect. Why are you subtracting the mean on the return statement? And be careful to initialize
r = np.zeros_like(rewards, dtype=float)
otherwise numpy might treat it as an integer vector and floor the results.