I would like to understand how gamma have an impact on the learnt policy. I cannot understand if the final reward has a linear or an exponential discount.
I would expect the final reward to be something like
R = sum_i gamma ^ (i) * rew_i
but I cannot find this in the main code. Thank you