I'm currently working on implementing a Tabular Q-Learning algorithm in Python, and I've encountered a conceptual question about the necessity and implementation of an 'action_history' variable.
In my current understanding of Q-Learning, the Q-value update is performed based on the immediate action and its outcome. However, I am wondering whether there is a need to keep track of all previous actions (an 'action_history') to backpropagate the updated Q-value to those actions.
To elaborate, my main questions are:
Is it necessary or beneficial to maintain an 'action_history' in Tabular Q-Learning? I am curious if backpropagating the Q-value through all past actions in a given episode would enhance the learning process, or if it contradicts the fundamental principles of Q-Learning.
How is the Q-value typically updated across multiple past actions? If an 'action_history' is indeed useful or required, I'm unsure about the mechanism to effectively distribute the updated Q-value across previous actions. Is there a standard approach or formula for this?