I have reinforcement learning problem which for this purpose can be substituted by multi-armed bandit. There are various reinforcement learning techniques applicable to this problem, two being:
- Epsilon greedy where the best action is selected with probability
p=1-epsilonand withp=epsilonwe select random action instead. - Boltzmann Exploration (Softmax) where probability of selecting some action is based on calculating the softmax over action values for each action in the available action set and then choosing for example by roulette wheel selection the action to be played.
Side note on epsilon greedy efficient implementation (can be skipped):
Efficient implementation of epsilon greedy strategy can be achieved using heap and array. In every step if random action should be selected, we just pull random number, and look up the index in the array of actions O(1). If best action should be selected, we just look at top of the heap O(1) and then reinsert the element O(log(n)), yielding overall O(log(n)) complexity per one action step.
Now, I'm trying to apply Boltzmann Exploration, but I'm struggling to make the algorithm fast enough. The issue is two-fold:
- After action is played, its value is updated. This affects the denominator in the softmax, thus all actions' selection probabilities. That would mean that we have to update all probabilities, making each step's time complexity
O(n). - Selection of action can be done for example using roulette wheel selection but that also means going over actions until the sum of visited actions probabilities does not exceed the randomly generated number. Thus selecting the value yields another
O(n)in each step of the algorithm.
Of course we can join the steps to pass over all actions just once.
The issue is that such algorithms are mostly studied for their theoretical properties but when implemented somewhere, the code is usually not very efficient. Since I'm no expert in algorithm design, my question is: Is there some data structure, way of implementing the algorithm or both so I can improve the implementation of Boltzmann Exploration algorithm?
Sidenote:
Actually, all algorithms for the multi-armed bandit beside epsilon-greedy (UCB, Thompson sampling...) seem to share increased time complexity (cost of step is at least O(n)). For some reason, the comparisons of other stateless reinforcement learning algorithms to epsilon-greedy in literature are done using the efficiency of convergence in number of steps, regret in number of steps and so on, but never consider achievable implementation speed which can lead to much faster convergence of epsilon-greedy in practice.
Basic idea (but improvement is only in constant factor multiplication of the overall complexity):
One idea I had is to forget about the normalization in the softmax. At the first step, we would randomly order all actions and calculated value of each action (the numerator in softmax) as a sum of itself with the preceding actions in this random ordering. (Basically create values for the roulette wheel selection in advance.) Then we would place these values as nodes into a binary tree, each node representing one action's start of interval, where it should be selected. Then we would pull integer number from the range(0, sum of numerators). We would then locate the action in O(log(n)) and once found, we would only have to update the values in the right hand side of the tree from the node. But this is still O(n). I think this can further be improved by only noting the number which was selected and update the following numbers in the tree only once they should affect the result of the roulette draw, but this should not improve the speed dramatically.