Counterintuitive results on multi-armed bandit exercise

Question

Counterintuitive results on multi-armed bandit exercise

875 views Asked by Jeff Davis At 06 January 2017 at 12:01

I am working through chapter 2, section 7, of Sutton & Barto's Reinforcement Learning: An Introduction, which deals with gradient methods in the multi-armed bandit problem. (I realize that the 2nd edition is a draft and it seems that the sections move around a little bit, but my file has section 2.7 titled "Gradient Bandits".) I have managed to use the methods in sections 2.3-2.5 without issue, but I am consistently getting results using the gradient methods that are perplexing. I will walk through my code and show an example.

Just initializing everything here:

import random
import math
import numpy as np, numpy.random

# number of arms (k) and step-size (alpha)
k = 10
alpha = 0.1

# initialize preference function (H), and reward distribution (R)
H = {i: 0 for i in range(k)}
R = {i: [random.uniform(-100,100), 1] for i in range(k)}

I'm using stationary reward distributions, and I am using dictionaries to represent those distributions. I am assuming each reward is described by a Gaussian, so I am mapping actions to rewards with the following function:

def getReward(action, rewardDistribution):
  return random.gauss(rewardDistribution[action][0], rewardDistribution[action][1])

The so-called "preference function" H, which is used to determine the action probabilities, is also given by a dictionary. I am spreading out the choices across a very wide range, as each reward is described by a Gaussian distribution with standard deviation 1 located somewhere between -100 and 100. I am doing because my intuition is telling me that it will make it harder for the algorithm to settle on a sub-optimal choice, but I am finding that the opposite is occurring.

This code selects my actions at each iteration:

def selectAction(policy):
  return np.random.choice(list(policy.keys()), p=list(policy.values()))

And next is the code that runs the iterations of the algorithm. Note that pi is the policy and is initialized to give probability 1/k to each action.

avgReward = 0
for i in range(100000):
  pi = {i: math.exp(H[i])/sum([math.exp(H[j]) for j in range(k)]) for i in range(k)}
  A = selectAction(pi)
  R_A = getReward(A, R)
  avgReward += (R_A - avgReward)/(i + 1)
  H = {i: H[i] + alpha*(R_A - avgReward)*((i == A) - pi[i]) for i in range(k)}

Notice I'm running 100,000 iterations, which to me seems like it should be overkill. This is my first attempt at this problem, so my intuition may be off, but I tried to set this up to make it easy for the algorithm to find the optimal choice. So what I expect is the process with converge on the action with the distribution having the highest expected value and will continue hitting it as the iterations go by. But, when I print out results relative to each possible action by the bandit, this is what I see:

for i in range(k):
  print("Expected reward: " + str(R[i][0]) + " | Selection probability: " + str(pi[i]) + " | Preference: " + str(H[i]))

Expected reward: -50.62506110888989 | Selection probability: 3.617077909489526e-13 | Preference: -7.82992533515
Expected reward: 11.866419726345484 | Selection probability: 1.2337498052271344e-10 | Preference: -1.99777839484
Expected reward: 75.41139657867947 | Selection probability: 1.5933517476231946e-09 | Preference: 0.560588358966
Expected reward: -72.44467653824414 | Selection probability: 3.4267025247257986e-13 | Preference: -7.88399339198
Expected reward: -43.466561447399 | Selection probability: 1.5933517476231946e-09 | Preference: 0.560588358966
Expected reward: -75.99171566420297 | Selection probability: 1.5933517476231946e-09 | Preference: 0.560588358966
Expected reward: -82.11920932060593 | Selection probability: 3.120658098513757e-13 | Preference: -7.97754791911
Expected reward: 95.00643386364632 | Selection probability: 1.5933517476231946e-09 | Preference: 0.560588358966
Expected reward: 31.384022070017835 | Selection probability: 1.2605442916195123e-08 | Preference: 2.62887724114
Expected reward: 49.83925652065625 | Selection probability: 0.9999999808967586 | Preference: 20.8180143641

The last action has an expected reward of 49.8, and the bandit selects it virtually every time. This is the 3rd-best of the 10 options, but it ignores an option that has an expected reward of 75.4 and another one that has an expected reward of 95.0.

So, my question: why is this bandit missing the optimal choice? This is just an example, this is happening on a pretty consistent basis when I run the program. Is my intuition off with respect to what I should expect the bandit to do, or have I coded this algorithm up incorrectly?

Original Q&A

There are 1 answers

**Dennis Soemers** · Accepted Answer · 2017-01-06T16:50:49+00:00

The problem is that many arms (or actions; I'm using arms since that is the most common terminology in MAB problems) don't get played even a single time at all with your current setup. You can easily verify that this is the case by printing, for every arm, how often it was selected.

This happens because you rewards have a rather high absolute value. In literature on MAB problems, they often assume rewards in [0, 1] or [-1, 1]. This is not strictly necessary (although it is for some proofs related to theoretical performance of algorithms... but that's probably not interesting for you right now). Anyway, there are a few ways in which you could fix the problem:

1) Initialize the list of preferences (H) to high values, instead of 0s. This has a similar effect to the optimistic initialization of epsilon-greedy that is described earlier in the book, in that it motivates the algorithm to do a bit more exploration earlier on.

2) Drastically reduce the value of the learning rate alpha. Try something more like 0.00001, instead of 0.1. The effect of that change is that the preferences values in H grow away from 0 at a smaller rate, so the probabilities in pi also grow away from the initial 1/k at a reduced rate.

3) Re-scale the reward values to lie in, for example, [-1, 1] (this would also require an appropriate reduction of the standard deviation of the reward distributions, if you don't want the problem to become more complicated.

TechQA.

Counterintuitive results on multi-armed bandit exercise

There are 1 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in GRADIENT-DESCENT

Related Questions in REINFORCEMENT-LEARNING

Popular Questions

Popular Tags

Trending Questions