Finding an optimal action score function for Multi-Armed Bandit Problem

42 views Asked by At

Considering a multi-armed bandit problem where there are :

C: number of machines
T: timesteps
m(i) , v(i) = mean and variance of i-th machine's reward distribution
A(i) = Selected Machine index at timestep i
R(i) = Received Reward at timestep i : ~ N(m(A(i)),v(A(i))

(all indexes start from 1)

If we decide to follow this policy:

  1. For steps i = 1 to C selecting machine i in other words A(i) = i for i <= C
  2. For steps i > C selecting machine which has the highest score (Z is score function) : Z(j, i , R(1 to i-1), A(1 to i-1)) where j is machine's index and i is timestep

So we would have our expectation for the total reward as : enter image description here

And also we have:

enter image description here so we have this objective of finding unknown function Z which maximizes E[R_total]: enter image description here

(I purned out first term in E[R_total] since it is a constant)

I know there are well-known techniques like UCB, epsilon-greedy ,... however I think there should be an optimal Z function even if we dont have any information about mean and variance of reward distributions, because at least we know that they are normally distributed in this case.

My questions are :

  1. Is there any well known technique to solve such problems ? where we don't have any specific form for the function which we are trying to find ?
  2. Is my assumption correct about existence of an optimal Z function ?
0

There are 0 answers