Finding an optimal action score function for Multi-Armed Bandit Problem

42 views Asked by MohammadAli Zeraatkar At 10 June 2023 at 10:39

Considering a multi-armed bandit problem where there are :

C: number of machines
T: timesteps
m(i) , v(i) = mean and variance of i-th machine's reward distribution
A(i) = Selected Machine index at timestep i
R(i) = Received Reward at timestep i : ~ N(m(A(i)),v(A(i))

(all indexes start from 1)

If we decide to follow this policy:

For steps i = 1 to C selecting machine i in other words A(i) = i for i <= C
For steps i > C selecting machine which has the highest score (Z is score function) : Z(j, i , R(1 to i-1), A(1 to i-1)) where j is machine's index and i is timestep

So we would have our expectation for the total reward as :

And also we have:

so we have this objective of finding unknown function Z which maximizes E[R_total]:

(I purned out first term in E[R_total] since it is a constant)

I know there are well-known techniques like UCB, epsilon-greedy ,... however I think there should be an optimal Z function even if we dont have any information about mean and variance of reward distributions, because at least we know that they are normally distributed in this case.

My questions are :

Is there any well known technique to solve such problems ? where we don't have any specific form for the function which we are trying to find ?
Is my assumption correct about existence of an optimal Z function ?

Original Q&A

TechQA.

Finding an optimal action score function for Multi-Armed Bandit Problem

There are 0 answers

Related Questions in OPTIMIZATION

Related Questions in PROBABILITY

Related Questions in REINFORCEMENT-LEARNING

Related Questions in EXPECTATIONS

Popular Questions

Trending Questions