Abstract:
βSAMWMIX, a Stochastic Multi-Armed
Bandit(SMAB) which obtains a 𝑶𝑶(𝒍𝒍𝒍𝒍𝒍𝒍 T) where T being the
number of steps in the time horizon, is proposed in the literature .
A blind-SAMWMIX which incorporates an input parameter
,which has better empirical performance but obtains a regret of
the order 𝑶𝑶(𝒍𝒍𝒍𝒍𝒈𝒈𝟏𝟏+𝟐𝟐𝜶𝜶 𝑻𝑻).Current work proposes an efficient
version of SAMWMIX which not only obtains a regret of 𝑶𝑶(𝒍𝒍𝒍𝒍𝒍𝒍
K) but also exults a better performance. A proof for the same is
given in this work. The proposed effSAMWMIX algorithm is
compared with KL-UCB and Thompson Sampling(TS) algorithms
over rewards which follow distributions like Exponential, Poisson,
Bernoulli, Triangular, Truncated Normal distribution and a
synthetic distribution designed to stress test SMAB algorithms
with closely spaced reward means. It is shown that effSAMWMIX
performs better than both KL-UCB & TS in both regret
performance and execution time