12 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Algorithm IMED | 1/2 | https://en.wikipedia.org/wiki/Algorithm_IMED | reference | science, encyclopedia | 2026-05-05T14:37:18.993972+00:00 | kb-cron |
In multi-armed bandit problems, IMED (for Indexed Minimum Empirical Divergence) is an algorithm developed in 2015 by Junya Honda and Akimichi Takemura. It is the first algorithm proved to be asymptotically optimal respect to the problem-dependant Lai–Robbins lower bound for distributions in
(
−
∞
,
1
]
{\displaystyle (-\infty ,1]}
.
== Multi-armed bandit problem == The Multi-armed bandit problem is a sequential game where one player has to choose at each turn between
K
{\displaystyle K}
actions (arms). Behind every arm
a
{\displaystyle a}
there is an unknown distribution
ν
a
{\displaystyle \nu _{a}}
that lies in a set
D
{\displaystyle {\mathcal {D}}}
known by the player (for example,
D
{\displaystyle {\mathcal {D}}}
can be the set of Gaussian distributions or Bernoulli distributions). At each turn
t
{\displaystyle t}
the player chooses (pulls) an arm
a
t
{\displaystyle a_{t}}
, he then gets an observation
X
t
{\displaystyle X_{t}}
of the distribution
ν
a
t
{\displaystyle \nu _{a_{t}}}
.
=== Regret minimization === The goal is to minimize the regret at time
T
{\displaystyle T}
that is defined as
R
T
:=
∑
a
=
1
K
Δ
a
E
[
N
a
(
T
)
]
{\displaystyle R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]}
where
μ
a
:=
E
[
ν
a
]
{\displaystyle \mu _{a}:=\mathbb {E} [\nu _{a}]}
is the mean of arm
a
{\displaystyle a}
μ
∗
:=
max
a
μ
a
{\displaystyle \mu ^{*}:=\max _{a}\mu _{a}}
is the highest mean
Δ
a
:=
μ
∗
−
μ
a
{\displaystyle \Delta _{a}:=\mu ^{*}-\mu _{a}}
N
a
(
t
)
{\displaystyle N_{a}(t)}
is the number of pulls of arm
a
{\displaystyle a}
up to turn
t
{\displaystyle t}
The player has to find an algorithm that chooses at each turn
t
{\displaystyle t}
which arm to pull based on the previous actions and observations
(
a
s
,
X
s
)
s
<
t
{\displaystyle (a_{s},X_{s})_{s<t}}
to minimize the regret
R
T
{\displaystyle R_{T}}
. This is a trade-off problem between exploration to find the best arm (the arm with the highest mean) and exploitation to play as much as possible the arm that we think is the best arm.
== Applications == Multi-armed bandit algorithms are used in a variety of fields; for example, they have applications in clinical trials, recommender systems, telecommunications, and agriculture.
== Algorithm IMED == The algorithm compute an index at each turn for each arm. Then it pull the arm with the smallest index. The index is the sum of two terms. The first is the cost of to transport the empirical distribution to a distribution such that the arm is optimal. The second is the cost of behing pulled to much. Formally, the index
I
a
(
t
)
{\displaystyle I_{a}(t)}
of an arm
a
{\displaystyle a}
at turn
t
{\displaystyle t}
is defined as follow:
I
a
(
t
)
:=
N
a
(
t
)
K
i
n
f
(
ν
^
a
(
t
)
,
μ
^
∗
(
t
)
)
+
ln
(
N
a
(
t
)
)
{\displaystyle I_{a}(t):=N_{a}(t)K_{inf}({\hat {\nu }}_{a}(t),{\hat {\mu }}^{*}(t))+\ln(N_{a}(t))}
where
K
i
n
f
(
ν
,
μ
,
D
)
:=
inf
{
K
L
(
ν
,
ν
~
)
|
ν
~
∈
P
(
[
−
∞
,
1
]
)
,
E
[
ν
~
]
>
μ
}
{\displaystyle {\mathcal {K}}_{inf}(\nu ,\mu ,{\mathcal {D}}):=\inf \left\{\mathrm {KL} (\nu ,{\tilde {\nu }})\ |\ {\tilde {\nu }}\in {\mathcal {P}}([-\infty ,1]),\ \mathbb {E} [{\tilde {\nu }}]>\mu \right\}}
K
L
{\displaystyle \mathrm {KL} }
is the Kullback–Leibler divergence
P
(
[
−
∞
,
1
]
)
{\displaystyle {\mathcal {P}}([-\infty ,1])}
is the set of distribution in
[
−
∞
,
1
]
{\displaystyle [-\infty ,1]}
ν
^
a
(
t
)
{\displaystyle {\hat {\nu }}_{a}(t)}
is the empirical distribution of arm
a
{\displaystyle a}
at turn
t
{\displaystyle t}
μ
^
∗
(
t
)
{\displaystyle {\hat {\mu }}^{*}(t)}
is the highest empirical mean of turn
t
{\displaystyle t}
Remark : For arms
a
{\displaystyle a}
that verify
μ
^
a
(
t
)
=
μ
^
∗
(
t
)
{\displaystyle {\hat {\mu }}_{a}(t)={\hat {\mu }}^{*}(t)}
we have
K
i
n
f
(
ν
^
a
(
t
)
,
μ
^
∗
(
t
)
)
=
0
{\displaystyle K_{inf}({\hat {\nu }}_{a}(t),{\hat {\mu }}^{*}(t))=0}
. Then there index is equal to
ln
(
N
a
(
t
)
)
{\displaystyle \ln(N_{a}(t))}
=== Pseudocode === for each arm i do: n[i] ← 1; nu[i] ← None; mu[i] ← None for t from 1 to K do: select arm t observe reward r n[t] ← n[t] + 1 nu[t] ← update empirical distribution mu[t] ← update empirical mean for t from K+1 to T do: mu* ← highest mu for each arm i do: scoreK[i] ← n[i] K_inf(nu[i],mu*) scoreN[i] ← ln(n[i]) index[i] ← scoreK[i] + scoreN[i] select arm a with smallest index[a] observe reward r n[a] ← n[a] + 1 nu[a] ← update empirical distribution mu[a] ← update empirical mean
== Theoretical results == In the multi-armed bandit problem we have the asymptotic Lai–Robbins lower bound asymptotic lower bound on regret. The algorithm IMED is the first algorithm that matches this lower bound for distribution in
(
−
∞
,
1
]
{\displaystyle (-\infty ,1]}
in the first order. If the distribution are also bounded then it also match the second order. It is the first algorithm that match the second under of this lower bound.