1189 lines
14 KiB
Markdown
1189 lines
14 KiB
Markdown
---
|
||
title: "Best arm identification"
|
||
chunk: 2/3
|
||
source: "https://en.wikipedia.org/wiki/Best_arm_identification"
|
||
category: "reference"
|
||
tags: "science, encyclopedia"
|
||
date_saved: "2026-05-05T14:37:26.259834+00:00"
|
||
instance: "kb-cron"
|
||
---
|
||
|
||
In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
while minimizing the expected number of samples. Any such algorithm requires two key components:
|
||
|
||
Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time
|
||
|
||
|
||
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
|
||
|
||
{\displaystyle \tau _{\delta }}
|
||
|
||
and returns an arm
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
a
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {a}}_{\tau _{\delta }}}
|
||
|
||
such that
|
||
|
||
|
||
|
||
|
||
P
|
||
|
||
(
|
||
|
||
|
||
|
||
|
||
a
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
|
||
|
||
≠
|
||
|
||
a
|
||
|
||
⋆
|
||
|
||
|
||
)
|
||
≤
|
||
δ
|
||
|
||
|
||
{\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta }
|
||
|
||
and
|
||
|
||
|
||
|
||
|
||
P
|
||
|
||
(
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
<
|
||
+
|
||
∞
|
||
)
|
||
=
|
||
1
|
||
|
||
|
||
{\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1}
|
||
|
||
.
|
||
Sampling rule: A policy
|
||
|
||
|
||
|
||
π
|
||
|
||
|
||
{\displaystyle \pi }
|
||
|
||
that, at each round
|
||
|
||
|
||
|
||
t
|
||
|
||
|
||
{\displaystyle t}
|
||
|
||
, selects the next arm to sample
|
||
|
||
|
||
|
||
|
||
a
|
||
|
||
t
|
||
|
||
|
||
|
||
|
||
{\displaystyle a_{t}}
|
||
|
||
based on all previous observations
|
||
|
||
|
||
|
||
(
|
||
|
||
a
|
||
|
||
s
|
||
|
||
|
||
,
|
||
|
||
X
|
||
|
||
s
|
||
|
||
|
||
|
||
)
|
||
|
||
s
|
||
<
|
||
t
|
||
|
||
|
||
|
||
|
||
{\displaystyle (a_{s},X_{s})_{s<t}}
|
||
|
||
.
|
||
|
||
=== Algorithm structure ===
|
||
The general structure of a fixed-confidence BAI algorithm can be described as follows:
|
||
|
||
Input: Distribution family
|
||
|
||
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\mathcal {D}}}
|
||
|
||
, confidence level
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
|
||
Initialize:
|
||
For each arm
|
||
|
||
|
||
|
||
i
|
||
|
||
|
||
{\displaystyle i}
|
||
|
||
: set
|
||
|
||
|
||
|
||
|
||
N
|
||
|
||
i
|
||
|
||
|
||
←
|
||
1
|
||
|
||
|
||
{\displaystyle N_{i}\leftarrow 1}
|
||
|
||
,
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ν
|
||
^
|
||
|
||
|
||
|
||
|
||
i
|
||
|
||
|
||
←
|
||
|
||
None
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\nu }}_{i}\leftarrow {\text{None}}}
|
||
|
||
|
||
Set
|
||
|
||
|
||
|
||
τ
|
||
←
|
||
0
|
||
|
||
|
||
{\displaystyle \tau \leftarrow 0}
|
||
|
||
,
|
||
|
||
|
||
|
||
|
||
stop
|
||
|
||
←
|
||
|
||
False
|
||
|
||
|
||
|
||
{\displaystyle {\text{stop}}\leftarrow {\text{False}}}
|
||
|
||
|
||
while
|
||
|
||
|
||
|
||
|
||
stop
|
||
|
||
==
|
||
|
||
False
|
||
|
||
|
||
|
||
{\displaystyle {\text{stop}}=={\text{False}}}
|
||
|
||
do:
|
||
|
||
|
||
|
||
|
||
a
|
||
←
|
||
|
||
|
||
{\displaystyle a\leftarrow }
|
||
|
||
Sampling_rule(
|
||
|
||
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\mathcal {D}}}
|
||
|
||
,
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
,
|
||
|
||
|
||
|
||
N
|
||
|
||
|
||
{\displaystyle N}
|
||
|
||
,
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ν
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\nu }}}
|
||
|
||
)
|
||
Observe reward
|
||
|
||
|
||
|
||
X
|
||
∼
|
||
|
||
ν
|
||
|
||
a
|
||
|
||
|
||
|
||
|
||
{\displaystyle X\sim \nu _{a}}
|
||
|
||
|
||
Update:
|
||
|
||
|
||
|
||
|
||
N
|
||
|
||
a
|
||
|
||
|
||
←
|
||
|
||
N
|
||
|
||
a
|
||
|
||
|
||
+
|
||
1
|
||
|
||
|
||
{\displaystyle N_{a}\leftarrow N_{a}+1}
|
||
|
||
|
||
Update empirical distribution
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ν
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
a
|
||
|
||
τ
|
||
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\nu }}_{a_{\tau }}}
|
||
|
||
|
||
|
||
|
||
|
||
|
||
τ
|
||
←
|
||
τ
|
||
+
|
||
1
|
||
|
||
|
||
{\displaystyle \tau \leftarrow \tau +1}
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
stop
|
||
|
||
|
||
|
||
{\displaystyle {\text{stop}}}
|
||
|
||
← Stopping_rule(
|
||
|
||
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\mathcal {D}}}
|
||
|
||
,
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
,
|
||
|
||
|
||
|
||
N
|
||
|
||
|
||
{\displaystyle N}
|
||
|
||
,
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ν
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\nu }}}
|
||
|
||
)
|
||
return
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
a
|
||
^
|
||
|
||
|
||
|
||
|
||
τ
|
||
|
||
|
||
⋆
|
||
|
||
|
||
←
|
||
arg
|
||
|
||
|
||
max
|
||
|
||
a
|
||
|
||
|
||
|
||
|
||
|
||
|
||
μ
|
||
^
|
||
|
||
|
||
|
||
|
||
a
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {a}}_{\tau }^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
|
||
|
||
|
||
=== Lower bound ===
|
||
|
||
The minimal expected number of pulls to obtain the confidence level of
|
||
|
||
|
||
|
||
1
|
||
−
|
||
δ
|
||
|
||
|
||
{\displaystyle 1-\delta }
|
||
|
||
was determined in 2016. For a given instance
|
||
|
||
|
||
|
||
ν
|
||
|
||
|
||
{\displaystyle \nu }
|
||
|
||
and a fixed
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
, it provides the minimum value of
|
||
|
||
|
||
|
||
|
||
E
|
||
|
||
[
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
]
|
||
|
||
|
||
{\displaystyle \mathbb {E} [\tau _{\delta }]}
|
||
|
||
possible.
|
||
To give the lower bound, we first need to define the function
|
||
|
||
|
||
|
||
kl
|
||
|
||
(
|
||
δ
|
||
,
|
||
1
|
||
−
|
||
δ
|
||
)
|
||
|
||
|
||
{\displaystyle \operatorname {kl} (\delta ,1-\delta )}
|
||
|
||
, the Kullback–Leibler divergence between two Bernoulli distributions with means
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
and
|
||
|
||
|
||
|
||
1
|
||
−
|
||
δ
|
||
|
||
|
||
{\displaystyle 1-\delta }
|
||
|
||
. This is equivalent to
|
||
|
||
|
||
|
||
ln
|
||
|
||
(
|
||
1
|
||
|
||
/
|
||
|
||
δ
|
||
)
|
||
|
||
|
||
{\displaystyle \ln(1/\delta )}
|
||
|
||
when
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
tend to
|
||
|
||
|
||
|
||
0
|
||
|
||
|
||
{\displaystyle 0}
|
||
|
||
.
|
||
For any algorithm satisfying the
|
||
|
||
|
||
|
||
δ
|
||
|
||
|
||
{\displaystyle \delta }
|
||
|
||
-correctness constraint, the expected sample complexity satisfies
|
||
|
||
|
||
|
||
|
||
|
||
E
|
||
|
||
[
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
]
|
||
≥
|
||
|
||
C
|
||
|
||
⋆
|
||
|
||
|
||
|
||
kl
|
||
|
||
(
|
||
δ
|
||
,
|
||
1
|
||
−
|
||
δ
|
||
)
|
||
|
||
|
||
{\displaystyle \mathbb {E} [\tau _{\delta }]\geq C^{\star }\,\operatorname {kl} (\delta ,1-\delta )}
|
||
|
||
|
||
where the problem-dependent constant
|
||
|
||
|
||
|
||
|
||
C
|
||
|
||
⋆
|
||
|
||
|
||
|
||
|
||
{\displaystyle C^{\star }}
|
||
|
||
only depends on
|
||
|
||
|
||
|
||
ν
|
||
|
||
|
||
{\displaystyle \nu }
|
||
|
||
. An optimal sampling rule
|
||
|
||
|
||
|
||
|
||
ω
|
||
|
||
∗
|
||
|
||
|
||
|
||
|
||
{\displaystyle \omega ^{*}}
|
||
|
||
is associated with this optimal constant.
|
||
|
||
=== Asymptotically optimal algorithms ===
|
||
Since
|
||
|
||
|
||
|
||
kl
|
||
|
||
(
|
||
δ
|
||
,
|
||
1
|
||
−
|
||
δ
|
||
)
|
||
∼
|
||
ln
|
||
|
||
(
|
||
1
|
||
|
||
/
|
||
|
||
δ
|
||
)
|
||
|
||
|
||
{\displaystyle \operatorname {kl} (\delta ,1-\delta )\sim \ln(1/\delta )}
|
||
|
||
as
|
||
|
||
|
||
|
||
δ
|
||
→
|
||
0
|
||
|
||
|
||
{\displaystyle \delta \to 0}
|
||
|
||
, an algorithm is called asymptotically optimal if
|
||
|
||
|
||
|
||
|
||
|
||
lim
|
||
|
||
δ
|
||
→
|
||
0
|
||
|
||
|
||
|
||
|
||
|
||
|
||
E
|
||
|
||
[
|
||
|
||
τ
|
||
|
||
δ
|
||
|
||
|
||
]
|
||
|
||
|
||
ln
|
||
|
||
(
|
||
1
|
||
|
||
/
|
||
|
||
δ
|
||
)
|
||
|
||
|
||
|
||
=
|
||
|
||
C
|
||
|
||
⋆
|
||
|
||
|
||
.
|
||
|
||
|
||
{\displaystyle \lim _{\delta \to 0}{\frac {\mathbb {E} [\tau _{\delta }]}{\ln(1/\delta )}}=C^{\star }.}
|
||
|
||
|
||
The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm, which consists of tracking the optimal sampling rule
|
||
|
||
|
||
|
||
|
||
ω
|
||
|
||
∗
|
||
|
||
|
||
|
||
|
||
{\displaystyle \omega ^{*}}
|
||
|
||
in the lower bound by using the empirical distribution and choosing the next arm to play according to this estimated optimal sampling rule.
|
||
|
||
== Fixed horizon ==
|
||
In the fixed-horizon setting, the total number of samples
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
{\displaystyle T}
|
||
|
||
is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round
|
||
|
||
|
||
|
||
t
|
||
∈
|
||
{
|
||
1
|
||
,
|
||
…
|
||
,
|
||
T
|
||
}
|
||
|
||
|
||
{\displaystyle t\in \{1,\ldots ,T\}}
|
||
|
||
, and at the end of the horizon, it returns a recommendation
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
a
|
||
^
|
||
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {a}}_{T}}
|
||
|
||
. The objective is to minimize the probability of error
|
||
|
||
|
||
|
||
|
||
P
|
||
|
||
(
|
||
|
||
|
||
|
||
|
||
a
|
||
^
|
||
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
≠
|
||
|
||
a
|
||
|
||
⋆
|
||
|
||
|
||
)
|
||
|
||
|
||
{\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{\star })}
|
||
|
||
.
|
||
This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of
|
||
|
||
|
||
|
||
K
|
||
|
||
|
||
{\displaystyle K}
|
||
|
||
is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between
|
||
|
||
|
||
|
||
K
|
||
|
||
|
||
{\displaystyle K}
|
||
|
||
which gives an observation of the distribution of the treatment, and each patient corresponds to a turn
|
||
|
||
|
||
|
||
t
|
||
|
||
|
||
{\displaystyle t}
|
||
|
||
. The total number of patients is the horizon
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
{\displaystyle T}
|
||
|
||
|
||
=== Algorithm structure ===
|
||
A typical fixed-horizon BAI algorithm proceeds as follows:
|
||
|
||
Input: Distribution family
|
||
|
||
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\mathcal {D}}}
|
||
|
||
, horizon
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
{\displaystyle T}
|
||
|
||
|
||
Initialize:
|
||
For each arm
|
||
|
||
|
||
|
||
i
|
||
|
||
|
||
{\displaystyle i}
|
||
|
||
: pull once and set
|
||
|
||
|
||
|
||
|
||
N
|
||
|
||
i
|
||
|
||
|
||
←
|
||
1
|
||
|
||
|
||
{\displaystyle N_{i}\leftarrow 1}
|
||
|
||
,
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ν
|
||
^
|
||
|
||
|
||
|
||
|
||
i
|
||
|
||
|
||
←
|
||
|
||
|
||
{\displaystyle {\hat {\nu }}_{i}\leftarrow }
|
||
|
||
initial empirical distribution
|
||
for
|
||
|
||
|
||
|
||
t
|
||
|
||
|
||
{\displaystyle t}
|
||
|
||
from
|
||
|
||
|
||
|
||
K
|
||
+
|
||
1
|
||
|
||
|
||
{\displaystyle K+1}
|
||
|
||
to
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
{\displaystyle T}
|
||
|
||
do:
|
||
|
||
|
||
|
||
|
||
|
||
a
|
||
|
||
t
|
||
|
||
|
||
←
|
||
|
||
|
||
{\displaystyle a_{t}\leftarrow }
|
||
|
||
Sampling_rule(
|
||
|
||
|
||
|
||
|
||
|
||
D
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\mathcal {D}}}
|
||
|
||
,
|
||
|
||
|
||
|
||
T
|
||
|
||
|
||
{\displaystyle T}
|
||
|
||
,
|
||
|
||
|
||
|
||
N
|
||
|
||
|
||
{\displaystyle N}
|
||
|
||
,
|
||
|
||
|
||
|
||
|
||
|
||
|
||
ν
|
||
^
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle {\hat {\nu }}}
|
||
|
||
)
|
||
Observe reward
|
||
|
||
|
||
|
||
|
||
X
|
||
|
||
t
|
||
|
||
|
||
∼
|
||
|
||
ν
|
||
|
||
|
||
a
|
||
|
||
t
|
||
|
||
|
||
|
||
|
||
|
||
|
||
{\displaystyle X_{t}\sim \nu _{a_{t}}}
|
||
|