14 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Best arm identification | 2/3 | https://en.wikipedia.org/wiki/Best_arm_identification | reference | science, encyclopedia | 2026-05-05T14:37:26.259834+00:00 | kb-cron |
In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level
δ
{\displaystyle \delta }
while minimizing the expected number of samples. Any such algorithm requires two key components:
Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time
τ
δ
{\displaystyle \tau _{\delta }}
and returns an arm
a
^
τ
δ
{\displaystyle {\hat {a}}_{\tau _{\delta }}}
such that
P
(
a
^
τ
δ
≠
a
⋆
)
≤
δ
{\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta }
and
P
(
τ
δ
<
+
∞
)
=
1
{\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1}
. Sampling rule: A policy
π
{\displaystyle \pi }
that, at each round
t
{\displaystyle t}
, selects the next arm to sample
a
t
{\displaystyle a_{t}}
based on all previous observations
(
a
s
,
X
s
)
s
<
t
{\displaystyle (a_{s},X_{s})_{s<t}}
.
=== Algorithm structure === The general structure of a fixed-confidence BAI algorithm can be described as follows:
Input: Distribution family
D
{\displaystyle {\mathcal {D}}}
, confidence level
δ
{\displaystyle \delta }
Initialize: For each arm
i
{\displaystyle i}
: set
N
i
←
1
{\displaystyle N_{i}\leftarrow 1}
,
ν
^
i
←
None
{\displaystyle {\hat {\nu }}_{i}\leftarrow {\text{None}}}
Set
τ
←
0
{\displaystyle \tau \leftarrow 0}
,
stop
←
False
{\displaystyle {\text{stop}}\leftarrow {\text{False}}}
while
stop
==
False
{\displaystyle {\text{stop}}=={\text{False}}}
do:
a
←
{\displaystyle a\leftarrow }
Sampling_rule(
D
{\displaystyle {\mathcal {D}}}
,
δ
{\displaystyle \delta }
,
N
{\displaystyle N}
,
ν
^
{\displaystyle {\hat {\nu }}}
) Observe reward
X
∼
ν
a
{\displaystyle X\sim \nu _{a}}
Update:
N
a
←
N
a
+
1
{\displaystyle N_{a}\leftarrow N_{a}+1}
Update empirical distribution
ν
^
a
τ
{\displaystyle {\hat {\nu }}_{a_{\tau }}}
τ
←
τ
+
1
{\displaystyle \tau \leftarrow \tau +1}
stop
{\displaystyle {\text{stop}}}
← Stopping_rule(
D
{\displaystyle {\mathcal {D}}}
,
δ
{\displaystyle \delta }
,
N
{\displaystyle N}
,
ν
^
{\displaystyle {\hat {\nu }}}
) return
a
^
τ
⋆
←
arg
max
a
μ
^
a
{\displaystyle {\hat {a}}_{\tau }^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
=== Lower bound ===
The minimal expected number of pulls to obtain the confidence level of
1
−
δ
{\displaystyle 1-\delta }
was determined in 2016. For a given instance
ν
{\displaystyle \nu }
and a fixed
δ
{\displaystyle \delta }
, it provides the minimum value of
E
[
τ
δ
]
{\displaystyle \mathbb {E} [\tau _{\delta }]}
possible. To give the lower bound, we first need to define the function
kl
(
δ
,
1
−
δ
)
{\displaystyle \operatorname {kl} (\delta ,1-\delta )}
, the Kullback–Leibler divergence between two Bernoulli distributions with means
δ
{\displaystyle \delta }
and
1
−
δ
{\displaystyle 1-\delta }
. This is equivalent to
ln
(
1
/
δ
)
{\displaystyle \ln(1/\delta )}
when
δ
{\displaystyle \delta }
tend to
0
{\displaystyle 0}
. For any algorithm satisfying the
δ
{\displaystyle \delta }
-correctness constraint, the expected sample complexity satisfies
E
[
τ
δ
]
≥
C
⋆
kl
(
δ
,
1
−
δ
)
{\displaystyle \mathbb {E} [\tau _{\delta }]\geq C^{\star }\,\operatorname {kl} (\delta ,1-\delta )}
where the problem-dependent constant
C
⋆
{\displaystyle C^{\star }}
only depends on
ν
{\displaystyle \nu }
. An optimal sampling rule
ω
∗
{\displaystyle \omega ^{*}}
is associated with this optimal constant.
=== Asymptotically optimal algorithms === Since
kl
(
δ
,
1
−
δ
)
∼
ln
(
1
/
δ
)
{\displaystyle \operatorname {kl} (\delta ,1-\delta )\sim \ln(1/\delta )}
as
δ
→
0
{\displaystyle \delta \to 0}
, an algorithm is called asymptotically optimal if
lim
δ
→
0
E
[
τ
δ
]
ln
(
1
/
δ
)
=
C
⋆
.
{\displaystyle \lim _{\delta \to 0}{\frac {\mathbb {E} [\tau _{\delta }]}{\ln(1/\delta )}}=C^{\star }.}
The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm, which consists of tracking the optimal sampling rule
ω
∗
{\displaystyle \omega ^{*}}
in the lower bound by using the empirical distribution and choosing the next arm to play according to this estimated optimal sampling rule.
== Fixed horizon == In the fixed-horizon setting, the total number of samples
T
{\displaystyle T}
is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round
t
∈
{
1
,
…
,
T
}
{\displaystyle t\in \{1,\ldots ,T\}}
, and at the end of the horizon, it returns a recommendation
a
^
T
{\displaystyle {\hat {a}}_{T}}
. The objective is to minimize the probability of error
P
(
a
^
T
≠
a
⋆
)
{\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{\star })}
. This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of
K
{\displaystyle K}
is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between
K
{\displaystyle K}
which gives an observation of the distribution of the treatment, and each patient corresponds to a turn
t
{\displaystyle t}
. The total number of patients is the horizon
T
{\displaystyle T}
=== Algorithm structure === A typical fixed-horizon BAI algorithm proceeds as follows:
Input: Distribution family
D
{\displaystyle {\mathcal {D}}}
, horizon
T
{\displaystyle T}
Initialize: For each arm
i
{\displaystyle i}
: pull once and set
N
i
←
1
{\displaystyle N_{i}\leftarrow 1}
,
ν
^
i
←
{\displaystyle {\hat {\nu }}_{i}\leftarrow }
initial empirical distribution for
t
{\displaystyle t}
from
K
+
1
{\displaystyle K+1}
to
T
{\displaystyle T}
do:
a
t
←
{\displaystyle a_{t}\leftarrow }
Sampling_rule(
D
{\displaystyle {\mathcal {D}}}
,
T
{\displaystyle T}
,
N
{\displaystyle N}
,
ν
^
{\displaystyle {\hat {\nu }}}
) Observe reward
X
t
∼
ν
a
t
{\displaystyle X_{t}\sim \nu _{a_{t}}}