kb/data/en.wikipedia.org/wiki/Best_arm_identification-1.md

1189 lines
14 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Best arm identification"
chunk: 2/3
source: "https://en.wikipedia.org/wiki/Best_arm_identification"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T14:37:26.259834+00:00"
instance: "kb-cron"
---
In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level
δ
{\displaystyle \delta }
while minimizing the expected number of samples. Any such algorithm requires two key components:
Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time
τ
δ
{\displaystyle \tau _{\delta }}
and returns an arm
a
^
τ
δ
{\displaystyle {\hat {a}}_{\tau _{\delta }}}
such that
P
(
a
^
τ
δ
a
)
δ
{\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta }
and
P
(
τ
δ
<
+
)
=
1
{\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1}
.
Sampling rule: A policy
π
{\displaystyle \pi }
that, at each round
t
{\displaystyle t}
, selects the next arm to sample
a
t
{\displaystyle a_{t}}
based on all previous observations
(
a
s
,
X
s
)
s
<
t
{\displaystyle (a_{s},X_{s})_{s<t}}
.
=== Algorithm structure ===
The general structure of a fixed-confidence BAI algorithm can be described as follows:
Input: Distribution family
D
{\displaystyle {\mathcal {D}}}
, confidence level
δ
{\displaystyle \delta }
Initialize:
For each arm
i
{\displaystyle i}
: set
N
i
1
{\displaystyle N_{i}\leftarrow 1}
,
ν
^
i
None
{\displaystyle {\hat {\nu }}_{i}\leftarrow {\text{None}}}
Set
τ
0
{\displaystyle \tau \leftarrow 0}
,
stop
False
{\displaystyle {\text{stop}}\leftarrow {\text{False}}}
while
stop
==
False
{\displaystyle {\text{stop}}=={\text{False}}}
do:
a
{\displaystyle a\leftarrow }
Sampling_rule(
D
{\displaystyle {\mathcal {D}}}
,
δ
{\displaystyle \delta }
,
N
{\displaystyle N}
,
ν
^
{\displaystyle {\hat {\nu }}}
)
Observe reward
X
ν
a
{\displaystyle X\sim \nu _{a}}
Update:
N
a
N
a
+
1
{\displaystyle N_{a}\leftarrow N_{a}+1}
Update empirical distribution
ν
^
a
τ
{\displaystyle {\hat {\nu }}_{a_{\tau }}}
τ
τ
+
1
{\displaystyle \tau \leftarrow \tau +1}
stop
{\displaystyle {\text{stop}}}
Stopping_rule(
D
{\displaystyle {\mathcal {D}}}
,
δ
{\displaystyle \delta }
,
N
{\displaystyle N}
,
ν
^
{\displaystyle {\hat {\nu }}}
)
return
a
^
τ
arg
max
a
μ
^
a
{\displaystyle {\hat {a}}_{\tau }^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
=== Lower bound ===
The minimal expected number of pulls to obtain the confidence level of
1
δ
{\displaystyle 1-\delta }
was determined in 2016. For a given instance
ν
{\displaystyle \nu }
and a fixed
δ
{\displaystyle \delta }
, it provides the minimum value of
E
[
τ
δ
]
{\displaystyle \mathbb {E} [\tau _{\delta }]}
possible.
To give the lower bound, we first need to define the function
kl
(
δ
,
1
δ
)
{\displaystyle \operatorname {kl} (\delta ,1-\delta )}
, the KullbackLeibler divergence between two Bernoulli distributions with means
δ
{\displaystyle \delta }
and
1
δ
{\displaystyle 1-\delta }
. This is equivalent to
ln
(
1
/
δ
)
{\displaystyle \ln(1/\delta )}
when
δ
{\displaystyle \delta }
tend to
0
{\displaystyle 0}
.
For any algorithm satisfying the
δ
{\displaystyle \delta }
-correctness constraint, the expected sample complexity satisfies
E
[
τ
δ
]
C
kl
(
δ
,
1
δ
)
{\displaystyle \mathbb {E} [\tau _{\delta }]\geq C^{\star }\,\operatorname {kl} (\delta ,1-\delta )}
where the problem-dependent constant
C
{\displaystyle C^{\star }}
only depends on
ν
{\displaystyle \nu }
. An optimal sampling rule
ω
{\displaystyle \omega ^{*}}
is associated with this optimal constant.
=== Asymptotically optimal algorithms ===
Since
kl
(
δ
,
1
δ
)
ln
(
1
/
δ
)
{\displaystyle \operatorname {kl} (\delta ,1-\delta )\sim \ln(1/\delta )}
as
δ
0
{\displaystyle \delta \to 0}
, an algorithm is called asymptotically optimal if
lim
δ
0
E
[
τ
δ
]
ln
(
1
/
δ
)
=
C
.
{\displaystyle \lim _{\delta \to 0}{\frac {\mathbb {E} [\tau _{\delta }]}{\ln(1/\delta )}}=C^{\star }.}
The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm, which consists of tracking the optimal sampling rule
ω
{\displaystyle \omega ^{*}}
in the lower bound by using the empirical distribution and choosing the next arm to play according to this estimated optimal sampling rule.
== Fixed horizon ==
In the fixed-horizon setting, the total number of samples
T
{\displaystyle T}
is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round
t
{
1
,
,
T
}
{\displaystyle t\in \{1,\ldots ,T\}}
, and at the end of the horizon, it returns a recommendation
a
^
T
{\displaystyle {\hat {a}}_{T}}
. The objective is to minimize the probability of error
P
(
a
^
T
a
)
{\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{\star })}
.
This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of
K
{\displaystyle K}
is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between
K
{\displaystyle K}
which gives an observation of the distribution of the treatment, and each patient corresponds to a turn
t
{\displaystyle t}
. The total number of patients is the horizon
T
{\displaystyle T}
=== Algorithm structure ===
A typical fixed-horizon BAI algorithm proceeds as follows:
Input: Distribution family
D
{\displaystyle {\mathcal {D}}}
, horizon
T
{\displaystyle T}
Initialize:
For each arm
i
{\displaystyle i}
: pull once and set
N
i
1
{\displaystyle N_{i}\leftarrow 1}
,
ν
^
i
{\displaystyle {\hat {\nu }}_{i}\leftarrow }
initial empirical distribution
for
t
{\displaystyle t}
from
K
+
1
{\displaystyle K+1}
to
T
{\displaystyle T}
do:
a
t
{\displaystyle a_{t}\leftarrow }
Sampling_rule(
D
{\displaystyle {\mathcal {D}}}
,
T
{\displaystyle T}
,
N
{\displaystyle N}
,
ν
^
{\displaystyle {\hat {\nu }}}
)
Observe reward
X
t
ν
a
t
{\displaystyle X_{t}\sim \nu _{a_{t}}}