5.7 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Best arm identification | 3/3 | https://en.wikipedia.org/wiki/Best_arm_identification | reference | science, encyclopedia | 2026-05-05T14:37:26.259834+00:00 | kb-cron |
Update:
N
a
t
←
N
a
t
+
1
{\displaystyle N_{a_{t}}\leftarrow N_{a_{t}}+1}
Update empirical distribution
ν
^
a
t
{\displaystyle {\hat {\nu }}_{a_{t}}}
return
a
^
T
⋆
←
arg
max
a
μ
^
a
{\displaystyle {\hat {a}}_{T}^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
Unlike the fixed-confidence setting, there is no stopping rule because we stop at time
T
{\displaystyle T}
. The algorithm is only base on a sampling rule.
=== Lower bound === The lower bound in the fixed-horizon setting gives the best confidence level we can reach with a given number of turns
T
{\displaystyle T}
. It is expressed as an asymptotic result when
T
{\displaystyle T}
is large. Lower bound theorem: For any algorithm, for any instance
ν
{\displaystyle \nu }
, there exists a constant
H
(
ν
)
{\displaystyle H(\nu )}
that depends only on
ν
{\displaystyle \nu }
such that the probability of error satisfies
lim
T
→
+
∞
P
(
a
^
T
∉
A
⋆
)
≥
exp
(
−
T
H
(
ν
)
)
{\displaystyle \lim _{T\to +\infty }\mathbb {P} ({\hat {a}}_{T}\notin {\mathcal {A}}^{\star })\geq \exp \left(-TH(\nu )\right)}
This result shows that the error probability decays exponentially with the number of turns
T
{\displaystyle T}
.
=== Simple regret === An alternative performance metric for fixed-horizon BAI is the simple regret, defined as
r
T
:=
E
[
μ
⋆
−
μ
a
^
T
∗
]
,
{\displaystyle r_{T}:=\mathbb {E} [\mu ^{\star }-\mu _{{\hat {a}}_{T}^{*}}],}
which measures the expected suboptimality of the returned arm. While
P
(
a
^
T
∗
≠
a
⋆
)
{\displaystyle \mathbb {P} ({\hat {a}}_{T}^{*}\neq a^{\star })}
treats all mistakes with the same cost, the simple regret
r
T
{\displaystyle r_{T}}
accounts for the gap between the optimal mean
μ
∗
{\displaystyle \mu ^{*}}
and the mean of the arm considered as the optimal arm by the algorithm
μ
a
^
T
∗
{\displaystyle \mu _{{\hat {a}}_{T}^{*}}}
. This distinction is important in applications where the cost of choosing a suboptimal arm depends on how far it is from optimal.
== See also == Multi-armed bandit Design of experiments Concentration inequality
== References ==