kb/data/en.wikipedia.org/wiki/Best_arm_identification-2.md

407 lines
5.7 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Best arm identification"
chunk: 3/3
source: "https://en.wikipedia.org/wiki/Best_arm_identification"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T14:37:26.259834+00:00"
instance: "kb-cron"
---
Update:
N
a
t
N
a
t
+
1
{\displaystyle N_{a_{t}}\leftarrow N_{a_{t}}+1}
Update empirical distribution
ν
^
a
t
{\displaystyle {\hat {\nu }}_{a_{t}}}
return
a
^
T
arg
max
a
μ
^
a
{\displaystyle {\hat {a}}_{T}^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
Unlike the fixed-confidence setting, there is no stopping rule because we stop at time
T
{\displaystyle T}
. The algorithm is only base on a sampling rule.
=== Lower bound ===
The lower bound in the fixed-horizon setting gives the best confidence level we can reach with a given number of turns
T
{\displaystyle T}
. It is expressed as an asymptotic result when
T
{\displaystyle T}
is large.
Lower bound theorem: For any algorithm, for any instance
ν
{\displaystyle \nu }
, there exists a constant
H
(
ν
)
{\displaystyle H(\nu )}
that depends only on
ν
{\displaystyle \nu }
such that the probability of error satisfies
lim
T
+
P
(
a
^
T
A
)
exp
(
T
H
(
ν
)
)
{\displaystyle \lim _{T\to +\infty }\mathbb {P} ({\hat {a}}_{T}\notin {\mathcal {A}}^{\star })\geq \exp \left(-TH(\nu )\right)}
This result shows that the error probability decays exponentially with the number of turns
T
{\displaystyle T}
.
=== Simple regret ===
An alternative performance metric for fixed-horizon BAI is the simple regret, defined as
r
T
:=
E
[
μ
μ
a
^
T
]
,
{\displaystyle r_{T}:=\mathbb {E} [\mu ^{\star }-\mu _{{\hat {a}}_{T}^{*}}],}
which measures the expected suboptimality of the returned arm.
While
P
(
a
^
T
a
)
{\displaystyle \mathbb {P} ({\hat {a}}_{T}^{*}\neq a^{\star })}
treats all mistakes with the same cost, the simple regret
r
T
{\displaystyle r_{T}}
accounts for the gap between the optimal mean
μ
{\displaystyle \mu ^{*}}
and the mean of the arm considered as the optimal arm by the algorithm
μ
a
^
T
{\displaystyle \mu _{{\hat {a}}_{T}^{*}}}
. This distinction is important in applications where the cost of choosing a suboptimal arm depends on how far it is from optimal.
== See also ==
Multi-armed bandit
Design of experiments
Concentration inequality
== References ==