kb/data/en.wikipedia.org/wiki/Best_arm_identification-1.md

---
title: "Best arm identification"
chunk: 2/3
source: "https://en.wikipedia.org/wiki/Best_arm_identification"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T14:37:26.259834+00:00"
instance: "kb-cron"
---

In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level


        δ


    {\displaystyle \delta }

 while minimizing the expected number of samples. Any such algorithm requires two key components:

Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time


          τ

            δ


    {\displaystyle \tau _{\delta }}

 and returns an arm


                a
                ^


              τ

                δ


    {\displaystyle {\hat {a}}_{\tau _{\delta }}}

 such that


          P

        (


                a
                ^


              τ

                δ


        ≠

          a

            ⋆


        )
        ≤
        δ


    {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta }

 and


          P

        (

          τ

            δ


        <
        +
        ∞
        )
        =
        1


    {\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1}

.
Sampling rule: A policy


        π


    {\displaystyle \pi }

 that, at each round


        t


    {\displaystyle t}

, selects the next arm to sample


          a

            t


    {\displaystyle a_{t}}

 based on all previous observations


        (

          a

            s


        ,

          X

            s


          )

            s
            <
            t


    {\displaystyle (a_{s},X_{s})_{s<t}}

.

=== Algorithm structure ===
The general structure of a fixed-confidence BAI algorithm can be described as follows:

Input: Distribution family


            D


    {\displaystyle {\mathcal {D}}}

, confidence level


        δ


    {\displaystyle \delta }


Initialize:
    For each arm


        i


    {\displaystyle i}

: set


          N

            i


        ←
        1


    {\displaystyle N_{i}\leftarrow 1}

,


                ν
                ^


            i


        ←

          None


    {\displaystyle {\hat {\nu }}_{i}\leftarrow {\text{None}}}


    Set


        τ
        ←
        0


    {\displaystyle \tau \leftarrow 0}

,


          stop

        ←

          False


    {\displaystyle {\text{stop}}\leftarrow {\text{False}}}


while


          stop

        ==

          False


    {\displaystyle {\text{stop}}=={\text{False}}}

 do:


        a
        ←


    {\displaystyle a\leftarrow }

 Sampling_rule(


            D


    {\displaystyle {\mathcal {D}}}

,


        δ


    {\displaystyle \delta }

,


        N


    {\displaystyle N}

,


              ν
              ^


    {\displaystyle {\hat {\nu }}}

)
    Observe reward


        X
        ∼

          ν

            a


    {\displaystyle X\sim \nu _{a}}


    Update:


          N

            a


        ←

          N

            a


        +
        1


    {\displaystyle N_{a}\leftarrow N_{a}+1}


    Update empirical distribution


                ν
                ^


              a

                τ


    {\displaystyle {\hat {\nu }}_{a_{\tau }}}


        τ
        ←
        τ
        +
        1


    {\displaystyle \tau \leftarrow \tau +1}


          stop


    {\displaystyle {\text{stop}}}

 ← Stopping_rule(


            D


    {\displaystyle {\mathcal {D}}}

,


        δ


    {\displaystyle \delta }

,


        N


    {\displaystyle N}

,


              ν
              ^


    {\displaystyle {\hat {\nu }}}

)
return


                a
                ^


            τ


            ⋆


        ←
        arg
        ⁡

          max

            a


                μ
                ^


            a


    {\displaystyle {\hat {a}}_{\tau }^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}


=== Lower bound ===

The minimal expected number of pulls to obtain the confidence level of


        1
        −
        δ


    {\displaystyle 1-\delta }

 was determined in 2016. For a given instance


        ν


    {\displaystyle \nu }

 and a fixed


        δ


    {\displaystyle \delta }

, it provides the minimum value of


          E

        [

          τ

            δ


        ]


    {\displaystyle \mathbb {E} [\tau _{\delta }]}

 possible.
To give the lower bound, we first need to define the function


        kl
        ⁡
        (
        δ
        ,
        1
        −
        δ
        )


    {\displaystyle \operatorname {kl} (\delta ,1-\delta )}

, the Kullback–Leibler divergence between two Bernoulli distributions with means


        δ


    {\displaystyle \delta }

 and


        1
        −
        δ


    {\displaystyle 1-\delta }

. This is equivalent to


        ln
        ⁡
        (
        1

          /

        δ
        )


    {\displaystyle \ln(1/\delta )}

 when


        δ


    {\displaystyle \delta }

 tend to


        0


    {\displaystyle 0}

.
For any algorithm satisfying the


        δ


    {\displaystyle \delta }

-correctness constraint, the expected sample complexity satisfies


          E

        [

          τ

            δ


        ]
        ≥

          C

            ⋆


        kl
        ⁡
        (
        δ
        ,
        1
        −
        δ
        )


    {\displaystyle \mathbb {E} [\tau _{\delta }]\geq C^{\star }\,\operatorname {kl} (\delta ,1-\delta )}


where the problem-dependent constant


          C

            ⋆


    {\displaystyle C^{\star }}

 only depends on


        ν


    {\displaystyle \nu }

. An optimal sampling rule


          ω

            ∗


    {\displaystyle \omega ^{*}}

 is associated with this optimal constant.

=== Asymptotically optimal algorithms ===
Since


        kl
        ⁡
        (
        δ
        ,
        1
        −
        δ
        )
        ∼
        ln
        ⁡
        (
        1

          /

        δ
        )


    {\displaystyle \operatorname {kl} (\delta ,1-\delta )\sim \ln(1/\delta )}

 as


        δ
        →
        0


    {\displaystyle \delta \to 0}

, an algorithm is called asymptotically optimal if


          lim

            δ
            →
            0


                E

              [

                τ

                  δ


              ]


              ln
              ⁡
              (
              1

                /

              δ
              )


        =

          C

            ⋆


        .


    {\displaystyle \lim _{\delta \to 0}{\frac {\mathbb {E} [\tau _{\delta }]}{\ln(1/\delta )}}=C^{\star }.}


The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm, which consists of tracking the optimal sampling rule


          ω

            ∗


    {\displaystyle \omega ^{*}}

 in the lower bound by using the empirical distribution and choosing the next arm to play according to this estimated optimal sampling rule.

== Fixed horizon ==
In the fixed-horizon setting, the total number of samples


        T


    {\displaystyle T}

 is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round


        t
        ∈
        {
        1
        ,
        …
        ,
        T
        }


    {\displaystyle t\in \{1,\ldots ,T\}}

, and at the end of the horizon, it returns a recommendation


                a
                ^


            T


    {\displaystyle {\hat {a}}_{T}}

. The objective is to minimize the probability of error


          P

        (


                a
                ^


            T


        ≠

          a

            ⋆


        )


    {\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{\star })}

.
This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of


        K


    {\displaystyle K}

 is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between


        K


    {\displaystyle K}

 which gives an observation of the distribution of the treatment, and each patient corresponds to a turn


        t


    {\displaystyle t}

. The total number of patients is the horizon


        T


    {\displaystyle T}


=== Algorithm structure ===
A typical fixed-horizon BAI algorithm proceeds as follows:

Input: Distribution family


            D


    {\displaystyle {\mathcal {D}}}

, horizon


        T


    {\displaystyle T}


Initialize:
    For each arm


        i


    {\displaystyle i}

: pull once and set


          N

            i


        ←
        1


    {\displaystyle N_{i}\leftarrow 1}

,


                ν
                ^


            i


        ←


    {\displaystyle {\hat {\nu }}_{i}\leftarrow }

 initial empirical distribution
for


        t


    {\displaystyle t}

 from


        K
        +
        1


    {\displaystyle K+1}

 to


        T


    {\displaystyle T}

 do:


          a

            t


        ←


    {\displaystyle a_{t}\leftarrow }

 Sampling_rule(


            D


    {\displaystyle {\mathcal {D}}}

,


        T


    {\displaystyle T}

,


        N


    {\displaystyle N}

,


              ν
              ^


    {\displaystyle {\hat {\nu }}}

)
    Observe reward


          X

            t


        ∼

          ν


              a

                t


    {\displaystyle X_{t}\sim \nu _{a_{t}}}