kb/data/en.wikipedia.org/wiki/Best_arm_identification-1.md

14 KiB
Raw Blame History

title chunk source category tags date_saved instance
Best arm identification 2/3 https://en.wikipedia.org/wiki/Best_arm_identification reference science, encyclopedia 2026-05-05T14:37:26.259834+00:00 kb-cron

In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level

    δ
  

{\displaystyle \delta }

while minimizing the expected number of samples. Any such algorithm requires two key components:

Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time

      τ
      
        δ
      
    
  

{\displaystyle \tau _{\delta }}

and returns an arm

            a
            ^
          
        
      
      
        
          τ
          
            δ
          
        
      
    
  

{\displaystyle {\hat {a}}_{\tau _{\delta }}}

such that

      P
    
    (
    
      
        
          
            a
            ^
          
        
      
      
        
          τ
          
            δ
          
        
      
    
    ≠
    
      a
      
        ⋆
      
    
    )
    ≤
    δ
  

{\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta }

and

      P
    
    (
    
      τ
      
        δ
      
    
    <
    +
    ∞
    )
    =
    1
  

{\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1}

. Sampling rule: A policy

    π
  

{\displaystyle \pi }

that, at each round

    t
  

{\displaystyle t}

, selects the next arm to sample

      a
      
        t
      
    
  

{\displaystyle a_{t}}

based on all previous observations

    (
    
      a
      
        s
      
    
    ,
    
      X
      
        s
      
    
    
      )
      
        s
        <
        t
      
    
  

{\displaystyle (a_{s},X_{s})_{s<t}}

.

=== Algorithm structure === The general structure of a fixed-confidence BAI algorithm can be described as follows:

Input: Distribution family

        D
      
    
  

{\displaystyle {\mathcal {D}}}

, confidence level

    δ
  

{\displaystyle \delta }

Initialize: For each arm

    i
  

{\displaystyle i}

: set

      N
      
        i
      
    
    ←
    1
  

{\displaystyle N_{i}\leftarrow 1}

,

            ν
            ^
          
        
      
      
        i
      
    
    ←
    
      None
    
  

{\displaystyle {\hat {\nu }}_{i}\leftarrow {\text{None}}}


Set 


  
    τ
    ←
    0
  

{\displaystyle \tau \leftarrow 0}

,

      stop
    
    ←
    
      False
    
  

{\displaystyle {\text{stop}}\leftarrow {\text{False}}}

while

      stop
    
    ==
    
      False
    
  

{\displaystyle {\text{stop}}=={\text{False}}}

do:

    a
    ←
  

{\displaystyle a\leftarrow }

Sampling_rule(

        D
      
    
  

{\displaystyle {\mathcal {D}}}

,

    δ
  

{\displaystyle \delta }

,

    N
  

{\displaystyle N}

,

          ν
          ^
        
      
    
  

{\displaystyle {\hat {\nu }}}

) Observe reward

    X
    
    
      ν
      
        a
      
    
  

{\displaystyle X\sim \nu _{a}}


Update: 


  
    
      N
      
        a
      
    
    ←
    
      N
      
        a
      
    
    +
    1
  

{\displaystyle N_{a}\leftarrow N_{a}+1}


Update empirical distribution 


  
    
      
        
          
            ν
            ^
          
        
      
      
        
          a
          
            τ
          
        
      
    
  

{\displaystyle {\hat {\nu }}_{a_{\tau }}}





  
    τ
    ←
    τ
    +
    1
  

{\displaystyle \tau \leftarrow \tau +1}





  
    
      stop
    
  

{\displaystyle {\text{stop}}}

← Stopping_rule(

        D
      
    
  

{\displaystyle {\mathcal {D}}}

,

    δ
  

{\displaystyle \delta }

,

    N
  

{\displaystyle N}

,

          ν
          ^
        
      
    
  

{\displaystyle {\hat {\nu }}}

) return

            a
            ^
          
        
      
      
        τ
      
      
        ⋆
      
    
    ←
    arg
    
    
      max
      
        a
      
    
    
      
        
          
            μ
            ^
          
        
      
      
        a
      
    
  

{\displaystyle {\hat {a}}_{\tau }^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}

=== Lower bound ===

The minimal expected number of pulls to obtain the confidence level of

    1
    
    δ
  

{\displaystyle 1-\delta }

was determined in 2016. For a given instance

    ν
  

{\displaystyle \nu }

and a fixed

    δ
  

{\displaystyle \delta }

, it provides the minimum value of

      E
    
    [
    
      τ
      
        δ
      
    
    ]
  

{\displaystyle \mathbb {E} [\tau _{\delta }]}

possible. To give the lower bound, we first need to define the function

    kl
    
    (
    δ
    ,
    1
    
    δ
    )
  

{\displaystyle \operatorname {kl} (\delta ,1-\delta )}

, the KullbackLeibler divergence between two Bernoulli distributions with means

    δ
  

{\displaystyle \delta }

and

    1
    
    δ
  

{\displaystyle 1-\delta }

. This is equivalent to

    ln
    
    (
    1
    
      /
    
    δ
    )
  

{\displaystyle \ln(1/\delta )}

when

    δ
  

{\displaystyle \delta }

tend to

    0
  

{\displaystyle 0}

. For any algorithm satisfying the

    δ
  

{\displaystyle \delta }

-correctness constraint, the expected sample complexity satisfies

      E
    
    [
    
      τ
      
        δ
      
    
    ]
    ≥
    
      C
      
        ⋆
      
    
    
    kl
    
    (
    δ
    ,
    1
    
    δ
    )
  

{\displaystyle \mathbb {E} [\tau _{\delta }]\geq C^{\star }\,\operatorname {kl} (\delta ,1-\delta )}

where the problem-dependent constant

      C
      
        ⋆
      
    
  

{\displaystyle C^{\star }}

only depends on

    ν
  

{\displaystyle \nu }

. An optimal sampling rule

      ω
      
        
      
    
  

{\displaystyle \omega ^{*}}

is associated with this optimal constant.

=== Asymptotically optimal algorithms === Since

    kl
    
    (
    δ
    ,
    1
    
    δ
    )
    
    ln
    
    (
    1
    
      /
    
    δ
    )
  

{\displaystyle \operatorname {kl} (\delta ,1-\delta )\sim \ln(1/\delta )}

as

    δ
    →
    0
  

{\displaystyle \delta \to 0}

, an algorithm is called asymptotically optimal if

      lim
      
        δ
        →
        0
      
    
    
      
        
          
            E
          
          [
          
            τ
            
              δ
            
          
          ]
        
        
          ln
          
          (
          1
          
            /
          
          δ
          )
        
      
    
    =
    
      C
      
        ⋆
      
    
    .
  

{\displaystyle \lim _{\delta \to 0}{\frac {\mathbb {E} [\tau _{\delta }]}{\ln(1/\delta )}}=C^{\star }.}

The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm, which consists of tracking the optimal sampling rule

      ω
      
        
      
    
  

{\displaystyle \omega ^{*}}

in the lower bound by using the empirical distribution and choosing the next arm to play according to this estimated optimal sampling rule.

== Fixed horizon == In the fixed-horizon setting, the total number of samples

    T
  

{\displaystyle T}

is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round

    t
    ∈
    {
    1
    ,
    …
    ,
    T
    }
  

{\displaystyle t\in \{1,\ldots ,T\}}

, and at the end of the horizon, it returns a recommendation

            a
            ^
          
        
      
      
        T
      
    
  

{\displaystyle {\hat {a}}_{T}}

. The objective is to minimize the probability of error

      P
    
    (
    
      
        
          
            a
            ^
          
        
      
      
        T
      
    
    ≠
    
      a
      
        ⋆
      
    
    )
  

{\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{\star })}

. This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of

    K
  

{\displaystyle K}

is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between

    K
  

{\displaystyle K}

which gives an observation of the distribution of the treatment, and each patient corresponds to a turn

    t
  

{\displaystyle t}

. The total number of patients is the horizon

    T
  

{\displaystyle T}

=== Algorithm structure === A typical fixed-horizon BAI algorithm proceeds as follows:

Input: Distribution family

        D
      
    
  

{\displaystyle {\mathcal {D}}}

, horizon

    T
  

{\displaystyle T}

Initialize: For each arm

    i
  

{\displaystyle i}

: pull once and set

      N
      
        i
      
    
    ←
    1
  

{\displaystyle N_{i}\leftarrow 1}

,

            ν
            ^
          
        
      
      
        i
      
    
    ←
  

{\displaystyle {\hat {\nu }}_{i}\leftarrow }

initial empirical distribution for

    t
  

{\displaystyle t}

from

    K
    +
    1
  

{\displaystyle K+1}

to

    T
  

{\displaystyle T}

do:

      a
      
        t
      
    
    ←
  

{\displaystyle a_{t}\leftarrow }

Sampling_rule(

        D
      
    
  

{\displaystyle {\mathcal {D}}}

,

    T
  

{\displaystyle T}

,

    N
  

{\displaystyle N}

,

          ν
          ^
        
      
    
  

{\displaystyle {\hat {\nu }}}

) Observe reward

      X
      
        t
      
    
    
    
      ν
      
        
          a
          
            t
          
        
      
    
  

{\displaystyle X_{t}\sim \nu _{a_{t}}}