---
title: "Best arm identification"
chunk: 2/3
source: "https://en.wikipedia.org/wiki/Best_arm_identification"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T14:37:26.259834+00:00"
instance: "kb-cron"
---

In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
 while minimizing the expected number of samples. Any such algorithm requires two key components:

Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time 
  
    
      
        
          τ
          
            δ
          
        
      
    
    {\displaystyle \tau _{\delta }}
  
 and returns an arm 
  
    
      
        
          
            
              
                a
                ^
              
            
          
          
            
              τ
              
                δ
              
            
          
        
      
    
    {\displaystyle {\hat {a}}_{\tau _{\delta }}}
  
 such that 
  
    
      
        
          P
        
        (
        
          
            
              
                a
                ^
              
            
          
          
            
              τ
              
                δ
              
            
          
        
        ≠
        
          a
          
            ⋆
          
        
        )
        ≤
        δ
      
    
    {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta }
  
 and 
  
    
      
        
          P
        
        (
        
          τ
          
            δ
          
        
        <
        +
        ∞
        )
        =
        1
      
    
    {\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1}
  
.
Sampling rule: A policy 
  
    
      
        π
      
    
    {\displaystyle \pi }
  
 that, at each round 
  
    
      
        t
      
    
    {\displaystyle t}
  
, selects the next arm to sample 
  
    
      
        
          a
          
            t
          
        
      
    
    {\displaystyle a_{t}}
  
 based on all previous observations 
  
    
      
        (
        
          a
          
            s
          
        
        ,
        
          X
          
            s
          
        
        
          )
          
            s
            <
            t
          
        
      
    
    {\displaystyle (a_{s},X_{s})_{s<t}}
  
.

=== Algorithm structure ===
The general structure of a fixed-confidence BAI algorithm can be described as follows:

Input: Distribution family 
  
    
      
        
          
            D
          
        
      
    
    {\displaystyle {\mathcal {D}}}
  
, confidence level 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  

Initialize: 
    For each arm 
  
    
      
        i
      
    
    {\displaystyle i}
  
: set 
  
    
      
        
          N
          
            i
          
        
        ←
        1
      
    
    {\displaystyle N_{i}\leftarrow 1}
  
, 
  
    
      
        
          
            
              
                ν
                ^
              
            
          
          
            i
          
        
        ←
        
          None
        
      
    
    {\displaystyle {\hat {\nu }}_{i}\leftarrow {\text{None}}}
  

    Set 
  
    
      
        τ
        ←
        0
      
    
    {\displaystyle \tau \leftarrow 0}
  
, 
  
    
      
        
          stop
        
        ←
        
          False
        
      
    
    {\displaystyle {\text{stop}}\leftarrow {\text{False}}}
  

while 
  
    
      
        
          stop
        
        ==
        
          False
        
      
    
    {\displaystyle {\text{stop}}=={\text{False}}}
  
 do:
    
  
    
      
        a
        ←
      
    
    {\displaystyle a\leftarrow }
  
 Sampling_rule(
  
    
      
        
          
            D
          
        
      
    
    {\displaystyle {\mathcal {D}}}
  
, 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
, 
  
    
      
        N
      
    
    {\displaystyle N}
  
, 
  
    
      
        
          
            
              ν
              ^
            
          
        
      
    
    {\displaystyle {\hat {\nu }}}
  
)
    Observe reward 
  
    
      
        X
        ∼
        
          ν
          
            a
          
        
      
    
    {\displaystyle X\sim \nu _{a}}
  

    Update: 
  
    
      
        
          N
          
            a
          
        
        ←
        
          N
          
            a
          
        
        +
        1
      
    
    {\displaystyle N_{a}\leftarrow N_{a}+1}
  

    Update empirical distribution 
  
    
      
        
          
            
              
                ν
                ^
              
            
          
          
            
              a
              
                τ
              
            
          
        
      
    
    {\displaystyle {\hat {\nu }}_{a_{\tau }}}
  

    
  
    
      
        τ
        ←
        τ
        +
        1
      
    
    {\displaystyle \tau \leftarrow \tau +1}
  

    
  
    
      
        
          stop
        
      
    
    {\displaystyle {\text{stop}}}
  
 ← Stopping_rule(
  
    
      
        
          
            D
          
        
      
    
    {\displaystyle {\mathcal {D}}}
  
, 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
, 
  
    
      
        N
      
    
    {\displaystyle N}
  
, 
  
    
      
        
          
            
              ν
              ^
            
          
        
      
    
    {\displaystyle {\hat {\nu }}}
  
)
return 
  
    
      
        
          
            
              
                a
                ^
              
            
          
          
            τ
          
          
            ⋆
          
        
        ←
        arg
        ⁡
        
          max
          
            a
          
        
        
          
            
              
                μ
                ^
              
            
          
          
            a
          
        
      
    
    {\displaystyle {\hat {a}}_{\tau }^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
  

=== Lower bound ===

The minimal expected number of pulls to obtain the confidence level of 
  
    
      
        1
        −
        δ
      
    
    {\displaystyle 1-\delta }
  
 was determined in 2016. For a given instance 
  
    
      
        ν
      
    
    {\displaystyle \nu }
  
 and a fixed 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
, it provides the minimum value of 
  
    
      
        
          E
        
        [
        
          τ
          
            δ
          
        
        ]
      
    
    {\displaystyle \mathbb {E} [\tau _{\delta }]}
  
 possible.
To give the lower bound, we first need to define the function 
  
    
      
        kl
        ⁡
        (
        δ
        ,
        1
        −
        δ
        )
      
    
    {\displaystyle \operatorname {kl} (\delta ,1-\delta )}
  
, the Kullback–Leibler divergence between two Bernoulli distributions with means 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
 and 
  
    
      
        1
        −
        δ
      
    
    {\displaystyle 1-\delta }
  
. This is equivalent to 
  
    
      
        ln
        ⁡
        (
        1
        
          /
        
        δ
        )
      
    
    {\displaystyle \ln(1/\delta )}
  
 when 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
 tend to 
  
    
      
        0
      
    
    {\displaystyle 0}
  
.
For any algorithm satisfying the 
  
    
      
        δ
      
    
    {\displaystyle \delta }
  
-correctness constraint, the expected sample complexity satisfies

  
    
      
        
          E
        
        [
        
          τ
          
            δ
          
        
        ]
        ≥
        
          C
          
            ⋆
          
        
        
        kl
        ⁡
        (
        δ
        ,
        1
        −
        δ
        )
      
    
    {\displaystyle \mathbb {E} [\tau _{\delta }]\geq C^{\star }\,\operatorname {kl} (\delta ,1-\delta )}
  

where the problem-dependent constant 
  
    
      
        
          C
          
            ⋆
          
        
      
    
    {\displaystyle C^{\star }}
  
 only depends on 
  
    
      
        ν
      
    
    {\displaystyle \nu }
  
. An optimal sampling rule 
  
    
      
        
          ω
          
            ∗
          
        
      
    
    {\displaystyle \omega ^{*}}
  
 is associated with this optimal constant.

=== Asymptotically optimal algorithms ===
Since 
  
    
      
        kl
        ⁡
        (
        δ
        ,
        1
        −
        δ
        )
        ∼
        ln
        ⁡
        (
        1
        
          /
        
        δ
        )
      
    
    {\displaystyle \operatorname {kl} (\delta ,1-\delta )\sim \ln(1/\delta )}
  
 as 
  
    
      
        δ
        →
        0
      
    
    {\displaystyle \delta \to 0}
  
, an algorithm is called asymptotically optimal if

  
    
      
        
          lim
          
            δ
            →
            0
          
        
        
          
            
              
                E
              
              [
              
                τ
                
                  δ
                
              
              ]
            
            
              ln
              ⁡
              (
              1
              
                /
              
              δ
              )
            
          
        
        =
        
          C
          
            ⋆
          
        
        .
      
    
    {\displaystyle \lim _{\delta \to 0}{\frac {\mathbb {E} [\tau _{\delta }]}{\ln(1/\delta )}}=C^{\star }.}
  

The first algorithm proposed to achieve asymptotic optimality is the Track-and-Stop algorithm, which consists of tracking the optimal sampling rule 
  
    
      
        
          ω
          
            ∗
          
        
      
    
    {\displaystyle \omega ^{*}}
  
 in the lower bound by using the empirical distribution and choosing the next arm to play according to this estimated optimal sampling rule.

== Fixed horizon ==
In the fixed-horizon setting, the total number of samples 
  
    
      
        T
      
    
    {\displaystyle T}
  
 is specified in advance, contrasting with the fixed-confidence setting where sampling continues until a confidence criterion is met. The algorithm must select an arm at each round 
  
    
      
        t
        ∈
        {
        1
        ,
        …
        ,
        T
        }
      
    
    {\displaystyle t\in \{1,\ldots ,T\}}
  
, and at the end of the horizon, it returns a recommendation 
  
    
      
        
          
            
              
                a
                ^
              
            
          
          
            T
          
        
      
    
    {\displaystyle {\hat {a}}_{T}}
  
. The objective is to minimize the probability of error 
  
    
      
        
          P
        
        (
        
          
            
              
                a
                ^
              
            
          
          
            T
          
        
        ≠
        
          a
          
            ⋆
          
        
        )
      
    
    {\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{\star })}
  
.
This setting is particularly relevant when computational or experimental resources are strictly limited, such as in clinical trials when we want to figure out which of 
  
    
      
        K
      
    
    {\displaystyle K}
  
 is the best, and patient enrollment is fixed. Each arm corresponds to a choice of treatment given to one patient between 
  
    
      
        K
      
    
    {\displaystyle K}
  
 which gives an observation of the distribution of the treatment, and each patient corresponds to a turn 
  
    
      
        t
      
    
    {\displaystyle t}
  
. The total number of patients is the horizon 
  
    
      
        T
      
    
    {\displaystyle T}
  

=== Algorithm structure ===
A typical fixed-horizon BAI algorithm proceeds as follows:

Input: Distribution family 
  
    
      
        
          
            D
          
        
      
    
    {\displaystyle {\mathcal {D}}}
  
, horizon 
  
    
      
        T
      
    
    {\displaystyle T}
  

Initialize: 
    For each arm 
  
    
      
        i
      
    
    {\displaystyle i}
  
: pull once and set 
  
    
      
        
          N
          
            i
          
        
        ←
        1
      
    
    {\displaystyle N_{i}\leftarrow 1}
  
, 
  
    
      
        
          
            
              
                ν
                ^
              
            
          
          
            i
          
        
        ←
      
    
    {\displaystyle {\hat {\nu }}_{i}\leftarrow }
  
 initial empirical distribution
for 
  
    
      
        t
      
    
    {\displaystyle t}
  
 from 
  
    
      
        K
        +
        1
      
    
    {\displaystyle K+1}
  
 to 
  
    
      
        T
      
    
    {\displaystyle T}
  
 do:
    
  
    
      
        
          a
          
            t
          
        
        ←
      
    
    {\displaystyle a_{t}\leftarrow }
  
 Sampling_rule(
  
    
      
        
          
            D
          
        
      
    
    {\displaystyle {\mathcal {D}}}
  
, 
  
    
      
        T
      
    
    {\displaystyle T}
  
, 
  
    
      
        N
      
    
    {\displaystyle N}
  
, 
  
    
      
        
          
            
              ν
              ^
            
          
        
      
    
    {\displaystyle {\hat {\nu }}}
  
)
    Observe reward 
  
    
      
        
          X
          
            t
          
        
        ∼
        
          ν
          
            
              a
              
                t
              
            
          
        
      
    
    {\displaystyle X_{t}\sim \nu _{a_{t}}}