---
title: "Best arm identification"
chunk: 3/3
source: "https://en.wikipedia.org/wiki/Best_arm_identification"
category: "reference"
tags: "science, encyclopedia"
date_saved: "2026-05-05T14:37:26.259834+00:00"
instance: "kb-cron"
---

    Update: 
  
    
      
        
          N
          
            
              a
              
                t
              
            
          
        
        ←
        
          N
          
            
              a
              
                t
              
            
          
        
        +
        1
      
    
    {\displaystyle N_{a_{t}}\leftarrow N_{a_{t}}+1}
  

    Update empirical distribution 
  
    
      
        
          
            
              
                ν
                ^
              
            
          
          
            
              a
              
                t
              
            
          
        
      
    
    {\displaystyle {\hat {\nu }}_{a_{t}}}
  

return 
  
    
      
        
          
            
              
                a
                ^
              
            
          
          
            T
          
          
            ⋆
          
        
        ←
        arg
        ⁡
        
          max
          
            a
          
        
        
          
            
              
                μ
                ^
              
            
          
          
            a
          
        
      
    
    {\displaystyle {\hat {a}}_{T}^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}
  

Unlike the fixed-confidence setting, there is no stopping rule because we stop at time 
  
    
      
        T
      
    
    {\displaystyle T}
  
. The algorithm is only base on a sampling rule.

=== Lower bound ===
The lower bound in the fixed-horizon setting gives the best confidence level we can reach with a given number of turns 
  
    
      
        T
      
    
    {\displaystyle T}
  
. It is expressed as an asymptotic result when 
  
    
      
        T
      
    
    {\displaystyle T}
  
 is large.
Lower bound theorem: For any algorithm, for any instance 
  
    
      
        ν
      
    
    {\displaystyle \nu }
  
, there exists a constant 
  
    
      
        H
        (
        ν
        )
      
    
    {\displaystyle H(\nu )}
  
 that depends only on 
  
    
      
        ν
      
    
    {\displaystyle \nu }
  
 such that the probability of error satisfies

  
    
      
        
          lim
          
            T
            →
            +
            ∞
          
        
        
          P
        
        (
        
          
            
              
                a
                ^
              
            
          
          
            T
          
        
        ∉
        
          
            
              A
            
          
          
            ⋆
          
        
        )
        ≥
        exp
        ⁡
        
          (
          
            −
            T
            H
            (
            ν
            )
          
          )
        
      
    
    {\displaystyle \lim _{T\to +\infty }\mathbb {P} ({\hat {a}}_{T}\notin {\mathcal {A}}^{\star })\geq \exp \left(-TH(\nu )\right)}
  

This result shows that the error probability decays exponentially with the number of turns 
  
    
      
        T
      
    
    {\displaystyle T}
  
.

=== Simple regret ===
An alternative performance metric for fixed-horizon BAI is the simple regret, defined as

  
    
      
        
          r
          
            T
          
        
        :=
        
          E
        
        [
        
          μ
          
            ⋆
          
        
        −
        
          μ
          
            
              
                
                  
                    a
                    ^
                  
                
              
              
                T
              
              
                ∗
              
            
          
        
        ]
        ,
      
    
    {\displaystyle r_{T}:=\mathbb {E} [\mu ^{\star }-\mu _{{\hat {a}}_{T}^{*}}],}
  

which measures the expected suboptimality of the returned arm.
While 
  
    
      
        
          P
        
        (
        
          
            
              
                a
                ^
              
            
          
          
            T
          
          
            ∗
          
        
        ≠
        
          a
          
            ⋆
          
        
        )
      
    
    {\displaystyle \mathbb {P} ({\hat {a}}_{T}^{*}\neq a^{\star })}
  
 treats all mistakes with the same cost, the simple regret 
  
    
      
        
          r
          
            T
          
        
      
    
    {\displaystyle r_{T}}
  
 accounts for the gap between the optimal mean 
  
    
      
        
          μ
          
            ∗
          
        
      
    
    {\displaystyle \mu ^{*}}
  
 and the mean of the arm considered as the optimal arm by the algorithm 
  
    
      
        
          μ
          
            
              
                
                  
                    a
                    ^
                  
                
              
              
                T
              
              
                ∗
              
            
          
        
      
    
    {\displaystyle \mu _{{\hat {a}}_{T}^{*}}}
  
. This distinction is important in applications where the cost of choosing a suboptimal arm depends on how far it is from optimal.

== See also ==
Multi-armed bandit
Design of experiments
Concentration inequality

== References ==