kb/data/en.wikipedia.org/wiki/Best_arm_identification-2.md

5.7 KiB
Raw Blame History

title chunk source category tags date_saved instance
Best arm identification 3/3 https://en.wikipedia.org/wiki/Best_arm_identification reference science, encyclopedia 2026-05-05T14:37:26.259834+00:00 kb-cron
Update: 


  
    
      N
      
        
          a
          
            t
          
        
      
    
    ←
    
      N
      
        
          a
          
            t
          
        
      
    
    +
    1
  

{\displaystyle N_{a_{t}}\leftarrow N_{a_{t}}+1}


Update empirical distribution 


  
    
      
        
          
            ν
            ^
          
        
      
      
        
          a
          
            t
          
        
      
    
  

{\displaystyle {\hat {\nu }}_{a_{t}}}

return

            a
            ^
          
        
      
      
        T
      
      
        ⋆
      
    
    ←
    arg
    
    
      max
      
        a
      
    
    
      
        
          
            μ
            ^
          
        
      
      
        a
      
    
  

{\displaystyle {\hat {a}}_{T}^{\star }\leftarrow \arg \max _{a}{\hat {\mu }}_{a}}

Unlike the fixed-confidence setting, there is no stopping rule because we stop at time

    T
  

{\displaystyle T}

. The algorithm is only base on a sampling rule.

=== Lower bound === The lower bound in the fixed-horizon setting gives the best confidence level we can reach with a given number of turns

    T
  

{\displaystyle T}

. It is expressed as an asymptotic result when

    T
  

{\displaystyle T}

is large. Lower bound theorem: For any algorithm, for any instance

    ν
  

{\displaystyle \nu }

, there exists a constant

    H
    (
    ν
    )
  

{\displaystyle H(\nu )}

that depends only on

    ν
  

{\displaystyle \nu }

such that the probability of error satisfies

      lim
      
        T
        →
        +
        ∞
      
    
    
      P
    
    (
    
      
        
          
            a
            ^
          
        
      
      
        T
      
    
    ∉
    
      
        
          A
        
      
      
        ⋆
      
    
    )
    ≥
    exp
    
    
      (
      
        
        T
        H
        (
        ν
        )
      
      )
    
  

{\displaystyle \lim _{T\to +\infty }\mathbb {P} ({\hat {a}}_{T}\notin {\mathcal {A}}^{\star })\geq \exp \left(-TH(\nu )\right)}

This result shows that the error probability decays exponentially with the number of turns

    T
  

{\displaystyle T}

.

=== Simple regret === An alternative performance metric for fixed-horizon BAI is the simple regret, defined as

      r
      
        T
      
    
    :=
    
      E
    
    [
    
      μ
      
        ⋆
      
    
    
    
      μ
      
        
          
            
              
                a
                ^
              
            
          
          
            T
          
          
            
          
        
      
    
    ]
    ,
  

{\displaystyle r_{T}:=\mathbb {E} [\mu ^{\star }-\mu _{{\hat {a}}_{T}^{*}}],}

which measures the expected suboptimality of the returned arm. While

      P
    
    (
    
      
        
          
            a
            ^
          
        
      
      
        T
      
      
        
      
    
    ≠
    
      a
      
        ⋆
      
    
    )
  

{\displaystyle \mathbb {P} ({\hat {a}}_{T}^{*}\neq a^{\star })}

treats all mistakes with the same cost, the simple regret

      r
      
        T
      
    
  

{\displaystyle r_{T}}

accounts for the gap between the optimal mean

      μ
      
        
      
    
  

{\displaystyle \mu ^{*}}

and the mean of the arm considered as the optimal arm by the algorithm

      μ
      
        
          
            
              
                a
                ^
              
            
          
          
            T
          
          
            
          
        
      
    
  

{\displaystyle \mu _{{\hat {a}}_{T}^{*}}}

. This distinction is important in applications where the cost of choosing a suboptimal arm depends on how far it is from optimal.

== See also == Multi-armed bandit Design of experiments Concentration inequality

== References ==