kb/Fisher_information-6.md at a246fdfd3454927233a3597be9e2ea1731ffbb48

turtle89431 712b063c02 Scrape wikipedia-science: 6045 new, 3188 updated, 9503 total (kb-cron)

2026-05-05 02:51:10 -07:00

12 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
Fisher information	7/8	https://en.wikipedia.org/wiki/Fisher_information	reference	science, encyclopedia	2026-05-05T09:50:15.726073+00:00	kb-cron

where

      Z
      
        ε
      
    
  

{\displaystyle Z_{\varepsilon }}

is a Gaussian variable with covariance matrix

    ε
    I
  

{\displaystyle \varepsilon I}

. The name "surface area" is apt because the entropy power

      e
      
        H
        (
        X
        )
      
    
  

{\displaystyle e^{H(X)}}

is the volume of the "effective support set", so

    S
    (
    X
    )
  

{\displaystyle S(X)}

is the "derivative" of the volume of the effective support set, much like the Minkowski-Steiner formula. The remainder of the proof uses the entropy power inequality, which is like the Brunn–Minkowski inequality. The trace of the Fisher information matrix is found to be a factor of

    S
    (
    X
    )
  

{\displaystyle S(X)}

== Applications ==

=== Optimal design of experiments === Fisher information is widely used in optimal experimental design. Because of the reciprocity of estimator-variance and Fisher information, minimizing the variance corresponds to maximizing the information. When the linear (or linearized) statistical model has several parameters, the mean of the parameter estimator is a vector and its variance is a matrix. The inverse of the variance matrix is called the "information matrix". Because the variance of the estimator of a parameter vector is a matrix, the problem of "minimizing the variance" is complicated. Using statistical theory, statisticians compress the information-matrix using real-valued summary statistics; being real-valued functions, these "information criteria" can be maximized. Traditionally, statisticians have evaluated estimators and designs by considering some summary statistic of the covariance matrix (of an unbiased estimator), usually with positive real values (like the determinant or matrix trace). Working with positive real numbers brings several advantages: If the estimator of a single parameter has a positive variance, then the variance and the Fisher information are both positive real numbers; hence they are members of the convex cone of nonnegative real numbers (whose nonzero members have reciprocals in this same cone). For several parameters, the covariance matrices and information matrices are elements of the convex cone of nonnegative-definite symmetric matrices in a partially ordered vector space, under the Loewner (Löwner) order. This cone is closed under matrix addition and inversion, as well as under the multiplication of positive real numbers and matrices. An exposition of matrix theory and Loewner order appears in Pukelsheim. The traditional optimality criteria are the information matrix's invariants, in the sense of invariant theory; algebraically, the traditional optimality criteria are functionals of the eigenvalues of the (Fisher) information matrix (see optimal design).

=== Jeffreys prior in Bayesian statistics === In Bayesian statistics, the Fisher information is used to calculate the Jeffreys prior, which is a standard, non-informative prior for continuous distribution parameters.

=== Computational neuroscience === The Fisher information has been used to find bounds on the accuracy of neural codes. In that case, X is typically the joint responses of many neurons representing a low dimensional variable θ (such as a stimulus parameter). In particular the role of correlations in the noise of the neural responses has been studied.

=== Epidemiology === Fisher information was used to study how informative different data sources are for estimation of the reproduction number of SARS-CoV-2.

=== Machine learning === The Fisher information is used in machine learning techniques such as elastic weight consolidation, which reduces catastrophic forgetting in artificial neural networks. Fisher information can be used as an alternative to the Hessian of the loss function in second-order gradient descent network training.

=== Color discrimination === Using a Fisher information metric, da Fonseca et al. investigated the degree to which MacAdam ellipses (color discrimination ellipses) can be derived from the response functions of the retinal photoreceptors.

== Relation to relative entropy ==

Fisher information is related to relative entropy. The relative entropy, or Kullback–Leibler divergence, between two distributions

    p
  

{\displaystyle p}

and

    q
  

{\displaystyle q}

can be written as

    K
    L
    (
    p
    :
    q
    )
    =
    ∫
    p
    (
    x
    )
    log
    ⁡
    
      
        
          p
          (
          x
          )
        
        
          q
          (
          x
          )
        
      
    
    
    d
    x
    .
  

{\displaystyle KL(p:q)=\int p(x)\log {\frac {p(x)}{q(x)}}\,dx.}

Now, consider a family of probability distributions

    f
    (
    x
    ;
    θ
    )
  

{\displaystyle f(x;\theta )}

parametrized by

    θ
    ∈
    Θ
  

{\displaystyle \theta \in \Theta }

. Then the Kullback–Leibler divergence, between two distributions in the family can be written as

    D
    (
    θ
    ,
    
      θ
      ′
    
    )
    =
    K
    L
    (
    p
    (
    

    
    ⋅
    

    
    ;
    θ
    )
    :
    p
    (
    

    
    ⋅
    

    
    ;
    
      θ
      ′
    
    )
    )
    =
    ∫
    f
    (
    x
    ;
    θ
    )
    log
    ⁡
    
      
        
          f
          (
          x
          ;
          θ
          )
        
        
          f
          (
          x
          ;
          
            θ
            ′
          
          )
        
      
    
    
    d
    x
    .
  

{\displaystyle D(\theta ,\theta ')=KL(p({}\cdot {};\theta ):p({}\cdot {};\theta '))=\int f(x;\theta )\log {\frac {f(x;\theta )}{f(x;\theta ')}}\,dx.}

    θ
  

{\displaystyle \theta }

is fixed, then the relative entropy between two distributions of the same family is minimized at

      θ
      ′
    
    =
    θ
  

{\displaystyle \theta '=\theta }

. For

      θ
      ′
    
  

{\displaystyle \theta '}

close to

    θ
  

{\displaystyle \theta }

, one may expand the previous expression in a series up to second order:

    D
    (
    θ
    ,
    
      θ
      ′
    
    )
    =
    
      
        1
        2
      
    
    (
    
      θ
      ′
    
    −
    θ
    
      )
      
        
          T
        
      
    
    
      
        (
        
          
            
              
                ∂
                
                  2
                
              
              
                ∂
                
                  θ
                  
                    i
                  
                  ′
                
                
                ∂
                
                  θ
                  
                    j
                  
                  ′
                
              
            
          
          D
          (
          θ
          ,
          
            θ
            ′
          
          )
        
        )
      
      
        
          θ
          ′
        
        =
        θ
      
    
    (
    
      θ
      ′
    
    −
    θ
    )
    +
    o
    
      (
      
        (
        
          θ
          ′
        
        −
        θ
        
          )
          
            2
          
        
      
      )
    
  

{\displaystyle D(\theta ,\theta ')={\frac {1}{2}}(\theta '-\theta )^{\textsf {T}}\left({\frac {\partial ^{2}}{\partial \theta '_{i}\,\partial \theta '_{j}}}D(\theta ,\theta ')\right)_{\theta '=\theta }(\theta '-\theta )+o\left((\theta '-\theta )^{2}\right)}

But the second order derivative can be written as

        (
        
          
            
              
                ∂
                
                  2
                
              
              
                ∂
                
                  θ
                  
                    i
                  
                  ′
                
                
                ∂
                
                  θ
                  
                    j
                  
                  ′
                
              
            
          
          D
          (
          θ
          ,
          
            θ
            ′
          
          )
        
        )
      
      
        
          θ
          ′
        
        =
        θ
      
    
    =
    −
    ∫
    f
    (
    x
    ;
    θ
    )
    
      
        (
        
          
            
              
                ∂
                
                  2
                
              
              
                ∂
                
                  θ
                  
                    i
                  
                  ′
                
                
                ∂
                
                  θ
                  
                    j
                  
                  ′
                
              
            
          
          log
          ⁡
          (
          f
          (
          x
          ;
          
            θ
            ′
          
          )
          )
        
        )
      
      
        
          θ
          ′
        
        =
        θ
      
    
    
    d
    x
    =
    [
    
      
        I
      
    
    (
    θ
    )
    
      ]
      
        i
        ,
        j
      
    
    .
  

{\displaystyle \left({\frac {\partial ^{2}}{\partial \theta '_{i}\,\partial \theta '_{j}}}D(\theta ,\theta ')\right)_{\theta '=\theta }=-\int f(x;\theta )\left({\frac {\partial ^{2}}{\partial \theta '_{i}\,\partial \theta '_{j}}}\log(f(x;\theta '))\right)_{\theta '=\theta }\,dx=[{\mathcal {I}}(\theta )]_{i,j}.}

12 KiB Raw Blame History Unescape Escape

12 KiB

Raw Blame History