15 KiB
| title | chunk | source | category | tags | date_saved | instance |
|---|---|---|---|---|---|---|
| Fisher information | 4/8 | https://en.wikipedia.org/wiki/Fisher_information | reference | science, encyclopedia | 2026-05-05T09:50:15.726073+00:00 | kb-cron |
=== Estimate θ from X ~ Bern (√θ) === As another toy example consider a random variable
X
{\displaystyle X}
with possible outcomes 0 and 1, with probabilities
p
0
=
1
−
θ
{\displaystyle p_{0}=1-{\sqrt {\theta }}}
and
p
1
=
θ
{\displaystyle p_{1}={\sqrt {\theta }}}
, respectively, for some
θ
∈
[
0
,
1
]
{\displaystyle \theta \in [0,1]}
. Our goal is estimating
θ
{\displaystyle \theta }
from observations of
X
{\displaystyle X}
. The Fisher information reads in this case
I
(
θ
)
=
E
[
(
∂
∂
θ
log
f
(
X
;
θ
)
)
2
|
θ
]
=
(
1
−
θ
)
(
−
1
2
θ
(
1
−
θ
)
)
2
+
θ
(
1
2
θ
)
2
=
1
4
θ
(
1
1
−
θ
+
1
θ
)
.
{\displaystyle {\begin{aligned}{\mathcal {I}}(\theta )&=\mathrm {E} \left[\left({\frac {\partial }{\partial \theta }}\log f(X;\theta )\right)^{2}{\Bigg |}\,\theta \right]\\&=(1-{\sqrt {\theta }})\left({\frac {-1}{2{\sqrt {\theta }}(1-{\sqrt {\theta }})}}\right)^{2}+{\sqrt {\theta }}\left({\frac {1}{2\theta }}\right)^{2}\\&={\frac {1}{4\theta }}\left({\frac {1}{1-{\sqrt {\theta }}}}+{\frac {1}{\sqrt {\theta }}}\right)\end{aligned}}.}
This expression can also be derived directly from the change of reparametrization formula given below. More generally, for any sufficiently regular function
f
{\displaystyle f}
such that
f
(
θ
)
∈
[
0
,
1
]
{\displaystyle f(\theta )\in [0,1]}
, the Fisher information to retrieve
θ
{\displaystyle \theta }
from
X
∼
Bern
(
f
(
θ
)
)
{\displaystyle X\sim \operatorname {Bern} (f(\theta ))}
is similarly computed to be
I
(
θ
)
=
f
′
(
θ
)
2
(
1
1
−
f
(
θ
)
+
1
f
(
θ
)
)
.
{\displaystyle {\mathcal {I}}(\theta )=f'(\theta )^{2}\left({\frac {1}{1-f(\theta )}}+{\frac {1}{f(\theta )}}\right).}
== Matrix form == When there are N parameters, so that θ is an N × 1 vector
θ
=
[
θ
1
θ
2
…
θ
N
]
T
,
{\displaystyle \theta ={\begin{bmatrix}\theta _{1}&\theta _{2}&\dots &\theta _{N}\end{bmatrix}}^{\textsf {T}},}
the Fisher information takes the form of an N × N matrix. This matrix is called the Fisher information matrix (FIM) and has typical element
[
I
(
θ
)
]
i
,
j
=
E
[
(
∂
∂
θ
i
log
f
(
X
;
θ
)
)
(
∂
∂
θ
j
log
f
(
X
;
θ
)
)
|
θ
]
.
{\displaystyle {\bigl [}{\mathcal {I}}(\theta ){\bigr ]}_{i,j}=\operatorname {E} \left[\left.\left({\frac {\partial }{\partial \theta _{i}}}\log f(X;\theta )\right)\left({\frac {\partial }{\partial \theta _{j}}}\log f(X;\theta )\right)\,\,\right|\,\,\theta \right].}
The FIM is a N × N positive semidefinite matrix. If it is positive definite, then it defines a Riemannian metric on the N-dimensional parameter space. The topic information geometry uses this to connect Fisher information to differential geometry, and in that context, this metric is known as the Fisher information metric. Under certain regularity conditions, the Fisher information matrix may also be written as
[
I
(
θ
)
]
i
,
j
=
−
E
[
∂
2
∂
θ
i
∂
θ
j
log
f
(
X
;
θ
)
|
θ
]
.
{\displaystyle {\bigl [}{\mathcal {I}}(\theta ){\bigr ]}_{i,j}=-\operatorname {E} \left[\left.{\frac {\partial ^{2}}{\partial \theta _{i}\,\partial \theta _{j}}}\log f(X;\theta )\,\,\right|\,\,\theta \right]\,.}
The result is interesting in several ways:
It is equal to minus the expected Hessian of the relative entropy. It can be used as a Riemannian metric for defining Fisher-Rao geometry when it is positive-definite. It can be understood as a metric induced from the Euclidean metric, after appropriate change of variable. In its complex-valued form, it is the Fubini–Study metric. It is the key part of the proof of Wilks' theorem, which allows confidence region estimates for maximum likelihood estimation (for those conditions for which it applies) without needing the Likelihood Principle. In cases where the analytical calculations of the FIM above are difficult, it is possible to form an average of easy Monte Carlo estimates of the Hessian of the negative log-likelihood function as an estimate of the FIM. The estimates may be based on values of the negative log-likelihood function or the gradient of the negative log-likelihood function; no analytical calculation of the Hessian of the negative log-likelihood function is needed.
=== Information orthogonal parameters === We say that two parameter component vectors θ1 and θ2 are information orthogonal if the Fisher information matrix is block diagonal, with these components in separate blocks. Orthogonal parameters are easy to deal with in the sense that their maximum likelihood estimates are asymptotically uncorrelated. When considering how to analyse a statistical model, the modeller is advised to invest some time searching for an orthogonal parametrization of the model, in particular when the parameter of interest is one-dimensional, but the nuisance parameter can have any dimension.
=== Singular statistical model ===
If the Fisher information matrix is positive definite for all θ, then the corresponding statistical model is said to be regular; otherwise, the statistical model is said to be singular. Examples of singular statistical models include the following: normal mixtures, binomial mixtures, multinomial mixtures, Bayesian networks, neural networks, radial basis functions, hidden Markov models, stochastic context-free grammars, reduced rank regressions, Boltzmann machines. In machine learning, if a statistical model is devised so that it extracts hidden structure from a random phenomenon, then it naturally becomes singular.