kb/Information_theory-3.md at 4fca621ad830a6c19df8ca3044bfe7cad4ef2ffe

turtle89431 b5a459ebe7 Scrape wikipedia-science: 306 new, 864 updated, 1205 total (kb-cron)

2026-05-04 20:56:40 -07:00

9.6 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
Information theory	4/7	https://en.wikipedia.org/wiki/Information_theory	reference	science, encyclopedia	2026-05-05T03:56:37.735412+00:00	kb-cron

=== Mutual information (transinformation) === Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of X relative to Y is given by:

    I
    (
    X
    ;
    Y
    )
    =
    
      
        E
      
      
        X
        ,
        Y
      
    
    [
    S
    I
    (
    x
    ,
    y
    )
    ]
    =
    
      ∑
      
        x
        ,
        y
      
    
    p
    (
    x
    ,
    y
    )
    log
    ⁡
    
      
        
          p
          (
          x
          ,
          y
          )
        
        
          p
          (
          x
          )
          
          p
          (
          y
          )
        
      
    
  

{\displaystyle I(X;Y)=\mathbb {E} _{X,Y}[SI(x,y)]=\sum _{x,y}p(x,y)\log {\frac {p(x,y)}{p(x)\,p(y)}}}

where SI (Specific mutual Information) is the pointwise mutual information. A basic property of the mutual information is that:

    I
    (
    X
    ;
    Y
    )
    =
    H
    (
    X
    )
    −
    H
    (
    X
    
      |
    
    Y
    )
    .
    
  

{\displaystyle I(X;Y)=H(X)-H(X|Y).\,}

That is, knowing

    Y
  

{\textstyle Y}

, we can save an average of I(X; Y) bits in encoding

    X
  

{\displaystyle X}

compared to not knowing

    Y
  

{\displaystyle Y}

. Mutual information is symmetric:

    I
    (
    X
    ;
    Y
    )
    =
    I
    (
    Y
    ;
    X
    )
    =
    H
    (
    X
    )
    +
    H
    (
    Y
    )
    −
    H
    (
    X
    ,
    Y
    )
    .
    
  

{\displaystyle I(X;Y)=I(Y;X)=H(X)+H(Y)-H(X,Y).\,}

Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of

    X
  

{\displaystyle X}

given the value of

    Y
  

{\textstyle Y}

and the prior distribution on

    X
  

{\displaystyle X}

    I
    (
    X
    ;
    Y
    )
    =
    
      
        E
      
      
        p
        (
        y
        )
      
    
    [
    
      D
      
        
          K
          L
        
      
    
    (
    p
    (
    X
    
      |
    
    Y
    =
    y
    )
    ‖
    p
    (
    X
    )
    )
    ]
    .
  

{\displaystyle I(X;Y)=\mathbb {E} _{p(y)}[D_{\mathrm {KL} }(p(X|Y=y)\|p(X))].}

In other words, this is a measure of how much, on the average, the probability distribution on

    X
  

{\displaystyle X}

will change if we are given the value of

    Y
  

{\textstyle Y}

. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:

    I
    (
    X
    ;
    Y
    )
    =
    
      D
      
        
          K
          L
        
      
    
    (
    p
    (
    X
    ,
    Y
    )
    ‖
    p
    (
    X
    )
    p
    (
    Y
    )
    )
    .
  

{\displaystyle I(X;Y)=D_{\mathrm {KL} }(p(X,Y)\|p(X)p(Y)).}

Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

=== Kullback–Leibler divergence (information gain) === The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions: a "true" probability distribution ⁠

    p
    (
    X
    )
  

{\displaystyle p(X)}

⁠, and an arbitrary probability distribution ⁠

    q
    (
    X
    )
  

{\displaystyle q(X)}

⁠. If we compress data in a manner that assumes ⁠

    q
    (
    X
    )
  

{\displaystyle q(X)}

⁠ is the distribution underlying some data, when, in reality, ⁠

    p
    (
    X
    )
  

{\displaystyle p(X)}

⁠ is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined

      D
      
        
          K
          L
        
      
    
    (
    p
    (
    X
    )
    ‖
    q
    (
    X
    )
    )
    =
    
      ∑
      
        x
        ∈
        X
      
    
    −
    p
    (
    x
    )
    log
    ⁡
    
      q
      (
      x
      )
    
    
    −
    
    
      ∑
      
        x
        ∈
        X
      
    
    −
    p
    (
    x
    )
    log
    ⁡
    
      p
      (
      x
      )
    
    =
    
      ∑
      
        x
        ∈
        X
      
    
    p
    (
    x
    )
    log
    ⁡
    
      
        
          p
          (
          x
          )
        
        
          q
          (
          x
          )
        
      
    
    .
  

{\displaystyle D_{\mathrm {KL} }(p(X)\|q(X))=\sum _{x\in X}-p(x)\log {q(x)}\,-\,\sum _{x\in X}-p(x)\log {p(x)}=\sum _{x\in X}p(x)\log {\frac {p(x)}{q(x)}}.}

Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric). Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number

    X
  

{\displaystyle X}

is about to be drawn randomly from a discrete set with probability distribution ⁠

    p
    (
    x
    )
  

{\displaystyle p(x)}

⁠. If Alice knows the true distribution ⁠

    p
    (
    x
    )
  

{\displaystyle p(x)}

⁠, while Bob believes (has a prior) that the distribution is ⁠

    q
    (
    x
    )
  

{\displaystyle q(x)}

⁠, then Bob will be more surprised than Alice, on average, upon seeing the value of

    X
  

{\displaystyle X}

. The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if the log is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him.

=== Directed Information === Directed information,

    I
    (
    
      X
      
        n
      
    
    →
    
      Y
      
        n
      
    
    )
  

{\displaystyle I(X^{n}\to Y^{n})}

, is an information theory measure that quantifies the information flow from the random process

      X
      
        n
      
    
    =
    {
    
      X
      
        1
      
    
    ,
    
      X
      
        2
      
    
    ,
    …
    ,
    
      X
      
        n
      
    
    }
  

{\displaystyle X^{n}=\{X_{1},X_{2},\dots ,X_{n}\}}

to the random process

      Y
      
        n
      
    
    =
    {
    
      Y
      
        1
      
    
    ,
    
      Y
      
        2
      
    
    ,
    …
    ,
    
      Y
      
        n
      
    
    }
  

{\displaystyle Y^{n}=\{Y_{1},Y_{2},\dots ,Y_{n}\}}

. The term directed information was coined by James Massey and is defined as:

9.6 KiB Raw Blame History Unescape Escape

9.6 KiB

Raw Blame History