kb/Analysis_of_variance-5.md at bbc06cd6bdadfdebc18ee39c5df36a320b379042

turtle89431 292594baa5 Scrape wikipedia-science: 5966 new, 3181 updated, 9417 total (kb-cron)

2026-05-05 02:49:05 -07:00

12 KiB

Raw Blame History

title	chunk	source	category	tags	date_saved	instance
Analysis of variance	6/7	https://en.wikipedia.org/wiki/Analysis_of_variance	reference	science, encyclopedia	2026-05-05T09:48:53.349210+00:00	kb-cron

== Cautions == Balanced experiments (those with an equal sample size for each treatment) are relatively easy to interpret; unbalanced experiments offer more complexity. For single-factor (one-way) ANOVA, the adjustment for unbalanced data is easy, but the unbalanced analysis lacks both robustness and power. For more complex designs the lack of balance leads to further complications. "The orthogonality property of main effects and interactions present in balanced data does not carry over to the unbalanced case. This means that the usual analysis of variance techniques do not apply. Consequently, the analysis of unbalanced factorials is much more difficult than that for balanced designs." In the general case, "The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F-ratios will depend on the order in which the sources of variation are considered." ANOVA is (in part) a test of statistical significance. The American Psychological Association (and many other organisations) holds the view that simply reporting statistical significance is insufficient and that reporting confidence bounds is preferred.

== Generalizations == ANOVA is considered to be a special case of linear regression which in turn is a special case of the general linear model. All consider the observations to be the sum of a model (fit) and a residual (error) to be minimized. The Kruskal-Wallis test and the Friedman test are nonparametric tests which do not rely on an assumption of normality.

=== Connection to linear regression === Below we make clear the connection between multi-way ANOVA and linear regression. Linearly re-order the data so that

    k
  

{\displaystyle k}

-th observation is associated with a response

      y
      
        k
      
    
  

{\displaystyle y_{k}}

and factors

      Z
      
        k
        ,
        b
      
    
  

{\displaystyle Z_{k,b}}

where

    b
    ∈
    {
    1
    ,
    2
    ,
    …
    ,
    B
    }
  

{\displaystyle b\in \{1,2,\ldots ,B\}}

denotes the different factors and

    B
  

{\displaystyle B}

is the total number of factors. In one-way ANOVA

    B
    =
    1
  

{\displaystyle B=1}

and in two-way ANOVA

    B
    =
    2
  

{\displaystyle B=2}

. Furthermore, we assume the

    b
  

{\displaystyle b}

-th factor has

      I
      
        b
      
    
  

{\displaystyle I_{b}}

levels, namely

    {
    1
    ,
    2
    ,
    …
    ,
    
      I
      
        b
      
    
    }
  

{\displaystyle \{1,2,\ldots ,I_{b}\}}

. Now, we can one-hot encode the factors into the

      ∑
      
        b
        =
        1
      
      
        B
      
    
    
      I
      
        b
      
    
  

{\textstyle \sum _{b=1}^{B}I_{b}}

dimensional vector

      v
      
        k
      
    
  

{\displaystyle v_{k}}

. The one-hot encoding function

      g
      
        b
      
    
    :
    {
    1
    ,
    2
    ,
    …
    ,
    
      I
      
        b
      
    
    }
    ↦
    {
    0
    ,
    1
    
      }
      
        
          I
          
            b
          
        
      
    
  

{\displaystyle g_{b}:\{1,2,\ldots ,I_{b}\}\mapsto \{0,1\}^{I_{b}}}

is defined such that the

    i
  

{\displaystyle i}

-th entry of

      g
      
        b
      
    
    (
    
      Z
      
        k
        ,
        b
      
    
    )
  

{\displaystyle g_{b}(Z_{k,b})}

      g
      
        b
      
    
    (
    
      Z
      
        k
        ,
        b
      
    
    
      )
      
        i
      
    
    =
    
      
        {
        
          
            
              1
            
            
              
                if 
              
              i
              =
              
                Z
                
                  k
                  ,
                  b
                
              
            
          
          
            
              0
            
            
              
                otherwise
              
            
          
        
        
      
    
  

{\displaystyle g_{b}(Z_{k,b})_{i}={\begin{cases}1&{\text{if }}i=Z_{k,b}\\0&{\text{otherwise}}\end{cases}}}

The vector

      v
      
        k
      
    
  

{\displaystyle v_{k}}

is the concatenation of all of the above vectors for all

    b
  

{\displaystyle b}

. Thus,

      v
      
        k
      
    
    =
    [
    
      g
      
        1
      
    
    (
    
      Z
      
        k
        ,
        1
      
    
    )
    ,
    
      g
      
        2
      
    
    (
    
      Z
      
        k
        ,
        2
      
    
    )
    ,
    …
    ,
    
      g
      
        B
      
    
    (
    
      Z
      
        k
        ,
        B
      
    
    )
    ]
  

{\displaystyle v_{k}=[g_{1}(Z_{k,1}),g_{2}(Z_{k,2}),\ldots ,g_{B}(Z_{k,B})]}

. In order to obtain a fully general

    B
  

{\displaystyle B}

-way interaction ANOVA we must also concatenate every additional interaction term in the vector

      v
      
        k
      
    
  

{\displaystyle v_{k}}

and then add an intercept term. Let that vector be

      X
      
        k
      
    
  

{\displaystyle X_{k}}

. With this notation in place, we now have the exact connection with linear regression. We simply regress response

      y
      
        k
      
    
  

{\displaystyle y_{k}}

against the vector

      X
      
        k
      
    
  

{\displaystyle X_{k}}

. However, there is a concern about identifiability. In order to overcome such issues we assume that the sum of the parameters within each set of interactions is equal to zero. From here, one can use F-statistics or other methods to determine the relevance of the individual factors.

==== Example ==== We can consider the 2-way interaction example where we assume that the first factor has 2 levels and the second factor has 3 levels. Define

      a
      
        i
      
    
    =
    1
  

{\displaystyle a_{i}=1}

      Z
      
        k
        ,
        1
      
    
    =
    i
  

{\displaystyle Z_{k,1}=i}

and

      b
      
        i
      
    
    =
    1
  

{\displaystyle b_{i}=1}

      Z
      
        k
        ,
        2
      
    
    =
    i
  

{\displaystyle Z_{k,2}=i}

, i.e.

    a
  

{\displaystyle a}

is the one-hot encoding of the first factor and

    b
  

{\displaystyle b}

is the one-hot encoding of the second factor. With that,

      X
      
        k
      
    
    =
    [
    
      a
      
        1
      
    
    ,
    
      a
      
        2
      
    
    ,
    
      b
      
        1
      
    
    ,
    
      b
      
        2
      
    
    ,
    
      b
      
        3
      
    
    ,
    
      a
      
        1
      
    
    ×
    
      b
      
        1
      
    
    ,
    
      a
      
        1
      
    
    ×
    
      b
      
        2
      
    
    ,
    
      a
      
        1
      
    
    ×
    
      b
      
        3
      
    
    ,
    
      a
      
        2
      
    
    ×
    
      b
      
        1
      
    
    ,
    
      a
      
        2
      
    
    ×
    
      b
      
        2
      
    
    ,
    
      a
      
        2
      
    
    ×
    
      b
      
        3
      
    
    ,
    1
    ]
  

{\displaystyle X_{k}=[a_{1},a_{2},b_{1},b_{2},b_{3},a_{1}\times b_{1},a_{1}\times b_{2},a_{1}\times b_{3},a_{2}\times b_{1},a_{2}\times b_{2},a_{2}\times b_{3},1]}

where the last term is an intercept term. For a more concrete example suppose that

              Z
              
                k
                ,
                1
              
            
          
          
            
            =
            2
          
        
        
          
            
              Z
              
                k
                ,
                2
              
            
          
          
            
            =
            1
          
        
      
    
  

{\displaystyle {\begin{aligned}Z_{k,1}&=2\\Z_{k,2}&=1\end{aligned}}}

Then,

      X
      
        k
      
    
    =
    [
    0
    ,
    1
    ,
    1
    ,
    0
    ,
    0
    ,
    0
    ,
    0
    ,
    0
    ,
    1
    ,
    0
    ,
    0
    ,
    1
    ]
  

{\displaystyle X_{k}=[0,1,1,0,0,0,0,0,1,0,0,1]}

== See also ==

== Footnotes ==

== Notes ==

12 KiB Raw Blame History Unescape Escape

12 KiB

Raw Blame History