kb/data/en.wikipedia.org/wiki/Coefficient_of_determination-3.md

6.9 KiB
Raw Blame History

title chunk source category tags date_saved instance
Coefficient of determination 4/6 https://en.wikipedia.org/wiki/Coefficient_of_determination reference science, encyclopedia 2026-05-05T07:23:31.318214+00:00 kb-cron

This equation corresponds to the ordinary least squares regression model with two regressors. The prediction is shown as the blue vector in the figure on the right. Geometrically, it is the projection of true value onto a larger model space in

        R
      
      
        2
      
    
  

{\displaystyle \mathbb {R} ^{2}}

(without intercept). Noticeably, the values of

      β
      
        0
      
    
  

{\displaystyle \beta _{0}}

and

      β
      
        0
      
    
  

{\displaystyle \beta _{0}}

are not the same as in the equation for smaller model space as long as

      X
      
        1
      
    
  

{\displaystyle X_{1}}

and

      X
      
        2
      
    
  

{\displaystyle X_{2}}

are not zero vectors. Therefore, the equations are expected to yield different predictions (i.e., the blue vector is expected to be different from the red vector). The least squares regression criterion ensures that the residual is minimized. In the figure, the blue line representing the residual is orthogonal to the model space in

        R
      
      
        2
      
    
  

{\displaystyle \mathbb {R} ^{2}}

, giving the minimal distance from the space. The smaller model space is a subspace of the larger one, and thereby the residual of the smaller model is guaranteed to be larger. Comparing the red and blue lines in the figure, the blue line is orthogonal to the space, and any other line would be larger than the blue one. Considering the calculation for R2, a smaller value of

    S
    
      S
      
        t
        o
        t
      
    
  

{\displaystyle SS_{tot}}

will lead to a larger value of R2, meaning that adding regressors will result in inflation of R2.

=== Caveats === R2 does not indicate whether:

the independent variables are a cause of the changes in the dependent variable; omitted-variable bias exists; the correct regression was used; the most appropriate set of independent variables has been chosen; there is collinearity present in the data on the explanatory variables; the model might be improved by using transformed versions of the existing set of independent variables; there are enough data points to make a solid conclusion; there are a few outliers in an otherwise good sample.

== Extensions ==

=== Adjusted R2 ===

The use of an adjusted R2 (one common notation is

            R
            ¯
          
        
      
      
        2
      
    
  

{\displaystyle {\bar {R}}^{2}}

, pronounced "R bar squared"; another is

      R
      
        a
      
      
        2
      
    
  

{\displaystyle R_{\text{a}}^{2}}

or

      R
      
        adj
      
      
        2
      
    
  

{\displaystyle R_{\text{adj}}^{2}}

) is an attempt to account for the phenomenon of the R2 automatically increasing when extra explanatory variables are added to the model. There are many different ways of adjusting. By far the most used one, to the point that it is typically just referred to as adjusted R, is the correction proposed by Mordecai Ezekiel. The adjusted R2 is defined as

            R
            ¯
          
        
      
      
        2
      
    
    =
    
      1
      
      
        
          
            S
            
              S
              
                res
              
            
            
              /
            
            
              
                df
              
              
                res
              
            
          
          
            S
            
              S
              
                tot
              
            
            
              /
            
            
              
                df
              
              
                tot
              
            
          
        
      
    
  

{\displaystyle {\bar {R}}^{2}={1-{SS_{\text{res}}/{\text{df}}_{\text{res}} \over SS_{\text{tot}}/{\text{df}}_{\text{tot}}}}}

where dfres is the degrees of freedom of the estimate of the population variance around the model, and dftot is the degrees of freedom of the estimate of the population variance around the mean. dfres is given in terms of the sample size n and the number of variables p in the model, dfres=n p 1. dftot is given in the same way, but with p being zero for the mean (i.e., dftot=n 1). Inserting the degrees of freedom and using the definition of R2, it can be rewritten as:

            R
            ¯
          
        
      
      
        2
      
    
    =
    1
    
    (
    1
    
    
      R
      
        2
      
    
    )
    
      
        
          n
          
          1
        
        
          n
          
          p
          
          1
        
      
    
  

{\displaystyle {\bar {R}}^{2}=1-(1-R^{2}){n-1 \over n-p-1}}

where p is the total number of explanatory variables in the model (excluding the intercept), and n is the sample size. The adjusted R2 can be negative, and its value will always be less than or equal to that of R2. Unlike R2, the adjusted R2 increases only when the increase in R2 (due to the inclusion of a new explanatory variable) is more than one would expect to see by chance. If a set of explanatory variables with a predetermined hierarchy of importance are introduced into a regression one at a time, with the adjusted R2 computed each time, the level at which adjusted R2 reaches a maximum, and decreases afterward, would be the regression with the ideal combination of having the best fit without excess/unnecessary terms.