kb/data/en.wikipedia.org/wiki/Coefficient_of_determination-2.md

7.7 KiB
Raw Blame History

title chunk source category tags date_saved instance
Coefficient of determination 3/6 https://en.wikipedia.org/wiki/Coefficient_of_determination reference science, encyclopedia 2026-05-05T07:23:31.318214+00:00 kb-cron

where, for the ith case,

        Y
        
          i
        
      
    
  

{\displaystyle {Y_{i}}}

is the response variable,

      X
      
        i
        ,
        1
      
    
    ,
    …
    ,
    
      X
      
        i
        ,
        p
      
    
  

{\displaystyle X_{i,1},\dots ,X_{i,p}}

are p regressors, and

      ε
      
        i
      
    
  

{\displaystyle \varepsilon _{i}}

is a mean zero error term. The quantities

      β
      
        0
      
    
    ,
    …
    ,
    
      β
      
        p
      
    
  

{\displaystyle \beta _{0},\dots ,\beta _{p}}

are unknown coefficients, whose values are estimated by least squares. The coefficient of determination R2 is a measure of the global fit of the model. Specifically, R2 is an element of [0, 1] and represents the proportion of variability in Yi that may be attributed to some linear combination of the regressors (explanatory variables) in X. R2 is often interpreted as the proportion of response variation "explained" by the regressors in the model. Thus, R2 = 1 indicates that the fitted model explains all variability in

    y
  

{\displaystyle y}

, while R2 = 0 indicates no 'linear' relationship (for straight line regression, this means that the straight line model is a constant line (slope = 0, intercept =

          y
          ¯
        
      
    
  

{\displaystyle {\bar {y}}}

) between the response variable and regressors). An interior value such as R2 = 0.7 may be interpreted as follows: "Seventy percent of the variance in the response variable can be explained by the explanatory variables. The remaining thirty percent can be attributed to unknown, lurking variables or inherent variability." A caution that applies to R2, as to other statistical descriptions of correlation and association is that "correlation does not imply causation". In other words, while correlations may sometimes provide valuable clues in uncovering causal relationships among variables, a non-zero estimated correlation between two variables is not, on its own, evidence that changing the value of one variable would result in changes in the values of other variables. For example, the practice of carrying matches (or a lighter) is correlated with incidence of lung cancer, but carrying matches does not cause cancer (in the standard sense of "cause"). In case of a single regressor, fitted by least squares, R2 is the square of the Pearson product-moment correlation coefficient relating the regressor and the response variable. More generally, R2 is the square of the correlation between the constructed predictor and the response variable. With more than one regressor, the R2 can be referred to as the coefficient of multiple determination.

=== Inflation of R2 === In least squares regression using typical data, R2 is at least weakly increasing with an increase in number of regressors in the model. Because increases in the number of regressors increase the value of R2, R2 alone cannot be used as a meaningful comparison of models with very different numbers of independent variables. For a meaningful comparison between two models, an F-test can be performed on the residual sum of squares , similar to the F-tests in Granger causality, though this is not always appropriate. As a reminder of this, some authors denote R2 by Rq2, where q is the number of columns in X (the number of explanators including the constant). To demonstrate this property, first recall that the objective of least squares linear regression is

      min
      
        b
      
    
    S
    
      S
      
        res
      
    
    (
    b
    )
    ⇒
    
      min
      
        b
      
    
    
      ∑
      
        i
      
    
    (
    
      y
      
        i
      
    
    
    
      X
      
        i
      
    
    b
    
      )
      
        2
      
    
    
  

{\displaystyle \min _{b}SS_{\text{res}}(b)\Rightarrow \min _{b}\sum _{i}(y_{i}-X_{i}b)^{2}\,}

where Xi is a row vector of values of explanatory variables for case i and b is a column vector of coefficients of the respective elements of Xi. The optimal value of the objective is weakly smaller as more explanatory variables are added and hence additional columns of

    X
  

{\displaystyle X}

(the explanatory data matrix whose ith row is Xi) are added, by the fact that less constrained minimization leads to an optimal cost which is weakly smaller than more constrained minimization does. Given the previous conclusion and noting that

    S
    
      S
      
        t
        o
        t
      
    
  

{\displaystyle SS_{tot}}

depends only on y, the non-decreasing property of R2 follows directly from the definition above. The intuitive reason that using an additional explanatory variable cannot lower the R2 is this: Minimizing

    S
    
      S
      
        res
      
    
  

{\displaystyle SS_{\text{res}}}

is equivalent to maximizing R2. When the extra variable is included, the data always have the option of giving it an estimated coefficient of zero, leaving the predicted values and the R2 unchanged. The only way that the optimization problem will give a non-zero coefficient is if doing so improves the R2. The above gives an analytical explanation of the inflation of R2. Next, an example based on ordinary least square from a geometric perspective is shown below.

A simple case to be considered first:

    Y
    =
    
      β
      
        0
      
    
    +
    
      β
      
        1
      
    
    ⋅
    
      X
      
        1
      
    
    +
    ε
    
  

{\displaystyle Y=\beta _{0}+\beta _{1}\cdot X_{1}+\varepsilon \,}

This equation describes the ordinary least squares regression model with one regressor. The prediction is shown as the red vector in the figure on the right. Geometrically, it is the projection of true value onto a model space in

      R
    
  

{\displaystyle \mathbb {R} }

(without intercept). The residual is shown as the red line.

    Y
    =
    
      β
      
        0
      
    
    +
    
      β
      
        1
      
    
    ⋅
    
      X
      
        1
      
    
    +
    
      β
      
        2
      
    
    ⋅
    
      X
      
        2
      
    
    +
    ε
    
  

{\displaystyle Y=\beta _{0}+\beta _{1}\cdot X_{1}+\beta _{2}\cdot X_{2}+\varepsilon \,}