kb/data/en.wikipedia.org/wiki/Coefficient_of_determination-0.md

9.8 KiB
Raw Blame History

title chunk source category tags date_saved instance
Coefficient of determination 1/6 https://en.wikipedia.org/wiki/Coefficient_of_determination reference science, encyclopedia 2026-05-05T07:23:31.318214+00:00 kb-cron

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. There are several definitions of R2 that are only sometimes equivalent. In simple linear regression (which includes an intercept), r2 is simply the square of the sample correlation coefficient (r), between the observed outcomes and the observed predictor values. If additional regressors are included, R2 is the square of the coefficient of multiple correlation. In both such cases, the coefficient of determination is always smaller than 1 and usually greater than 0. Cases where R2 is negative can arise when the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data. Even if a model-fitting procedure has been used, R2 may still be negative, for example when linear regression is conducted without including an intercept, or when a non-linear function is used to fit the data. In cases where negative values arise, the mean of the data provides a better fit to the outcomes than do the fitted function values, according to this particular criterion. The coefficient of determination can be more intuitively informative than MAE, MAPE, MSE, and RMSE in regression analysis evaluation, as the former can be expressed as a percentage, whereas the latter measures have arbitrary ranges. It also proved more robust for poor fits compared to SMAPE on certain test datasets. When evaluating the goodness-of-fit of simulated (Ypred) versus measured (Yobs) values, it is not appropriate to base this on the R2 of the linear regression (i.e., Yobs=m·Ypred + b). The R2 quantifies the degree of any linear correlation between Yobs and Ypred, while for the goodness-of-fit evaluation only one specific linear correlation should be taken into consideration: Yobs=1·Ypred + 0 (i.e., the 1:1 line).

== Definitions ==

A data set has n values marked y1, ..., yn (collectively known as yi or as a vector y=[y1, ..., yn]T), each associated with a fitted (or modeled, or predicted) value f1, ..., fn (known as fi, or sometimes ŷi, as a vector f). Define the residuals as ei=yi fi (forming a vector e). If

          y
          ¯
        
      
    
  

{\displaystyle {\bar {y}}}

is the mean of the observed data:

          y
          ¯
        
      
    
    =
    
      
        1
        n
      
    
    
      ∑
      
        i
        =
        1
      
      
        n
      
    
    
      y
      
        i
      
    
  

{\displaystyle {\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}}

then the variability of the data set can be measured with two sums of squares formulas:

The sum of squares of residuals, also called the residual sum of squares:

    S
    
      S
      
        res
      
    
    =
    
      ∑
      
        i
      
    
    (
    
      y
      
        i
      
    
    
    
      f
      
        i
      
    
    
      )
      
        2
      
    
    =
    
      ∑
      
        i
      
    
    
      e
      
        i
      
      
        2
      
    
    
  

{\displaystyle SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,}

The total sum of squares (proportional to the variance of the data):

    S
    
      S
      
        tot
      
    
    =
    
      ∑
      
        i
      
    
    (
    
      y
      
        i
      
    
    
    
      
        
          y
          ¯
        
      
    
    
      )
      
        2
      
    
  

{\displaystyle SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}}

The most general definition of the coefficient of determination is

      R
      
        2
      
    
    =
    1
    
    
      
        
          S
          
            S
            
              
                r
                e
                s
              
            
          
        
        
          S
          
            S
            
              
                t
                o
                t
              
            
          
        
      
    
  

{\displaystyle R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}}

In the best case, the modeled values exactly match the observed values, which results in

    S
    
      S
      
        res
      
    
    =
    0
  

{\displaystyle SS_{\text{res}}=0}

and R2=1. A baseline model, which always predicts y, will have R2=0.

=== Relation to unexplained variance ===

In a general form, R2 can be seen to be related to the fraction of variance unexplained (FVU), since the second term compares the unexplained variance (variance of the model's errors) with the total variance (of the data):

      R
      
        2
      
    
    =
    1
    
    
      FVU
    
  

{\displaystyle R^{2}=1-{\text{FVU}}}

=== As explained variance === A larger value of R2 implies a more successful regression model. Suppose R2=0.49. This implies that 49% of the variability of the dependent variable in the data set has been accounted for, and the remaining 51% of the variability is still unaccounted for. For regression models, the regression sum of squares, also called the explained sum of squares, is defined as

    S
    
      S
      
        reg
      
    
    =
    
      ∑
      
        i
      
    
    (
    
      f
      
        i
      
    
    
    
      
        
          y
          ¯
        
      
    
    
      )
      
        2
      
    
  

{\displaystyle SS_{\text{reg}}=\sum _{i}(f_{i}-{\bar {y}})^{2}}

In some cases, as in simple linear regression, the total sum of squares equals the sum of the two other sums of squares defined above:

    S
    
      S
      
        res
      
    
    +
    S
    
      S
      
        reg
      
    
    =
    S
    
      S
      
        tot
      
    
  

{\displaystyle SS_{\text{res}}+SS_{\text{reg}}=SS_{\text{tot}}}

See Partitioning in the general OLS model for a derivation of this result for one case where the relation holds. When this relation does hold, the above definition of R2 is equivalent to

      R
      
        2
      
    
    =
    
      
        
          S
          
            S
            
              reg
            
          
        
        
          S
          
            S
            
              tot
            
          
        
      
    
    =
    
      
        
          S
          
            S
            
              reg
            
          
          
            /
          
          n
        
        
          S
          
            S
            
              tot
            
          
          
            /
          
          n
        
      
    
  

{\displaystyle R^{2}={\frac {SS_{\text{reg}}}{SS_{\text{tot}}}}={\frac {SS_{\text{reg}}/n}{SS_{\text{tot}}/n}}}

where n is the number of observations (cases) on the variables. In this form R2 is expressed as the ratio of the explained variance (variance of the model's predictions, which is SSreg / n) to the total variance (sample variance of the dependent variable, which is SStot / n). This partition of the sum of squares holds for instance when the model values ƒi have been obtained by linear regression. A milder sufficient condition reads as follows: The model has the form

      f
      
        i
      
    
    =
    
      
        
          α
          ^
        
      
    
    +
    
      
        
          β
          ^
        
      
    
    
      q
      
        i
      
    
  

{\displaystyle f_{i}={\widehat {\alpha }}+{\widehat {\beta }}q_{i}}