kb/data/en.wikipedia.org/wiki/Coefficient_of_determination-1.md

8.2 KiB
Raw Blame History

title chunk source category tags date_saved instance
Coefficient of determination 2/6 https://en.wikipedia.org/wiki/Coefficient_of_determination reference science, encyclopedia 2026-05-05T07:23:31.318214+00:00 kb-cron

where the qi are arbitrary values that may or may not depend on i or on other free parameters (the common choice qi = xi is just one special case), and the coefficient estimates

          α
          ^
        
      
    
  

{\displaystyle {\widehat {\alpha }}}

and

          β
          ^
        
      
    
  

{\displaystyle {\widehat {\beta }}}

are obtained by minimizing the residual sum of squares. This set of conditions is an important one and it has a number of implications for the properties of the fitted residuals and the modelled values. In particular, under these conditions:

          f
          ¯
        
      
    
    =
    
      
        
          y
          ¯
        
      
    
    .
    
  

{\displaystyle {\bar {f}}={\bar {y}}.\,}

=== As squared correlation coefficient === In linear least squares multiple regression (with fitted intercept and slope), R2 equals

      ρ
      
        2
      
    
    (
    y
    ,
    f
    )
  

{\displaystyle \rho ^{2}(y,f)}

the square of the Pearson correlation coefficient between the observed

    y
  

{\displaystyle y}

and modeled (predicted)

    f
  

{\displaystyle f}

data values of the dependent variable. In a linear least squares regression with a single explanator (with fitted intercept and slope), this is also equal to

      ρ
      
        2
      
    
    (
    y
    ,
    x
    )
  

{\displaystyle \rho ^{2}(y,x)}

the squared Pearson correlation coefficient between the dependent variable

    y
  

{\displaystyle y}

and explanatory variable

    x
  

{\displaystyle x}

. It should not be confused with the correlation coefficient between two explanatory variables, defined as

      ρ
      
        
          
            
              α
              ^
            
          
        
        ,
        
          
            
              β
              ^
            
          
        
      
    
    =
    
      
        
          cov
          
          
            (
            
              
                
                  
                    α
                    ^
                  
                
              
              ,
              
                
                  
                    β
                    ^
                  
                
              
            
            )
          
        
        
          
            σ
            
              
                
                  α
                  ^
                
              
            
          
          
            σ
            
              
                
                  β
                  ^
                
              
            
          
        
      
    
    ,
  

{\displaystyle \rho _{{\widehat {\alpha }},{\widehat {\beta }}}={\operatorname {cov} \left({\widehat {\alpha }},{\widehat {\beta }}\right) \over \sigma _{\widehat {\alpha }}\sigma _{\widehat {\beta }}},}

where the covariance between two coefficient estimates, as well as their standard deviations, are obtained from the covariance matrix of the coefficient estimates,

    (
    
      X
      
        T
      
    
    X
    
      )
      
        
        1
      
    
  

{\displaystyle (X^{T}X)^{-1}}

. Under more general modeling conditions, where the predicted values might be generated from a model different from linear least squares regression, an R2 value can be calculated as the square of the correlation coefficient between the original

    y
  

{\displaystyle y}

and modeled

    f
  

{\displaystyle f}

data values. In this case, the value is not directly a measure of how good the modeled values are, but rather a measure of how good a predictor might be constructed from the modeled values (by creating a revised predictor of the form α + βƒi). According to Everitt, this usage is specifically the definition of the term "coefficient of determination": the square of the correlation between two (general) variables.

== Interpretation == R2 is a measure of the goodness of fit of a model. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data. Values of R2 outside the range 0 to 1 occur when the model fits the data worse than the worst possible least-squares predictor (equivalent to a horizontal hyperplane at a height equal to the mean of the observed data). This occurs when a wrong model was chosen, or nonsensical constraints were applied by mistake. If equation 1 of Kvålseth is used (this is the equation used most often), R2 can be less than zero. If equation 2 of Kvålseth is used, R2 can be greater than one. In all instances where R2 is used, the predictors are calculated by ordinary least-squares regression: that is, by minimizing SSres. In this case, R2 increases as the number of variables in the model is increased (R2 is monotone increasing with the number of variables included—it will never decrease). This illustrates a drawback to one possible use of R2, where one might keep adding variables (kitchen sink regression) to increase the R2 value. For example, if one is trying to predict the sales of a model of car from the car's gas mileage, price, and engine power, one can include probably irrelevant factors such as the first letter of the model's name or the height of the lead engineer designing the car because the R2 will never decrease as variables are added and will likely experience an increase due to chance alone. This leads to the alternative approach of looking at the adjusted R2. The explanation of this statistic is almost the same as R2 but it penalizes the statistic as extra variables are included in the model. For cases other than fitting by ordinary least squares, the R2 statistic can be calculated as above and may still be a useful measure. If fitting is by weighted least squares or generalized least squares, alternative versions of R2 can be calculated appropriate to those statistical frameworks, while the "raw" R2 may still be useful if it is more easily interpreted. Values for R2 can be calculated for any type of predictive model, which need not have a statistical basis.

=== In a multiple linear model === Consider a linear model with more than a single explanatory variable, of the form

      Y
      
        i
      
    
    =
    
      β
      
        0
      
    
    +
    
      ∑
      
        j
        =
        1
      
      
        p
      
    
    
      β
      
        j
      
    
    
      X
      
        i
        ,
        j
      
    
    +
    
      ε
      
        i
      
    
    ,
  

{\displaystyle Y_{i}=\beta _{0}+\sum _{j=1}^{p}\beta _{j}X_{i,j}+\varepsilon _{i},}