What does R2 mean in statistics?
The phrase what does r squared meanl is treated as a misspelling of “what does R squared mean.” In statistics, R2 is the coefficient of determination, a summary of how much variation in a response variable is explained by a regression model relative to an intercept-only baseline.
Core meaning in regression output
In simple linear regression (and more generally, least-squares regression with an intercept), R2 is interpreted as the proportion of total variability in the observed response values that is accounted for by the fitted model. Values closer to 1 indicate the model explains a larger share of the variation; values closer to 0 indicate the model explains little beyond the mean of the response.
R2 is commonly reported as a percentage. For example, R2 = 0.83 corresponds to about 83% of the variation in the response being explained by the model, with the remaining variation attributed to residual error and unmodeled factors.
Variance decomposition definition
Let \(y_1,\dots,y_n\) be observed responses, \(\hat{y}_1,\dots,\hat{y}_n\) the fitted (predicted) responses from a least-squares line, and \(\bar{y}\) the sample mean: \[ \bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i. \] Three sums of squares formalize “total variation,” “explained variation,” and “unexplained variation”: \[ \mathrm{SST}=\sum_{i=1}^{n}(y_i-\bar{y})^2,\qquad \mathrm{SSR}=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2,\qquad \mathrm{SSE}=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2. \] With an intercept in the model, the decomposition \[ \mathrm{SST}=\mathrm{SSR}+\mathrm{SSE} \] holds, and the coefficient of determination is \[ R^2=\frac{\mathrm{SSR}}{\mathrm{SST}}=1-\frac{\mathrm{SSE}}{\mathrm{SST}}. \]
Connection to correlation in simple linear regression
In simple linear regression with an intercept, R2 equals the square of the Pearson correlation coefficient \(r\) between \(x\) and \(y\): \[ R^2=r^2. \] The sign of the linear association is carried by \(r\) (positive or negative slope), while R2 is nonnegative and measures strength of fit in the least-squares sense.
Interpretation boundaries and common pitfalls
| Interpretation statement | Statistical meaning | Clarifying note |
|---|---|---|
| “R² is the percent of variability explained.” | R² is a proportion of \(\mathrm{SST}\) explained by the model’s fitted values. | Meaning depends on an intercept model and the chosen response variable; it is not a statement about individual prediction errors. |
| “High R² means accurate predictions.” | High R² indicates a strong reduction in squared error versus the mean-only baseline. | Large errors can still occur for particular cases; prediction performance is better assessed with residual plots, RMSE, and validation. |
| “Low R² means no relationship.” | Low R² indicates the linear model explains little of the response variation. | Nonlinear patterns, restricted ranges of \(x\), or high measurement noise can produce low R² even when a meaningful relationship exists. |
| “R² implies causation.” | R² is descriptive of fit, not causal structure. | Causal claims require study design, controls, and assumptions beyond regression output. |
Negative R2 values can appear when a model is evaluated without an intercept or when predictions are assessed out of sample; in such settings the fitted model can perform worse (in squared error) than predicting the mean of the observed responses.
Worked example (SST, SSE, SSR, and R2)
Consider four paired observations \((x_i,y_i)\): \((1,2)\), \((2,2)\), \((3,3)\), \((4,5)\). The least-squares fitted line (with intercept) is \(\hat{y}=0.5+1.0x\), producing fitted values \(\hat{y}_i\). The mean response is \(\bar{y}=(2+2+3+5)/4=3\).
| \(i\) | \(x_i\) | \(y_i\) | \(\hat{y}_i\) | \(y_i-\bar{y}\) | \((y_i-\bar{y})^2\) | \(y_i-\hat{y}_i\) | \((y_i-\hat{y}_i)^2\) | \(\hat{y}_i-\bar{y}\) | \((\hat{y}_i-\bar{y})^2\) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 2 | 1.5 | \(-1\) | 1 | 0.5 | 0.25 | \(-1.5\) | 2.25 |
| 2 | 2 | 2 | 2.5 | \(-1\) | 1 | \(-0.5\) | 0.25 | \(-0.5\) | 0.25 |
| 3 | 3 | 3 | 3.5 | 0 | 0 | \(-0.5\) | 0.25 | 0.5 | 0.25 |
| 4 | 4 | 5 | 4.5 | 2 | 4 | 0.5 | 0.25 | 1.5 | 2.25 |
Summing the squared columns yields \[ \mathrm{SST}=1+1+0+4=6,\qquad \mathrm{SSE}=0.25+0.25+0.25+0.25=1,\qquad \mathrm{SSR}=2.25+0.25+0.25+2.25=5. \] The decomposition \(\mathrm{SST}=\mathrm{SSR}+\mathrm{SSE}\) holds since \(6=5+1\). Therefore, \[ R^2=1-\frac{\mathrm{SSE}}{\mathrm{SST}}=1-\frac{1}{6}=\frac{5}{6}\approx 0.8333. \] About 83.33% of the total variation in \(y\) (around \(\bar{y}=3\)) is explained by the fitted line.