Linear regression (trendline) + correlation — Theory
Chapter: Bio Lab Math & Data Analysis • Topic: Linear regression (trendline) + correlation
1) What linear regression answers
In many biology labs, you collect paired measurements \((x_i, y_i)\): for example,
dose vs response, time vs concentration, temperature vs enzyme activity, or body length vs mass.
Simple linear regression summarizes the relationship by fitting a straight line that best predicts \(y\) from \(x\).
What you get
- A trendline \(\hat{y} = a + b\cdot x\).
- Strength and direction of association via correlation \(r\).
- How much variation is explained via \(r^2\).
- Predictions: \(\hat{y}(x_0)\) for a chosen \(x_0\).
When it makes sense
- The scatter plot suggests a roughly straight-line pattern.
- Residuals do not show obvious structure (see below).
- You are not extrapolating far beyond the observed \(x\)-range.
2) The model and the fitted line
The regression line used in this calculator is written as:
\[
\hat{y} = a + b\cdot x
\]
- Slope \(b\): expected change in \(\hat{y}\) for a 1-unit increase in \(x\).
- Intercept \(a\): predicted value when \(x = 0\) (sometimes meaningful, sometimes not).
The “hat” in \(\hat{y}\) means “predicted y”. The actual observed value is \(y\),
and the difference \(y-\hat{y}\) is the residual.
3) Least squares fitting (how \(a\) and \(b\) are chosen)
The fitted line is the one that minimizes the sum of squared vertical errors:
\[
\mathrm{SSE} = \sum_{i=1}^{n}\left(y_i-\hat{y}_i\right)^2
\qquad \text{where } \hat{y}_i = a + b\cdot x_i
\]
To compute \(a\) and \(b\), we first define sample means and centered sums:
\[
\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i,
\qquad
\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_i
\]
\[
S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x})^2,
\qquad
S_{xy}=\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})
\]
Then the least-squares slope and intercept are:
\[
b=\frac{S_{xy}}{S_{xx}},
\qquad
a=\bar{y}-b\cdot \bar{x}
\]
If all \(x\) values are identical, then \(S_{xx}=0\) and the slope cannot be computed.
In practice, you need variation in \(x\).
4) Correlation \(r\) and determination \(r^2\)
Correlation measures the strength and direction of the linear relationship between \(x\) and \(y\).
It is computed using:
\[
r=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}
\qquad \text{where } \quad
S_{yy}=\sum_{i=1}^{n}(y_i-\bar{y})^2
\]
- \(r\in[-1,1]\).
- \(r>0\): increasing trend. \(r<0\): decreasing trend.
- \(|r|\) close to 1 indicates a strong linear pattern; close to 0 indicates weak linear association.
The coefficient of determination is:
\[
r^2 = (r)^2
\]
In simple linear regression, \(r^2\) can be interpreted as the fraction of variability in \(y\)
explained by a linear model in \(x\). It does not guarantee causation.
5) Residuals, SSE, and RMSE
The residual for observation \(i\) is:
\[
e_i = y_i - \hat{y}_i
\]
The calculator reports the sum of squared errors and a basic scale of residual size:
\[
\mathrm{SSE}=\sum_{i=1}^{n} e_i^2
\]
\[
\mathrm{RMSE}=\sqrt{\frac{\mathrm{SSE}}{n-2}}
\]
The \(n-2\) appears because two parameters \((a,b)\) are estimated from the data.
RMSE is best viewed as a “typical” vertical deviation from the line, in the units of \(y\).
6) Interpreting the graphs in the calculator
Scatter plot + trendline
- Points show observed \((x,y)\).
- The line shows \(\hat{y}=a+b\cdot x\).
- The \(x_0\) marker shows the prediction \(\hat{y}(x_0)\).
- Optional residual segments show each \(y_i-\hat{y}_i\) vertically.
Residual plot
- Plots \((x_i, e_i)\) where \(e_i=y_i-\hat{y}_i\).
- Good fit: residuals scattered around 0 with no systematic curve.
- Curvature suggests a nonlinear relationship.
- Funnel shape suggests changing variance.
The calculator’s interactivity (hover, zoom, pan, click-to-highlight) is meant to help you
connect each plotted point with its numerical row in the table.
7) Common pitfalls (important in lab reporting)
- Correlation is not causation. A strong \(r\) does not prove that changes in \(x\) cause changes in \(y\).
- Outliers can dominate the slope. Always inspect the scatter plot and residuals.
- Extrapolation risk. Predicting far beyond the observed \(x\)-range can be misleading.
- Hidden groups. If data come from different conditions (e.g., different subjects or batches), a single line may be inappropriate.
- Nonlinear relationships. If the pattern is curved, consider transformations or nonlinear models.
For future upgrades (beyond this basic calculator), confidence or prediction intervals
can be added using the standard error of the regression and the \(t\) distribution.