Sampling distributions: the key bridge to inference
In statistical inference, we rarely observe the entire population. Instead, we take a sample and compute a statistic
(like a sample mean or a sample proportion). The sampling distribution is the probability distribution of that
statistic over repeated samples of the same size \(n\).
1) Sampling distribution of the sample mean \(\bar X\)
Suppose \(X_1,\dots,X_n\) are i.i.d. observations from a population with mean \(\mu\) and variance \(\sigma^2\).
The sample mean is:
\[
\bar X=\frac{1}{n}\sum_{i=1}^{n}X_i
\]
Two results are fundamental (and are exact, not approximations):
\[
\begin{aligned}
E[\bar X] &= \mu \\
Var(\bar X) &= \frac{\sigma^2}{n}
\end{aligned}
\]
The standard deviation of the sampling distribution of \(\bar X\) is called the standard error:
\[
SE(\bar X)=\sqrt{Var(\bar X)}=\frac{\sigma}{\sqrt{n}}
\]
2) Sampling distribution of the sample proportion \(\hat p\)
For a binary outcome (success/failure) with population success probability \(p\), define \(I_i\in\{0,1\}\) as the indicator
of success on trial \(i\). Then the sample proportion is:
\[
\hat p=\frac{1}{n}\sum_{i=1}^{n} I_i
\]
Because \(E[I_i]=p\) and \(Var(I_i)=p(1-p)\), we get:
\[
\begin{aligned}
E[\hat p] &= p \\
Var(\hat p) &= \frac{p(1-p)}{n} \\
SE(\hat p) &= \sqrt{\frac{p(1-p)}{n}}
\end{aligned}
\]
3) The Central Limit Theorem (CLT) connection
The mean and variance formulas above are always true. The CLT explains shape:
for large \(n\), the sampling distribution becomes approximately normal even if the population is not normal
(assuming finite variance).
\[
\bar X \;\dot\sim\; \mathcal N\!\left(\mu,\;\frac{\sigma^2}{n}\right)
\]
For proportions, a common rule of thumb is that the normal approximation is reasonable when:
\[
np \ge 10 \quad\text{and}\quad n(1-p)\ge 10.
\]
4) Why this matters in hypothesis testing
Hypothesis tests and confidence intervals compare an observed statistic to what we would expect if a null hypothesis were true.
That comparison depends on the sampling distribution:
- smaller \(SE\) means more precise estimates (larger \(n\) shrinks the spread),
- normal approximations let us compute \(z\)-scores and \(p\)-values,
- simulation helps visualize how repeated samples vary around the population parameter.
5) Practical notes (and “university level” extensions)
-
Medians and other statistics: for many estimators (like the sample median), the mean/variance formulas are more complex.
As \(n\to\infty\), many statistics still have an approximately normal sampling distribution under general conditions.
-
Finite population correction: if sampling without replacement from a finite population, the variance of \(\bar X\) is reduced
by a factor \( \left(1-\frac{n}{N}\right)\) (when \(n\) is not negligible vs \(N\)).
-
Simulation vs theory: this calculator simulates sampling variability and overlays the theoretical normal curve when appropriate.