Sample size for power (hypothesis tests)
In study planning, you often want enough data to reliably detect a meaningful effect.
A power analysis answers: “What sample size \(n\) do I need so my test detects an effect of size \(d\)
with probability (power) at least \(1-\beta\), while controlling false positives at level \(\alpha\)?”
1) Key concepts: \(\alpha\), \(\beta\), and power
- \(\alpha\) (Type I error): probability of rejecting \(H_0\) when \(H_0\) is true (false positive).
- \(\beta\) (Type II error): probability of failing to reject \(H_0\) when the alternative is true (false negative).
- Power: \(1-\beta\), the probability the test detects the effect (rejects \(H_0\)) when the effect is present.
\[
\text{Power} = P(\text{Reject }H_0 \mid H_1\text{ true}) = 1-\beta.
\]
2) Effect size for means: Cohen’s \(d\)
For mean-based tests, a common standardized effect size is Cohen’s \(d\), which measures the mean difference
in standard deviation units:
\[
\text{One-sample:}\quad d=\frac{|\mu_1-\mu_0|}{\sigma},
\qquad
\text{Two-sample:}\quad d=\frac{|\mu_1-\mu_2|}{\sigma}.
\]
Interpreting \(d\) depends on context, but a common rule of thumb is:
small \(\approx 0.2\), medium \(\approx 0.5\), large \(\approx 0.8\).
3) How sample size affects power
Increasing \(n\) reduces standard error, making the test statistic “separate” more under the alternative.
In planning, we often use a normal approximation in which the test statistic under \(H_1\) behaves like:
\[
T \approx N(\delta,1),
\]
where \(\delta\) is the noncentrality (mean shift in standard-error units). For equal-variance designs:
\[
\delta(n)=
\begin{cases}
d\sqrt{n}, & \text{one-sample mean}\\[6pt]
d\sqrt{\dfrac{n}{2}}, & \text{two-sample means (equal }n\text{ per group)}
\end{cases}
\]
4) Critical values: one-sided vs two-sided
The significance level \(\alpha\) determines a rejection threshold \(c\):
\[
c=
\begin{cases}
z_{1-\alpha/2}, & \text{two-sided test}\\[6pt]
z_{1-\alpha}, & \text{one-sided test}
\end{cases}
\]
Here \(z_q\) means the \(q\)-quantile of the standard normal distribution.
For a t-test, the critical value is often a \(t\)-quantile with degrees of freedom, typically:
\(df=n-1\) (one-sample) or \(df=2n-2\) (two-sample, equal group sizes).
5) Power formulas (normal approximation)
Using \(T \approx N(\delta,1)\), power can be written in closed form using the standard normal CDF \(\Phi(\cdot)\).
Two-sided
\[
\mathrm{Power}(n)
=P(|N(\delta,1)|>c)
=\bigl(1-\Phi(c-\delta)\bigr)+\Phi(-c-\delta).
\]
One-sided (right-tailed)
\[
\mathrm{Power}(n)=P(N(\delta,1)>c)=1-\Phi(c-\delta).
\]
One-sided (left-tailed)
\[
\mathrm{Power}(n)=P(N(\delta,1)<-c)=\Phi(-c-\delta).
\]
In practice, two-sided tests are common unless you have a strong reason to justify a one-sided direction before data collection.
6) Solving for the required \(n\)
A classic planning shortcut uses the idea that you need \(\delta\) to exceed the critical value by an amount linked to the target power:
\[
\delta \approx z_{1-\alpha^\*}+z_{1-\beta},
\qquad
\alpha^\*=
\begin{cases}
\alpha/2, & \text{two-sided}\\
\alpha, & \text{one-sided}
\end{cases}
\]
Since \(1-\beta\) is the target power, we have \(z_{1-\beta}=z_{\text{power}}\).
Substituting \(\delta(n)\) gives a closed-form estimate for \(n\):
\[
n \approx
\begin{cases}
\left(\dfrac{z_{1-\alpha^\*}+z_{\text{power}}}{d}\right)^2, & \text{one-sample}\\[10pt]
2\left(\dfrac{z_{1-\alpha^\*}+z_{\text{power}}}{d}\right)^2, & \text{two-sample (per group)}
\end{cases}
\]
Because real studies need an integer \(n\), calculators typically round up and then re-check the achieved power.
This tool does exactly that: it finds the smallest integer \(n\) (up to a search limit) such that power\(\ge\)target.
7) Worked example (the 64-per-group rule-of-thumb)
Suppose you plan a two-sample comparison of means with:
\(\alpha=0.05\), target power \(=0.80\), and \(d=0.5\) (a medium effect).
\[
z_{1-\alpha/2}=z_{0.975}\approx 1.96,\qquad
z_{\text{power}}=z_{0.80}\approx 0.84.
\]
\[
n \approx 2\left(\frac{1.96+0.84}{0.5}\right)^2
=2\left(\frac{2.80}{0.5}\right)^2
=2(5.6)^2
=2(31.36)
\approx 62.7
\Rightarrow \text{round up } n\approx 63 \text{ or }64 \text{ per group.}
\]
Different conventions (continuity corrections, exact power, t vs z) can shift the result slightly,
which is why the calculator re-checks the achieved power at the rounded integer \(n\).
8) Practical notes and limitations
- Assumptions: independent observations, approximately normal errors (or large \(n\) via CLT), and equal variance for two-sample pooling.
- Dropout / missing data: increase planned \(n\) to compensate (e.g., \(n_{\text{planned}} \approx n/(1-\text{dropout rate})\)).
- Multiple testing: if you test many endpoints, effective \(\alpha\) can be smaller, increasing required \(n\).
- t-test power: exact t-test power uses a noncentral t distribution; many planning tools use a close normal approximation (especially for moderate/large \(n\)).
9) University extensions
Beyond mean tests, power analysis extends to:
- Proportions (risk differences, odds ratios) using normal/score/Wald approximations.
- ANOVA/regression with \(R^2\), \(f^2\), or noncentral F distributions.
- Sequential designs and group-sequential testing (alpha spending).