Mann–Whitney U (Wilcoxon rank-sum) test
The Mann–Whitney U test is a non-parametric method for comparing two independent samples.
It is commonly used when the normality assumption is questionable or when outliers make a mean-based test unreliable.
Instead of comparing means directly, it compares the relative ranks of observations across the two groups.
What question does it answer?
In its most common interpretation (a “shift” model), the test asks whether one sample tends to produce larger values than the other.
In more general terms, it tests whether the two samples come from the same continuous distribution:
\[
H_0:\ F_1(x)=F_2(x)\ \text{for all }x
\qquad\text{vs.}\qquad
H_1:\ F_1(x)\ne F_2(x)\ \text{(two-sided)}
\]
You can also run one-sided alternatives (“Sample 1 tends smaller” or “Sample 1 tends larger”).
Step 1: Pool and rank
Combine the two samples into one pooled list of size \(N=n_1+n_2\), then sort the pooled values.
Assign ranks \(1,2,\dots,N\). If there are ties (equal values), assign the average of the ranks that the tied values would occupy.
Let \(R_1\) be the sum of ranks belonging to Sample 1:
\[
R_1 = \sum_{\text{obs in Sample 1}} \text{rank}.
\]
Step 2: Compute \(U_1\) and \(U_2\)
The Mann–Whitney statistic for Sample 1 is:
\[
U_1 = n_1n_2 + \frac{n_1(n_1+1)}{2} - R_1.
\]
The corresponding statistic for Sample 2 is:
\[
U_2 = n_1n_2 - U_1.
\]
Intuition: \(U_1\) is closely related to “how often an observation from Sample 1 exceeds an observation from Sample 2” (with ties contributing half).
Step 3: Normal approximation and tie correction
For moderate or large sample sizes, the distribution of \(U\) under \(H_0\) is well-approximated by a normal distribution:
\[
\mu_U=\frac{n_1n_2}{2}.
\]
Without ties, the variance is:
\[
\operatorname{Var}(U)=\frac{n_1n_2(N+1)}{12}.
\]
With ties, the variance is reduced via a correction term. If tie groups have sizes \(t\) (for each distinct tied value),
then:
\[
\operatorname{Var}(U)=\frac{n_1n_2}{12}\left[(N+1)-\frac{\sum (t^3-t)}{N(N-1)}\right].
\]
The test statistic (z-score) is:
\[
z=\frac{U-\mu_U}{\sigma_U},\qquad \sigma_U=\sqrt{\operatorname{Var}(U)}.
\]
Many implementations apply a continuity correction (shifting \(U\) by \(\pm 0.5\)) to better match the discrete-to-continuous approximation.
p-values for different alternatives
- Two-sided: \(p = 2\Phi(-|z|)\)
- Sample 1 tends smaller: \(p = \Phi(z)\) (small \(U_1\) corresponds to negative \(z\))
- Sample 1 tends larger: \(p = 1-\Phi(z)\)
Effect size: rank-biserial correlation
Alongside significance, it is useful to report an effect size. A common choice is the rank-biserial correlation:
\[
r_{rb}=\frac{U_1-U_2}{n_1n_2}=\frac{2U_1}{n_1n_2}-1.
\]
Values near \(+1\) indicate Sample 1 tends larger; values near \(-1\) indicate Sample 1 tends smaller; values near \(0\) indicate little separation.
When should you prefer an exact test?
For very small samples (e.g., \(n_1\) and \(n_2\) both under ~10), exact p-values based on the exact distribution of rank sums
can be preferable. This calculator uses a normal approximation (with tie correction) for speed and broad applicability.
University extension: Kruskal–Wallis
The Kruskal–Wallis test generalizes the rank-sum idea to more than two groups. It also works by ranking pooled data and comparing
group rank sums, and it is often used as a non-parametric alternative to one-way ANOVA.