The mann whitney u test wilcoxon rank sum (also called the Wilcoxon rank-sum test) is a nonparametric method for comparing two independent samples using the ranks of the pooled observations rather than assuming normality.
When the test is appropriate
- Two independent samples (no pairing or repeated measurements between groups).
- Outcome is at least ordinal (ranks make sense) and often continuous.
- Primary goal: detect a systematic shift between groups; under similar-shape distributions, this is often interpreted as a difference in medians.
Hypotheses and test idea
Let Group 1 have sample size \(n_1\) and Group 2 have sample size \(n_2\), with total \(N=n_1+n_2\). The test begins by ranking all \(N\) observations together (smallest rank 1, largest rank \(N\)), then comparing how large the ranks tend to be in one group versus the other.
Core statistics: rank sum \(R_1\) (or \(W\)) and Mann–Whitney \(U\)
Step 1: Pool and rank (tie rule)
Combine both samples and assign ranks 1 through \(N\). If ties occur, assign each tied value the average of the ranks it would have occupied.
Step 2: Compute rank sums
Let \(R_1\) be the sum of ranks for Group 1 (often called \(W\), the Wilcoxon rank-sum statistic). Similarly define \(R_2\).
Step 3: Convert to \(U\)
The Mann–Whitney statistics are:
The smaller of \(U_1\) and \(U_2\) is often used as \(U_{\min}\) for a two-sided test because it measures how far the rank allocation deviates from balance.
Worked example (with full ranking)
Consider two independent samples (Group A and Group B), each of size 5:
| Observation | Group | Pooled order | Rank |
|---|---|---|---|
| 8 | B | 1st | 1 |
| 9 | B | 2nd | 2 |
| 10 | A | 3rd | 3 |
| 11 | B | 4th | 4 |
| 12 | A | 5th | 5 |
| 13 | B | 6th | 6 |
| 14 | A | 7th | 7 |
| 15 | A | 8th | 8 |
| 16 | B | 9th | 9 |
| 18 | A | 10th | 10 |
From \(U\) to a p-value (normal approximation)
For moderate to large samples, \(U\) is commonly standardized to a \(z\)-score under \(H_0\). The mean and (no-ties) standard deviation are:
With ties, the variance is reduced. If tie groups have sizes \(t_1,t_2,\dots\), a common correction is:
Applying the no-ties approximation to the example (and using a continuity correction because \(U_{\min}\) is below \(\mu_U\)):
A two-sided p-value is obtained as \(p = 2\cdot P(Z \le -|z|)\). The conclusion depends on the chosen significance level \(\alpha\) (commonly 0.05).
Interpretation of the decision
- If \(p \le \alpha\): evidence that the two independent populations differ in location/distribution (often described as one group tending to have larger values).
- If \(p > \alpha\): insufficient evidence to claim a difference; this does not prove the distributions are identical.
Effect size (recommended alongside the p-value)
Common-language effect size
A useful probability interpretation is:
\(\hat{A}\) estimates the probability that a randomly chosen observation from Group 1 exceeds a randomly chosen observation from Group 2 (with a standard tie convention depending on software).
Rank-biserial correlation
For the example:
Visualization: pooled order with group membership
Common pitfalls and reporting checklist
- Independence: paired data require the Wilcoxon signed-rank test, not the rank-sum/Mann–Whitney U.
- Ties: use average ranks and apply a tie correction when using a normal approximation.
- Interpretation: the test detects distributional differences; “median difference” is most defensible under a shift model with similarly shaped distributions.
- Report: \(n_1,n_2\), the statistic (\(U\) or \(W\)), p-value (exact or approximate), and an effect size (such as \(\hat{A}\) or \(r_{rb}\)).