t-Test vs Mann-Whitney U: Which Should You Use?
A data analyst runs a Shapiro-Wilk test on their A/B test data, gets p = 0.03, and switches from a t-test to the Mann-Whitney U. The result changes from significant (p = 0.041) to not significant (p = 0.067). They conclude the effect is not real. But here is the problem: with n = 45 per group, the t-test was perfectly valid despite the failed normality test, and the Mann-Whitney had lower power to detect the true effect. The analyst made the wrong call by mechanically following the "test for normality first" recipe.
Choosing between the t-test and the Mann-Whitney U is one of the most common decisions in applied statistics, and it is frequently made poorly. The textbook rule -- "use Mann-Whitney when data is non-normal" -- oversimplifies the decision. This guide covers what each test actually does, when the t-test is robust enough despite non-normality, when Mann-Whitney is genuinely necessary, and how the tests compare on statistical power.
What Each Test Does
The Independent Samples t-Test
The t-test compares the means of two independent groups. It calculates the difference between group means, divides by the standard error of that difference, and checks whether the resulting t-statistic is large enough to be unlikely under the null hypothesis of equal population means.
The t-test assumes that the sampling distribution of the mean is approximately normal. This is guaranteed when the data itself is normal, but the Central Limit Theorem ensures it is approximately true for non-normal data as long as the sample size is moderate (roughly n >= 30 per group).
The Mann-Whitney U Test
The Mann-Whitney U (also called the Wilcoxon rank-sum test) works on ranks rather than raw values. It pools all observations from both groups, ranks them from smallest to largest, and tests whether one group tends to have higher ranks than the other.
Because it uses ranks, the Mann-Whitney is not affected by the shape of the underlying distribution. It does not assume normality. However, it also discards information about the magnitude of differences between values -- the gap between ranks 1 and 2 is the same whether the actual values differ by 0.1 or 1000.
Side-by-Side Comparison
| Feature | t-Test | Mann-Whitney U |
|---|---|---|
| Tests for | Difference in means | Stochastic dominance (P(X > Y) != 0.5) |
| Data type | Continuous (interval/ratio) | Ordinal or continuous |
| Normality assumption | Yes (but robust with n >= 30) | No |
| Equal variance assumption | Yes (Welch's t-test relaxes this) | Similar shape assumed for median interpretation |
| Power (normal data) | Higher (uses full metric information) | ~95% of t-test power |
| Power (heavy-tailed data) | Can be lower due to inflated variance | Often higher (robust to outliers) |
| Handles outliers | Sensitive (outliers inflate variance) | Robust (ranks cap outlier influence) |
| Handles ties | Not applicable | Uses correction for ties (common with ordinal data) |
| Effect size | Cohen's d (mean difference in SD units) | Rank-biserial correlation or common language effect size |
| Sample size needed | Fewer observations for same power (normal data) | 5-15% more observations for same power (normal data) |
When the t-Test Is Robust Enough
The t-test is more robust than most practitioners realize. Research by simulation studies (Lumley et al., 2002; Fagerland, 2012) has shown that the t-test maintains accurate Type I error rates under a wide range of non-normal distributions when sample sizes are adequate.
The t-test works well even with non-normal data when:
- n >= 30 per group. The Central Limit Theorem ensures the sampling distribution of the mean is approximately normal, regardless of the data distribution. With n = 30, even exponential and uniform distributions produce valid t-test results.
- Distributions are symmetric but non-normal. Symmetric distributions (uniform, bimodal symmetric, light-tailed) pose minimal problems for the t-test even with small samples (n >= 15).
- Sample sizes are equal. Equal group sizes make the t-test robust to both non-normality and unequal variances simultaneously. With unequal sizes plus non-normality, use Welch's t-test.
- You care about mean differences. If the business question is about average performance (average revenue, mean response time), the t-test directly answers that question. Mann-Whitney answers a different question (stochastic dominance).
When Mann-Whitney Is Necessary
There are genuine scenarios where the Mann-Whitney is the correct choice:
- Ordinal data. Likert-scale responses (1-5 ratings), satisfaction rankings, or severity grades are ordinal. Means of ordinal data are not well-defined, so the t-test's interpretation breaks down. Mann-Whitney compares rank distributions directly.
- Small samples with severe skew. With n < 15 per group and heavily skewed data (income, response times, purchase amounts), the sampling distribution of the mean is not yet approximately normal. The t-test may produce inflated Type I error rates.
- Outlier-dominated data. If a few extreme values dominate the variance, the t-test loses power because the standard error inflates. Mann-Whitney ranks cap the influence of outliers. Revenue data with a few whale customers is a classic example.
- The research question is about ranks, not means. "Does treatment A tend to produce better outcomes than treatment B?" is a rank question. If you care about superiority rather than the size of the average difference, Mann-Whitney directly answers your question.
- Data is bounded or truncated. Floor and ceiling effects (e.g., a pain scale of 0-10 where many patients score 0 or 10) create non-normal distributions that violate t-test assumptions even with moderate samples.
Power Comparison
When data is truly normal, the t-test is the most powerful test for detecting mean differences. The Mann-Whitney achieves approximately 95.5% of the t-test's power -- the asymptotic relative efficiency (ARE) of the Mann-Whitney relative to the t-test is 3/pi = 0.955. In practical terms, if the t-test needs 100 observations per group, Mann-Whitney needs about 105 for the same power.
However, this relationship reverses for heavy-tailed distributions:
- Logistic distribution: Mann-Whitney has ARE = 1.097 (about 10% more powerful than the t-test)
- Double exponential: ARE = 1.5 (50% more powerful)
- Contaminated normal (5% outliers): ARE can exceed 2.0
The practical takeaway: for clean, roughly symmetric data, the t-test wins on power. For data with outliers or heavy tails, Mann-Whitney can substantially outperform the t-test because outliers inflate the t-test's variance estimate without affecting ranks.
Worked Example: A/B Test on Session Duration
An e-commerce site tests two checkout flows. Dependent variable: time to complete checkout (seconds). N = 40 per group.
The Data Problem
Session durations are right-skewed: most users finish in 30-90 seconds, but some abandon and return later (producing values of 300-900 seconds). The Shapiro-Wilk test rejects normality (p < 0.001).
t-Test Result
Group A mean: 78.3s (SD = 62.1)
Group B mean: 64.1s (SD = 55.8)
t(78) = 1.07, p = 0.287
Cohen's d = 0.24
Mann-Whitney Result
Group A median: 52s (IQR: 35-89)
Group B median: 41s (IQR: 28-72)
U = 604, p = 0.041
Rank-biserial r = 0.245
Here the Mann-Whitney detects a significant difference that the t-test misses. Why? The handful of extreme session durations (users who left and returned) inflate the standard deviations, making the t-test's denominator large. The Mann-Whitney, working on ranks, is unaffected by the magnitude of those outliers. For this type of data, Mann-Whitney is the better choice.
Decision Guide
Use the t-test when:
- Data is continuous (interval or ratio scale)
- n >= 30 per group, regardless of distribution shape
- n >= 15 per group with roughly symmetric data
- You want to estimate the magnitude of the mean difference
- Equal (or similar) sample sizes
- No extreme outliers dominating variance
Use Mann-Whitney when:
- Data is ordinal (Likert scales, rankings)
- Small samples (n < 15) with severely skewed data
- Extreme outliers are present and meaningful (not errors)
- Data is bounded with floor/ceiling effects
- You care about rank ordering rather than mean differences
- Distribution shapes differ substantially between groups
Consider alternatives when:
- You have paired/matched data (use paired t-test or Wilcoxon signed-rank)
- You have 3+ groups (use ANOVA or Kruskal-Wallis)
- You want to control for covariates (use regression or ANCOVA)
- The outcome is binary (use chi-square or Fisher's exact test)
Run Both Tests in Seconds
MCP Analytics automatically checks your data distribution and runs the appropriate test -- t-test, Welch's t-test, or Mann-Whitney U -- with effect sizes, confidence intervals, and assumption diagnostics. Upload your data and get results without writing code.
Frequently Asked Questions
Yes, they are mathematically equivalent. Different software uses different names: SciPy has both mannwhitneyu() and ranksums(), R uses wilcox.test(). Do not confuse the Wilcoxon rank-sum (for independent samples) with the Wilcoxon signed-rank test (for paired samples).
With n >= 30 per group, the t-test is robust for most distributions encountered in practice. For moderately skewed data, n >= 15 is often sufficient. For severely skewed distributions (exponential, log-normal) or heavy-tailed distributions, you may need n >= 50 or should use Mann-Whitney.
Mann-Whitney tests stochastic dominance: whether one group tends to produce larger values than the other. It is commonly described as a test of medians, but this interpretation is only valid when both groups have the same distribution shape. When shapes differ, Mann-Whitney can reject the null even when medians are equal.
Running a normality test as a gatekeeper is common but statistically questionable. With small samples, normality tests lack power. With large samples, they reject for trivial deviations. Instead, examine your data visually (histogram, Q-Q plot), consider the variable type (ordinal? bounded?), and use domain knowledge about the expected distribution.