t-Test vs Mann-Whitney U: Which Should You Use?

A data analyst runs a Shapiro-Wilk test on their A/B test data, gets p = 0.03, and switches from a t-test to the Mann-Whitney U. The result changes from significant (p = 0.041) to not significant (p = 0.067). They conclude the effect is not real. But here is the problem: with n = 45 per group, the t-test was perfectly valid despite the failed normality test, and the Mann-Whitney had lower power to detect the true effect. The analyst made the wrong call by mechanically following the "test for normality first" recipe.

Choosing between the t-test and the Mann-Whitney U is one of the most common decisions in applied statistics, and it is frequently made poorly. The textbook rule -- "use Mann-Whitney when data is non-normal" -- oversimplifies the decision. This guide covers what each test actually does, when the t-test is robust enough despite non-normality, when Mann-Whitney is genuinely necessary, and how the tests compare on statistical power.

What Each Test Does

The Independent Samples t-Test

The t-test compares the means of two independent groups. It calculates the difference between group means, divides by the standard error of that difference, and checks whether the resulting t-statistic is large enough to be unlikely under the null hypothesis of equal population means.

The t-test assumes that the sampling distribution of the mean is approximately normal. This is guaranteed when the data itself is normal, but the Central Limit Theorem ensures it is approximately true for non-normal data as long as the sample size is moderate (roughly n >= 30 per group).

The Mann-Whitney U Test

The Mann-Whitney U (also called the Wilcoxon rank-sum test) works on ranks rather than raw values. It pools all observations from both groups, ranks them from smallest to largest, and tests whether one group tends to have higher ranks than the other.

Because it uses ranks, the Mann-Whitney is not affected by the shape of the underlying distribution. It does not assume normality. However, it also discards information about the magnitude of differences between values -- the gap between ranks 1 and 2 is the same whether the actual values differ by 0.1 or 1000.

Side-by-Side Comparison

Feature t-Test Mann-Whitney U
Tests for Difference in means Stochastic dominance (P(X > Y) != 0.5)
Data type Continuous (interval/ratio) Ordinal or continuous
Normality assumption Yes (but robust with n >= 30) No
Equal variance assumption Yes (Welch's t-test relaxes this) Similar shape assumed for median interpretation
Power (normal data) Higher (uses full metric information) ~95% of t-test power
Power (heavy-tailed data) Can be lower due to inflated variance Often higher (robust to outliers)
Handles outliers Sensitive (outliers inflate variance) Robust (ranks cap outlier influence)
Handles ties Not applicable Uses correction for ties (common with ordinal data)
Effect size Cohen's d (mean difference in SD units) Rank-biserial correlation or common language effect size
Sample size needed Fewer observations for same power (normal data) 5-15% more observations for same power (normal data)

When the t-Test Is Robust Enough

The t-test is more robust than most practitioners realize. Research by simulation studies (Lumley et al., 2002; Fagerland, 2012) has shown that the t-test maintains accurate Type I error rates under a wide range of non-normal distributions when sample sizes are adequate.

The t-test works well even with non-normal data when:

Key insight: The Shapiro-Wilk normality test should not be used as a gatekeeper for the t-test. With small samples, it lacks power to detect non-normality. With large samples (where the t-test is robust anyway), it rejects normality for trivial deviations. Visual inspection with a Q-Q plot is more informative.

When Mann-Whitney Is Necessary

There are genuine scenarios where the Mann-Whitney is the correct choice:

Common misconception: "Mann-Whitney tests medians." This is only true when both groups have the same distribution shape. If the distributions differ in shape or spread, Mann-Whitney can reject the null hypothesis even when the medians are identical. It actually tests whether a randomly selected observation from group A is equally likely to be greater or less than a randomly selected observation from group B.

Power Comparison

When data is truly normal, the t-test is the most powerful test for detecting mean differences. The Mann-Whitney achieves approximately 95.5% of the t-test's power -- the asymptotic relative efficiency (ARE) of the Mann-Whitney relative to the t-test is 3/pi = 0.955. In practical terms, if the t-test needs 100 observations per group, Mann-Whitney needs about 105 for the same power.

However, this relationship reverses for heavy-tailed distributions:

The practical takeaway: for clean, roughly symmetric data, the t-test wins on power. For data with outliers or heavy tails, Mann-Whitney can substantially outperform the t-test because outliers inflate the t-test's variance estimate without affecting ranks.

Worked Example: A/B Test on Session Duration

An e-commerce site tests two checkout flows. Dependent variable: time to complete checkout (seconds). N = 40 per group.

The Data Problem

Session durations are right-skewed: most users finish in 30-90 seconds, but some abandon and return later (producing values of 300-900 seconds). The Shapiro-Wilk test rejects normality (p < 0.001).

t-Test Result

Group A mean: 78.3s (SD = 62.1)
Group B mean: 64.1s (SD = 55.8)
t(78) = 1.07, p = 0.287
Cohen's d = 0.24

Mann-Whitney Result

Group A median: 52s (IQR: 35-89)
Group B median: 41s (IQR: 28-72)
U = 604, p = 0.041
Rank-biserial r = 0.245

Here the Mann-Whitney detects a significant difference that the t-test misses. Why? The handful of extreme session durations (users who left and returned) inflate the standard deviations, making the t-test's denominator large. The Mann-Whitney, working on ranks, is unaffected by the magnitude of those outliers. For this type of data, Mann-Whitney is the better choice.

Decision Guide

Use the t-test when:

Use Mann-Whitney when:

Consider alternatives when:

Run Both Tests in Seconds

MCP Analytics automatically checks your data distribution and runs the appropriate test -- t-test, Welch's t-test, or Mann-Whitney U -- with effect sizes, confidence intervals, and assumption diagnostics. Upload your data and get results without writing code.

Start free | See pricing

Frequently Asked Questions

Is the Mann-Whitney U test the same as the Wilcoxon rank-sum test?

Yes, they are mathematically equivalent. Different software uses different names: SciPy has both mannwhitneyu() and ranksums(), R uses wilcox.test(). Do not confuse the Wilcoxon rank-sum (for independent samples) with the Wilcoxon signed-rank test (for paired samples).

How large does my sample need to be for the t-test to be robust?

With n >= 30 per group, the t-test is robust for most distributions encountered in practice. For moderately skewed data, n >= 15 is often sufficient. For severely skewed distributions (exponential, log-normal) or heavy-tailed distributions, you may need n >= 50 or should use Mann-Whitney.

Does Mann-Whitney test medians or mean ranks?

Mann-Whitney tests stochastic dominance: whether one group tends to produce larger values than the other. It is commonly described as a test of medians, but this interpretation is only valid when both groups have the same distribution shape. When shapes differ, Mann-Whitney can reject the null even when medians are equal.

Should I run a normality test before choosing?

Running a normality test as a gatekeeper is common but statistically questionable. With small samples, normality tests lack power. With large samples, they reject for trivial deviations. Instead, examine your data visually (histogram, Q-Q plot), consider the variable type (ordinal? bounded?), and use domain knowledge about the expected distribution.