How large does my sample need to be for the t-test to be robust to non-normality?

The Central Limit Theorem ensures that with n >= 30 per group, the sampling distribution of the mean is approximately normal regardless of the underlying distribution, making the t-test robust. For moderately skewed data, n >= 15 per group is often sufficient. For severely skewed distributions (exponential, log-normal) or heavy-tailed distributions, you may need n >= 50 or should use Mann-Whitney instead.

Should I run a normality test before deciding between t-test and Mann-Whitney?

This is common practice but statistically questionable. With small samples, normality tests have low power and will fail to reject normality even for clearly non-normal data. With large samples, normality tests reject normality for trivial deviations that do not affect the t-test. Instead, examine your data visually (histogram, Q-Q plot), consider the nature of your variable (bounded? ordinal? count data?), and use domain knowledge about the expected distribution shape.

t-Test vs Mann-Whitney U: Which Should You Use?

Q: Is the Mann-Whitney U test the same as the Wilcoxon rank-sum test?

Yes. The Mann-Whitney U test and the Wilcoxon rank-sum test are mathematically equivalent tests for comparing two independent groups. They use different formulas to arrive at the same p-value. Most software implements both names: SciPy uses mannwhitneyu() and ranksums(), R uses wilcox.test(). The Wilcoxon signed-rank test is a different test used for paired (dependent) samples.

Q: Does Mann-Whitney test medians or mean ranks?

Mann-Whitney tests whether one group tends to have larger values than the other (stochastic dominance). It is commonly described as a test of medians, but this is only true when the two distributions have the same shape and differ only in location. When distributions have different shapes or spreads, Mann-Whitney can reject the null even when medians are equal. Technically, it tests whether P(X > Y) = 0.5.

A data analyst runs a Shapiro-Wilk test on their A/B test data, gets p = 0.03, and switches from a t-test to the Mann-Whitney U. The result changes from significant (p = 0.041) to not significant (p = 0.067). They conclude the effect is not real. But here is the problem: with n = 45 per group, the t-test was perfectly valid despite the failed normality test, and the Mann-Whitney had lower power to detect the true effect. The analyst made the wrong call by mechanically following the "test for normality first" recipe.

Choosing between the t-test and the Mann-Whitney U is one of the most common decisions in applied statistics, and it is frequently made poorly. The textbook rule -- "use Mann-Whitney when data is non-normal" -- oversimplifies the decision. This guide covers what each test actually does, when the t-test is robust enough despite non-normality, when Mann-Whitney is genuinely necessary, and how the tests compare on statistical power.

What Each Test Does

The Independent Samples t-Test

The t-test compares the means of two independent groups. It calculates the difference between group means, divides by the standard error of that difference, and checks whether the resulting t-statistic is large enough to be unlikely under the null hypothesis of equal population means.

The t-test assumes that the sampling distribution of the mean is approximately normal. This is guaranteed when the data itself is normal, but the Central Limit Theorem ensures it is approximately true for non-normal data as long as the sample size is moderate (roughly n >= 30 per group).

The Mann-Whitney U Test

The Mann-Whitney U (also called the Wilcoxon rank-sum test) works on ranks rather than raw values. It pools all observations from both groups, ranks them from smallest to largest, and tests whether one group tends to have higher ranks than the other.

Because it uses ranks, the Mann-Whitney is not affected by the shape of the underlying distribution. It does not assume normality. However, it also discards information about the magnitude of differences between values -- the gap between ranks 1 and 2 is the same whether the actual values differ by 0.1 or 1000.

Side-by-Side Comparison

Feature	t-Test	Mann-Whitney U
Tests for	Difference in means	Stochastic dominance (P(X > Y) != 0.5)
Data type	Continuous (interval/ratio)	Ordinal or continuous
Normality assumption	Yes (but robust with n >= 30)	No
Equal variance assumption	Yes (Welch's t-test relaxes this)	Similar shape assumed for median interpretation
Power (normal data)	Higher (uses full metric information)	~95% of t-test power
Power (heavy-tailed data)	Can be lower due to inflated variance	Often higher (robust to outliers)
Handles outliers	Sensitive (outliers inflate variance)	Robust (ranks cap outlier influence)
Handles ties	Not applicable	Uses correction for ties (common with ordinal data)
Effect size	Cohen's d (mean difference in SD units)	Rank-biserial correlation or common language effect size
Sample size needed	Fewer observations for same power (normal data)	5-15% more observations for same power (normal data)

When the t-Test Is Robust Enough

The t-test is more robust than most practitioners realize. Research by simulation studies (Lumley et al., 2002; Fagerland, 2012) has shown that the t-test maintains accurate Type I error rates under a wide range of non-normal distributions when sample sizes are adequate.

The t-test works well even with non-normal data when:

n >= 30 per group. The Central Limit Theorem ensures the sampling distribution of the mean is approximately normal, regardless of the data distribution. With n = 30, even exponential and uniform distributions produce valid t-test results.
Distributions are symmetric but non-normal. Symmetric distributions (uniform, bimodal symmetric, light-tailed) pose minimal problems for the t-test even with small samples (n >= 15).
Sample sizes are equal. Equal group sizes make the t-test robust to both non-normality and unequal variances simultaneously. With unequal sizes plus non-normality, use Welch's t-test.
You care about mean differences. If the business question is about average performance (average revenue, mean response time), the t-test directly answers that question. Mann-Whitney answers a different question (stochastic dominance).

Key insight: The Shapiro-Wilk normality test should not be used as a gatekeeper for the t-test. With small samples, it lacks power to detect non-normality. With large samples (where the t-test is robust anyway), it rejects normality for trivial deviations. Visual inspection with a Q-Q plot is more informative.

When Mann-Whitney Is Necessary

There are genuine scenarios where the Mann-Whitney is the correct choice:

Ordinal data. Likert-scale responses (1-5 ratings), satisfaction rankings, or severity grades are ordinal. Means of ordinal data are not well-defined, so the t-test's interpretation breaks down. Mann-Whitney compares rank distributions directly.
Small samples with severe skew. With n < 15 per group and heavily skewed data (income, response times, purchase amounts), the sampling distribution of the mean is not yet approximately normal. The t-test may produce inflated Type I error rates.
Outlier-dominated data. If a few extreme values dominate the variance, the t-test loses power because the standard error inflates. Mann-Whitney ranks cap the influence of outliers. Revenue data with a few whale customers is a classic example.
The research question is about ranks, not means. "Does treatment A tend to produce better outcomes than treatment B?" is a rank question. If you care about superiority rather than the size of the average difference, Mann-Whitney directly answers your question.
Data is bounded or truncated. Floor and ceiling effects (e.g., a pain scale of 0-10 where many patients score 0 or 10) create non-normal distributions that violate t-test assumptions even with moderate samples.

Common misconception: "Mann-Whitney tests medians." This is only true when both groups have the same distribution shape. If the distributions differ in shape or spread, Mann-Whitney can reject the null hypothesis even when the medians are identical. It actually tests whether a randomly selected observation from group A is equally likely to be greater or less than a randomly selected observation from group B.

Power Comparison

When data is truly normal, the t-test is the most powerful test for detecting mean differences. The Mann-Whitney achieves approximately 95.5% of the t-test's power -- the asymptotic relative efficiency (ARE) of the Mann-Whitney relative to the t-test is 3/pi = 0.955. In practical terms, if the t-test needs 100 observations per group, Mann-Whitney needs about 105 for the same power.

However, this relationship reverses for heavy-tailed distributions:

Logistic distribution: Mann-Whitney has ARE = 1.097 (about 10% more powerful than the t-test)
Double exponential: ARE = 1.5 (50% more powerful)
Contaminated normal (5% outliers): ARE can exceed 2.0

The practical takeaway: for clean, roughly symmetric data, the t-test wins on power. For data with outliers or heavy tails, Mann-Whitney can substantially outperform the t-test because outliers inflate the t-test's variance estimate without affecting ranks.

Worked Example: A/B Test on Session Duration

An e-commerce site tests two checkout flows. Dependent variable: time to complete checkout (seconds). N = 40 per group.

The Data Problem

Session durations are right-skewed: most users finish in 30-90 seconds, but some abandon and return later (producing values of 300-900 seconds). The Shapiro-Wilk test rejects normality (p < 0.001).

t-Test Result

Group A mean: 78.3s (SD = 62.1)
Group B mean: 64.1s (SD = 55.8)
t(78) = 1.07, p = 0.287
Cohen's d = 0.24

Mann-Whitney Result

Group A median: 52s (IQR: 35-89)
Group B median: 41s (IQR: 28-72)
U = 604, p = 0.041
Rank-biserial r = 0.245

Here the Mann-Whitney detects a significant difference that the t-test misses. Why? The handful of extreme session durations (users who left and returned) inflate the standard deviations, making the t-test's denominator large. The Mann-Whitney, working on ranks, is unaffected by the magnitude of those outliers. For this type of data, Mann-Whitney is the better choice.

Decision Guide

Use the t-test when:

Data is continuous (interval or ratio scale)
n >= 30 per group, regardless of distribution shape
n >= 15 per group with roughly symmetric data
You want to estimate the magnitude of the mean difference
Equal (or similar) sample sizes
No extreme outliers dominating variance

Use Mann-Whitney when:

Data is ordinal (Likert scales, rankings)
Small samples (n < 15) with severely skewed data
Extreme outliers are present and meaningful (not errors)
Data is bounded with floor/ceiling effects
You care about rank ordering rather than mean differences
Distribution shapes differ substantially between groups

Consider alternatives when:

You have paired/matched data (use paired t-test or Wilcoxon signed-rank)
You have 3+ groups (use ANOVA or Kruskal-Wallis)
You want to control for covariates (use regression or ANCOVA)
The outcome is binary (use chi-square or Fisher's exact test)

Analyze Your Own Data — upload a CSV and run this analysis instantly. No code, no setup.

Analyze Your CSV →

Run Both Tests in Seconds

MCP Analytics automatically checks your data distribution and runs the appropriate test -- t-test, Welch's t-test, or Mann-Whitney U -- with effect sizes, confidence intervals, and assumption diagnostics. Upload your data and get results without writing code.

Start free | See pricing

Frequently Asked Questions

Is the Mann-Whitney U test the same as the Wilcoxon rank-sum test?

Yes, they are mathematically equivalent. Different software uses different names: SciPy has both mannwhitneyu() and ranksums(), R uses wilcox.test(). Do not confuse the Wilcoxon rank-sum (for independent samples) with the Wilcoxon signed-rank test (for paired samples).

How large does my sample need to be for the t-test to be robust?

With n >= 30 per group, the t-test is robust for most distributions encountered in practice. For moderately skewed data, n >= 15 is often sufficient. For severely skewed distributions (exponential, log-normal) or heavy-tailed distributions, you may need n >= 50 or should use Mann-Whitney.

Does Mann-Whitney test medians or mean ranks?

Mann-Whitney tests stochastic dominance: whether one group tends to produce larger values than the other. It is commonly described as a test of medians, but this interpretation is only valid when both groups have the same distribution shape. When shapes differ, Mann-Whitney can reject the null even when medians are equal.

Should I run a normality test before choosing?

Running a normality test as a gatekeeper is common but statistically questionable. With small samples, normality tests lack power. With large samples, they reject for trivial deviations. Instead, examine your data visually (histogram, Q-Q plot), consider the variable type (ordinal? bounded?), and use domain knowledge about the expected distribution.