Mann-Whitney U Test — Compare Two Groups Without Assuming Normality

You have two groups and a number you care about — satisfaction scores, salaries, response times, conversion metrics. But the data is skewed, full of outliers, or measured on an ordinal scale. A t-test assumes normality; the Mann-Whitney U test does not. It compares ranks instead of means, making it the go-to test when your data breaks the rules that parametric tests require. Upload a CSV and get results in under 60 seconds.

What Is the Mann-Whitney U Test?

The Mann-Whitney U test is the non-parametric alternative to the independent samples t-test. Where the t-test compares the means of two groups and assumes the data follows a bell curve, the Mann-Whitney compares the ranks of observations and makes no assumption about the shape of the distribution. It answers a simple question: if you pick a random observation from Group A and a random observation from Group B, is one consistently larger than the other?

Here is how it works intuitively. Take all observations from both groups and line them up from smallest to largest. Assign each observation a rank — 1 for the smallest, 2 for the next, and so on. Now add up the ranks for each group separately. If Group A's values tend to be higher, Group A will have larger ranks and a larger rank sum. The U statistic measures how much one group's ranks dominate the other's. A very large or very small U relative to what you would expect by chance means the groups are genuinely different.

Because the test works with ranks rather than raw values, it is naturally resistant to outliers and skew. A single extreme value that would drag a mean (and wreck a t-test) gets the same rank treatment as any other observation — it just gets the highest rank. This makes the Mann-Whitney especially valuable for real-world business data, which is almost never perfectly normal.

When to Use the Mann-Whitney U Test

The most common scenario is comparing two groups when your data is not normally distributed. Customer satisfaction scores on a 1-to-5 scale are ordinal — the distance between "satisfied" and "very satisfied" is not necessarily the same as between "neutral" and "satisfied." Running a t-test on ordinal data is technically questionable; running a Mann-Whitney is clean. You rank the scores and compare whether one group tends to rate higher than the other.

Salary comparisons are another textbook case. Salary data is almost always right-skewed — most people cluster in a range, but a few high earners stretch the tail. Comparing mean salaries between two departments will be heavily influenced by those outliers. The Mann-Whitney compares the typical salary (via ranks) without letting a few executives distort the picture.

In medical and clinical research, the Mann-Whitney is standard for comparing treatment outcomes when sample sizes are small. With only 15 patients per group, you cannot reliably check whether the data is normal. The Mann-Whitney sidesteps the question entirely — it works regardless of the distribution shape, making it the safer choice when you have limited data.

A/B testing with non-normal metrics is another strong fit. Time-on-page, revenue per session, and support ticket resolution times are typically right-skewed with a long tail. If you are comparing a new checkout flow against the old one and your metric is revenue per visitor, the Mann-Whitney will give you a more honest comparison than a t-test that is being pulled around by a handful of large orders.

What Data Do You Need?

You need a CSV with two columns: a group column that identifies which of the two groups each observation belongs to (like "Treatment" vs. "Control", "Plan A" vs. "Plan B", or "Before" vs. "After"), and a measurement column with the numeric or ordinal values you want to compare (like satisfaction score, salary, response time, or conversion value).

The group column must contain exactly two distinct values. If you have three or more groups, you need the Kruskal-Wallis test instead — it is the non-parametric extension of this test for multiple groups, just as ANOVA extends the t-test.

There is no strict minimum sample size, but aim for at least five observations per group. The test works with very small samples (it was designed for exactly that), but statistical power — your ability to detect a real difference — increases with more data. With fewer than five per group, even a large real difference may not reach significance.

Module Parameters

The module accepts three optional parameters that control the test behavior:

Alternative hypothesis — choose "two.sided" (default, tests for any difference), "less" (Group A tends to be lower), or "greater" (Group A tends to be higher). Use a one-sided test only when you have a strong prior expectation about the direction.
Confidence level — the confidence level for the test, defaulting to 0.95 (the conventional 95%). Adjust if your domain requires stricter thresholds (e.g., 0.99 for clinical work).
Continuity correction — applies a continuity correction to the normal approximation of the U statistic, enabled by default. This matters mainly for small samples where the normal approximation is less accurate.

How to Read the Report

The report is structured into ten sections, each designed to answer a specific question about your data and results.

Analysis Overview

The overview card sets the stage: how many observations are in each group, what measurement is being compared, and which test variant is being run (one-sided vs. two-sided, with or without continuity correction). This is your sanity check — make sure the groups and sample sizes match what you expect before diving into results.

Data Preprocessing

Before running the test, the module examines your data for issues that could affect results. It reports how many rows were used, whether any were excluded (missing values, non-numeric entries), and whether the data needed any cleaning. If a significant number of rows were dropped, this section tells you why.

Executive Summary

The TL;DR. This card gives you the bottom line in plain language: is there a statistically significant difference between the two groups? It states the U statistic, the p-value, the effect size, and what they mean in context. If you only read one section of the report, read this one.

Distribution Comparison

Overlapping density plots or histograms show the shape of each group's data. This is where you can visually assess whether the distributions overlap heavily (suggesting similar groups) or are shifted apart (suggesting a real difference). You can also see skewness, bimodality, or outliers that might influence interpretation. The Mann-Whitney does not require the distributions to have the same shape, but knowing the shapes helps you understand what "different" means — is it a location shift, a spread difference, or something more complex?

Box Plot Comparison

Side-by-side box plots show the median, interquartile range (25th to 75th percentile), whiskers, and outliers for each group. Box plots are the fastest way to visually compare two groups. If the boxes do not overlap, the difference is likely significant. If the medians are close but one group has a much wider spread, the groups may differ in variability rather than central tendency. Outliers appear as individual points beyond the whiskers — the Mann-Whitney handles these gracefully because it works with ranks.

Rank Distribution

This section shows how the ranks are distributed across the two groups. Since the Mann-Whitney works by ranking all observations together, this visualization makes the test's logic transparent. If one group dominates the high ranks, you can see it directly. The rank sum for each group is reported alongside the expected rank sum under the null hypothesis of no difference.

Test Results

The core statistical output. This card reports the U statistic (or W statistic, which is mathematically equivalent), the p-value, and the test conclusion at your chosen significance level. The p-value is the probability of observing a U statistic this extreme if the two groups were actually drawn from the same distribution. A p-value below 0.05 (or your chosen threshold) means you can reject the null hypothesis — the groups differ. The card also reports whether the exact p-value or the normal approximation was used, which depends on sample size and whether there are ties in the data.

Effect Size

Statistical significance tells you that a difference exists; effect size tells you whether it matters. The report calculates the rank-biserial correlation (r), which ranges from -1 to +1. Values near 0 mean virtually no difference in ranks between groups. Values above 0.3 suggest a medium effect; above 0.5, a large effect. A statistically significant result with an effect size of 0.05 means the difference is real but negligible in practice — probably not worth changing your strategy over. A large effect size, even with a borderline p-value, suggests a meaningful difference worth investigating further.

Summary Statistics

Descriptive statistics for each group: count, mean, median, standard deviation, min, max, and quartiles. While the Mann-Whitney is a rank-based test and does not formally compare means or medians, these statistics give you the practical context. If Group A has a median satisfaction score of 4.2 and Group B has 3.6, that tells you the magnitude of the difference in the original units — something the U statistic alone does not convey.

Normality Diagnostics

Even though the Mann-Whitney does not require normality, the report runs normality checks (Shapiro-Wilk test) on each group. This serves two purposes. First, it validates your choice of test — if both groups pass the normality check, you might get tighter confidence intervals by switching to a t-test. Second, it documents the distributional characteristics of your data for anyone reviewing the analysis. If the Shapiro-Wilk p-value is below 0.05, the data is significantly non-normal, confirming that the Mann-Whitney was the right call.

Real-World Examples

Customer satisfaction ratings. You survey customers on two support channels — live chat and email — using a 1-to-5 satisfaction scale. The data is ordinal and cannot be normally distributed (it only takes five values). The Mann-Whitney tests whether chat customers tend to give higher ratings than email customers, without pretending the scale is continuous.

Salary comparison across departments. You want to know if the engineering department pays more than marketing. Salary distributions are right-skewed, and your sample is only 40 people per department. The Mann-Whitney compares the typical salary without being distorted by the CTO's compensation package.

Medical treatment outcomes. A pilot study with 12 patients per arm measures pain reduction on a subjective 0-10 scale. With 12 observations, you cannot meaningfully test for normality. The Mann-Whitney gives you a valid comparison without requiring distributional assumptions that small samples cannot verify.

A/B test with revenue data. You are testing a new pricing page and measuring revenue per visitor. Most visitors generate $0 (they do not buy), some generate $20-100, and a few generate $500+. This distribution is massively right-skewed with a spike at zero. The Mann-Whitney compares the groups fairly, while a t-test on this data would be dominated by the few large purchases.

When to Use Something Else

If your data is approximately normally distributed and you have a reasonable sample size (say, 20+ per group), use a t-test instead. The t-test is more powerful than the Mann-Whitney when its assumptions are met — it is better at detecting real differences because it uses all the information in the data, not just the ranks. The normality diagnostics in the report tell you whether this is the case.

If your two groups are paired — the same subjects measured twice, like before-and-after a treatment — you need the Wilcoxon signed-rank test, not the Mann-Whitney. The Mann-Whitney assumes independent groups. Using it on paired data throws away the pairing information and loses statistical power.

If you have three or more groups, the Mann-Whitney cannot help directly. Use the Kruskal-Wallis test — it extends the Mann-Whitney logic to any number of groups, just as ANOVA extends the t-test. Running multiple pairwise Mann-Whitney tests without correction inflates your false positive rate.

If you want maximum flexibility and are willing to trade simplicity for power, consider a permutation test. Permutation tests make even fewer assumptions than the Mann-Whitney and can test any statistic (mean, median, ratio), not just ranks. They are computationally intensive but conceptually straightforward.

The R Code Behind the Analysis

Every report includes the exact R code used to produce the results — reproducible, auditable, and citable. This is not AI-generated code that changes every run. The same data produces the same analysis every time.

The analysis uses wilcox.test() from base R — the standard implementation of the Mann-Whitney U test (also known as the Wilcoxon rank-sum test). This is the same function used in academic research, textbooks, and peer-reviewed publications. The report also calculates the rank-biserial correlation for effect size, runs shapiro.test() for normality diagnostics, and generates distribution and box plot comparisons using ggplot2. Every step is visible in the code tab of your report, so you or a statistician can verify exactly what was done.