Statistical Tests: Which Hypothesis Test Should You Use?

You have data and a question. Revenue went up after a campaign -- was it real or noise? Two customer segments show different churn rates -- is the gap statistically meaningful? Choosing the wrong test can produce a confident-looking p-value that means nothing, or miss a genuine effect hiding in your data. This guide maps every common scenario to the right statistical test so you stop guessing and start deciding.

Why Choosing the Right Test Matters

Every hypothesis test balances two risks. A Type I error (false positive) occurs when you declare a difference real when it is actually random variation -- you roll out a pricing change that had no true effect. A Type II error (false negative) happens when you miss a real difference -- you abandon a winning campaign because the test lacked statistical power to detect its impact.

The test you choose determines how well you control these risks. Parametric tests like the t-test assume your data follows a normal distribution and offer maximum statistical power when those assumptions hold. Use them on non-normal data, however, and your p-values become unreliable -- you either see effects that are not there or miss ones that are. Nonparametric alternatives like the Mann-Whitney U test make fewer assumptions at the cost of slightly less power, but they produce trustworthy results across a wider range of data shapes.

Matching the test to the data also matters for the number of groups you compare. A t-test handles two groups. Comparing three or more groups with repeated t-tests inflates your false-positive rate exponentially -- with 10 groups and 45 pairwise comparisons, you have a 90% chance of at least one spurious finding at alpha = 0.05. ANOVA and its nonparametric counterparts exist precisely to solve this problem.

Rule of Thumb

If your data is continuous and roughly normal, use parametric tests (t-test, ANOVA). If it is ordinal, heavily skewed, or has outliers, use nonparametric tests (Mann-Whitney, Kruskal-Wallis). If it is categorical (counts), use Chi-Square or Fisher's Exact. When in doubt, run both -- if they agree, report the parametric result; if they disagree, trust the nonparametric.


Quick Comparison: Every Test at a Glance

Test Data Type Groups Assumptions Guide
T-Test Continuous 2 (independent or paired) Normality, equal variance (Welch's relaxes this) Read guide
ANOVA Continuous 3+ Normality, homogeneity of variance Read guide
Chi-Square Test Categorical (counts) 2+ categories Expected frequency ≥ 5 per cell Read guide
Mann-Whitney U Ordinal / Non-normal continuous 2 (independent) Independent samples, similar shape Read guide
Wilcoxon Signed-Rank Ordinal / Non-normal continuous 2 (paired) Paired observations, symmetric differences Read guide
Kruskal-Wallis Ordinal / Non-normal continuous 3+ Independent samples Read guide
Fisher's Exact Test Categorical (counts) 2×2 (small samples) None (exact calculation) Read guide
McNemar's Test Categorical (paired) 2 (before/after) Paired binary outcomes Read guide
Kolmogorov-Smirnov Continuous 1 or 2 distributions Continuous data Read guide
Bonferroni Correction Any (p-value adjustment) Multiple tests Conservative; controls family-wise error Read guide
Holm-Bonferroni Any (p-value adjustment) Multiple tests Less conservative than Bonferroni; step-down Read guide
Benjamini-Hochberg Any (p-value adjustment) Multiple tests Controls false discovery rate, not FWER Read guide
Power Analysis Any (design planning) Pre-test Requires effect size estimate Read guide

When to Use Each Test

T-Test

Use the t-test when comparing means between exactly two groups on a continuous, approximately normal outcome. The independent-samples t-test compares two separate groups (treatment vs. control), while the paired t-test compares the same subjects measured twice (before vs. after). Welch's t-test is the safer default because it does not assume equal variances. If your sample is large (n > 30 per group), the t-test is robust to moderate non-normality thanks to the central limit theorem. Full t-test guide →

ANOVA

Use one-way ANOVA when comparing means across three or more independent groups -- for example, testing whether conversion rates differ across four landing page variants. ANOVA tells you whether at least one group differs; follow up with post-hoc tests (Tukey HSD, Bonferroni) to find which pairs. Two-way ANOVA adds a second factor and tests for interaction effects. If normality or equal-variance assumptions fail, switch to Kruskal-Wallis. Full ANOVA guide →

Chi-Square Test

Use the Chi-Square test of independence when both your variables are categorical and you want to know if they are associated. Common applications include testing whether customer segment (new vs. returning) is related to purchase category, or whether conversion rates differ across traffic sources. Requires expected cell counts of at least 5; if any cell falls below this, use Fisher's Exact Test instead. Full Chi-Square guide →

Mann-Whitney U Test

Use the Mann-Whitney U test as the nonparametric alternative to the independent-samples t-test. It compares the rank distributions of two independent groups and works well with ordinal data, skewed distributions, or when outliers make the t-test unreliable. It tests whether one group tends to have larger values than the other. Common in A/B testing on revenue data (which is almost always right-skewed) and customer satisfaction scores. Full Mann-Whitney guide →

Wilcoxon Signed-Rank Test

Use the Wilcoxon signed-rank test as the nonparametric alternative to the paired t-test. It compares two related measurements -- before and after an intervention, or matched pairs -- when the differences are not normally distributed. Particularly useful for small samples where normality is hard to verify, or when working with ordinal rating scales. Full Wilcoxon guide →

Kruskal-Wallis Test

Use the Kruskal-Wallis test as the nonparametric alternative to one-way ANOVA. It compares rank distributions across three or more independent groups when normality assumptions are violated. Follow up a significant result with Dunn's post-hoc test to identify which pairs differ. Common in comparing customer satisfaction scores, response times, or order values across multiple segments. Full Kruskal-Wallis guide →

Fisher's Exact Test

Use Fisher's Exact Test for 2x2 contingency tables when sample sizes are small (any expected cell count below 5). Unlike the Chi-Square test, Fisher's computes the exact probability rather than relying on an approximation, so it is always valid regardless of sample size. Use it for rare events, small A/B tests, or medical/safety data where precision matters more than computational convenience. Full Fisher's Exact guide →

McNemar's Test

Use McNemar's test when you have paired binary outcomes -- the same subjects measured on a yes/no variable at two time points. It tests whether the proportion of "yes" changed between measurements. Common applications include before/after surveys ("Did you purchase? Yes/No"), diagnostic test comparisons, and matched case-control studies. Full McNemar's guide →

Kolmogorov-Smirnov Test

Use the Kolmogorov-Smirnov (KS) test to compare an observed distribution against a theoretical distribution (one-sample) or to compare two observed distributions (two-sample). It is sensitive to differences in both location and shape. In business analytics, the KS test is widely used to validate model calibration, detect data drift, and check normality assumptions before applying parametric tests. Full KS test guide →

Power Analysis

Use power analysis before you run any test to determine how many observations you need to detect a meaningful effect. It links four quantities: sample size, effect size, significance level (alpha), and statistical power (1 - beta). Skipping this step is the most common reason tests return inconclusive results -- the experiment simply did not have enough data. Run power analysis during experiment design, not after. Full power analysis guide →


Decision Flowchart: Choosing the Right Test

Start at the top and follow the branches based on your data characteristics.

What type of outcome variable? ┌──────────────┼──────────────┐ ▼ ▼ ▼ CONTINUOUS ORDINAL CATEGORICAL │ │ (counts) ▼ │ ▼ Is data normal? │ Expected cells ≥ 5? ┌───┴───┐ │ ┌───┴───┐ ▼ ▼ │ ▼ ▼ YES NO │ YES NO │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ How many How many How many Chi- Fisher's groups? groups? groups? Square Exact ┌──┴──┐ ┌──┴──┐ ┌──┴──┐ Test Test ▼ ▼ ▼ ▼ ▼ ▼ 2 3+ 2 3+ 2 3+ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ T-Test ANOVA Mann- Kruskal- Wilcoxon Friedman Whitney Wallis Signed- Test U Test Test Rank Paired data? → Use paired variant of each test Multiple comparisons? → See correction methods below

Multiple Comparisons: When and How to Correct

Every time you run a hypothesis test at alpha = 0.05, you accept a 5% chance of a false positive. Run 20 tests on the same dataset and you expect one spurious "significant" result even if no real effects exist. Multiple comparison corrections control this inflation so your findings remain trustworthy.

When You Must Correct

Correction is mandatory when you test multiple hypotheses on the same dataset and report any result as significant. This includes: post-hoc pairwise comparisons after ANOVA, testing multiple outcome metrics in a single experiment, subgroup analyses that were not pre-specified, and running the same test across many segments (e.g., testing conversion lift in each of 15 countries).

Bonferroni Correction

The simplest and most conservative approach: divide your significance threshold by the number of tests. With 10 comparisons at alpha = 0.05, each individual test must reach p < 0.005 to be declared significant. Easy to implement and explain, but overly strict when you have many tests -- it dramatically increases the risk of missing real effects. Best for small numbers of planned comparisons where false positives are costly. Full Bonferroni guide →

Holm-Bonferroni Method

A step-down procedure that is uniformly more powerful than Bonferroni while still controlling the family-wise error rate. Sort p-values from smallest to largest and compare each to a progressively less strict threshold. It rejects more true effects than Bonferroni with the same Type I error guarantee, making it the recommended default for most applications. There is no reason to use plain Bonferroni when Holm-Bonferroni is available. Full Holm-Bonferroni guide →

Benjamini-Hochberg Procedure

Controls the false discovery rate (FDR) rather than the family-wise error rate, making it less conservative and more suitable for exploratory analyses with many tests. Instead of guaranteeing that no false positive slips through, it controls the expected proportion of false positives among rejected hypotheses. Use it when you are screening many variables for potential effects (e.g., which of 200 product features predict churn) and can tolerate some false leads in exchange for detecting more real effects. Full Benjamini-Hochberg guide →

Which Correction Should You Use?

Few planned comparisons, high stakes: Bonferroni or Holm-Bonferroni. Many comparisons, exploratory: Benjamini-Hochberg. Always prefer Holm-Bonferroni over Bonferroni -- it is strictly more powerful with the same error control. Use Benjamini-Hochberg when you are screening for discoveries and can follow up with confirmatory tests.


Beyond Frequentist Testing

Classical hypothesis tests answer a narrow question: "Is this result unlikely under the null hypothesis?" Two complementary approaches provide richer answers for business decision-making.

Bayesian A/B Testing

Instead of a binary significant/not-significant verdict, Bayesian A/B testing gives you a probability that one variant beats another -- for example, "there is an 94% probability that the new checkout flow increases revenue." This maps directly to business risk decisions and does not require fixed sample sizes, making it ideal for continuous experimentation programs. Full Bayesian A/B testing guide →

Causal Impact Analysis

When you cannot run a randomized experiment -- a policy change affects all customers simultaneously, or you need to measure the impact of an external event -- Causal Impact uses Bayesian structural time series to construct a synthetic counterfactual. It estimates what would have happened without the intervention and quantifies the causal effect with credible intervals. Essential for measuring marketing campaigns, product launches, and operational changes that lack a clean control group.


See Statistical Tests in Action — View a live hypothesis testing report built from real data.
View Sample Report

Hypothesis Testing for Marketing & Business

A/B tests, campaign comparisons, and conversion rate optimization all rely on the statistical tests above. Whether you're testing a new checkout flow or comparing ad creatives, hypothesis testing tells you if the difference is real or just noise.

Run the Right Test on Your Data

Upload a CSV, pick a hypothesis test, and get a publication-ready report with effect sizes, confidence intervals, and plain-language interpretation -- no code required.

Start Free Trial