Holm-Bonferroni Multiple Comparison Correction — Correct P-Values Across Many Tests

When you run many hypothesis tests at once, some will come back significant by pure chance. Run 20 t-tests at the 5% level and you can expect one false positive even if nothing real is happening. Holm-Bonferroni controls the family-wise error rate so you only keep results that are truly significant — not statistical flukes. Upload a CSV of your p-values and get corrected results, method comparisons, and interactive visualizations in under 60 seconds.

The Multiple Testing Problem

Every hypothesis test carries a risk of a false positive. At the standard 5% significance level, there is a 1-in-20 chance that you declare a result significant when nothing real is happening. That is fine for a single test. But when you run many tests simultaneously, those odds compound. With 20 independent tests, the probability that at least one gives a false positive climbs to 64%. With 50 tests, it reaches 92%. Run 100 tests and you are virtually guaranteed at least one spurious finding.

This is the multiple testing problem, and it shows up everywhere. A marketing team runs 20 A/B tests on different email subject lines against a control — even if none of the variants is actually better, the team will likely see at least one "winner." A genomics researcher tests thousands of genes for differential expression and gets hundreds of hits, most of which are noise. A survey analyst compares satisfaction scores across every combination of department, seniority level, and gender, generating dozens of pairwise comparisons where some will be significant by chance alone.

The family-wise error rate (FWER) is the probability of making at least one false positive across the entire family of tests. Without correction, it grows rapidly with the number of tests: FWER = 1 - (1 - alpha)^m, where m is the number of tests. Holm-Bonferroni is a procedure that adjusts your significance thresholds so the FWER stays at the level you set — typically 5% — no matter how many tests you run.

What Holm-Bonferroni Does

Holm-Bonferroni is a step-down procedure. It sorts your p-values from smallest to largest, then compares each one to a progressively relaxed threshold. The first (smallest) p-value is compared to alpha/m — the same strict threshold that Bonferroni uses for all tests. But the second smallest is compared to alpha/(m-1), the third to alpha/(m-2), and so on. The procedure stops rejecting the first time a p-value fails to clear its threshold, and all remaining tests are also retained (not rejected).

This step-down logic is what makes Holm more powerful than the classical Bonferroni correction. Bonferroni applies the strictest threshold — alpha/m — to every test, regardless of rank. That means it treats the most significant result and the least significant result with equal suspicion. Holm starts just as strict but relaxes the bar as it works through the sorted list, which means it can reject tests that Bonferroni would miss. Critically, Holm still controls the FWER at exactly the same level as Bonferroni. You get more power without sacrificing error control.

For example, suppose you run 20 t-tests simultaneously and set alpha at 0.05. Bonferroni compares every p-value to 0.05/20 = 0.0025. Any p-value above 0.0025 is declared not significant, even if it is 0.003. Holm starts at 0.0025 for the smallest p-value, but by the time it reaches the 10th-ranked p-value, the threshold has relaxed to 0.05/11 = 0.0045. That difference can mean two or three additional genuine discoveries that Bonferroni would have thrown away.

Real-World Examples

Consider a concrete scenario: a marketing team tests 15 email subject line variants against a control. Each variant gets a chi-square test of conversion rate difference. Without any correction, 12 of the 15 variants come back significant at the 5% level. But with 15 simultaneous tests, the uncorrected FWER is 54% — better than a coin flip that at least one of those 12 "winners" is a false positive. After applying Holm-Bonferroni, only 8 of the 12 survive. Those 8 represent genuinely better subject lines. The other 4 were likely statistical noise — and rolling them out would have wasted email list segments on variants that do not actually outperform the control.

In gene expression studies, the stakes are even higher. A researcher might test 10,000 genes for differential expression between healthy and diseased tissue. At alpha = 0.05 with no correction, 500 genes would be expected to appear significant by chance alone — even if the disease has no effect on gene expression. Holm-Bonferroni (or the related Benjamini-Hochberg FDR procedure, for studies this large) separates the genuine biological signals from the thousands of false leads that would otherwise consume years of follow-up research.

Survey analysis produces the same problem at a smaller scale. An HR team compares employee engagement scores across 5 departments, 3 seniority levels, and 2 genders. That generates 84 pairwise t-tests. Without correction, 58 come back significant — but at that scale, many are noise. After Holm-Bonferroni, 36 remain. These 36 represent real disparities that justify HR attention and resource allocation. The 22 that dropped out were differences that appeared real in isolation but could not survive the scrutiny of multiple comparison correction.

What Data Do You Need?

You need a CSV with at least two columns: a label identifying each test (like a variant name, gene ID, or group comparison description) and the raw p-value from the original hypothesis test. The p-values must be numbers between 0 and 1. You need at least two tests — a single test does not need multiple comparison correction.

Optional but recommended columns include effect size (such as Cohen's d, a log odds ratio, or a relative risk), sample size per group, a test family column (if you want to correct within subgroups rather than across all tests), and a column identifying the type of statistical test used. Effect sizes are particularly valuable because they let the report distinguish between results that are statistically significant and results that are practically meaningful — a distinction that matters for any real decision.

The tool accepts up to 5,000 tests on the free tier and up to 50,000 on professional plans. Execution typically takes 5 to 15 seconds regardless of test count.

How to Read the Report

The report contains 10 layout cards organized into sections that walk you from the big picture down to individual test results. Here is what each one tells you.

Analysis Overview — The report opens with two side-by-side cards: a data overview showing your input characteristics (number of tests, p-value range, which optional columns were detected) and key metrics summarizing rejections across all correction methods. This gives you the headline numbers immediately: how many tests survived Holm, how many survived Bonferroni, and how those compare to uncorrected results.

Executive Summary — A plain-language summary of the key findings: which correction method was applied, how many tests were rejected, and what the practical implications are. This card is designed to be shared with stakeholders who need the conclusion without the statistical detail.

Holm Step-Down Procedure — The signature visualization. Your p-values are sorted from smallest to largest and displayed as a lollipop chart, with a staircase threshold line overlaid showing the Holm sequential thresholds. Tests below the staircase are rejected (significant); tests above it are retained. The point where the staircase crosses the p-values is exactly where the Holm procedure stops rejecting. This chart makes the sequential logic visible and intuitive.

Method Comparison: Adjusted P-Values — A scatter plot showing raw p-values on one axis and adjusted p-values on the other, with one series per correction method (Holm, Bonferroni, Hochberg, Benjamini-Hochberg). This reveals how aggressively each method inflates the p-values and where the methods diverge. Tests that cluster near the diagonal are barely affected by correction; tests flung far from the diagonal are the ones where correction changed the conclusion.

Significance Decision Matrix — A heatmap with tests as rows and correction methods as columns, colored by reject or retain. This is the fastest way to see where the methods agree and where they disagree. A test that survives all methods is rock-solid. A test that only survives Benjamini-Hochberg but not Holm is significant under FDR control but not under the stricter FWER control — and that distinction matters for how much confidence you place in the result.

Method Comparison: Rejection Counts — A grouped bar chart showing the total number of rejections per correction method. The ordering is predictable: Bonferroni (most conservative) rejects the fewest, then Holm, then Hochberg, then Benjamini-Hochberg (least conservative). The gaps between bars tell you how much power you gain or lose by choosing one method over another.

Effect Size Analysis — A forest plot showing effect sizes (Cohen's d) with confidence intervals, sorted by significance. Tests with large effects and narrow confidence intervals are your strongest findings. Tests with tiny effects that happen to be significant get flagged — they passed the statistical bar but may not clear the practical significance bar for your domain. This card is available when you include an effect size column in your data.

FWER Accumulation Curve — A line chart showing how the probability of at least one false positive grows as you add more tests without correction. This is the motivational chart — it shows why correction is necessary. At 14 tests, the uncorrected FWER crosses 50%. At 59 tests, it exceeds 95%. The chart makes the case for correction more powerfully than any verbal explanation.

Detailed Results Table — The complete numerical output: every test with its raw p-value, adjusted p-values under each method, effect size, and the reject/retain decision for each method. This is the table you export for your records or attach to a report. Every number in every chart traces back to this table.

Methodology Summary — A reference table comparing the correction methods used: what each one controls (FWER vs. FDR), what assumptions it makes (independence, positive dependence), and how it ranks in statistical power. This card is useful for justifying your choice of method in a write-up or presentation.

Holm vs. Bonferroni vs. Benjamini-Hochberg

The three most common correction methods differ in what they control and how aggressively they adjust. Bonferroni is the simplest: divide alpha by the number of tests and apply that threshold uniformly. It controls the FWER but is the most conservative — it throws away the most genuine results along with the false positives. Holm uses the same starting threshold but relaxes it step by step, which means it always rejects at least as many tests as Bonferroni and often more. Both Bonferroni and Holm control the FWER at exactly the same level.

Benjamini-Hochberg (BH) controls something different: the false discovery rate (FDR), which is the expected proportion of rejected tests that are false positives. FDR control is less strict than FWER control. If you reject 100 tests at FDR = 5%, you expect about 5 of those to be false positives — but you do not know which 5. For exploratory studies with many tests (genomics, large-scale A/B testing), FDR control is often the right tradeoff because it retains more genuine discoveries. For confirmatory studies where any single false positive is costly (clinical trials, regulatory submissions), FWER control via Holm is the standard.

The report includes all three methods side by side so you can see exactly how the choice of method affects your conclusions. In many cases, the methods agree on the strongest results and only diverge on borderline tests — which is where understanding the tradeoff matters most.

When to Use Something Else

If you have only a single hypothesis test, you do not need multiple comparison correction at all. Use a t-test, chi-square test, or whichever test fits your data.

If you have raw data rather than pre-computed p-values, start with the appropriate analysis module. Run ANOVA to compare means across groups, a Kruskal-Wallis test for non-parametric group comparisons, or a t-test for two-group comparisons. Those modules produce p-values that you can then feed into this correction tool.

If you are running thousands of tests — common in genomics, proteomics, or large-scale feature screening — Holm-Bonferroni may be too conservative. With 10,000 tests, Holm will reject very few unless the effects are enormous. In that case, Benjamini-Hochberg FDR control (included in this report for comparison) or more advanced methods like Storey's q-value are better choices. The report flags when your test count is large enough that FDR control may be more appropriate.

If your tests are not independent — for example, you are comparing multiple correlated outcomes from the same subjects — consider permutation-based methods like Westfall-Young, which account for the correlation structure directly. Holm-Bonferroni assumes independence or positive dependence, which covers most practical scenarios but not all.

The R Code Behind the Analysis

Every report includes the exact R code used to produce the results — reproducible, auditable, and citable. This is not AI-generated code that changes every run. The same data produces the same analysis every time.

The analysis uses p.adjust() from base R with method="holm" for the primary correction, plus method="bonferroni", method="hochberg", and method="BH" for the comparison methods. These are the same functions used in academic research, clinical trial submissions, and peer-reviewed publications — the same implementations referenced in statistics textbooks worldwide. No custom algorithms, no black boxes. The step-down procedure visualization, FWER accumulation curve, and decision heatmap are built with Plotly for interactivity. Every step is visible in the code tab of your report, so you or a statistician can verify exactly what was done and reproduce it independently.