You run 45 t-tests and find 30 significant results at p < 0.05. Before you celebrate, do the math: with 45 tests and a 5% false positive rate per test, you'd expect 2-3 spurious significant findings even if nothing real is happening. The problem gets worse: the probability of at least one false positive across all 45 tests is over 90%. Half your "discoveries" might be noise.

This is the multiple comparisons problem, and it's why we need correction methods. Before we draw conclusions from multiple statistical tests, let's check the experimental design — specifically, how we control the family-wise error rate.

The Multiple Comparisons Problem: Why 0.05 Breaks Down

Here's the issue: when you set α = 0.05 for a single hypothesis test, you're accepting a 5% chance of a false positive. That's the agreed-upon trade-off in classical hypothesis testing. But when you run multiple tests, those 5% chances accumulate.

The family-wise error rate (FWER) is the probability of making at least one Type I error (false positive) across all your tests. For independent tests, the math is straightforward:

FWER = 1 - (1 - α)^m

Where m is the number of tests. With 10 independent comparisons at α = 0.05, your family-wise error rate is 1 - (0.95)^10 = 0.40. You have a 40% chance of at least one false positive. With 45 comparisons, it climbs to 90%.

What Counts as a "Family" of Tests?

A family is a set of hypothesis tests that address related questions within a single analysis or experiment. If you're comparing six parental education groups across three outcomes (math, reading, writing), all 45 pairwise comparisons form a single family — they're all answering the broader question "do education groups differ on academic performance?" If you run separate experiments on unrelated questions, those are different families and don't need joint correction.

The solution is to adjust your decision threshold to control FWER at your desired level (typically 0.05). The simplest approach is Bonferroni correction: divide your α by the number of tests. For 45 comparisons, each individual test would need p < 0.05/45 = 0.0011 to be declared significant. This guarantees FWER ≤ 0.05.

The problem? Standard Bonferroni is conservative. It protects you from false positives, but it also reduces power — your ability to detect real effects. You end up missing true differences because the threshold is too strict.

How Holm-Bonferroni Improves on Standard Bonferroni

Holm-Bonferroni is a sequential rejection procedure that's uniformly more powerful than standard Bonferroni while maintaining the same FWER protection. Here's how it works:

  1. Sort your p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(m)
  2. Test sequentially starting with the smallest p-value:
    • Test p(1) against α/m
    • If rejected, test p(2) against α/(m-1)
    • If rejected, test p(3) against α/(m-2)
    • Continue until you fail to reject
  3. Stop when you retain a null: the moment you fail to reject, stop testing and retain all remaining null hypotheses

The key insight: once you've rejected the k most significant tests, you only have (m - k) hypotheses left to worry about, so you can use a less stringent threshold for the next test. This step-down approach gives you more power than Bonferroni's uniform threshold while still guaranteeing FWER ≤ α.

Holm-Bonferroni vs. Bonferroni: The Power Difference

With 10 tests, standard Bonferroni requires p < 0.005 for every test. Holm-Bonferroni requires p < 0.005 for only the first test. If that passes, the second test needs p < 0.0056, the third p < 0.00625, and so on. The thresholds relax as you go, giving you more opportunities to reject false nulls without inflating your false positive rate.

A Real Example: Parental Education and Academic Performance

Let's walk through a concrete case. You're analyzing student performance data with six parental education levels: some high school, high school graduate, some college, associate's degree, bachelor's degree, and master's degree. You have three outcome measures: math scores, reading scores, and writing scores.

The research question: do students whose parents have different education levels perform differently on standardized tests? To answer this, you run all pairwise comparisons between education groups for each outcome.

With six groups, there are C(6,2) = 15 unique pairs. Across three outcomes, that's 45 total t-tests. Without correction, any test with p < 0.05 would be called significant. But we know that approach would inflate our false positive rate to unacceptable levels.

This is exactly the scenario where Holm-Bonferroni shines: a moderate number of planned comparisons (not so many that we need FDR methods, but enough that Bonferroni would be overly conservative), all addressing a coherent family of related hypotheses.

Group Descriptive Statistics

Before running any statistical tests, examine your descriptive statistics. This table shows mean scores and sample sizes for each parental education group across math, reading, and writing outcomes. The pattern is immediately visible: mean scores increase monotonically with parental education level across all three subjects.

Students whose parents have a master's degree score approximately 10-15 points higher (on a 100-point scale) than students whose parents have only some high school education. The groups in between show a clear stepwise progression. Reading scores are consistently higher than math or writing scores across all education levels, but the relative ordering of groups remains stable.

Sample sizes vary across groups, from 179 students in the "some high school" category to 118 in the "master's degree" group. These differences in group size affect statistical power for detecting differences. Comparisons involving larger groups will have more power than those involving smaller groups, all else equal.

What's your sample size? Are these groups adequately powered to detect meaningful differences? With 100+ students per group and effect sizes on the order of 10-15 points (roughly 0.5-0.8 standard deviations), we have good power for the largest contrasts. Smaller adjacent-group comparisons (e.g., some college vs. associate's degree) will be harder to detect with statistical confidence.

Mean Scores by Parental Education Level

This visualization makes the dose-response relationship unmistakable. Across all three academic subjects, average scores climb steadily as parental education increases. The ordering is perfectly consistent: master's degree tops the chart, followed by bachelor's, associate's, some college, high school, and some high school at the bottom.

Reading scores show the highest absolute values (mid-70s at the top end), followed by writing and math. But the relative gaps between education levels are similar across subjects. The difference between the highest and lowest education groups is roughly 12-15 points regardless of subject.

Here's what this visualization doesn't tell you: whether those differences are statistically significant after accounting for within-group variability and multiple comparisons. Those bars represent point estimates (means), not confidence intervals. Adjacent groups look close — are they significantly different, or could those small gaps be sampling variation?

This is why we need inferential statistics. Descriptive charts like this generate hypotheses and show patterns, but correlation is interesting — causation requires an experiment. Even with rigorous statistical testing, observational data like this can only show association, not causal effects. We'd need randomized assignment of parental education (impossible) or careful causal inference methods to make stronger claims.

Score Distributions by Education Level

Box plots reveal within-group variability — critical information for understanding statistical power. Each box spans the interquartile range (IQR), showing where the middle 50% of students in each group scored. The whiskers extend to 1.5 × IQR, with outliers plotted individually.

Notice the substantial overlap between adjacent groups. While the medians (horizontal lines inside each box) show the same upward trend we saw in the bar chart, the distributions overlap considerably. A student at the 75th percentile of the "high school" group might score higher than a student at the 25th percentile of the "bachelor's degree" group.

This overlap makes pairwise comparisons challenging. When distributions substantially overlap, you need large sample sizes to detect significant differences — even when the means differ by several points. The t-test asks: "Is the difference in means large relative to the pooled within-group variability?" These boxes tell us the denominator of that ratio is substantial.

Also note the outliers, particularly in math scores. Several students score well below the typical range for their parental education group, while a few score above. These outliers increase variance, reducing power. For parametric tests like t-tests, outliers can also influence means, potentially affecting which comparisons reach significance.

Pairwise Comparison Results (All 45 Tests)

Here's where Holm-Bonferroni correction does its work. This table shows all 45 pairwise t-tests: 15 education group pairs × 3 outcomes. For each test, you see the raw p-value, the Holm-Bonferroni adjusted p-value, and the significance decision.

Look at the math score comparison between "some high school" and "master's degree": raw p = 0.00001, adjusted p = 0.00045, significant = Yes. This is the most extreme difference, so it easily survives correction. The adjusted p-value is still well below 0.05.

Now look at adjacent groups. The comparison between "associate's degree" and "bachelor's degree" for math scores shows raw p = 0.092, adjusted p = 1.000, significant = No. The raw p-value was already above 0.05, so no surprise it fails after correction. But notice how the adjustment inflates it to 1.000 — the maximum possible p-value — indicating this test had no chance of significance in the multiple comparisons framework.

The critical comparisons are those with raw p-values between 0.001 and 0.05. A reading score comparison might have raw p = 0.018 (significant without correction) but adjusted p = 0.324 (not significant after correction). This is Holm-Bonferroni protecting you from false positives. That raw p = 0.018 looked promising in isolation, but in the context of 45 tests, it's not strong enough evidence to trust.

Count the rejections: across all 45 tests, 18 comparisons remain significant after Holm-Bonferroni correction. These are the differences you can report with confidence, knowing your family-wise error rate is controlled at 0.05. The other 27 comparisons — including some with raw p < 0.05 — should be interpreted as inconclusive or treated as exploratory findings requiring replication.

How to Interpret Your Holm-Bonferroni Results

Once you've applied Holm-Bonferroni correction, interpretation follows these principles:

Report the adjusted p-values, not the raw ones. Your readers need to know which tests survived correction. Reporting raw p-values without correction is misleading — it suggests a false positive rate that doesn't account for multiple testing. Always present adjusted p-values in tables and text.

Distinguish between significant and non-significant findings. This seems obvious, but it matters: a comparison that doesn't reach significance after correction should not be described with language like "trending toward significance" or "marginally significant." It failed the test. You can note the pattern in exploratory language ("we observed a numerical difference of X points, though this did not reach statistical significance"), but don't hedge your way into treating it as a positive finding.

Focus on effect sizes, not just p-values. Statistical significance tells you whether a difference is unlikely to be due to chance. Effect size tells you whether it's meaningful. A comparison between distant education groups (some high school vs. master's degree) might be highly significant but only represent a 12-point difference on a 100-point scale. Is that educationally meaningful? Depends on your context and the minimum detectable effect that matters for your research question.

Consider power and Type II errors. Non-significant comparisons could be true nulls (no real difference) or false negatives (real difference that you failed to detect). With 100+ subjects per group, you have decent power for moderate effect sizes, but adjacent-group comparisons with small true differences might be underpowered. Don't conclude "no difference exists" — say "we found no significant difference" and acknowledge the possibility of Type II error.

When to Report Non-Significant Results

In confirmatory research, non-significant results are important findings. If you hypothesized that high school and some college groups would differ, and they don't after correction, that's a result worth reporting — it means parental education effects might plateau or be non-linear. In exploratory research, focus your discussion on significant findings, but include the full table of results in an appendix so readers can see the complete picture.

Common Mistakes When Applying Multiple Comparisons Correction

I see these errors regularly in submitted analyses and published papers:

Testing everything, then correcting. Holm-Bonferroni (and all FWER methods) assumes the tests in your family are planned comparisons addressing a coherent research question. If you run every possible analysis on your dataset, find 100 significant results, then apply correction as an afterthought, you're misusing the method. Multiple comparisons correction is not a fix for data dredging — it's a tool for controlling error rates in planned families of tests.

Correcting within subgroups instead of across the full family. If you run 15 pairwise comparisons for math scores, 15 for reading, and 15 for writing, you have 45 tests in one family (all testing education effects on academic performance). Correcting within each outcome separately (m = 15 three times) is less stringent than correcting across all 45 tests together. Use the full family unless you have strong a priori reasons to treat the outcomes as independent families.

Choosing α after seeing the data. If no tests survive Holm-Bonferroni correction at α = 0.05, you can't just raise α to 0.10 and re-run the correction until something passes. The whole point is to set your error rate tolerance before testing. If you adjust α post-hoc based on the results, you've invalidated the FWER control.

Ignoring the assumption of independence. Holm-Bonferroni controls FWER even when tests are correlated (it's conservative in that case), but the interpretation changes. If your 45 tests are all highly correlated — say, three nearly identical outcomes measured in the same students — you're not really running 45 independent chances to make a false positive. Methods like Bonferroni might be overly conservative here. That said, being conservative (protecting against false positives at the cost of power) is usually preferable to being anti-conservative (letting false positives through).

Holm-Bonferroni vs. Other Multiple Comparisons Methods

Why use Holm-Bonferroni instead of alternatives? Here's the decision framework:

Use Holm-Bonferroni when:

  • You have 5-50 planned comparisons in a family
  • You need to control family-wise error rate (every false positive is costly)
  • You want more power than standard Bonferroni without additional assumptions
  • You're working in a confirmatory framework with pre-specified tests

Use Benjamini-Hochberg (FDR control) when:

  • You have 50+ comparisons or are doing exploratory analysis
  • You can tolerate some false positives among your rejections
  • You want maximum power to detect true effects
  • You're doing genomics, neuroimaging, or other high-dimensional research

Use Tukey HSD or Dunnett's test when:

  • You're comparing means from a one-way ANOVA
  • You want simultaneous confidence intervals, not just hypothesis tests
  • You're comparing all pairs (Tukey) or all groups to a control (Dunnett)
  • You value the exact FWER control these methods provide for balanced designs

Use no correction when:

  • You genuinely pre-specified a single primary comparison
  • You're reporting exploratory results clearly labeled as such
  • You're showing all results transparently and letting readers judge

For the typical case — a researcher with a clear research question, a moderate number of planned comparisons, and a need for FWER control — Holm-Bonferroni is the default choice. It's simple, widely understood, and more powerful than the standard Bonferroni most people learned in graduate school.

Run Holm-Bonferroni Correction in 60 Seconds

Upload your data with multiple groups and outcome variables. MCP Analytics runs all pairwise t-tests, applies Holm-Bonferroni correction, and shows you which differences survive — controlling your family-wise error rate at 0.05.

Try Multiple Comparisons Analysis →

Sample Size Planning for Multiple Comparisons

Before we run the experiment, let's calculate the minimum sample size needed. Multiple comparisons correction reduces effective power, so you need larger samples than you would for a single test.

Here's the logic: if you're running m comparisons and using Holm-Bonferroni, your most significant test needs p < α/m to be declared significant. That's equivalent to running a single test at a much lower α level. Lower α means you need a larger effect size or bigger sample to achieve the same power.

For a standard two-sample t-test, power depends on four quantities: sample size (n), effect size (d), significance level (α), and power (1 - β, typically 0.80). With Holm-Bonferroni, you should plan for the most stringent test to be powered. Use α/m as your significance level when calculating required n.

Example: you plan 15 pairwise comparisons. Standard Bonferroni (and Holm's first test) requires α/m = 0.05/15 = 0.0033. To detect a medium effect size (d = 0.5) with 80% power at α = 0.0033, you need approximately 220 subjects per group (compared to 64 per group for a single test at α = 0.05).

That's a sobering calculation. Running 15 comparisons requires 3-4× more subjects than running one test, if you want to maintain power. This is why experimental design matters: reduce the number of comparisons by focusing on specific hypotheses rather than exhaustive pairwise testing.

Power Analysis Tools

Use G*Power (free software) or R packages like pwr to calculate required sample sizes for multiple comparisons. Input your planned number of tests, desired power, expected effect size, and α/m as your significance level. This gives you a realistic target n before you start data collection. Post-hoc power analysis (calculating power after seeing your results) is controversial and generally not recommended — plan power prospectively.

Reporting Holm-Bonferroni Results in Papers

When writing up your analysis, follow these reporting standards:

In the Methods section, specify:

  • The family of tests you're correcting across (e.g., "all pairwise comparisons between six education groups across three outcomes, totaling 45 tests")
  • The correction method (Holm-Bonferroni procedure)
  • Your α level (typically 0.05)
  • The software/package used (e.g., "p.adjust function in R with method='holm'")

In the Results section, provide:

  • A table with raw p-values, adjusted p-values, and significance decisions
  • Descriptive statistics (means, SDs, sample sizes) for each group
  • Effect sizes (Cohen's d or mean differences with confidence intervals)
  • Text summary highlighting which comparisons were significant after correction

Example text: "We conducted 45 independent-samples t-tests (15 pairwise comparisons × 3 outcomes) to examine differences in academic performance across parental education levels. To control family-wise error rate at α = 0.05, we applied Holm-Bonferroni correction to all 45 p-values. After correction, 18 comparisons remained statistically significant (see Table 2). The largest effects were observed comparing the lowest education group (some high school) with the highest groups (bachelor's and master's degrees), with adjusted p < 0.001 for all three outcomes and effect sizes ranging from d = 0.62 to d = 0.89."

Did you randomize? What were the control conditions? In observational studies like this education analysis, you can't randomize parental education. Be explicit about this limitation: "These analyses are observational and cannot establish causal relationships. Parental education may be confounded with income, school quality, genetics, and other factors that independently affect student performance."

Frequently Asked Questions

When should I use Holm-Bonferroni correction instead of standard Bonferroni?

Use Holm-Bonferroni whenever you're running multiple comparisons. It's strictly more powerful than standard Bonferroni (meaning it rejects more false nulls while maintaining the same family-wise error rate protection). The only reason to use standard Bonferroni is if you need to set alpha levels before seeing the data — but in most analysis workflows, you have all p-values before making decisions, so Holm-Bonferroni is the better choice.

How does Holm-Bonferroni control the family-wise error rate?

Holm-Bonferroni uses a step-down procedure: sort your p-values from smallest to largest, then test each one against progressively less stringent thresholds (α/m, α/(m-1), α/(m-2), etc.). The moment you fail to reject a null hypothesis, you stop and retain all remaining nulls. This sequential rejection process mathematically guarantees that the probability of making at least one false positive across your entire family of tests stays at or below your chosen α level (typically 0.05).

What's the difference between controlling family-wise error rate and false discovery rate?

Family-wise error rate (FWER) is the probability of making at least one false positive across all your tests. It's conservative — appropriate when every false positive is costly. False discovery rate (FDR) is the expected proportion of false positives among your rejected nulls. FDR methods like Benjamini-Hochberg are more powerful when you're running dozens or hundreds of tests and can tolerate some false positives mixed in with true discoveries. For typical experimental comparisons with 10-50 tests, Holm-Bonferroni's FWER control is the standard.

How many comparisons can I run before Holm-Bonferroni loses all power?

There's no hard cutoff, but practical power degrades as the number of comparisons grows. With 10 comparisons, your most significant test needs p < 0.005 to pass. With 50 comparisons, you need p < 0.001. If you're running 100+ comparisons, consider whether you truly need FWER control or whether an FDR approach (Benjamini-Hochberg) is more appropriate. The real solution is better experimental design: fewer, more focused comparisons guided by specific hypotheses rather than exhaustive pairwise testing.

Do I need multiple comparisons correction if I planned only one primary comparison?

No — if you genuinely pre-specified a single primary comparison before seeing the data, no correction is needed. But be honest about this. If you ran the experiment, looked at the results, then picked the most interesting comparison to report, that's data-driven selection and you need correction. The key is pre-specification: write down your primary hypothesis before data collection. Secondary and exploratory analyses should be clearly labeled as such and either corrected or interpreted cautiously.

The Bottom Line on Multiple Comparisons Correction

Multiple comparisons inflate false positive rates. Without correction, the more tests you run, the more likely you are to find spurious significant results. Holm-Bonferroni correction controls family-wise error rate while maintaining more power than standard Bonferroni.

Apply it when you have a coherent family of planned comparisons (typically 5-50 tests), report adjusted p-values transparently, and interpret non-significant results appropriately. Plan your sample size with correction in mind — you need more subjects than you'd think to maintain power across multiple comparisons.

And remember: correction methods protect you from false positives in the tests you run, but they don't protect you from cherry-picking which tests to run in the first place. Before we draw conclusions, let's check the experimental design — including whether your comparisons were truly pre-planned or selected after seeing the data.

What's your sample size? Is this test adequately powered for the number of comparisons you're running? Run the power analysis before you collect data, not after.