Causal Inference Methods: Complete Guide to Measuring What Actually Works

Correlation is not causation. Everyone knows this. Very few people know what to do about it. When your marketing campaign runs at the same time revenue spikes, did the campaign cause the spike, or was it seasonal demand? When customers who use your premium feature churn less, is the feature reducing churn, or do loyal customers just happen to use more features?

Causal inference methods answer these questions rigorously. They are the difference between "these two things moved together" and "this thing caused that thing." This guide covers nine methods, from the gold standard (randomized A/B tests) to quasi-experimental methods for situations where you cannot randomize. Each section tells you when a method works, when it fails, and links to a full deep-dive article.

Quick Comparison

Method Best For Requires Randomization? Guide
A/B Testing Digital experiments with full control over assignment Yes Full guide
Bayesian A/B Testing A/B tests with early stopping or unequal groups Yes Full guide
Difference-in-Differences Policy changes, feature rollouts to subgroups No Full guide
Propensity Score Matching Observational data where treatment is self-selected No Full guide
Instrumental Variables Endogeneity problems, omitted variable bias No Full guide
Regression Discontinuity Treatment assigned by a threshold (score, date, age) No Full guide
Causal Impact Time series interventions (launch, campaign, policy) No Full guide
Synthetic Control Single-unit interventions with comparison units No Full guide
ANCOVA Treatment effects with pre-treatment covariate adjustment Either Full guide

Method Deep-Dives

1. A/B Testing (Frequentist)

The question it answers: Did variant B actually outperform variant A, or was it random noise?

A/B testing is the gold standard for causal inference because randomization eliminates confounders by design. You randomly assign users to control and treatment, measure the outcome, and run a statistical significance test. If you can randomize, always start here. The method is straightforward, well-understood, and produces the most defensible results. The catch: you need enough traffic or conversions for statistical power, and you need to resist peeking at results before the test reaches its planned sample size.

Use when: You control the user experience and can randomly assign users to different versions (website, email, pricing, features).

Read the full A/B testing guide →

2. Bayesian A/B Testing

The question it answers: What is the probability that B is better than A, and by how much?

Bayesian A/B testing gives you something frequentist tests cannot: a direct probability statement. Instead of "we reject the null hypothesis at p < 0.05," you get "there is an 94% probability that B outperforms A by at least 3%." This is more intuitive for decision-makers. Bayesian methods also handle early stopping naturally -- you can check results at any time without inflating error rates. The tradeoff is that you need to specify a prior, and computation is more involved.

Use when: You want to stop tests early, have unequal group sizes, or need probability statements your stakeholders can actually understand.

Read the full Bayesian A/B testing guide →

3. Difference-in-Differences (DiD)

The question it answers: What was the effect of a change that only affected some groups?

DiD is the workhorse of quasi-experimental methods. When you roll out a feature to one region but not another, change pricing for one segment, or launch a campaign in some markets, DiD measures the causal effect by comparing the change in outcomes for the treated group versus the control group. The key assumption is parallel trends: absent the treatment, both groups would have followed the same trajectory. DiD is powerful because it does not require randomization, only a plausible comparison group.

Use when: You have before-and-after data for both a treated group and a comparable untreated group.

Read the full DiD guide →

4. Propensity Score Matching

The question it answers: What would have happened to these treated individuals if they had not been treated?

When treatment is self-selected (customers who chose to use a feature, users who opted into a program), you cannot simply compare treated to untreated because the groups differ systematically. Propensity score matching creates an artificial control group by matching each treated individual to an untreated individual with similar characteristics. This approximates what randomization would have achieved. The method requires rich covariates -- the more you know about what drives selection, the better the match.

Use when: You have observational data, treatment was not randomized, and you have rich data about the factors that influenced who received treatment.

Read the full propensity score matching guide →

5. Instrumental Variables (IV)

The question it answers: What is the true causal effect when there are unmeasured confounders?

IV is the method you reach for when you suspect your estimate is biased by something you cannot measure. It uses a third variable (the "instrument") that affects the treatment but has no direct effect on the outcome. Classic examples: distance to a college as an instrument for education's effect on earnings, or weather as an instrument for foot traffic's effect on sales. Finding a valid instrument is the hard part -- if your instrument violates the exclusion restriction, the estimate is worse than OLS.

Use when: You suspect omitted variable bias and can find a credible instrument. This is an advanced method -- use it when simpler alternatives (DiD, matching) are not feasible.

Read the full IV guide →

6. Regression Discontinuity (RD)

The question it answers: What happens right at the threshold where treatment kicks in?

RD exploits situations where treatment is assigned based on a cutoff: customers who spend above $100 get a loyalty discount, students scoring above 70 get into the honors program, ads shown to users over age 25. Near the cutoff, assignment is almost random -- a customer at $101 is nearly identical to one at $99. RD estimates the causal effect by comparing outcomes just above and just below the threshold. It is one of the most credible quasi-experimental methods because the identifying assumption is hard to violate.

Use when: Treatment is assigned by a numeric threshold and you have dense data near the cutoff.

Read the full RD guide →

7. Causal Impact Analysis

The question it answers: What would have happened to this time series if the intervention had never occurred?

Causal Impact (developed by Google) builds a Bayesian structural time series model of what would have happened without the intervention, then compares that counterfactual to what actually happened. It is designed for marketing interventions: you launched a campaign on March 1, and you want to know if the revenue lift was caused by the campaign or was going to happen anyway. The method requires control series (related time series unaffected by the intervention) to build the counterfactual.

Use when: You have a time series interrupted by an event (campaign launch, policy change, feature release) and comparable control series.

Read the full Causal Impact guide →

8. Synthetic Control

The question it answers: What would have happened to this single treated unit without the intervention?

Synthetic control is designed for the common case where you only have one treated unit: one state passed a law, one store got renovated, one country changed policy. It constructs a weighted combination of untreated units that best matches the treated unit's pre-intervention trajectory. The synthetic version serves as the counterfactual. The method is transparent -- you can see exactly which comparison units contribute and how much -- and produces visual results that are easy to explain to non-technical stakeholders.

Use when: You have a single treated unit and a pool of comparable untreated units with pre-intervention data.

Read the full synthetic control guide →

9. ANCOVA (Analysis of Covariance)

The question it answers: What is the treatment effect after adjusting for pre-treatment differences?

ANCOVA combines ANOVA with regression to estimate treatment effects while controlling for covariates. In A/B testing, ANCOVA with pre-treatment outcome as a covariate (often called CUPED in tech) can dramatically reduce variance and increase statistical power. This means you can detect the same effect with fewer observations, or detect smaller effects with the same sample size. ANCOVA works for both randomized experiments and observational studies.

Use when: You have pre-treatment data on the outcome variable and want more precise treatment effect estimates, or when groups differ on important covariates.

Read the full ANCOVA guide →

Decision Guide: Which Method Do You Need?

Can you randomize? │ ├─ YES: You can assign users randomly │ ├─ Want frequentist significance test? │ │ └─ A/B Testing │ ├─ Want probability of B > A, or early stopping? │ │ └─ Bayesian A/B Testing │ └─ Want to reduce variance with pre-treatment data? │ └─ ANCOVA / CUPED │ └─ NO: Treatment was not randomly assigned │ ├─ Do you have before/after data + a comparison group? │ └─ Difference-in-Differences │ ├─ Is treatment assigned by a numeric threshold? │ └─ Regression Discontinuity │ ├─ Do you have rich covariates about who got treated? │ └─ Propensity Score Matching │ ├─ Is this a time series with an intervention point? │ ├─ Multiple comparison series available? │ │ └─ Causal Impact │ └─ Single treated unit + pool of comparison units? │ └─ Synthetic Control │ └─ Do you have a credible instrument? └─ Instrumental Variables

How These Methods Relate

Causal inference methods sit on a spectrum from strongest to weakest internal validity:

  1. Randomized experiments (A/B testing, Bayesian A/B) -- the gold standard. Randomization eliminates confounders by design.
  2. Quasi-experimental methods (DiD, RD, synthetic control, causal impact) -- exploit natural variation or policy changes to approximate randomization. Strong when assumptions hold.
  3. Matching and adjustment methods (propensity scores, ANCOVA, IV) -- use statistical techniques to adjust for confounders. Quality depends entirely on having the right covariates or instruments.

In practice, most businesses start with A/B testing for digital experiments and use DiD or Causal Impact for interventions that cannot be randomized (pricing changes, market launches, policy shifts). Propensity score matching fills the gap when you have observational data and rich customer profiles. The other methods are specialized tools for specific situations.

Causal Inference for Business Decisions

Test Your Hypotheses with Real Data

Upload your experiment or observational data and get causal analysis with statistical rigor. A/B test significance, treatment effects, and counterfactual estimates.

Start Free Analysis

All Causal Inference Guides