T-Test (Independent Samples) — Compare Two Group Means with Statistical Confidence

You have two groups and a number you care about. Did the new checkout flow actually increase order values, or did you just get lucky? Is there a real salary gap between departments, or is the difference within normal variation? The independent samples t-test gives you a definitive, statistically grounded answer. Upload a CSV and find out in under 60 seconds.

What Is a T-Test?

The independent samples t-test answers one question: is the difference between two group averages real, or just random noise? It is the most widely used statistical test in both business and research, and for good reason — it is simple, well-understood, and gives you a clear yes-or-no answer backed by a probability.

Here is the intuition. Suppose you run an A/B test on your pricing page. Version A showed a price of $49 and version B showed $39. After a month, the average order value for Version A customers was $127 and for Version B customers it was $134. That is a $7 difference. But is it a real effect of the lower price point, or could you see a $7 gap just from normal customer-to-customer variation even if the two pages performed identically? The t-test calculates exactly how unlikely that gap would be if there were no real difference. If the probability is low enough (typically below 5%), you conclude the difference is real.

Mathematically, the t-test computes a t-statistic: the difference between the two group means divided by the standard error of that difference. The standard error accounts for both the sample sizes and how much variation exists within each group. A large t-statistic means the gap between groups is large relative to the noise, and the corresponding p-value tells you the probability of seeing a gap that large (or larger) by chance alone.

When to Use the T-Test

Use the t-test whenever you have exactly two groups and a numeric outcome you want to compare. The classic scenarios span every domain:

A/B testing. You split website traffic between two variants and measure conversion rate, revenue per visitor, or time on page. The t-test tells you whether the winning variant genuinely outperforms the loser, or whether the observed difference is within the range of normal fluctuation. This is the foundation of data-driven product decisions at companies of every size.

Clinical trials. A treatment group receives a drug and a control group receives a placebo. The t-test determines whether the treatment group's outcomes (blood pressure reduction, symptom improvement, recovery time) are significantly better than the control group's. Regulatory agencies expect this level of statistical rigor.

HR and compensation analysis. Compare average salaries between two departments, genders, or office locations. The t-test quantifies whether an observed pay gap is statistically significant or falls within normal variation given sample sizes and salary dispersion. This is critical for pay equity audits and compliance reporting.

Marketing campaigns. You ran Campaign A (email-only) and Campaign B (email plus retargeting). Compare the response rate, average order value, or customer acquisition cost between the two campaigns. The t-test prevents you from declaring a winner based on a difference that could easily be noise.

The key constraint is exactly two groups. If you have three or more groups (e.g., four ad variants, five sales regions), use ANOVA instead. Running multiple t-tests across many pairs inflates your false positive rate — ANOVA handles this correctly in a single test.

What Data Do You Need?

You need a CSV file with at least two columns:

group_var — a categorical column that identifies which group each observation belongs to. This column must have exactly two unique values. Examples: "treatment" and "control", "Version A" and "Version B", "Male" and "Female", "Before" and "After". If your column has three or more unique values, the tool will flag it and you should use ANOVA instead.

measurement — a numeric column containing the outcome you want to compare. Examples: order value in dollars, test score, response time in seconds, satisfaction rating on a 1-10 scale. This must be a continuous (or at least interval-scale) numeric variable, not a category.

For reliable results, aim for at least 30 observations per group. The tool requires a minimum of 5 per group to run, but with fewer than 30 the test has limited statistical power — meaning it may fail to detect real differences. With very small samples (under 15 per group), the normality assumption becomes more important and the report's normality diagnostics deserve close attention.

The tool uses Welch's t-test by default, which does not assume equal variances between the two groups. This is the safer and more modern approach. If you know your groups have equal variances (or want to test it), the module_parameters include a var_equal option that switches to the classic Student's t-test. In practice, Welch's test performs as well or better than Student's in nearly all cases, so the default is almost always the right choice. The report includes a variance equality test (F-test) so you can verify this for yourself.

Additional module_parameters let you set the significance_level (default 0.05), confidence_level (default 0.95), and alternative hypothesis direction ("two.sided", "less", or "greater"). For most analyses, the defaults are appropriate.

How to Read the Report

The report contains nine sections, each serving a specific purpose. Here is what to look for in each one.

Analysis Overview. This section sets the stage: which groups are being compared, how many observations are in each, and what measurement is being tested. Verify that the groups and sample sizes match your expectations. If you see unexpected group labels, your group column may contain data quality issues (typos, extra whitespace, mixed casing).

Data Preprocessing. Shows what the tool did to prepare your data before running the test — handling missing values, identifying and flagging outliers, verifying the two-group constraint. If rows were removed, this section explains why. Check this if your sample sizes are smaller than expected.

Executive Summary. The plain-English bottom line. This section tells you whether the difference between groups is statistically significant, how large it is in practical terms, and what it means for your decision. If you only read one section, read this one. It synthesizes the t-statistic, p-value, and effect size into an actionable conclusion.

Distribution Comparison. Overlapping histograms or density plots showing how the two groups' values are distributed. This is your first visual check. If the two distributions overlap heavily, the groups are similar. If they are clearly separated, the difference is likely significant and meaningful. Look for skewness, bimodality, or outliers that might affect the test's reliability.

Box Plot Comparison. Side-by-side box plots showing the median, interquartile range (25th to 75th percentile), and outliers for each group. Box plots make it easy to compare central tendency and spread at a glance. Non-overlapping boxes with separated medians suggest a real difference. Outliers (individual dots beyond the whiskers) can pull the mean away from the median — if you see many outliers, the Mann-Whitney test may be more appropriate.

Normality Diagnostics (QQ Plot). Quantile-quantile plots and Shapiro-Wilk test results for each group. The QQ plot charts each group's data against what you would expect from a perfect normal distribution. Points falling along the diagonal line indicate normality. Deviations at the tails (S-curves or hooks) indicate skewness or heavy tails. The Shapiro-Wilk p-value below 0.05 formally rejects normality. For samples over 30, the t-test is robust to moderate non-normality thanks to the Central Limit Theorem. For smaller samples with clear normality violations, consider the Mann-Whitney U test.

Effect Size. Cohen's d quantifies how large the difference is in standardized units (standard deviations). This is arguably more important than the p-value because it tells you whether the difference matters in practice, not just whether it is statistically detectable. The standard benchmarks: d = 0.2 is a small effect (the difference exists but is subtle), d = 0.5 is a medium effect (noticeable and potentially meaningful), and d = 0.8 or higher is a large effect (the groups are clearly different). A statistically significant result (p < 0.05) with a tiny effect size (d < 0.2) means the difference is real but probably too small to act on. Conversely, a non-significant result with a medium or large effect size suggests you may need more data.

Test Results. The core statistical output. The t-statistic tells you how many standard errors apart the two group means are. The degrees of freedom (adjusted for Welch's correction) indicate the effective sample size. The p-value is the probability of observing a t-statistic this extreme if the two groups were actually identical. A p-value below your significance level (typically 0.05) means you reject the null hypothesis and conclude the groups differ. The confidence interval for the mean difference shows the range of plausible values for the true difference — if it does not include zero, the difference is significant at your chosen confidence level.

Summary Statistics. Group-level descriptive statistics: mean, standard deviation, median, minimum, maximum, and sample size for each group. Use this to ground the abstract test results in concrete numbers. The mean difference and its direction are immediately visible here. If the standard deviations are very different between groups (e.g., one is three times the other), Welch's correction is doing important work — the classic Student's t-test would be unreliable.

When to Use Something Else

The t-test is the right tool for two-group, numeric-outcome comparisons when the data is approximately normal. But several common situations call for a different approach.

If your data is heavily skewed, has extreme outliers, or is measured on an ordinal scale (like a 1-5 Likert rating), use the Mann-Whitney U test. It compares medians rather than means and makes no distributional assumptions. The t-test report's normality diagnostics will flag when this is the better choice. Try it free.

If you have three or more groups, use ANOVA. Running separate t-tests for every pair of groups inflates your false positive rate. With four groups, that is six pairwise comparisons and a 26% chance of at least one false positive. ANOVA handles all groups in a single test with proper error control, and its post-hoc tests (Tukey HSD) identify which specific pairs differ. Try it free.

If the same subjects are measured twice (before and after an intervention, or under two conditions), you need a paired t-test, not the independent samples version. The paired test accounts for within-subject correlation and is substantially more powerful in this design. The independent samples t-test treats each observation as coming from a different person, which wastes information and reduces sensitivity when the data is actually paired.

If your outcome variable is categorical rather than numeric — for example, comparing pass/fail rates or yes/no responses between two groups — use a chi-square test. The t-test requires a numeric measurement; it cannot analyze proportions or counts in a contingency table. Try it free.

If your data has more than two groups and is non-normal, the Kruskal-Wallis test is the non-parametric analogue of ANOVA. It handles three or more groups without assuming normality. Try it free.

The R Code Behind the Analysis

Every report includes the exact R code used to produce the results — reproducible, auditable, and citable. This is not AI-generated code that changes every run. The same data produces the same analysis every time.

The analysis uses t.test() from base R with Welch's correction enabled by default (var.equal = FALSE). This is the same function used in university statistics courses, textbooks, and peer-reviewed research. The effect size is computed using Cohen's d from the effsize package. Normality is assessed with shapiro.test() (Shapiro-Wilk test) and QQ plots, while variance equality is checked using var.test() (F-test for ratio of variances). If you set var_equal = TRUE in the module parameters, the analysis switches to the classic Student's t-test (var.equal = TRUE), but the default Welch variant is recommended for almost all real-world data. Every step is visible in the code tab of your report, so you or a statistician can verify exactly what was done.