Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| batch_assignments | auto | batch_assignments |
| content_type_regex | /articles/|/tutorials/|/blogs/|/whitepapers/ | content_type_regex |
| min_impressions | 10 | min_impressions |
| win_threshold | 0.0 | win_threshold |
This analysis evaluates a portfolio of 12 active experiments to identify winning hypotheses and success patterns. The objective is to determine which treatments outperform controls and what characteristics—impression volume, ad position, content type—correlate with experiment success. Understanding these patterns enables more efficient hypothesis testing and resource allocation.
The portfolio shows early-stage results with limited statistical confidence. Only one experiment demonstrates clear success, while most remain underpowered. The position-based pattern (11-20 outperforming
Data preprocessing and column mapping
| Metric | Value |
|---|---|
| Initial Rows | 19 |
| Final Rows | 19 |
| Rows Removed | 0 |
| Retention Rate | 100% |
This section documents the data preprocessing pipeline for the experiment analysis, showing that all 19 observations (12 experiment verdicts, 1 batch summary, 6 pattern analyses, and supporting datasets) were retained without removal. Perfect retention indicates either minimal data quality issues or that no aggressive filtering was applied, which is critical for maintaining statistical validity in A/B testing analysis where sample size directly impacts power and significance detection.
The 100% retention rate reflects a conservative preprocessing approach appropriate for experiment analysis, where each trial represents a distinct business decision point. However, the absence of any data cleaning raises questions about how missing p-values (75% of cases marked "insufficient") and extreme values (raw lift ranging -100% to +732%) were handled. The lack of train/test splitting is expected since this is retrospective analysis of completed experiments rather than
| Finding | Value |
|---|---|
| Total Experiments | 12 (3 testable) |
| Program Win Rate | 33.3% |
| Industry Benchmark | 15% (SearchPilot) |
| Winners (Promote) | 1 experiments |
| Projected Monthly Click Uplift | 9 clicks |
| Estimated Monthly Value | $44 (at $5 CPC) |
This analysis evaluates a portfolio of 12 SEO experiments to determine whether the testing program is delivering measurable business value. The assessment synthesizes win rates, effect sizes, and ROI projections to inform deployment decisions and resource allocation for the broader optimization initiative.
The program demonstrates promise with a win rate well above industry norms and one experiment showing exceptional lift. However, the portfolio remains largely underpowered—75% of experiments cannot yet support confident decisions. The median adjusted lift of -50% reflects the high proportion of inconclusive cases rather than true negative performance. This suggests the testing infrastructure
Per-experiment win/loss/neutral status with CTR lift and significance
| experiment_id | control_impressions | control_clicks | control_ctr | control_adj_ctr | treatment_impressions | treatment_clicks | treatment_ctr | treatment_adj_ctr | raw_lift_pct | adjusted_lift_pct | verdict | content_type | impression_bucket | position_bucket | batch_name | _row |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| exp-007 | 1693 | 0 | 0 | 0 | 41 | 1 | 0.0244 | 0.7644 | 0 | 0 | insufficient | other | 500-2000 | 4-10 | all_experiments | exp-007 |
| exp-009 | 1639 | 0 | 0 | 0 | 262 | 0 | 0 | 0 | 0 | 0 | insufficient | other | 500-2000 | 4-10 | all_experiments | exp-009 |
| exp-010 | 1567 | 1 | 6.00e-04 | 0.0459 | 422 | 1 | 0.0024 | 0.1261 | 271.3 | 174.7 | neutral | other | 500-2000 | 4-10 | all_experiments | exp-010 |
| exp-012 | 1303 | 2 | 0.0015 | 0.3483 | 282 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 500-2000 | 11-20 | all_experiments | exp-012 |
| exp-015 | 1331 | 3 | 0.0023 | 0.3679 | 176 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 500-2000 | 11-20 | all_experiments | exp-015 |
| exp-016 | 1082 | 2 | 0.0018 | 0.4034 | 195 | 3 | 0.0154 | 1.923 | 732.3 | 376.6 | winning | other | 500-2000 | 11-20 | all_experiments | exp-016 |
| exp-017 | 919 | 1 | 0.0011 | 0.1589 | 291 | 1 | 0.0034 | 0.3765 | 215.8 | 136.9 | neutral | other | 500-2000 | 11-20 | all_experiments | exp-017 |
| exp-027 | 287 | 43 | 0.1498 | 6.055 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 100-500 | 4-10 | all_experiments | exp-027 |
| exp-028 | 2080 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | insufficient | other | 2000+ | 11-20 | all_experiments | exp-028 |
| exp-029 | 3259 | 3 | 9.00e-04 | 0.3009 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 2000+ | 11-20 | all_experiments | exp-029 |
| exp-030 | 1727 | 2 | 0.0012 | 0.0874 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 500-2000 | 4-10 | all_experiments | exp-030 |
| exp-031 | 2782 | 5 | 0.0018 | 0.107 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 2000+ | 4-10 | all_experiments | exp-031 |
This section identifies which experiments demonstrate statistically significant improvements (winning), declines (losing), or inconclusive results (neutral/insufficient). Of 12 experiments, only 3 have adequate statistical power to draw reliable conclusions, making this a critical filter for distinguishing real effects from noise in the testing portfolio.
The data reveals a portfolio heavily constrained by sample size rather than treatment quality. The single winning experiment shows substantial effect magnitude, but the high proportion of insufficient verdicts (9/12) indicates most experiments cannot yet distinguish signal from noise. This pattern suggests the testing infrastructure may be underpowered for the baseline click rates observed (mean control CTR = 0.01), requiring either longer run times or higher-traffic segments to achieve reliable conclusions.
Win rate and average effect size comparison across hypothesis batches
This section evaluates hypothesis batch performance by comparing win rates and average effect sizes across experiment groups. It identifies which hypothesis types (e.g., title framing, intent matching) are generating the strongest positive results, enabling prioritization of high-performing hypotheses for scaling and resource allocation.
The single batch ("all_experiments") demonstrates strong performance relative to industry standards, with one clear winner and two neutral results among testable experiments. However, the high proportion of insufficient experiments (9 of 12) suggests most hypotheses lack adequate sample size or effect magnitude for confident conclusions. The exceptional 376.65% average lift reflects the winning experiment's substantial impact, though this represents only one successful case within a larger portfolio of underpowered tests.
This analysis treats all 12
Win rate segmented by content type, impression level, and position bucket
This section identifies which page characteristics—traffic volume, ranking position, and content type—correlate with successful title experiments. By segmenting the 12 experiments across impression buckets and position ranges, the analysis reveals whether certain page types respond more favorably to title changes, enabling future experiments to be targeted at high-opportunity segments.
The data suggests a weak but directional pattern: pages ranking in positions 11-20 achieved the only position-based win, while mid-traffic pages (500-2000 impressions) showed marginal success. Conversely, higher-ranking pages (4-10) and very
Distribution of position-adjusted CTR lifts across all experiments
This section evaluates whether successful experiments deliver substantial impact or marginal gains. Understanding effect size distribution reveals the magnitude of wins relative to losses, helping assess whether the experimental portfolio is generating transformative improvements or incremental gains. This directly informs the value proposition of the testing program.
The portfolio exhibits polarized results: one transformative win offset by predominantly neutral or negative experiments. The absence of moderate wins (5-10% range) suggests either hypothesis diversity with high variance or insufficient statistical power to detect smaller effects. The extreme positive outlier (exp-016) represents a genuine breakthrough, but 75% insufficient verdicts indicate most experiments lack conclusive evidence. This distribution reflects early-stage testing where
Projected click uplift if winning experiments are promoted to production
| experiment_id | adjusted_lift_pct | monthly_impressions_estimate | projected_monthly_click_uplift |
|---|---|---|---|
| exp-016 | 376.6 | 1277 | 8.891 |
This section quantifies the business impact of promoting winning experiments to production by projecting incremental click volume and associated revenue. It translates experimental lift percentages into actionable monthly and annual metrics, enabling stakeholders to understand the tangible value generated from the 1 winning experiment (exp-016) identified across the 3 testable experiments in this batch.
The winning experiment demonstrates substantial lift (376.65%), but the modest absolute click gains (9 monthly) reflect the relatively small impression volume (1,277) and low baseline click rates observed across the batch. The $44 monthly value represents incremental revenue from a single high-performing variant. This projection assumes consistent traffic patterns and sustained treatment effect post-launch.
Power analysis and recommendations for experiments needing more data
| experiment_id | current_impressions | power_estimate | impressions_needed_80pct | estimated_days_to_significance | recommended_action |
|---|---|---|---|---|---|
| exp-007 | 1734 | 0.79 | 3468 | 14 | Run 14 more days |
| exp-009 | 1901 | 0.79 | 3802 | 14 | Run 14 more days |
| exp-012 | 1585 | 0.79 | 3170 | 14 | Run 14 more days |
| exp-015 | 1507 | 0.79 | 3014 | 14 | Run 14 more days |
| exp-027 | 287 | 0.79 | 574 | 14 | Run 14 more days |
| exp-028 | 2080 | 0.79 | 4160 | 14 | Run 14 more days |
| exp-029 | 3259 | 0.79 | 6518 | 14 | Run 14 more days |
| exp-030 | 1727 | 0.79 | 3454 | 14 | Run 14 more days |
| exp-031 | 2782 | 0.79 | 5564 | 14 | Run 14 more days |
This section identifies which experiments lack sufficient statistical power to draw reliable conclusions. Nine of the twelve experiments currently operate at 79% power—below the 80% threshold needed for confident decision-making. Understanding data sufficiency is critical because premature conclusions from underpowered tests risk false negatives, while extending tests unnecessarily delays business decisions.
The consistent 14-day recommendation across heterogeneous sample sizes indicates the power calculation accounts for both current impressions and expected traffic velocity. These experiments sit at the margin of statistical reliability; additional data collection will push them above the 80% power threshold, enabling defensible conclusions about treatment effects. The uniform timeline suggests traffic distribution is stable across experiment conditions.
Power