Portfolio Overview
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| batch_assignments | auto | batch_assignments |
| content_type_regex | /articles/|/tutorials/|/blogs/|/whitepapers/ | content_type_regex |
| min_impressions | 10 | min_impressions |
| win_threshold | 0.0 | win_threshold |
Purpose
This analysis evaluates a portfolio of 12 active experiments to identify winning hypotheses and success patterns. The objective is to determine which treatments outperform controls and what characteristics—impression volume, ad position, content type—correlate with experiment success. Understanding these patterns enables more efficient hypothesis testing and resource allocation.
Key Findings
- Win Rate: 33.33% (1 winning experiment out of 3 testable) with 75% of experiments marked insufficient for statistical conclusion
- Winning Experiment (exp-016): 376.65% adjusted lift with projected 8.89 monthly click uplift from ~1,277 impressions
- Position Pattern: Ads in positions 11-20 show 16.67% win rate vs. 0% for positions 4-10, suggesting placement matters
- Impression Bucket Pattern: Mid-volume experiments (500-2,000 impressions) achieve 12.5% win rate; low and high volumes show 0%
- Data Maturity Issue: 9 of 12 experiments need ~14 additional days to reach 80% statistical power
Interpretation
The portfolio shows early-stage results with limited statistical confidence. Only one experiment demonstrates clear success, while most remain underpowered. The position-based pattern (11-20 outperforming
Data preprocessing and column mapping
Purpose
This section documents the data preprocessing pipeline for the experiment analysis, showing that all 19 observations (12 experiment verdicts, 1 batch summary, 6 pattern analyses, and supporting datasets) were retained without removal. Perfect retention indicates either minimal data quality issues or that no aggressive filtering was applied, which is critical for maintaining statistical validity in A/B testing analysis where sample size directly impacts power and significance detection.
Key Findings
- Retention Rate: 100% (19/19 rows) - No observations were excluded during preprocessing, preserving the full experimental dataset
- Rows Removed: 0 - No filtering, deduplication, or outlier removal occurred
- Train/Test Split: Not applicable - This is descriptive analysis of completed experiments, not predictive modeling
- Data Completeness: All 12 experiments retained despite 75% missing p-values, suggesting missing values were not treated as grounds for exclusion
Interpretation
The 100% retention rate reflects a conservative preprocessing approach appropriate for experiment analysis, where each trial represents a distinct business decision point. However, the absence of any data cleaning raises questions about how missing p-values (75% of cases marked "insufficient") and extreme values (raw lift ranging -100% to +732%) were handled. The lack of train/test splitting is expected since this is retrospective analysis of completed experiments rather than
Executive Summary
Executive summary with key findings and recommendations
| Finding | Value |
|---|---|
| Total Experiments | 12 (3 testable) |
| Program Win Rate | 33.3% |
| Industry Benchmark | 15% (SearchPilot) |
| Winners (Promote) | 1 experiments |
| Projected Monthly Click Uplift | 9 clicks |
| Estimated Monthly Value | $44 (at $5 CPC) |
Program Health:
• Analyzed 12 experiments (7 with treatment data, 5 control-only)
• Overall win rate: 33.3% — Above 15% industry benchmark ✓
• 1 winning, 0 losing, 2 neutral, 9 insufficient data
Effect Sizes:
• Winners average +376.6% position-adjusted CTR lift
• Median effect size: -50.0%
• Largest positive lift: +376.6%
ROI Projection:
• 9 additional monthly clicks if all winners promoted
• Estimated value: $44/month at $5 CPC
Recommendations:
• Promote: 1 winning experiments immediately
• Monitor: 2 neutral experiments (may need more time)
• End Early: 0 losing experiments
• Data Collection: 9 experiments need 14 more days on average
EXECUTIVE SUMMARY
Purpose
This analysis evaluates a portfolio of 12 SEO experiments to determine whether the testing program is delivering measurable business value. The assessment synthesizes win rates, effect sizes, and ROI projections to inform deployment decisions and resource allocation for the broader optimization initiative.
Key Findings
- Overall Win Rate: 33.3% — Significantly exceeds the 15% industry benchmark, indicating the experiment portfolio is performing above expected baseline
- Winning Effect Size: +376.6% adjusted CTR lift in the single confirmed winner, demonstrating substantial impact when treatments succeed
- Statistical Maturity: 75% of experiments (9 of 12) lack sufficient data; only 3 experiments have testable verdicts, limiting confidence in portfolio-level conclusions
- Projected Monthly Value: 8.9 additional clicks from promotion of winning experiments; modest absolute impact but positive directional signal
- Data Collection Timeline: Insufficient experiments require approximately 14 additional days to reach 80% statistical power
Interpretation
The program demonstrates promise with a win rate well above industry norms and one experiment showing exceptional lift. However, the portfolio remains largely underpowered—75% of experiments cannot yet support confident decisions. The median adjusted lift of -50% reflects the high proportion of inconclusive cases rather than true negative performance. This suggests the testing infrastructure
Experiment Verdicts
Per-experiment win/loss/neutral status with CTR lift and significance
| experiment_id | control_impressions | control_clicks | control_ctr | control_adj_ctr | treatment_impressions | treatment_clicks | treatment_ctr | treatment_adj_ctr | raw_lift_pct | adjusted_lift_pct | verdict | content_type | impression_bucket | position_bucket | batch_name | _row |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| exp-007 | 1693 | 0 | 0 | 0 | 41 | 1 | 0.0244 | 0.7644 | 0 | 0 | insufficient | other | 500-2000 | 4-10 | all_experiments | exp-007 |
| exp-009 | 1639 | 0 | 0 | 0 | 262 | 0 | 0 | 0 | 0 | 0 | insufficient | other | 500-2000 | 4-10 | all_experiments | exp-009 |
| exp-010 | 1567 | 1 | 0.0006 | 0.0459 | 422 | 1 | 0.0024 | 0.1261 | 271.3 | 174.7 | neutral | other | 500-2000 | 4-10 | all_experiments | exp-010 |
| exp-012 | 1303 | 2 | 0.0015 | 0.3483 | 282 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 500-2000 | 11-20 | all_experiments | exp-012 |
| exp-015 | 1331 | 3 | 0.0023 | 0.3679 | 176 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 500-2000 | 11-20 | all_experiments | exp-015 |
| exp-016 | 1082 | 2 | 0.0018 | 0.4034 | 195 | 3 | 0.0154 | 1.923 | 732.3 | 376.6 | winning | other | 500-2000 | 11-20 | all_experiments | exp-016 |
| exp-017 | 919 | 1 | 0.0011 | 0.1589 | 291 | 1 | 0.0034 | 0.3765 | 215.8 | 136.9 | neutral | other | 500-2000 | 11-20 | all_experiments | exp-017 |
| exp-027 | 287 | 43 | 0.1498 | 6.055 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 100-500 | 4-10 | all_experiments | exp-027 |
| exp-028 | 2080 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | insufficient | other | 2000+ | 11-20 | all_experiments | exp-028 |
| exp-029 | 3259 | 3 | 0.0009 | 0.3009 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 2000+ | 11-20 | all_experiments | exp-029 |
| exp-030 | 1727 | 2 | 0.0012 | 0.0874 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 500-2000 | 4-10 | all_experiments | exp-030 |
| exp-031 | 2782 | 5 | 0.0018 | 0.107 | 0 | 0 | 0 | 0 | -100 | -100 | insufficient | other | 2000+ | 4-10 | all_experiments | exp-031 |
Purpose
This section identifies which experiments demonstrate statistically significant improvements (winning), declines (losing), or inconclusive results (neutral/insufficient). Of 12 experiments, only 3 have adequate statistical power to draw reliable conclusions, making this a critical filter for distinguishing real effects from noise in the testing portfolio.
Key Findings
- Win Rate: 1 of 3 testable experiments (33.3%) shows positive, significant lift—indicating modest success in the overall testing program
- Adjusted Lift %: The winning experiment (exp-016) demonstrates 376.65% position-adjusted CTR improvement, isolating the title treatment effect from ranking confounds
- Insufficient Data: 75% of experiments lack statistical power, with median p-values of 0.9, reflecting low click volumes relative to variance
- No Losses: Zero experiments show statistically significant negative effects, suggesting treatments are not harmful
Interpretation
The data reveals a portfolio heavily constrained by sample size rather than treatment quality. The single winning experiment shows substantial effect magnitude, but the high proportion of insufficient verdicts (9/12) indicates most experiments cannot yet distinguish signal from noise. This pattern suggests the testing infrastructure may be underpowered for the baseline click rates observed (mean control CTR = 0.01), requiring either longer run times or higher-traffic segments to achieve reliable conclusions.
Batch Comparison
Win rate and average effect size comparison across hypothesis batches
Purpose
This section evaluates hypothesis batch performance by comparing win rates and average effect sizes across experiment groups. It identifies which hypothesis types (e.g., title framing, intent matching) are generating the strongest positive results, enabling prioritization of high-performing hypotheses for scaling and resource allocation.
Key Findings
- Win Rate: 33.3% — Significantly exceeds the 15% industry benchmark, indicating above-average hypothesis quality across the program
- Average Adjusted Lift: 376.65% — Represents the mean CTR improvement for winning experiments, demonstrating substantial effect sizes when experiments succeed
- Testable Experiments: 3 of 12 — Only 25% of experiments achieved statistical significance; 75% remain insufficient, limiting batch-level conclusions
- Zero Losses: No experiments showed statistically significant negative lift, reducing downside risk
Interpretation
The single batch ("all_experiments") demonstrates strong performance relative to industry standards, with one clear winner and two neutral results among testable experiments. However, the high proportion of insufficient experiments (9 of 12) suggests most hypotheses lack adequate sample size or effect magnitude for confident conclusions. The exceptional 376.65% average lift reflects the winning experiment's substantial impact, though this represents only one successful case within a larger portfolio of underpowered tests.
Context
This analysis treats all 12
Success Patterns
Win rate segmented by content type, impression level, and position bucket
Purpose
This section identifies which page characteristics—traffic volume, ranking position, and content type—correlate with successful title experiments. By segmenting the 12 experiments across impression buckets and position ranges, the analysis reveals whether certain page types respond more favorably to title changes, enabling future experiments to be targeted at high-opportunity segments.
Key Findings
- Position Bucket 11-20 Win Rate: 16.67% (1 win from 6 experiments) — the highest-performing segment, suggesting lower-ranked pages may be more responsive to title optimization
- Impression Bucket 500-2000 Win Rate: 12.5% (1 win from 8 experiments) — moderate traffic pages show modest success, representing the largest tested segment
- Position Bucket 4-10 Win Rate: 0% (0 wins from 6 experiments) — higher-ranked pages show no successful outcomes despite equal sample size
- Overall Pattern: Win rates remain below 20% across all segments, indicating limited predictive power at current sample sizes
Interpretation
The data suggests a weak but directional pattern: pages ranking in positions 11-20 achieved the only position-based win, while mid-traffic pages (500-2000 impressions) showed marginal success. Conversely, higher-ranking pages (4-10) and very
Effect Size Distribution
Distribution of position-adjusted CTR lifts across all experiments
Purpose
This section evaluates whether successful experiments deliver substantial impact or marginal gains. Understanding effect size distribution reveals the magnitude of wins relative to losses, helping assess whether the experimental portfolio is generating transformative improvements or incremental gains. This directly informs the value proposition of the testing program.
Key Findings
- Median Effect Size: -50% — The typical experiment shows negative or zero lift, indicating most tests underperform control
- Maximum Positive Lift: +376.6% — The single winning experiment (exp-016) demonstrates exceptionally large impact, far exceeding the "large win" threshold (>10%)
- Effect Distribution: Highly skewed with extreme variance (SD=150.91%) — Results cluster at -100%, 0%, or +174-376%, showing no moderate wins
- Winner vs. Loser Gap: Winners average +376.6% while losers average 0%, indicating a stark binary outcome pattern rather than a spectrum
Interpretation
The portfolio exhibits polarized results: one transformative win offset by predominantly neutral or negative experiments. The absence of moderate wins (5-10% range) suggests either hypothesis diversity with high variance or insufficient statistical power to detect smaller effects. The extreme positive outlier (exp-016) represents a genuine breakthrough, but 75% insufficient verdicts indicate most experiments lack conclusive evidence. This distribution reflects early-stage testing where
ROI Projection
Projected click uplift if winning experiments are promoted to production
| experiment_id | adjusted_lift_pct | monthly_impressions_estimate | projected_monthly_click_uplift |
|---|---|---|---|
| exp-016 | 376.6 | 1277 | 8.891 |
Purpose
This section quantifies the business impact of promoting winning experiments to production by projecting incremental click volume and associated revenue. It translates experimental lift percentages into actionable monthly and annual metrics, enabling stakeholders to understand the tangible value generated from the 1 winning experiment (exp-016) identified across the 3 testable experiments in this batch.
Key Findings
- Projected Monthly Click Uplift: 8.89 clicks — derived from exp-016's 376.65% adjusted lift applied to 1,277 monthly impressions
- Projected Annual Click Uplift: 106.69 clicks — annualized monthly projection showing sustained impact over 12 months
- Estimated Monthly Revenue Value: $44.45 — calculated at $5 cost-per-click, representing direct traffic value from the winning variant
- Win Rate Context: Only 1 of 3 testable experiments achieved statistical significance, limiting the overall uplift pool
Interpretation
The winning experiment demonstrates substantial lift (376.65%), but the modest absolute click gains (9 monthly) reflect the relatively small impression volume (1,277) and low baseline click rates observed across the batch. The $44 monthly value represents incremental revenue from a single high-performing variant. This projection assumes consistent traffic patterns and sustained treatment effect post-launch.
Context
Data Sufficiency
Power analysis and recommendations for experiments needing more data
| experiment_id | current_impressions | power_estimate | impressions_needed_80pct | estimated_days_to_significance | recommended_action |
|---|---|---|---|---|---|
| exp-007 | 1734 | 0.79 | 3468 | 14 | Run 14 more days |
| exp-009 | 1901 | 0.79 | 3802 | 14 | Run 14 more days |
| exp-012 | 1585 | 0.79 | 3170 | 14 | Run 14 more days |
| exp-015 | 1507 | 0.79 | 3014 | 14 | Run 14 more days |
| exp-027 | 287 | 0.79 | 574 | 14 | Run 14 more days |
| exp-028 | 2080 | 0.79 | 4160 | 14 | Run 14 more days |
| exp-029 | 3259 | 0.79 | 6518 | 14 | Run 14 more days |
| exp-030 | 1727 | 0.79 | 3454 | 14 | Run 14 more days |
| exp-031 | 2782 | 0.79 | 5564 | 14 | Run 14 more days |
Purpose
This section identifies which experiments lack sufficient statistical power to draw reliable conclusions. Nine of the twelve experiments currently operate at 79% power—below the 80% threshold needed for confident decision-making. Understanding data sufficiency is critical because premature conclusions from underpowered tests risk false negatives, while extending tests unnecessarily delays business decisions.
Key Findings
- Current Power Estimate: 0.79 across all insufficient experiments—just below the 80% target threshold, indicating borderline adequacy for statistical inference
- Impressions Needed: Average of 3,747 additional impressions required (range: 574–6,518), representing roughly 2x the current collection in most cases
- Timeline to Significance: Uniform 14-day extension needed across all nine experiments, suggesting consistent traffic patterns and effect sizes
- Sample Size Variation: Current impressions range from 287 to 3,259, with smaller experiments (exp-027) requiring proportionally less additional data
Interpretation
The consistent 14-day recommendation across heterogeneous sample sizes indicates the power calculation accounts for both current impressions and expected traffic velocity. These experiments sit at the margin of statistical reliability; additional data collection will push them above the 80% power threshold, enabling defensible conclusions about treatment effects. The uniform timeline suggests traffic distribution is stable across experiment conditions.
Context
Power