Overview

Portfolio Overview

Analysis overview and configuration

Analysis TypePortfolio Analysis

CompanyMCP Analytics

ObjectiveAcross all active experiments, which hypotheses are winning and what patterns predict experiment success?

Analysis Date2026-03-03

Processing Idtest_1772576412

Total Observations19

Parameter	Value	_row
batch_assignments	auto	batch_assignments
content_type_regex	/articles/\|/tutorials/\|/blogs/\|/whitepapers/	content_type_regex
min_impressions	10	min_impressions
win_threshold	0.0	win_threshold

Interpretation

Purpose

This analysis evaluates a portfolio of 12 active experiments to identify winning hypotheses and success patterns. The objective is to determine which treatments outperform controls and what characteristics—impression volume, ad position, content type—correlate with experiment success. Understanding these patterns enables more efficient hypothesis testing and resource allocation.

Key Findings

Win Rate: 33.33% (1 winning experiment out of 3 testable) with 75% of experiments marked insufficient for statistical conclusion
Winning Experiment (exp-016): 376.65% adjusted lift with projected 8.89 monthly click uplift from ~1,277 impressions
Position Pattern: Ads in positions 11-20 show 16.67% win rate vs. 0% for positions 4-10, suggesting placement matters
Impression Bucket Pattern: Mid-volume experiments (500-2,000 impressions) achieve 12.5% win rate; low and high volumes show 0%
Data Maturity Issue: 9 of 12 experiments need ~14 additional days to reach 80% statistical power

Interpretation

The portfolio shows early-stage results with limited statistical confidence. Only one experiment demonstrates clear success, while most remain underpowered. The position-based pattern (11-20 outperforming

Data preprocessing and column mapping

Initial Rows19

Final Rows19

Rows Removed0

Retention Rate100

Interpretation

Purpose

This section documents the data preprocessing pipeline for the experiment analysis, showing that all 19 observations (12 experiment verdicts, 1 batch summary, 6 pattern analyses, and supporting datasets) were retained without removal. Perfect retention indicates either minimal data quality issues or that no aggressive filtering was applied, which is critical for maintaining statistical validity in A/B testing analysis where sample size directly impacts power and significance detection.

Key Findings

Retention Rate: 100% (19/19 rows) - No observations were excluded during preprocessing, preserving the full experimental dataset
Rows Removed: 0 - No filtering, deduplication, or outlier removal occurred
Train/Test Split: Not applicable - This is descriptive analysis of completed experiments, not predictive modeling
Data Completeness: All 12 experiments retained despite 75% missing p-values, suggesting missing values were not treated as grounds for exclusion

Interpretation

The 100% retention rate reflects a conservative preprocessing approach appropriate for experiment analysis, where each trial represents a distinct business decision point. However, the absence of any data cleaning raises questions about how missing p-values (75% of cases marked "insufficient") and extreme values (raw lift ranging -100% to +732%) were handled. The lack of train/test splitting is expected since this is retrospective analysis of completed experiments rather than

Executive Summary

Executive summary with key findings and recommendations

total_experiments

overall_win_rate

33.3333

avg_lift_winners

376.6485

projected_monthly_clicks

8.8906

program_verdict

Above benchmark

Finding	Value
Total Experiments	12 (3 testable)
Program Win Rate	33.3%
Industry Benchmark	15% (SearchPilot)
Winners (Promote)	1 experiments
Projected Monthly Click Uplift	9 clicks
Estimated Monthly Value	$44 (at $5 CPC)

SEO Experiment Portfolio Analysis - Executive Summary

Program Health:
• Analyzed 12 experiments (7 with treatment data, 5 control-only)
• Overall win rate: 33.3% — Above 15% industry benchmark ✓
• 1 winning, 0 losing, 2 neutral, 9 insufficient data

Effect Sizes:
• Winners average +376.6% position-adjusted CTR lift
• Median effect size: -50.0%
• Largest positive lift: +376.6%

ROI Projection:
• 9 additional monthly clicks if all winners promoted
• Estimated value: $44/month at $5 CPC

Recommendations:
• Promote: 1 winning experiments immediately
• Monitor: 2 neutral experiments (may need more time)
• End Early: 0 losing experiments
• Data Collection: 9 experiments need 14 more days on average

Interpretation

EXECUTIVE SUMMARY

Purpose

This analysis evaluates a portfolio of 12 SEO experiments to determine whether the testing program is delivering measurable business value. The assessment synthesizes win rates, effect sizes, and ROI projections to inform deployment decisions and resource allocation for the broader optimization initiative.

Key Findings

Overall Win Rate: 33.3% — Significantly exceeds the 15% industry benchmark, indicating the experiment portfolio is performing above expected baseline
Winning Effect Size: +376.6% adjusted CTR lift in the single confirmed winner, demonstrating substantial impact when treatments succeed
Statistical Maturity: 75% of experiments (9 of 12) lack sufficient data; only 3 experiments have testable verdicts, limiting confidence in portfolio-level conclusions
Projected Monthly Value: 8.9 additional clicks from promotion of winning experiments; modest absolute impact but positive directional signal
Data Collection Timeline: Insufficient experiments require approximately 14 additional days to reach 80% statistical power

Interpretation

The program demonstrates promise with a win rate well above industry norms and one experiment showing exceptional lift. However, the portfolio remains largely underpowered—75% of experiments cannot yet support confident decisions. The median adjusted lift of -50% reflects the high proportion of inconclusive cases rather than true negative performance. This suggests the testing infrastructure

Data Table

Experiment Verdicts

Per-experiment win/loss/neutral status with CTR lift and significance

experiment_id	control_impressions	control_clicks	control_ctr	control_adj_ctr	treatment_impressions	treatment_clicks	treatment_ctr	treatment_adj_ctr	raw_lift_pct	adjusted_lift_pct	verdict	content_type	impression_bucket	position_bucket	batch_name	_row
exp-007	1693	0	0	0	41	1	0.0244	0.7644	0	0	insufficient	other	500-2000	4-10	all_experiments	exp-007
exp-009	1639	0	0	0	262	0	0	0	0	0	insufficient	other	500-2000	4-10	all_experiments	exp-009
exp-010	1567	1	0.0006	0.0459	422	1	0.0024	0.1261	271.3	174.7	neutral	other	500-2000	4-10	all_experiments	exp-010
exp-012	1303	2	0.0015	0.3483	282	0	0	0	-100	-100	insufficient	other	500-2000	11-20	all_experiments	exp-012
exp-015	1331	3	0.0023	0.3679	176	0	0	0	-100	-100	insufficient	other	500-2000	11-20	all_experiments	exp-015
exp-016	1082	2	0.0018	0.4034	195	3	0.0154	1.923	732.3	376.6	winning	other	500-2000	11-20	all_experiments	exp-016
exp-017	919	1	0.0011	0.1589	291	1	0.0034	0.3765	215.8	136.9	neutral	other	500-2000	11-20	all_experiments	exp-017
exp-027	287	43	0.1498	6.055	0	0	0	0	-100	-100	insufficient	other	100-500	4-10	all_experiments	exp-027
exp-028	2080	0	0	0	0	0	0	0	0	0	insufficient	other	2000+	11-20	all_experiments	exp-028
exp-029	3259	3	0.0009	0.3009	0	0	0	0	-100	-100	insufficient	other	2000+	11-20	all_experiments	exp-029
exp-030	1727	2	0.0012	0.0874	0	0	0	0	-100	-100	insufficient	other	500-2000	4-10	all_experiments	exp-030
exp-031	2782	5	0.0018	0.107	0	0	0	0	-100	-100	insufficient	other	2000+	4-10	all_experiments	exp-031

Interpretation

Purpose

This section identifies which experiments demonstrate statistically significant improvements (winning), declines (losing), or inconclusive results (neutral/insufficient). Of 12 experiments, only 3 have adequate statistical power to draw reliable conclusions, making this a critical filter for distinguishing real effects from noise in the testing portfolio.

Key Findings

Win Rate: 1 of 3 testable experiments (33.3%) shows positive, significant lift—indicating modest success in the overall testing program
Adjusted Lift %: The winning experiment (exp-016) demonstrates 376.65% position-adjusted CTR improvement, isolating the title treatment effect from ranking confounds
Insufficient Data: 75% of experiments lack statistical power, with median p-values of 0.9, reflecting low click volumes relative to variance
No Losses: Zero experiments show statistically significant negative effects, suggesting treatments are not harmful

Interpretation

The data reveals a portfolio heavily constrained by sample size rather than treatment quality. The single winning experiment shows substantial effect magnitude, but the high proportion of insufficient verdicts (9/12) indicates most experiments cannot yet distinguish signal from noise. This pattern suggests the testing infrastructure may be underpowered for the baseline click rates observed (mean control CTR = 0.01), requiring either longer run times or higher-traffic segments to achieve reliable conclusions.

Visualization

Batch Comparison

Win rate and average effect size comparison across hypothesis batches

Interpretation

Purpose

This section evaluates hypothesis batch performance by comparing win rates and average effect sizes across experiment groups. It identifies which hypothesis types (e.g., title framing, intent matching) are generating the strongest positive results, enabling prioritization of high-performing hypotheses for scaling and resource allocation.

Key Findings

Win Rate: 33.3% — Significantly exceeds the 15% industry benchmark, indicating above-average hypothesis quality across the program
Average Adjusted Lift: 376.65% — Represents the mean CTR improvement for winning experiments, demonstrating substantial effect sizes when experiments succeed
Testable Experiments: 3 of 12 — Only 25% of experiments achieved statistical significance; 75% remain insufficient, limiting batch-level conclusions
Zero Losses: No experiments showed statistically significant negative lift, reducing downside risk

Interpretation

The single batch ("all_experiments") demonstrates strong performance relative to industry standards, with one clear winner and two neutral results among testable experiments. However, the high proportion of insufficient experiments (9 of 12) suggests most hypotheses lack adequate sample size or effect magnitude for confident conclusions. The exceptional 376.65% average lift reflects the winning experiment's substantial impact, though this represents only one successful case within a larger portfolio of underpowered tests.

Context

This analysis treats all 12

Visualization

Success Patterns

Win rate segmented by content type, impression level, and position bucket

Interpretation

Purpose

This section identifies which page characteristics—traffic volume, ranking position, and content type—correlate with successful title experiments. By segmenting the 12 experiments across impression buckets and position ranges, the analysis reveals whether certain page types respond more favorably to title changes, enabling future experiments to be targeted at high-opportunity segments.

Key Findings

Position Bucket 11-20 Win Rate: 16.67% (1 win from 6 experiments) — the highest-performing segment, suggesting lower-ranked pages may be more responsive to title optimization
Impression Bucket 500-2000 Win Rate: 12.5% (1 win from 8 experiments) — moderate traffic pages show modest success, representing the largest tested segment
Position Bucket 4-10 Win Rate: 0% (0 wins from 6 experiments) — higher-ranked pages show no successful outcomes despite equal sample size
Overall Pattern: Win rates remain below 20% across all segments, indicating limited predictive power at current sample sizes

Interpretation

The data suggests a weak but directional pattern: pages ranking in positions 11-20 achieved the only position-based win, while mid-traffic pages (500-2000 impressions) showed marginal success. Conversely, higher-ranking pages (4-10) and very

Visualization

Effect Size Distribution

Distribution of position-adjusted CTR lifts across all experiments

Interpretation

Purpose

This section evaluates whether successful experiments deliver substantial impact or marginal gains. Understanding effect size distribution reveals the magnitude of wins relative to losses, helping assess whether the experimental portfolio is generating transformative improvements or incremental gains. This directly informs the value proposition of the testing program.

Key Findings

Median Effect Size: -50% — The typical experiment shows negative or zero lift, indicating most tests underperform control
Maximum Positive Lift: +376.6% — The single winning experiment (exp-016) demonstrates exceptionally large impact, far exceeding the "large win" threshold (>10%)
Effect Distribution: Highly skewed with extreme variance (SD=150.91%) — Results cluster at -100%, 0%, or +174-376%, showing no moderate wins
Winner vs. Loser Gap: Winners average +376.6% while losers average 0%, indicating a stark binary outcome pattern rather than a spectrum

Interpretation

The portfolio exhibits polarized results: one transformative win offset by predominantly neutral or negative experiments. The absence of moderate wins (5-10% range) suggests either hypothesis diversity with high variance or insufficient statistical power to detect smaller effects. The extreme positive outlier (exp-016) represents a genuine breakthrough, but 75% insufficient verdicts indicate most experiments lack conclusive evidence. This distribution reflects early-stage testing where

Data Table

ROI Projection

Projected click uplift if winning experiments are promoted to production

experiment_id	adjusted_lift_pct	monthly_impressions_estimate	projected_monthly_click_uplift
exp-016	376.6	1277	8.891

Interpretation

Purpose

This section quantifies the business impact of promoting winning experiments to production by projecting incremental click volume and associated revenue. It translates experimental lift percentages into actionable monthly and annual metrics, enabling stakeholders to understand the tangible value generated from the 1 winning experiment (exp-016) identified across the 3 testable experiments in this batch.

Key Findings

Projected Monthly Click Uplift: 8.89 clicks — derived from exp-016's 376.65% adjusted lift applied to 1,277 monthly impressions
Projected Annual Click Uplift: 106.69 clicks — annualized monthly projection showing sustained impact over 12 months
Estimated Monthly Revenue Value: $44.45 — calculated at $5 cost-per-click, representing direct traffic value from the winning variant
Win Rate Context: Only 1 of 3 testable experiments achieved statistical significance, limiting the overall uplift pool

Interpretation

The winning experiment demonstrates substantial lift (376.65%), but the modest absolute click gains (9 monthly) reflect the relatively small impression volume (1,277) and low baseline click rates observed across the batch. The $44 monthly value represents incremental revenue from a single high-performing variant. This projection assumes consistent traffic patterns and sustained treatment effect post-launch.

Context

Data Table

Data Sufficiency

Power analysis and recommendations for experiments needing more data

experiment_id	current_impressions	power_estimate	impressions_needed_80pct	estimated_days_to_significance	recommended_action
exp-007	1734	0.79	3468	14	Run 14 more days
exp-009	1901	0.79	3802	14	Run 14 more days
exp-012	1585	0.79	3170	14	Run 14 more days
exp-015	1507	0.79	3014	14	Run 14 more days
exp-027	287	0.79	574	14	Run 14 more days
exp-028	2080	0.79	4160	14	Run 14 more days
exp-029	3259	0.79	6518	14	Run 14 more days
exp-030	1727	0.79	3454	14	Run 14 more days
exp-031	2782	0.79	5564	14	Run 14 more days

Interpretation

Purpose

This section identifies which experiments lack sufficient statistical power to draw reliable conclusions. Nine of the twelve experiments currently operate at 79% power—below the 80% threshold needed for confident decision-making. Understanding data sufficiency is critical because premature conclusions from underpowered tests risk false negatives, while extending tests unnecessarily delays business decisions.

Key Findings

Current Power Estimate: 0.79 across all insufficient experiments—just below the 80% target threshold, indicating borderline adequacy for statistical inference
Impressions Needed: Average of 3,747 additional impressions required (range: 574–6,518), representing roughly 2x the current collection in most cases
Timeline to Significance: Uniform 14-day extension needed across all nine experiments, suggesting consistent traffic patterns and effect sizes
Sample Size Variation: Current impressions range from 287 to 3,259, with smaller experiments (exp-027) requiring proportionally less additional data

Interpretation

The consistent 14-day recommendation across heterogeneous sample sizes indicates the power calculation accounts for both current impressions and expected traffic velocity. These experiments sit at the margin of statistical reliability; additional data collection will push them above the 80% power threshold, enabling defensible conclusions about treatment effects. The uniform timeline suggests traffic distribution is stable across experiment conditions.

Context

Power