Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| confidence_level | 0.95 | confidence_level |
| include_interaction_terms | FALSE | include_interaction_terms |
| model_selection_method | forward | model_selection_method |
| diagnostic_plots | TRUE | diagnostic_plots |
| vif_threshold | 10 | vif_threshold |
| cv_folds | 5 | cv_folds |
| cv_seed | 42 | cv_seed |
| cooks_d_threshold | 0.5 | cooks_d_threshold |
| include_prediction_intervals | TRUE | include_prediction_intervals |
| include_standardized_coefs | TRUE | include_standardized_coefs |
| heteroscedasticity_test | breusch_pagan | heteroscedasticity_test |
| alpha | 0.05 | alpha |
This analysis builds a predictive model for sales revenue based on advertising spend across three digital channels (TikTok, Facebook, Google Ads). The objective is to quantify each channel's contribution to revenue and enable data-driven budget allocation decisions for the marketing agency.
The model reveals a nuanced channel hierarchy: while Google
Data preprocessing and column mapping
| Metric | Value |
|---|---|
| Initial Rows | 0 |
| Final Rows | 0 |
| Rows Removed | 0 |
| Retention Rate | 100% |
This section documents the data preprocessing pipeline for the marketing ROI analysis. It shows that no data rows were processed during the preprocessing stage, which is inconsistent with the main analysis that evaluated 200 observations. This discrepancy suggests the preprocessing metadata may not have been properly captured or the pipeline documentation is incomplete.
The preprocessing section shows zero rows processed, yet the regression model successfully analyzed 200 observations with no rows removed (initial_rows = 200, rows_removed = 0). This indicates either the preprocessing documentation failed to capture the actual data pipeline, or the data was loaded directly without formal preprocessing steps. The 100% retention rate is technically accurate but uninformative—it reflects that no data was explicitly removed, not that preprocessing was thorough.
This metadata gap creates uncertainty about data quality decisions, missing value handling, and feature engineering applied before modeling. Given the main analysis shows heteroscedasticity issues and non-normal residuals, understanding preprocessing choices would be critical
| Metric | Value |
|---|---|
| Model Performance | R² = 78.2% (explains 78.2% of sales variance) |
| Best Channel | Google with ROI = $1.22 per ad dollar |
| Significant Channels | 3 of 3 channels statistically significant |
| Model Quality | Strong |
This executive summary synthesizes the advertising channel analysis to assess whether the regression model reliably explains sales performance and supports budget allocation decisions. The 78.2% variance explained indicates strong predictive power, but diagnostic concerns require careful interpretation before deployment.
The model demonstrates strong predictive capability with all channels showing reliable, significant effects. Google's superior ROI suggests budget reallocation potential. However, violated diagnostic assumptions—
How well does the model predict sales? Actual vs predicted values with model performance metrics
This section evaluates how accurately the regression model predicts sales outcomes by comparing actual values against model predictions. Understanding model fit is essential for assessing whether the three marketing channels (TikTok, Facebook, Google Ads) reliably explain sales variation and whether predictions are trustworthy for business decisions.
The model demonstrates solid predictive capability, with actual sales clustering reasonably close to predicted values. The near-identical R² and adjusted R² values indicate the model complexity is appropriate—no unnecessary predictors inflate performance artificially. Residuals averaging zero with median of -$244 suggest slight systematic underprediction at lower values, though overall bias is minimal.
This fit assessment assumes linear relationships between marketing spend and sales.
Which advertising channels drive the most sales per dollar spent? Coefficient estimates with confidence intervals
This section quantifies the sales impact of each advertising channel by estimating marginal ROI—the dollars of sales generated per dollar spent. All three channels show statistically significant effects, meaning their contributions to sales are reliable and not due to chance. These coefficients directly address the core business question of which channels deliver the strongest financial returns.
Google Ads demonstrates substantially higher efficiency than both social channels, delivering more than 2.5× the return of TikTok. The tight confidence intervals (none crossing zero) confirm these rankings are stable estimates rather than statistical artifacts. This reflects the model's R² of 0.782, meaning these three channels explain 78% of
Are model assumptions satisfied? Residual plots check for homoscedasticity and linearity
This section evaluates whether the linear regression model satisfies two critical assumptions: homoscedasticity (constant variance across fitted values) and linearity (random scatter around zero). Violations of these assumptions undermine model reliability and suggest the relationship between predictors and outcomes may not be adequately captured by the linear specification.
The residual plot reveals violated homoscedasticity assumptions. The positive skew and median offset from zero indicate the model systematically underpredicts at certain fitted value ranges. The standardized residual exceeding ±3σ represents a notable outlier. These violations, confirmed by the failed diagnostic tests, suggest the linear model may misspecify the relationship between marketing
Are residuals normally distributed? QQ plot validates normality assumption required for inference
The QQ plot assesses whether residuals follow a normal distribution—a critical assumption for valid p-values and confidence intervals in regression. Deviations from the 45° reference line indicate non-normality, which can undermine the reliability of statistical inference for the marketing channel ROI model.
The residuals exhibit non-normal distribution, particularly in the upper tail, which violates a foundational assumption of ordinary least squares regression. This means the reported p-values (all 0.0000) and 95% confidence intervals for TikTok, Facebook, and Google Ads coefficients may be unreliable. The positive skew combined with heteroscedast
Are predictors highly correlated? VIF (Variance Inflation Factor) detects multicollinearity that inflates coefficient uncertainty
| test | statistic | p_value | result |
|---|---|---|---|
| Normality (Shapiro-Wilk) | 0.9630 | 0.0000 | Fail |
| Homoscedasticity (Breusch-Pagan) | 9.1101 | 0.0025 | Fail |
| Autocorrelation (Durbin-Watson) | 1.2206 | N/A | Fail |
This section evaluates whether predictors (TikTok, Facebook, Google Ads) are highly correlated with each other—a condition called multicollinearity that inflates coefficient uncertainty and reduces model reliability. VIF quantifies this relationship, with values above 10 indicating problematic correlation that compromises statistical inference.
The extremely low VIF values (all ≤1.02) demonstrate that the three advertising channels operate independently in the dataset. This independence strengthens confidence in the coefficient estimates—each channel's ROI (TikTok: 0.36, Facebook: 0.49, Google Ads: 1.22) reflects its true isolated effect rather than shared variance with other channels. The model's ability to distinguish individual channel contributions is therefore robust.
While multicollinearity is not a concern, the diagnostic tests reveal violations in normality (Shap
Which observations have outsized influence on the model? Cook's Distance and leverage identify problematic data points
This section identifies observations that disproportionately affect model coefficients and predictions. By detecting influential points and high-leverage cases, we can assess whether the model's estimates are robust or driven by a small number of unusual data points. This is critical for validating the reliability of the marketing channel ROI estimates.
The model demonstrates strong stability: zero influential points means the TikTok, Facebook, and Google Ads coefficients are not driven by outliers. The 8 high-leverage observations represent unusual combinations of predictor values but do not distort estimates because their residuals remain moderate. This validates that the ROI estimates (TikTok: 0.36, Facebook: 0.49,
How well does the model generalize to new data? Cross-validation assesses out-of-sample performance
This section evaluates whether the marketing ROI model generalizes reliably to new, unseen data. Cross-validation partitions the dataset into five folds, training on four and testing on one, repeated across all combinations. This reveals whether the model's strong training performance (R² = 0.782) holds up when applied to data it hasn't encountered, which is critical for real-world deployment.
The model demonstrates excellent generalization. The negligible gap between training and cross-validation metrics (0.5% difference in RMSE) suggests the three marketing channels (TikTok
What is the uncertainty around individual predictions? 95% prediction intervals quantify forecast precision
Prediction intervals quantify uncertainty around individual forecasts by establishing lower and upper bounds where actual values are expected to fall. This section evaluates whether the model's uncertainty estimates are well-calibrated—critical for risk assessment and decision-making when deploying the marketing ROI model in production environments.
The 97% coverage rate demonstrates the model produces trustworthy uncertainty bounds. Predictions are neither overconfident (which would yield <95% coverage) nor overly conservative (which would exceed 98%). The narrow standard deviation of interval widths ($23.69) indicates uncertainty is uniformly estimated, not concentrated in specific regions. This calibration validates the model's suitability for business decisions requiring probabilistic forecasts of marketing channel ROI.
Which channel has the strongest impact? Standardized coefficients enable fair comparison across different spend scales
This section identifies which marketing channel drives the strongest relative impact on sales outcomes by comparing standardized effect sizes. Standardized coefficients normalize for differences in spend scale across channels, enabling fair comparison of true influence regardless of budget magnitude. This directly addresses the core business question: which channel delivers the most efficient return per unit of variation in spending?
Despite Google Ads having the largest raw coefficient (1.22), TikTok demonstrates the strongest standardized effect, indicating superior efficiency when accounting for spend variability. This reveals that TikTok's influence on sales is more pronounced relative to its natural variation in spending patterns. The ranking reflects true comparative leverage: TikTok's marginal impact per unit of standardized variation substantially exceeds both Facebook and Google Ads, making it the most influential channel in the model
Is variance constant across fitted values? Scale-Location plot detects heteroscedasticity that violates regression assumptions
This section evaluates whether prediction error variance remains constant across all fitted values—a core assumption of linear regression. The Scale-Location plot visualizes this relationship, while the Breusch-Pagan test provides statistical confirmation. Detecting heteroscedasticity is critical because it undermines the reliability of confidence intervals and hypothesis tests, even when predictions appear accurate.
The model exhibits heteroscedasticity, meaning prediction errors are not uniformly distributed across the range of fitted values. While the trend is modest (smooth line variation of ±0.08 around mean 0.87), it is statistically significant. This suggests that uncertainty in marketing ROI predictions may be systematically higher or lower depending on predicted spending levels, potentially affecting the precision of confidence intervals around channel-specific ROI estimates (TikTok: 0