Ecommerce · Customers · Churn Prediction P1778698833

Executive Summary

Key metrics for churn prediction model performance and customer cohort

Customers Analyzed

2000

Churn Rate (%)

23.5

Model AUC

0.87

Model Accuracy

0.847

Cross-Val AUC

0.855

Analyzed 2000 customers with a 23.5% churn rate. The logistic regression model achieves 84.7% accuracy with AUC of 0.870, while cross-validation AUC of 0.855 confirms robust generalization to unseen customer segments. Features like recency (days since last order), tenure, and satisfaction score emerge as the top predictors for identifying high-risk customers for targeted retention campaigns.

Interpretation

Analyzed 2000 customers with a 23.5% churn rate. The logistic regression model achieves 84.7% accuracy with AUC of 0.870, while cross-validation AUC of 0.855 confirms robust generalization to unseen customer segments. Features like recency (days since last order), tenure, and satisfaction score emerge as the top predictors for identifying high-risk customers for targeted retention campaigns.

Overview

Analysis Overview

Dataset summary and machine learning approach for churn prediction

Customers Analyzed2000

Churn Rate (%)23.5

Model Auc0.87

Model Accuracy0.847

Cross-Val Auc0.855

Interpretation

Dataset contains 2000 customers with 19 behavioral, engagement, and demographic features. The churn base rate is 23.5%, indicating moderate class imbalance. A two-model ensemble approach combines Random Forest (non-linear feature interactions and importance) with Logistic Regression (interpretable coefficients). Models are trained with 5-fold cross-validation to estimate generalization performance on new customers.

Data Preparation

Data Quality & Preprocessing

Data quality assessment and preprocessing steps applied

Customers Analyzed2000

Churn Rate (%)23.5

Model Auc0.87

Model Accuracy0.847

Cross-Val Auc0.855

Interpretation

Input dataset: 2000 rows × 21 columns (14 numeric, 5 categorical). No rows removed during preprocessing; all 2000 customers retained for balanced analysis. Missing values handled with listwise deletion in model fitting. Categorical features converted to factors; numeric features standardized for logistic regression via caret::preProcess. Feature engineering: no new features created; all predictors derived from raw dataset columns.

Visualization

Feature Distributions by Churn Status

Comparison of customer behavioral and demographic features between churned and active customers

Interpretation

Key features show pronounced differences between customer cohorts. Customer ID, Cashback Amount Total, Account Tenure Months display the largest separation between churners and active customers, indicating strong predictive signal. These behavioral metrics—particularly recency, engagement frequency, and satisfaction—are critical signals for early identification of at-risk segments.

Visualization

Feature Correlation Matrix

Pearson correlations among numeric behavioral, engagement, and demographic features

Interpretation

Features show low to moderate correlations (r < 0.7), indicating they capture independent dimensions of customer behavior. This reduces multicollinearity concerns and supports model stability.

Visualization

Feature Importance Ranking

Random Forest feature importance based on mean decrease in Gini impurity

Interpretation

The Random Forest model identifies Satisfaction Score, Account Tenure Months, Cashback Amount Total as the top predictive features for churn risk. These three features account for approximately 208.0% of the model's predictive power, making them prime candidates for targeted retention strategies and customer monitoring dashboards.

Visualization

Logistic Regression Coefficients

Interpretable feature effects on churn probability with statistical significance

Interpretation

13 features increase churn risk (positive coefficients) while 13 features decrease it (negative coefficients). 12 features show statistically significant effects (p < 0.05). Features with negative coefficients—such as tenure and frequency—are protective, while high recency and low satisfaction strongly increase risk.

Visualization

ROC Curve Analysis

Receiver Operating Characteristic curve showing sensitivity-specificity tradeoff at all probability thresholds

Interpretation

The ROC curve reveals an AUC of 0.870, indicating strong discriminative ability between churners and active customers. The optimal classification threshold is 0.30, balancing 72.4% sensitivity (catching at-risk customers) against 15.5% false positive rate. This threshold setting minimizes unnecessary intervention costs while maximizing retention program effectiveness.

Visualization

Confusion Matrix

Classification performance breakdown showing true positives, false positives, true negatives, and false negatives

Interpretation

The model correctly identifies 55.8% of actual churners (sensitivity) and 93.5% of actual active customers (specificity). With 263 true positives, the model effectively catches at-risk customers for proactive retention. The 208 false negatives represent missed opportunities, while 99 false positives indicate some false alarms requiring verification.

Data Table

Classification Performance Metrics

Training and cross-validated performance metrics comparing model quality and generalization

Metric Name	Training Value	Cv Value
Accuracy	0.847	0.838
Precision	0.727	0.707
Recall	0.558	0.545
F1 Score	0.631	0.612
AUC	0.87	0.855

Interpretation

The model shows strong overall performance with 84.7% training accuracy and 0.870 AUC. Cross-validation reveals excellent generalization, with average CV AUC of 0.855 and mean metric drift of 0.015, confirming the model will perform consistently on new customer data. Precision and recall are well-balanced, enabling effective targeting of at-risk segments.

What's wrong with this card?

Executive Summary

Analysis Overview

Data Quality & Preprocessing

Feature Distributions by Churn Status

Feature Correlation Matrix

Feature Importance Ranking

Logistic Regression Coefficients

ROC Curve Analysis

Confusion Matrix

Classification Performance Metrics

Report an Issue