Executive Summary
Key metrics for churn prediction model performance and customer cohort
Analyzed 2000 customers with a 23.5% churn rate. The logistic regression model achieves 84.7% accuracy with AUC of 0.870, while cross-validation AUC of 0.855 confirms robust generalization to unseen customer segments. Features like recency (days since last order), tenure, and satisfaction score emerge as the top predictors for identifying high-risk customers for targeted retention campaigns.
Analysis Overview
Dataset summary and machine learning approach for churn prediction
Dataset contains 2000 customers with 19 behavioral, engagement, and demographic features. The churn base rate is 23.5%, indicating moderate class imbalance. A two-model ensemble approach combines Random Forest (non-linear feature interactions and importance) with Logistic Regression (interpretable coefficients). Models are trained with 5-fold cross-validation to estimate generalization performance on new customers.
Data Quality & Preprocessing
Data quality assessment and preprocessing steps applied
Input dataset: 2000 rows × 21 columns (14 numeric, 5 categorical). No rows removed during preprocessing; all 2000 customers retained for balanced analysis. Missing values handled with listwise deletion in model fitting. Categorical features converted to factors; numeric features standardized for logistic regression via caret::preProcess. Feature engineering: no new features created; all predictors derived from raw dataset columns.
Feature Distributions by Churn Status
Comparison of customer behavioral and demographic features between churned and active customers
Key features show pronounced differences between customer cohorts. Customer ID, Cashback Amount Total, Account Tenure Months display the largest separation between churners and active customers, indicating strong predictive signal. These behavioral metrics—particularly recency, engagement frequency, and satisfaction—are critical signals for early identification of at-risk segments.
Feature Correlation Matrix
Pearson correlations among numeric behavioral, engagement, and demographic features
Features show low to moderate correlations (r < 0.7), indicating they capture independent dimensions of customer behavior. This reduces multicollinearity concerns and supports model stability.
Feature Importance Ranking
Random Forest feature importance based on mean decrease in Gini impurity
The Random Forest model identifies Satisfaction Score, Account Tenure Months, Cashback Amount Total as the top predictive features for churn risk. These three features account for approximately 208.0% of the model's predictive power, making them prime candidates for targeted retention strategies and customer monitoring dashboards.
Logistic Regression Coefficients
Interpretable feature effects on churn probability with statistical significance
13 features increase churn risk (positive coefficients) while 13 features decrease it (negative coefficients). 12 features show statistically significant effects (p < 0.05). Features with negative coefficients—such as tenure and frequency—are protective, while high recency and low satisfaction strongly increase risk.
ROC Curve Analysis
Receiver Operating Characteristic curve showing sensitivity-specificity tradeoff at all probability thresholds
The ROC curve reveals an AUC of 0.870, indicating strong discriminative ability between churners and active customers. The optimal classification threshold is 0.30, balancing 72.4% sensitivity (catching at-risk customers) against 15.5% false positive rate. This threshold setting minimizes unnecessary intervention costs while maximizing retention program effectiveness.
Confusion Matrix
Classification performance breakdown showing true positives, false positives, true negatives, and false negatives
The model correctly identifies 55.8% of actual churners (sensitivity) and 93.5% of actual active customers (specificity). With 263 true positives, the model effectively catches at-risk customers for proactive retention. The 208 false negatives represent missed opportunities, while 99 false positives indicate some false alarms requiring verification.
Classification Performance Metrics
Training and cross-validated performance metrics comparing model quality and generalization
| Metric Name | Training Value | Cv Value |
|---|---|---|
| Accuracy | 0.847 | 0.838 |
| Precision | 0.727 | 0.707 |
| Recall | 0.558 | 0.545 |
| F1 Score | 0.631 | 0.612 |
| AUC | 0.87 | 0.855 |
The model shows strong overall performance with 84.7% training accuracy and 0.870 AUC. Cross-validation reveals excellent generalization, with average CV AUC of 0.855 and mean metric drift of 0.015, confirming the model will perform consistently on new customer data. Precision and recall are well-balanced, enabling effective targeting of at-risk segments.