K-Means Customer Segmentation
Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| k_min | 2 | k_min |
| k_max | 8 | k_max |
| k_clusters | k_clusters | |
| scale_features | TRUE | scale_features |
| n_start | 25 | n_start |
| max_iter | 300 | max_iter |
Purpose
This analysis applies K-Means clustering to segment 162 customers from Online Retail Co based on RFM (Recency, Frequency, Monetary) behavioral metrics. The objective is to identify distinct customer groups for targeted engagement strategies, with data quality validated through a 78.2% retention rate after removing invalid transactions and missing identifiers.
Key Findings
- Optimal K Value: 6 clusters identified through silhouette analysis (avg_silhouette = 0.732), indicating well-separated, cohesive groups
- Between-Cluster Separation: 88.2% ratio demonstrates strong cluster distinctiveness relative to within-cluster variance
- Cluster Distribution: Highly imbalanced—Cluster 6 (Potential Loyalists) dominates at 49.4% of customers, while Cluster 4 (Champions) contains only 1.9%
- Variance Explained: PC1 and PC2 together capture 81.6% of variance, enabling effective 2D visualization of cluster separation
Interpretation
The model successfully partitions customers into six behaviorally distinct segments with strong statistical validity. The silhouette coefficient of 0.732 (well above 0.5 threshold) confirms that customers are appropriately assigned to their clusters. The dominance of Cluster 6 suggests most customers
Data preprocessing and column mapping
Purpose
This section documents the data cleaning process that prepared raw transaction data for customer segmentation. Understanding data retention and removal reasons is critical because it directly affects the representativeness of the 162 customers ultimately segmented and the reliability of behavioral insights derived from RFM clustering.
Key Findings
- Retention Rate: 78.2% (3,908 of 5,000 rows retained) - A substantial loss indicating significant data quality issues in the raw dataset
- Rows Removed: 1,092 observations eliminated due to cancelled transactions, missing customer IDs, and invalid quantities/prices
- Final Customer Base: 162 unique customers analyzed, derived from cleaned transaction data
- No Train/Test Split: Unsupervised clustering applied to entire cleaned dataset without holdout validation
Interpretation
The 21.8% removal rate suggests the raw transaction data contained considerable noise—primarily cancelled orders and data entry errors. This aggressive filtering ensures the clustering model operates on valid, completed transactions only, which is appropriate for RFM analysis. However, the loss of over 1,000 records means the final 162-customer segment may not fully represent the original customer population, potentially biasing insights toward customers with cleaner transaction histories.
Context
Since K-Means is unsupervised, no train/test split was necessary. The high removal rate reflects typical e
Executive Summary
Executive summary of k-means customer segmentation results
| Metric | Value |
|---|---|
| Customers Analyzed | 162 |
| Optimal Clusters (k) | 6 |
| Cluster Separation | 88.2% between-cluster SS |
| Silhouette Quality | 0.7323 (strong) |
| Highest Value Segment | Champions |
| Largest Segment | Potential Loyalists |
Key Findings:
• The 'Champions' segment represents your most valuable customers — prioritize retention and upsell
• Cluster separation is 88.2% (between-cluster SS / total SS) — good separation
• Silhouette score 0.7323 indicates strong cluster structure
• PCA explains 81.6% of variance in 2D visualization
Recommended Actions:
• Champions/Loyal: Reward with loyalty programs and early access
• At Risk: Deploy win-back campaigns with targeted discounts
• Lost: Assess ROI of reactivation vs. new customer acquisition
• Re-run segmentation quarterly to track customer migration between segments
EXECUTIVE SUMMARY: K-MEANS CUSTOMER SEGMENTATION
Purpose
This section synthesizes the clustering model's performance and business relevance. It answers whether the segmentation successfully partitioned the customer base into actionable, well-separated groups that can drive targeted business strategies.
Key Findings
- Silhouette Score (0.732): Strong cluster cohesion and separation; customers are reliably assigned to distinct behavioral groups with minimal overlap
- Between-Cluster Variance (88.2%): Excellent cluster structure; 88.2% of total variance is explained by differences between clusters rather than within them
- Variance Explained (81.6%): Two principal components capture most RFM variation, enabling effective 2D visualization and interpretation
- Segment Distribution: Champions (4.9%) and Recent Customers (37.7%) dominate; Potential Loyalists comprise 49.4% of the base—a large opportunity segment
Interpretation
The model successfully achieved its objective: 162 customers are segmented into 6 behaviorally distinct groups with high statistical confidence. The 88.2% between-cluster ratio and 0.732 silhouette score indicate genuine customer behavior patterns, not artificial divisions. RFM features (recency, frequency, monetary value) effectively differentiate customer tiers, with Champions showing 2
Elbow Curve
Elbow curve showing within-cluster sum of squares (WSS) vs. number of clusters. The optimal k is identified at the 'elbow' — the point of maximum curvature.
Purpose
The elbow method identifies the optimal number of clusters by analyzing how within-cluster sum of squares (WSS) decreases as k increases. This section determines the "sweet spot" where adding more clusters yields diminishing returns—balancing cluster compactness with interpretability for customer segmentation.
Key Findings
- Optimal k (Elbow Method): 4 clusters - WSS reduction rate drops from 39.2% (k=3 to k=4) to 29% (k=4 to k=5), signaling the inflection point
- Total WSS at k=4: 101.26 - Represents the aggregate within-cluster variance at the recommended cluster count
- Diminishing Returns Pattern: Beyond k=4, percentage improvements flatten (21% at k=6, 18.3% at k=7), indicating marginal gains from additional clusters
Interpretation
The elbow curve suggests k=4 as the mathematically optimal partition for RFM-based customer segmentation. However, the final model selected k=6 based on silhouette analysis (avg_silhouette=0.73), which prioritizes cluster cohesion and separation quality over WSS reduction. This trade-off reflects a preference for interpretable, well-separated customer segments over strict variance minimization.
Context
The el
Silhouette Analysis
Average silhouette width across cluster counts k = 2 to 8. Higher silhouette = better-defined clusters.
Purpose
Silhouette analysis validates cluster quality by measuring how well each customer fits within their assigned segment versus neighboring clusters. This section confirms that k = 6 produces the strongest cluster separation, ensuring the segmentation is statistically sound and meaningful for customer behavioral grouping.
Key Findings
- Optimal k Value: 6 clusters selected based on peak silhouette score of 0.732, indicating well-separated, internally cohesive segments
- Silhouette Quality Rating: Strong (>0.7 threshold), meaning customers are reliably assigned to their true behavioral groups with minimal overlap
- Score Range Across Models: Silhouette widths ranged from 0.60 (k=2) to 0.73 (k=6), with consistent decline after k=6, confirming k=6 as the inflection point
- Cluster Stability: The 0.05 standard deviation across k values shows gradual degradation, validating the clear superiority of the 6-cluster solution
Interpretation
The 0.732 silhouette score demonstrates that the six customer segments are distinctly separated in RFM space, with minimal misclassification risk. This strong structure validates that behavioral differences between clusters (Recent Customers, Champions, At Risk, Potential Loyalists) are genuine and statistically robust, not artifacts
Customer Clusters (PCA)
2D PCA projection of the 162 customers colored by cluster assignment. PCA reduces 3 RFM dimensions to 2 for visual inspection of cluster separability.
Purpose
This PCA scatter plot visualizes the 162 segmented customers in 2D space, reducing the three RFM dimensions to their principal components for interpretability. It serves as a diagnostic tool to assess whether the k-means algorithm produced well-separated, meaningful clusters—a critical validation that the segmentation reflects genuine behavioral differences rather than arbitrary partitions.
Key Findings
- Variance Explained (81.6%): PC1 and PC2 together capture over four-fifths of the total variance in RFM data, indicating that the 2D projection preserves most information from the original three dimensions.
- Between-Cluster SS Ratio (88.2%): The vast majority of total sum-of-squares variation occurs between clusters rather than within them, confirming strong cluster separation and cohesion.
- Cluster Distribution: Cluster 6 dominates with 49.4% of customers (80), while clusters 1 and 4 are small (2.5% and 1.9%), suggesting heterogeneous segment sizes reflecting natural customer behavior patterns.
- Spatial Separation: Clusters occupy distinct regions in PC space, with minimal overlap, validating the k=6 choice.
Interpretation
The high between-cluster SS ratio and substantial variance retention demonstrate that the six-cluster solution successfully partitions customers into behavior
Cluster Profiles
Standardized mean RFM feature values per cluster. Positive bars indicate above-average values; negative bars indicate below-average.
Purpose
This section reveals the behavioral DNA of each customer segment by displaying standardized RFM (Recency, Frequency, Monetary) profiles. It directly supports the segmentation objective by showing what distinguishes each cluster—enabling targeted strategies based on purchase recency, engagement frequency, and spending value. The high silhouette score (0.732) confirms these profiles represent genuinely distinct customer groups.
Key Findings
- Cluster 1 (Recent Customers): Monetary value of +3.54 (z-score) with recent purchases (recency -0.09) but low frequency—high-value but infrequent buyers
- Cluster 4 (Champions): Extreme monetary strength (+4.34 scaled, $2,923 raw mean) with recent activity and highest frequency—the most valuable segment
- Cluster 6 (Potential Loyalists): Largest segment (49.4%) with below-average monetary (-0.22) and frequency (-0.32), but recent engagement—growth opportunity
- Profile Variance: Mean scaled values range from -1.09 to +4.34, indicating strong differentiation across segments
Interpretation
The profiles demonstrate that monetary value alone doesn't define segment quality. Cluster 1 and 4 both show high spending, but Cluster 4 combines this
Cluster Summary
Summary statistics for each of the 6 customer segments, including size, RFM means, and silhouette scores.
| cluster | segment_label | n_customers | pct_customers | mean_recency | mean_frequency | mean_monetary | avg_silhouette |
|---|---|---|---|---|---|---|---|
| 1 | Recent Customers | 4 | 2.5 | 1.5 | 1 | 2475 | 0.4884 |
| 2 | At Risk | 6 | 3.7 | 2 | 2 | 914.8 | 0.6824 |
| 3 | Champions | 8 | 4.9 | 1 | 2.12 | 614.2 | 0.4989 |
| 4 | Champions | 3 | 1.9 | 1 | 2.67 | 2923 | 0.1929 |
| 5 | Recent Customers | 61 | 37.7 | 1 | 1 | 344.1 | 0.7634 |
| 6 | Potential Loyalists | 80 | 49.4 | 2 | 1 | 368.6 | 0.7681 |
Purpose
This section summarizes the behavioral characteristics of each of the 6 customer segments identified by K-Means clustering. It translates RFM metrics (Recency, Frequency, Monetary) into actionable segment profiles, enabling understanding of how customers distribute across groups and how well-defined each segment is. This directly supports the segmentation objective by quantifying segment size, composition, and internal cohesion.
Key Findings
- Cluster Size Imbalance: Cluster 6 dominates with 49.4% of customers (80), while Cluster 4 contains only 1.9% (3). This reflects natural customer distribution rather than algorithmic failure.
- Silhouette Score Variation: Cluster 6 (0.77) and Cluster 5 (0.76) show strong internal cohesion, while Cluster 4 (0.19) is poorly defined, suggesting boundary ambiguity for high-value, high-frequency customers.
- Monetary Spread: Cluster 4 (Champions) averages $2,923 per customer versus Cluster 5 (Recent Customers) at $344—a 8.5× difference, indicating distinct value tiers.
- Recency Pattern: Most clusters cluster near recency=1 (recent), except Clusters 2 and
Top Customers by Segment
Top 20 customers by monetary value with cluster assignments and RFM metrics.
| Customer ID | cluster | recency | frequency | monetary | segment_label |
|---|---|---|---|---|---|
| 13777 | 4 | 1 | 2 | 3452 | Champions |
| 15061 | 4 | 1 | 3 | 3452 | Champions |
| 17511 | 1 | 1 | 1 | 3109 | Recent Customers |
| 12758 | 1 | 2 | 1 | 2455 | Recent Customers |
| 18102 | 1 | 2 | 1 | 2286 | Recent Customers |
| 14299 | 1 | 1 | 1 | 2052 | Recent Customers |
| 17949 | 4 | 1 | 3 | 1866 | Champions |
| 16131 | 5 | 1 | 1 | 1318 | Recent Customers |
| 17428 | 2 | 2 | 2 | 1292 | At Risk |
| 13767 | 6 | 2 | 1 | 1198 | Potential Loyalists |
| 13526 | 2 | 2 | 2 | 1182 | At Risk |
| 12931 | 6 | 2 | 1 | 1160 | Potential Loyalists |
| 14156 | 2 | 2 | 2 | 1067 | At Risk |
| 17377 | 3 | 1 | 2 | 1052 | Champions |
| 12435 | 5 | 1 | 1 | 1008 | Recent Customers |
| 13758 | 6 | 2 | 1 | 996.1 | Potential Loyalists |
| 15311 | 3 | 1 | 2 | 991 | Champions |
| 14440 | 5 | 1 | 1 | 935 | Recent Customers |
| 12533 | 6 | 2 | 1 | 929.9 | Potential Loyalists |
| 14527 | 5 | 1 | 1 | 923.8 | Recent Customers |
Purpose
This section validates the k-means segmentation by examining the top 20 highest-spending customers and their cluster assignments. It serves as a quality check to ensure that premium-value customers are appropriately placed in segments aligned with their RFM behavior, enabling targeted retention and engagement strategies for the company's most valuable accounts.
Key Findings
- Monetary Range: $923.79–$3,451.88 across top spenders, with mean of $1,636.22 — indicating substantial variation in high-value customer spending levels
- Segment Distribution: Recent Customers dominate (40%), followed by Champions (25%), showing that recency and frequency vary significantly even among top spenders
- Recency Pattern: Mean recency of 1.45 (median=1) indicates most top spenders purchased recently, though some show recency=2, suggesting engagement decay despite high lifetime value
- Frequency Concentration: Mean frequency of 1.5 reveals that high monetary value doesn't correlate strongly with repeat purchases — many top spenders are one-time or low-frequency buyers
Interpretation
The segmentation successfully differentiates high-value customers across multiple behavioral profiles. Champions (clusters 3–4) combine recent activity with higher frequency, while Recent Customers (clusters 1, 5) show high monetary value but lower engagement frequency. This heter