Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| k_min | 2 | k_min |
| k_max | 8 | k_max |
| k_clusters | k_clusters | |
| scale_features | TRUE | scale_features |
| n_start | 25 | n_start |
| max_iter | 300 | max_iter |
This analysis applies K-Means clustering to segment 162 customers from Online Retail Co based on RFM (Recency, Frequency, Monetary) behavioral metrics. The objective is to identify distinct customer groups for targeted engagement strategies, with data quality validated through a 78.2% retention rate after removing invalid transactions and missing identifiers.
The model successfully partitions customers into six behaviorally distinct segments with strong statistical validity. The silhouette coefficient of 0.732 (well above 0.5 threshold) confirms that customers are appropriately assigned to their clusters. The dominance of Cluster 6 suggests most customers
Data preprocessing and column mapping
| Metric | Value |
|---|---|
| Initial Rows | 5,000 |
| Final Rows | 3,908 |
| Rows Removed | 1,092 |
| Retention Rate | 78.2% |
This section documents the data cleaning process that prepared raw transaction data for customer segmentation. Understanding data retention and removal reasons is critical because it directly affects the representativeness of the 162 customers ultimately segmented and the reliability of behavioral insights derived from RFM clustering.
The 21.8% removal rate suggests the raw transaction data contained considerable noise—primarily cancelled orders and data entry errors. This aggressive filtering ensures the clustering model operates on valid, completed transactions only, which is appropriate for RFM analysis. However, the loss of over 1,000 records means the final 162-customer segment may not fully represent the original customer population, potentially biasing insights toward customers with cleaner transaction histories.
Since K-Means is unsupervised, no train/test split was necessary. The high removal rate reflects typical e
| Metric | Value |
|---|---|
| Customers Analyzed | 162 |
| Optimal Clusters (k) | 6 |
| Cluster Separation | 88.2% between-cluster SS |
| Silhouette Quality | 0.7323 (strong) |
| Highest Value Segment | Champions |
| Largest Segment | Potential Loyalists |
This section synthesizes the clustering model's performance and business relevance. It answers whether the segmentation successfully partitioned the customer base into actionable, well-separated groups that can drive targeted business strategies.
The model successfully achieved its objective: 162 customers are segmented into 6 behaviorally distinct groups with high statistical confidence. The 88.2% between-cluster ratio and 0.732 silhouette score indicate genuine customer behavior patterns, not artificial divisions. RFM features (recency, frequency, monetary value) effectively differentiate customer tiers, with Champions showing 2
Elbow curve showing within-cluster sum of squares (WSS) vs. number of clusters. The optimal k is identified at the 'elbow' — the point of maximum curvature.
The elbow method identifies the optimal number of clusters by analyzing how within-cluster sum of squares (WSS) decreases as k increases. This section determines the "sweet spot" where adding more clusters yields diminishing returns—balancing cluster compactness with interpretability for customer segmentation.
The elbow curve suggests k=4 as the mathematically optimal partition for RFM-based customer segmentation. However, the final model selected k=6 based on silhouette analysis (avg_silhouette=0.73), which prioritizes cluster cohesion and separation quality over WSS reduction. This trade-off reflects a preference for interpretable, well-separated customer segments over strict variance minimization.
The el
Average silhouette width across cluster counts k = 2 to 8. Higher silhouette = better-defined clusters.
Silhouette analysis validates cluster quality by measuring how well each customer fits within their assigned segment versus neighboring clusters. This section confirms that k = 6 produces the strongest cluster separation, ensuring the segmentation is statistically sound and meaningful for customer behavioral grouping.
The 0.732 silhouette score demonstrates that the six customer segments are distinctly separated in RFM space, with minimal misclassification risk. This strong structure validates that behavioral differences between clusters (Recent Customers, Champions, At Risk, Potential Loyalists) are genuine and statistically robust, not artifacts
2D PCA projection of the 162 customers colored by cluster assignment. PCA reduces 3 RFM dimensions to 2 for visual inspection of cluster separability.
This PCA scatter plot visualizes the 162 segmented customers in 2D space, reducing the three RFM dimensions to their principal components for interpretability. It serves as a diagnostic tool to assess whether the k-means algorithm produced well-separated, meaningful clusters—a critical validation that the segmentation reflects genuine behavioral differences rather than arbitrary partitions.
The high between-cluster SS ratio and substantial variance retention demonstrate that the six-cluster solution successfully partitions customers into behavior
Standardized mean RFM feature values per cluster. Positive bars indicate above-average values; negative bars indicate below-average.
This section reveals the behavioral DNA of each customer segment by displaying standardized RFM (Recency, Frequency, Monetary) profiles. It directly supports the segmentation objective by showing what distinguishes each cluster—enabling targeted strategies based on purchase recency, engagement frequency, and spending value. The high silhouette score (0.732) confirms these profiles represent genuinely distinct customer groups.
The profiles demonstrate that monetary value alone doesn't define segment quality. Cluster 1 and 4 both show high spending, but Cluster 4 combines this
Summary statistics for each of the 6 customer segments, including size, RFM means, and silhouette scores.
| cluster | segment_label | n_customers | pct_customers | mean_recency | mean_frequency | mean_monetary | avg_silhouette |
|---|---|---|---|---|---|---|---|
| 1 | Recent Customers | 4 | 2.5 | 1.5 | 1 | 2475 | 0.4884 |
| 2 | At Risk | 6 | 3.7 | 2 | 2 | 914.8 | 0.6824 |
| 3 | Champions | 8 | 4.9 | 1 | 2.12 | 614.2 | 0.4989 |
| 4 | Champions | 3 | 1.9 | 1 | 2.67 | 2923 | 0.1929 |
| 5 | Recent Customers | 61 | 37.7 | 1 | 1 | 344.1 | 0.7634 |
| 6 | Potential Loyalists | 80 | 49.4 | 2 | 1 | 368.6 | 0.7681 |
This section summarizes the behavioral characteristics of each of the 6 customer segments identified by K-Means clustering. It translates RFM metrics (Recency, Frequency, Monetary) into actionable segment profiles, enabling understanding of how customers distribute across groups and how well-defined each segment is. This directly supports the segmentation objective by quantifying segment size, composition, and internal cohesion.
Top 20 customers by monetary value with cluster assignments and RFM metrics.
| Customer ID | cluster | recency | frequency | monetary | segment_label |
|---|---|---|---|---|---|
| 13777 | 4 | 1 | 2 | 3452 | Champions |
| 15061 | 4 | 1 | 3 | 3452 | Champions |
| 17511 | 1 | 1 | 1 | 3109 | Recent Customers |
| 12758 | 1 | 2 | 1 | 2455 | Recent Customers |
| 18102 | 1 | 2 | 1 | 2286 | Recent Customers |
| 14299 | 1 | 1 | 1 | 2052 | Recent Customers |
| 17949 | 4 | 1 | 3 | 1866 | Champions |
| 16131 | 5 | 1 | 1 | 1318 | Recent Customers |
| 17428 | 2 | 2 | 2 | 1292 | At Risk |
| 13767 | 6 | 2 | 1 | 1198 | Potential Loyalists |
| 13526 | 2 | 2 | 2 | 1182 | At Risk |
| 12931 | 6 | 2 | 1 | 1160 | Potential Loyalists |
| 14156 | 2 | 2 | 2 | 1067 | At Risk |
| 17377 | 3 | 1 | 2 | 1052 | Champions |
| 12435 | 5 | 1 | 1 | 1008 | Recent Customers |
| 13758 | 6 | 2 | 1 | 996.1 | Potential Loyalists |
| 15311 | 3 | 1 | 2 | 991 | Champions |
| 14440 | 5 | 1 | 1 | 935 | Recent Customers |
| 12533 | 6 | 2 | 1 | 929.9 | Potential Loyalists |
| 14527 | 5 | 1 | 1 | 923.8 | Recent Customers |
This section validates the k-means segmentation by examining the top 20 highest-spending customers and their cluster assignments. It serves as a quality check to ensure that premium-value customers are appropriately placed in segments aligned with their RFM behavior, enabling targeted retention and engagement strategies for the company's most valuable accounts.
The segmentation successfully differentiates high-value customers across multiple behavioral profiles. Champions (clusters 3–4) combine recent activity with higher frequency, while Recent Customers (clusters 1, 5) show high monetary value but lower engagement frequency. This heter