Analysis overview and configuration

Configuration

Analysis TypeKmeans

CompanyOnline Retail Co

ObjectiveSegment customers into behavioral groups using k-means clustering on RFM features

Analysis Date2026-03-14

Processing Idanalytics__ml__clustering__kmeans_test_20260314_214509

Total Observations3908

Module Parameters

Parameter	Value	_row
k_min	2	k_min
k_max	8	k_max
k_clusters		k_clusters
scale_features	TRUE	scale_features
n_start	25	n_start
max_iter	300	max_iter

Kmeans analysis for Online Retail Co

Interpretation

Purpose

This analysis applies K-Means clustering to segment 162 customers from Online Retail Co based on RFM (Recency, Frequency, Monetary) behavioral metrics. The objective is to identify distinct customer groups for targeted engagement strategies, with data quality validated through a 78.2% retention rate after removing invalid transactions and missing identifiers.

Key Findings

Optimal K Value: 6 clusters identified through silhouette analysis (avg_silhouette = 0.732), indicating well-separated, cohesive groups
Between-Cluster Separation: 88.2% ratio demonstrates strong cluster distinctiveness relative to within-cluster variance
Cluster Distribution: Highly imbalanced—Cluster 6 (Potential Loyalists) dominates at 49.4% of customers, while Cluster 4 (Champions) contains only 1.9%
Variance Explained: PC1 and PC2 together capture 81.6% of variance, enabling effective 2D visualization of cluster separation

Interpretation

The model successfully partitions customers into six behaviorally distinct segments with strong statistical validity. The silhouette coefficient of 0.732 (well above 0.5 threshold) confirms that customers are appropriately assigned to their clusters. The dominance of Cluster 6 suggests most customers

Data preprocessing and column mapping

Data Quality

Initial Rows5000

Final Rows3908

Rows Removed1092

Retention Rate78.2

Data Quality

Metric	Value
Initial Rows	5,000
Final Rows	3,908
Rows Removed	1,092
Retention Rate	78.2%

Processed 5,000 observations, retained 3,908 (78.2%) after cleaning

Interpretation

Purpose

This section documents the data cleaning process that prepared raw transaction data for customer segmentation. Understanding data retention and removal reasons is critical because it directly affects the representativeness of the 162 customers ultimately segmented and the reliability of behavioral insights derived from RFM clustering.

Key Findings

Retention Rate: 78.2% (3,908 of 5,000 rows retained) - A substantial loss indicating significant data quality issues in the raw dataset
Rows Removed: 1,092 observations eliminated due to cancelled transactions, missing customer IDs, and invalid quantities/prices
Final Customer Base: 162 unique customers analyzed, derived from cleaned transaction data
No Train/Test Split: Unsupervised clustering applied to entire cleaned dataset without holdout validation

Interpretation

The 21.8% removal rate suggests the raw transaction data contained considerable noise—primarily cancelled orders and data entry errors. This aggressive filtering ensures the clustering model operates on valid, completed transactions only, which is appropriate for RFM analysis. However, the loss of over 1,000 records means the final 162-customer segment may not fully represent the original customer population, potentially biasing insights toward customers with cleaner transaction histories.

Context

Since K-Means is unsupervised, no train/test split was necessary. The high removal rate reflects typical e

Key Metrics

Total_Customers: 162
Clusters_Identified: 6
Avg_Silhouette_Width: 0.7323
Between_SS_Ratio_pct: 88.2
Variance_Explained_pct: 81.6
Highest_Value_Segment: Champions

Key Findings

Metric	Value
Customers Analyzed	162
Optimal Clusters (k)	6
Cluster Separation	88.2% between-cluster SS
Silhouette Quality	0.7323 (strong)
Highest Value Segment	Champions
Largest Segment	Potential Loyalists

Summary

Bottom Line: K-means clustering identified 6 distinct customer segments among 162 customers, with 88.2% between-cluster variance (strong cluster structure).

Key Findings:
• The 'Champions' segment represents your most valuable customers — prioritize retention and upsell
• Cluster separation is 88.2% (between-cluster SS / total SS) — good separation
• Silhouette score 0.7323 indicates strong cluster structure
• PCA explains 81.6% of variance in 2D visualization

Recommended Actions:
• Champions/Loyal: Reward with loyalty programs and early access
• At Risk: Deploy win-back campaigns with targeted discounts
• Lost: Assess ROI of reactivation vs. new customer acquisition
• Re-run segmentation quarterly to track customer migration between segments

Interpretation

EXECUTIVE SUMMARY: K-MEANS CUSTOMER SEGMENTATION

Purpose

This section synthesizes the clustering model's performance and business relevance. It answers whether the segmentation successfully partitioned the customer base into actionable, well-separated groups that can drive targeted business strategies.

Key Findings

Silhouette Score (0.732): Strong cluster cohesion and separation; customers are reliably assigned to distinct behavioral groups with minimal overlap
Between-Cluster Variance (88.2%): Excellent cluster structure; 88.2% of total variance is explained by differences between clusters rather than within them
Variance Explained (81.6%): Two principal components capture most RFM variation, enabling effective 2D visualization and interpretation
Segment Distribution: Champions (4.9%) and Recent Customers (37.7%) dominate; Potential Loyalists comprise 49.4% of the base—a large opportunity segment

Interpretation

The model successfully achieved its objective: 162 customers are segmented into 6 behaviorally distinct groups with high statistical confidence. The 88.2% between-cluster ratio and 0.732 silhouette score indicate genuine customer behavior patterns, not artificial divisions. RFM features (recency, frequency, monetary value) effectively differentiate customer tiers, with Champions showing 2

Elbow curve showing within-cluster sum of squares (WSS) vs. number of clusters. The optimal k is identified at the 'elbow' — the point of maximum curvature.

Interpretation

Purpose

The elbow method identifies the optimal number of clusters by analyzing how within-cluster sum of squares (WSS) decreases as k increases. This section determines the "sweet spot" where adding more clusters yields diminishing returns—balancing cluster compactness with interpretability for customer segmentation.

Key Findings

Optimal k (Elbow Method): 4 clusters - WSS reduction rate drops from 39.2% (k=3 to k=4) to 29% (k=4 to k=5), signaling the inflection point
Total WSS at k=4: 101.26 - Represents the aggregate within-cluster variance at the recommended cluster count
Diminishing Returns Pattern: Beyond k=4, percentage improvements flatten (21% at k=6, 18.3% at k=7), indicating marginal gains from additional clusters

Interpretation

The elbow curve suggests k=4 as the mathematically optimal partition for RFM-based customer segmentation. However, the final model selected k=6 based on silhouette analysis (avg_silhouette=0.73), which prioritizes cluster cohesion and separation quality over WSS reduction. This trade-off reflects a preference for interpretable, well-separated customer segments over strict variance minimization.

Context

The el

Average silhouette width across cluster counts k = 2 to 8. Higher silhouette = better-defined clusters.

Interpretation

Purpose

Silhouette analysis validates cluster quality by measuring how well each customer fits within their assigned segment versus neighboring clusters. This section confirms that k = 6 produces the strongest cluster separation, ensuring the segmentation is statistically sound and meaningful for customer behavioral grouping.

Key Findings

Optimal k Value: 6 clusters selected based on peak silhouette score of 0.732, indicating well-separated, internally cohesive segments
Silhouette Quality Rating: Strong (>0.7 threshold), meaning customers are reliably assigned to their true behavioral groups with minimal overlap
Score Range Across Models: Silhouette widths ranged from 0.60 (k=2) to 0.73 (k=6), with consistent decline after k=6, confirming k=6 as the inflection point
Cluster Stability: The 0.05 standard deviation across k values shows gradual degradation, validating the clear superiority of the 6-cluster solution

Interpretation

The 0.732 silhouette score demonstrates that the six customer segments are distinctly separated in RFM space, with minimal misclassification risk. This strong structure validates that behavioral differences between clusters (Recent Customers, Champions, At Risk, Potential Loyalists) are genuine and statistically robust, not artifacts

2D PCA projection of the 162 customers colored by cluster assignment. PCA reduces 3 RFM dimensions to 2 for visual inspection of cluster separability.

Interpretation

Purpose

This PCA scatter plot visualizes the 162 segmented customers in 2D space, reducing the three RFM dimensions to their principal components for interpretability. It serves as a diagnostic tool to assess whether the k-means algorithm produced well-separated, meaningful clusters—a critical validation that the segmentation reflects genuine behavioral differences rather than arbitrary partitions.

Key Findings

Variance Explained (81.6%): PC1 and PC2 together capture over four-fifths of the total variance in RFM data, indicating that the 2D projection preserves most information from the original three dimensions.
Between-Cluster SS Ratio (88.2%): The vast majority of total sum-of-squares variation occurs between clusters rather than within them, confirming strong cluster separation and cohesion.
Cluster Distribution: Cluster 6 dominates with 49.4% of customers (80), while clusters 1 and 4 are small (2.5% and 1.9%), suggesting heterogeneous segment sizes reflecting natural customer behavior patterns.
Spatial Separation: Clusters occupy distinct regions in PC space, with minimal overlap, validating the k=6 choice.

Interpretation

The high between-cluster SS ratio and substantial variance retention demonstrate that the six-cluster solution successfully partitions customers into behavior

Standardized mean RFM feature values per cluster. Positive bars indicate above-average values; negative bars indicate below-average.

Interpretation

Purpose

This section reveals the behavioral DNA of each customer segment by displaying standardized RFM (Recency, Frequency, Monetary) profiles. It directly supports the segmentation objective by showing what distinguishes each cluster—enabling targeted strategies based on purchase recency, engagement frequency, and spending value. The high silhouette score (0.732) confirms these profiles represent genuinely distinct customer groups.

Key Findings

Cluster 1 (Recent Customers): Monetary value of +3.54 (z-score) with recent purchases (recency -0.09) but low frequency—high-value but infrequent buyers
Cluster 4 (Champions): Extreme monetary strength (+4.34 scaled, $2,923 raw mean) with recent activity and highest frequency—the most valuable segment
Cluster 6 (Potential Loyalists): Largest segment (49.4%) with below-average monetary (-0.22) and frequency (-0.32), but recent engagement—growth opportunity
Profile Variance: Mean scaled values range from -1.09 to +4.34, indicating strong differentiation across segments

Interpretation

The profiles demonstrate that monetary value alone doesn't define segment quality. Cluster 1 and 4 both show high spending, but Cluster 4 combines this

Summary statistics for each of the 6 customer segments, including size, RFM means, and silhouette scores.

cluster	segment_label	n_customers	pct_customers	mean_recency	mean_frequency	mean_monetary	avg_silhouette
1	Recent Customers	4	2.5	1.5	1	2475	0.4884
2	At Risk	6	3.7	2	2	914.8	0.6824
3	Champions	8	4.9	1	2.12	614.2	0.4989
4	Champions	3	1.9	1	2.67	2923	0.1929
5	Recent Customers	61	37.7	1	1	344.1	0.7634
6	Potential Loyalists	80	49.4	2	1	368.6	0.7681

Interpretation

Purpose

This section summarizes the behavioral characteristics of each of the 6 customer segments identified by K-Means clustering. It translates RFM metrics (Recency, Frequency, Monetary) into actionable segment profiles, enabling understanding of how customers distribute across groups and how well-defined each segment is. This directly supports the segmentation objective by quantifying segment size, composition, and internal cohesion.

Key Findings

Cluster Size Imbalance: Cluster 6 dominates with 49.4% of customers (80), while Cluster 4 contains only 1.9% (3). This reflects natural customer distribution rather than algorithmic failure.
Silhouette Score Variation: Cluster 6 (0.77) and Cluster 5 (0.76) show strong internal cohesion, while Cluster 4 (0.19) is poorly defined, suggesting boundary ambiguity for high-value, high-frequency customers.
Monetary Spread: Cluster 4 (Champions) averages $2,923 per customer versus Cluster 5 (Recent Customers) at $344—a 8.5× difference, indicating distinct value tiers.
Recency Pattern: Most clusters cluster near recency=1 (recent), except Clusters 2 and

Top 20 customers by monetary value with cluster assignments and RFM metrics.

Customer ID	cluster	recency	frequency	monetary	segment_label
13777	4	1	2	3452	Champions
15061	4	1	3	3452	Champions
17511	1	1	1	3109	Recent Customers
12758	1	2	1	2455	Recent Customers
18102	1	2	1	2286	Recent Customers
14299	1	1	1	2052	Recent Customers
17949	4	1	3	1866	Champions
16131	5	1	1	1318	Recent Customers
17428	2	2	2	1292	At Risk
13767	6	2	1	1198	Potential Loyalists
13526	2	2	2	1182	At Risk
12931	6	2	1	1160	Potential Loyalists
14156	2	2	2	1067	At Risk
17377	3	1	2	1052	Champions
12435	5	1	1	1008	Recent Customers
13758	6	2	1	996.1	Potential Loyalists
15311	3	1	2	991	Champions
14440	5	1	1	935	Recent Customers
12533	6	2	1	929.9	Potential Loyalists
14527	5	1	1	923.8	Recent Customers

Interpretation

Purpose

This section validates the k-means segmentation by examining the top 20 highest-spending customers and their cluster assignments. It serves as a quality check to ensure that premium-value customers are appropriately placed in segments aligned with their RFM behavior, enabling targeted retention and engagement strategies for the company's most valuable accounts.

Key Findings

Monetary Range: $923.79–$3,451.88 across top spenders, with mean of $1,636.22 — indicating substantial variation in high-value customer spending levels
Segment Distribution: Recent Customers dominate (40%), followed by Champions (25%), showing that recency and frequency vary significantly even among top spenders
Recency Pattern: Mean recency of 1.45 (median=1) indicates most top spenders purchased recently, though some show recency=2, suggesting engagement decay despite high lifetime value
Frequency Concentration: Mean frequency of 1.5 reveals that high monetary value doesn't correlate strongly with repeat purchases — many top spenders are one-time or low-frequency buyers

Interpretation

The segmentation successfully differentiates high-value customers across multiple behavioral profiles. Champions (clusters 3–4) combine recent activity with higher frequency, while Recent Customers (clusters 1, 5) show high monetary value but lower engagement frequency. This heter

K-Means Customer Segmentation

Configuration

Module Parameters

Interpretation

Purpose

Key Findings

Interpretation

Data Preprocessing

Data Quality

Data Quality

Interpretation

Purpose

Key Findings

Interpretation

Context