Overview

K-Means Customer Segmentation

RFM-Based Clustering Analysis

Analysis overview and configuration

Configuration

Analysis TypeKmeans
CompanyOnline Retail Co
ObjectiveSegment customers into behavioral groups using k-means clustering on RFM features
Analysis Date2026-03-14
Processing Idanalytics__ml__clustering__kmeans_test_20260314_214509
Total Observations3908

Module Parameters

ParameterValue_row
k_min2k_min
k_max8k_max
k_clustersk_clusters
scale_featuresTRUEscale_features
n_start25n_start
max_iter300max_iter
Kmeans analysis for Online Retail Co

Interpretation

Purpose

This analysis applies K-Means clustering to segment 162 customers from Online Retail Co based on RFM (Recency, Frequency, Monetary) behavioral metrics. The objective is to identify distinct customer groups for targeted engagement strategies, with data quality validated through a 78.2% retention rate after removing invalid transactions and missing identifiers.

Key Findings

  • Optimal K Value: 6 clusters identified through silhouette analysis (avg_silhouette = 0.732), indicating well-separated, cohesive groups
  • Between-Cluster Separation: 88.2% ratio demonstrates strong cluster distinctiveness relative to within-cluster variance
  • Cluster Distribution: Highly imbalanced—Cluster 6 (Potential Loyalists) dominates at 49.4% of customers, while Cluster 4 (Champions) contains only 1.9%
  • Variance Explained: PC1 and PC2 together capture 81.6% of variance, enabling effective 2D visualization of cluster separation

Interpretation

The model successfully partitions customers into six behaviorally distinct segments with strong statistical validity. The silhouette coefficient of 0.732 (well above 0.5 threshold) confirms that customers are appropriately assigned to their clusters. The dominance of Cluster 6 suggests most customers

Data Preparation

Data Preprocessing

Transaction Filtering & RFM Aggregation

Data preprocessing and column mapping

Data Quality

Initial Rows5000
Final Rows3908
Rows Removed1092
Retention Rate78.2

Data Quality

MetricValue
Initial Rows5,000
Final Rows3,908
Rows Removed1,092
Retention Rate78.2%
Processed 5,000 observations, retained 3,908 (78.2%) after cleaning

Interpretation

Purpose

This section documents the data cleaning process that prepared raw transaction data for customer segmentation. Understanding data retention and removal reasons is critical because it directly affects the representativeness of the 162 customers ultimately segmented and the reliability of behavioral insights derived from RFM clustering.

Key Findings

  • Retention Rate: 78.2% (3,908 of 5,000 rows retained) - A substantial loss indicating significant data quality issues in the raw dataset
  • Rows Removed: 1,092 observations eliminated due to cancelled transactions, missing customer IDs, and invalid quantities/prices
  • Final Customer Base: 162 unique customers analyzed, derived from cleaned transaction data
  • No Train/Test Split: Unsupervised clustering applied to entire cleaned dataset without holdout validation

Interpretation

The 21.8% removal rate suggests the raw transaction data contained considerable noise—primarily cancelled orders and data entry errors. This aggressive filtering ensures the clustering model operates on valid, completed transactions only, which is appropriate for RFM analysis. However, the loss of over 1,000 records means the final 162-customer segment may not fully represent the original customer population, potentially biasing insights toward customers with cleaner transaction histories.

Context

Since K-Means is unsupervised, no train/test split was necessary. The high removal rate reflects typical e

Executive Summary

Executive Summary

Customer Segmentation Results

Key Metrics

Total_Customers
162
Clusters_Identified
6
Avg_Silhouette_Width
0.7323
Between_SS_Ratio_pct
88.2
Variance_Explained_pct
81.6
Highest_Value_Segment
Champions

Key Findings

MetricValue
Customers Analyzed162
Optimal Clusters (k)6
Cluster Separation88.2% between-cluster SS
Silhouette Quality0.7323 (strong)
Highest Value SegmentChampions
Largest SegmentPotential Loyalists

Summary

Bottom Line: K-means clustering identified 6 distinct customer segments among 162 customers, with 88.2% between-cluster variance (strong cluster structure).

Key Findings:
• The 'Champions' segment represents your most valuable customers — prioritize retention and upsell
• Cluster separation is 88.2% (between-cluster SS / total SS) — good separation
• Silhouette score 0.7323 indicates strong cluster structure
• PCA explains 81.6% of variance in 2D visualization

Recommended Actions:
Champions/Loyal: Reward with loyalty programs and early access
At Risk: Deploy win-back campaigns with targeted discounts
Lost: Assess ROI of reactivation vs. new customer acquisition
• Re-run segmentation quarterly to track customer migration between segments

Interpretation

EXECUTIVE SUMMARY: K-MEANS CUSTOMER SEGMENTATION

Purpose

This section synthesizes the clustering model's performance and business relevance. It answers whether the segmentation successfully partitioned the customer base into actionable, well-separated groups that can drive targeted business strategies.

Key Findings

  • Silhouette Score (0.732): Strong cluster cohesion and separation; customers are reliably assigned to distinct behavioral groups with minimal overlap
  • Between-Cluster Variance (88.2%): Excellent cluster structure; 88.2% of total variance is explained by differences between clusters rather than within them
  • Variance Explained (81.6%): Two principal components capture most RFM variation, enabling effective 2D visualization and interpretation
  • Segment Distribution: Champions (4.9%) and Recent Customers (37.7%) dominate; Potential Loyalists comprise 49.4% of the base—a large opportunity segment

Interpretation

The model successfully achieved its objective: 162 customers are segmented into 6 behaviorally distinct groups with high statistical confidence. The 88.2% between-cluster ratio and 0.732 silhouette score indicate genuine customer behavior patterns, not artificial divisions. RFM features (recency, frequency, monetary value) effectively differentiate customer tiers, with Champions showing 2

Figure 4

Elbow Curve

Within-Cluster Sum of Squares vs. Number of Clusters

Elbow curve showing within-cluster sum of squares (WSS) vs. number of clusters. The optimal k is identified at the 'elbow' — the point of maximum curvature.

Interpretation

Purpose

The elbow method identifies the optimal number of clusters by analyzing how within-cluster sum of squares (WSS) decreases as k increases. This section determines the "sweet spot" where adding more clusters yields diminishing returns—balancing cluster compactness with interpretability for customer segmentation.

Key Findings

  • Optimal k (Elbow Method): 4 clusters - WSS reduction rate drops from 39.2% (k=3 to k=4) to 29% (k=4 to k=5), signaling the inflection point
  • Total WSS at k=4: 101.26 - Represents the aggregate within-cluster variance at the recommended cluster count
  • Diminishing Returns Pattern: Beyond k=4, percentage improvements flatten (21% at k=6, 18.3% at k=7), indicating marginal gains from additional clusters

Interpretation

The elbow curve suggests k=4 as the mathematically optimal partition for RFM-based customer segmentation. However, the final model selected k=6 based on silhouette analysis (avg_silhouette=0.73), which prioritizes cluster cohesion and separation quality over WSS reduction. This trade-off reflects a preference for interpretable, well-separated customer segments over strict variance minimization.

Context

The el

Figure 5

Silhouette Analysis

Average Silhouette Width by Number of Clusters

Average silhouette width across cluster counts k = 2 to 8. Higher silhouette = better-defined clusters.

Interpretation

Purpose

Silhouette analysis validates cluster quality by measuring how well each customer fits within their assigned segment versus neighboring clusters. This section confirms that k = 6 produces the strongest cluster separation, ensuring the segmentation is statistically sound and meaningful for customer behavioral grouping.

Key Findings

  • Optimal k Value: 6 clusters selected based on peak silhouette score of 0.732, indicating well-separated, internally cohesive segments
  • Silhouette Quality Rating: Strong (>0.7 threshold), meaning customers are reliably assigned to their true behavioral groups with minimal overlap
  • Score Range Across Models: Silhouette widths ranged from 0.60 (k=2) to 0.73 (k=6), with consistent decline after k=6, confirming k=6 as the inflection point
  • Cluster Stability: The 0.05 standard deviation across k values shows gradual degradation, validating the clear superiority of the 6-cluster solution

Interpretation

The 0.732 silhouette score demonstrates that the six customer segments are distinctly separated in RFM space, with minimal misclassification risk. This strong structure validates that behavioral differences between clusters (Recent Customers, Champions, At Risk, Potential Loyalists) are genuine and statistically robust, not artifacts

Figure 6

Customer Clusters (PCA)

2D PCA Projection of Customer Segments

2D PCA projection of the 162 customers colored by cluster assignment. PCA reduces 3 RFM dimensions to 2 for visual inspection of cluster separability.

Interpretation

Purpose

This PCA scatter plot visualizes the 162 segmented customers in 2D space, reducing the three RFM dimensions to their principal components for interpretability. It serves as a diagnostic tool to assess whether the k-means algorithm produced well-separated, meaningful clusters—a critical validation that the segmentation reflects genuine behavioral differences rather than arbitrary partitions.

Key Findings

  • Variance Explained (81.6%): PC1 and PC2 together capture over four-fifths of the total variance in RFM data, indicating that the 2D projection preserves most information from the original three dimensions.
  • Between-Cluster SS Ratio (88.2%): The vast majority of total sum-of-squares variation occurs between clusters rather than within them, confirming strong cluster separation and cohesion.
  • Cluster Distribution: Cluster 6 dominates with 49.4% of customers (80), while clusters 1 and 4 are small (2.5% and 1.9%), suggesting heterogeneous segment sizes reflecting natural customer behavior patterns.
  • Spatial Separation: Clusters occupy distinct regions in PC space, with minimal overlap, validating the k=6 choice.

Interpretation

The high between-cluster SS ratio and substantial variance retention demonstrate that the six-cluster solution successfully partitions customers into behavior

Figure 7

Cluster Profiles

Standardized RFM Feature Means by Cluster

Standardized mean RFM feature values per cluster. Positive bars indicate above-average values; negative bars indicate below-average.

Interpretation

Purpose

This section reveals the behavioral DNA of each customer segment by displaying standardized RFM (Recency, Frequency, Monetary) profiles. It directly supports the segmentation objective by showing what distinguishes each cluster—enabling targeted strategies based on purchase recency, engagement frequency, and spending value. The high silhouette score (0.732) confirms these profiles represent genuinely distinct customer groups.

Key Findings

  • Cluster 1 (Recent Customers): Monetary value of +3.54 (z-score) with recent purchases (recency -0.09) but low frequency—high-value but infrequent buyers
  • Cluster 4 (Champions): Extreme monetary strength (+4.34 scaled, $2,923 raw mean) with recent activity and highest frequency—the most valuable segment
  • Cluster 6 (Potential Loyalists): Largest segment (49.4%) with below-average monetary (-0.22) and frequency (-0.32), but recent engagement—growth opportunity
  • Profile Variance: Mean scaled values range from -1.09 to +4.34, indicating strong differentiation across segments

Interpretation

The profiles demonstrate that monetary value alone doesn't define segment quality. Cluster 1 and 4 both show high spending, but Cluster 4 combines this

Table 8

Cluster Summary

Size and RFM Statistics per Segment

Summary statistics for each of the 6 customer segments, including size, RFM means, and silhouette scores.

clustersegment_labeln_customerspct_customersmean_recencymean_frequencymean_monetaryavg_silhouette
1Recent Customers42.51.5124750.4884
2At Risk63.722914.80.6824
3Champions84.912.12614.20.4989
4Champions31.912.6729230.1929
5Recent Customers6137.711344.10.7634
6Potential Loyalists8049.421368.60.7681

Interpretation

Purpose

This section summarizes the behavioral characteristics of each of the 6 customer segments identified by K-Means clustering. It translates RFM metrics (Recency, Frequency, Monetary) into actionable segment profiles, enabling understanding of how customers distribute across groups and how well-defined each segment is. This directly supports the segmentation objective by quantifying segment size, composition, and internal cohesion.

Key Findings

  • Cluster Size Imbalance: Cluster 6 dominates with 49.4% of customers (80), while Cluster 4 contains only 1.9% (3). This reflects natural customer distribution rather than algorithmic failure.
  • Silhouette Score Variation: Cluster 6 (0.77) and Cluster 5 (0.76) show strong internal cohesion, while Cluster 4 (0.19) is poorly defined, suggesting boundary ambiguity for high-value, high-frequency customers.
  • Monetary Spread: Cluster 4 (Champions) averages $2,923 per customer versus Cluster 5 (Recent Customers) at $344—a 8.5× difference, indicating distinct value tiers.
  • Recency Pattern: Most clusters cluster near recency=1 (recent), except Clusters 2 and
Table 9

Top Customers by Segment

Highest-Value Customers with Cluster Assignments

Top 20 customers by monetary value with cluster assignments and RFM metrics.

Customer IDclusterrecencyfrequencymonetarysegment_label
137774123452Champions
150614133452Champions
175111113109Recent Customers
127581212455Recent Customers
181021212286Recent Customers
142991112052Recent Customers
179494131866Champions
161315111318Recent Customers
174282221292At Risk
137676211198Potential Loyalists
135262221182At Risk
129316211160Potential Loyalists
141562221067At Risk
173773121052Champions
124355111008Recent Customers
13758621996.1Potential Loyalists
15311312991Champions
14440511935Recent Customers
12533621929.9Potential Loyalists
14527511923.8Recent Customers

Interpretation

Purpose

This section validates the k-means segmentation by examining the top 20 highest-spending customers and their cluster assignments. It serves as a quality check to ensure that premium-value customers are appropriately placed in segments aligned with their RFM behavior, enabling targeted retention and engagement strategies for the company's most valuable accounts.

Key Findings

  • Monetary Range: $923.79–$3,451.88 across top spenders, with mean of $1,636.22 — indicating substantial variation in high-value customer spending levels
  • Segment Distribution: Recent Customers dominate (40%), followed by Champions (25%), showing that recency and frequency vary significantly even among top spenders
  • Recency Pattern: Mean recency of 1.45 (median=1) indicates most top spenders purchased recently, though some show recency=2, suggesting engagement decay despite high lifetime value
  • Frequency Concentration: Mean frequency of 1.5 reveals that high monetary value doesn't correlate strongly with repeat purchases — many top spenders are one-time or low-frequency buyers

Interpretation

The segmentation successfully differentiates high-value customers across multiple behavioral profiles. Champions (clusters 3–4) combine recent activity with higher frequency, while Recent Customers (clusters 1, 5) show high monetary value but lower engagement frequency. This heter

Want to run this analysis on your own data? Upload CSV — Free Analysis See Pricing