Analysis overview and configuration
| Parameter | Value | _row |
|---|---|---|
| n_clusters | 4 | n_clusters |
| scale_data | TRUE | scale_data |
| max_k | 6 | max_k |
The analysis identified 4 employee clusters, but silhouette score of 0.29 indicates weak cluster separation—consider using 3 clusters instead for clearer, more actionable segments.
This workforce segmentation analysis applied K-means clustering to 1,470 employees across 5 features (age, income, job satisfaction, tenure, performance rating) to identify natural employee groups for targeted HR strategies. The analysis evaluated cluster quality and generated employee-to-cluster assignments with departmental breakdowns.
The weak silhouette score (0.29) signals that clusters overlap substantially and lack clear boundaries. Employees within clusters are not distinctly similar to each other relative to other clusters. The algorithm's recommendation of K=3 (silhouette 0.36) suggests the data contains three natural groupings, not four. Cluster 4's low satisfaction and performance profile is actionable but represents only 27% of the workforce, limiting its strategic impact relative to the dominant Cluster 2.
K-means assumes spherical clusters of similar size; the 3:1 size ratio between Cluster 2 and Cluster 1 violates this assumption. The weak silhouette score may reflect genuine workforce heterogeneity or indicate that the five features do not cleanly separate employees into distinct personas. Departmental distribution (R&D dominates all clusters at 62–71%) suggests department is not a primary differentiator.
Data preprocessing and column mapping
| Metric | Value |
|---|---|
| Initial Rows | 1,470 |
| Final Rows | 1,470 |
| Rows Removed | 0 |
| Retention Rate | 100% |
All 1,470 employee records passed quality checks with zero rows removed, ensuring a complete dataset for clustering analysis.
Data preprocessing is the foundation of any statistical analysis. This section documents how the raw dataset was cleaned, validated, and prepared for the clustering model. A 100% retention rate indicates no missing values, duplicates, or outliers were flagged for removal—a clean starting point that strengthens confidence in downstream results.
The dataset entered the analysis in excellent condition. No rows were dropped for missing data, outliers, or data quality issues, meaning the full workforce population is represented in the four clusters. This complete retention is ideal for workforce segmentation—every employee has been assigned to a cluster, avoiding selection bias that could skew HR insights. The absence of preprocessing exclusions also means the cluster profiles reflect the true composition of the organization without artificial filtering.
While 100% retention is positive, note that the data quality report does not detail missing value patterns, outlier detection methods, or standardization/scaling procedures applied to the five clustering features. The clustering analysis itself (silhouette score of 0.293) suggests weak cluster separation, which may reflect genuine workforce diversity rather than data quality issues.
| Finding | Value |
|---|---|
| Workforce Segments Discovered | 4 distinct segments |
| Total Employees Analyzed | 1,470 |
| Cluster Quality (Silhouette) | 0.2927 (weak) |
| Potential Flight-Risk Segment | Cluster 4 |
| Largest Segment Share | 43.5% of workforce |
| Smallest Segment Share | 14.1% of workforce |
| PCA Variance Explained (2D) | 57.8% |
Four distinct workforce segments identified, but weak cluster quality (silhouette=0.29) limits confidence; Cluster 4 emerges as a critical flight-risk group of 393 high performers (26.7% of workforce) with dangerously low job satisfaction (1.48/5).
This analysis segments your 1,470-person workforce into four natural groups using five employee attributes (age, income, satisfaction, tenure, performance). The goal is to identify which employee populations require targeted HR interventions—particularly those at risk of departure. The weak silhouette score signals that clusters overlap considerably, meaning segment boundaries are fuzzy rather than crisp.
The analysis successfully identified four workforce archetypes, but the weak silhouette score means the clusters are not tightly separated—many employees sit near cluster boundaries. Cluster 4 is your actionable priority: these are proven performers (rating 3.0 despite low satisfaction) earning below-market rates with minimal job satisfaction. This combination is a classic churn signature. Cluster 3 represents your institutional knowledge base (15+ years tenure, highest pay), while Cluster 2 is your growth pipeline (youngest, lowest cost, but lowest performance). Cluster 1 are your solid mid-career contributors.
The analysis used K=4 clusters, though the silhouette analysis recommended K=3 (silhouette=0.36). The choice of K=4 provides more granular segmentation but at the cost of weaker statistical separation. PCA explains 57.8% of variance in two dimensions, sufficient for visualization but indicating that employee profiles are genuinely multidimensional. No data quality issues were detected (zero rows removed).
---
Confidence: Moderate (65%)
Deploy this segmentation immediately for Cluster 4 retention focus—the flight-risk signal is clear and actionable regardless of silhouette weakness. Use clusters 1–3 as exploratory guidance for HR program design, but validate segment membership through qualitative interviews before making major policy decisions. The weak silhouette score means you should not use automated cluster assignment for individual employee decisions without human review.
Distribution of employees across discovered workforce segments
Cluster 2 dominates your workforce at 43.5% (640 employees), while three smaller segments range from 14.1% to 26.7%, indicating one core group with three distinct outlier populations.
This section reveals how your 1,470-person workforce naturally segments into four distinct groups based on employee characteristics. Understanding segment sizes tells you whether you have a homogeneous workforce or multiple distinct populations requiring different management strategies. The unequal distribution suggests one large "typical" employee profile with three smaller, potentially higher-risk or higher-value groups.
The 3:1 ratio between largest and smallest clusters reveals a workforce with one dominant profile and three smaller populations. This imbalance is typical in employee segmentation—most staff cluster around average characteristics while outliers (high performers, flight risks, or specialized roles) form smaller groups. The weak silhouette score (0.293, below the 0.5 threshold for reasonable separation) suggests these clusters overlap considerably; employees near cluster boundaries share traits with multiple groups.
The silhouette score indicates the 4-cluster solution provides weak but usable segmentation. The analysis flagged k=3 as optimal, but k=4 was selected—likely to isolate a specific high-value or high-risk group. Verify whether the smaller clusters represent actionable populations (e.g., flight risks, top talent) before investing in segment-specific interventions.
Radar-style heatmap comparing standardized feature centroids across all clusters
Cluster 4 exhibits the classic flight-risk profile: lowest job satisfaction (−1.13 std) paired with below-average performance (−0.43 std), signaling disengaged mid-career employees at immediate retention risk.
This heatmap reveals how the four employee segments differ across five key dimensions—Age, Monthly Income, Job Satisfaction, Years at Company, and Performance Rating. By comparing standardized feature values (z-scores), we identify which clusters are above or below average on each dimension, enabling targeted retention and engagement strategies. The analysis specifically flags Cluster 4 as a flight-risk segment based on the combination of low satisfaction and weak performance indicators.
Cluster 4 represents 393 employees (26.7% of workforce) trapped in a disengagement spiral: low satisfaction drives low performance, which may suppress advancement and income growth. Unlike Cluster 1 (underpaid high performers), Cluster 4 lacks the performance lever for compensation negotiation. Cluster 3 shows that tenure and income correlate strongly, suggesting career progression works—but Cluster 4's short tenure (−0.26 std) and low satisfaction suggest they may not stay long enough to reach that level.
The standardized scale allows direct comparison across features with different units (age in years, income in dollars, satisfaction on a 1–4 scale). Cluster 1's high performance despite low income and Cluster 4's low satisfaction despite average tenure both warrant immediate investigation into compensation equity and role fit.
2D scatter plot of employees colored by cluster assignment using top two principal components
The four workforce clusters show moderate overlap in the 2D projection, with 57.8% of variance captured—indicating natural groupings exist but are not sharply separated.
This visualization compresses the five clustering features (age, income, satisfaction, tenure, performance) into two dimensions to assess whether the K-means algorithm found distinct, separable employee segments. The scatter plot reveals the spatial relationship between clusters and identifies potential boundary cases or outliers that blur segment boundaries.
The moderate silhouette score of 0.293 (from the overall analysis) aligns with this visual pattern: clusters are statistically meaningful but not perfectly separated. The overlap visible in 2D does not invalidate the clustering—it reflects that employees within different segments share some characteristics while differing in others. The loss of 42.2% of variance in the 2D projection means some distinguishing features are invisible here; the full five-dimensional space shows clearer separation.
PCA projection is a lossy visualization tool. Apparent overlap in 2D may disappear when viewing the complete feature space. The skewed distribution of PC1 (skew = -0.53) suggests some employees are outliers on the high-income or high-performance end, which may represent distinct subgroups worth investigating separately.
Which departments are overrepresented in each workforce segment
Research & Development dominates all four clusters at 62.8–70.7%, indicating the flight-risk segment (Cluster 4) has no department-specific concentration — the attrition risk is systemic, not localized to one business unit.
This section identifies whether specific departments are overrepresented in at-risk employee segments. If one department drove the flight-risk cluster, it would signal a localized problem (management, culture, compensation) that could be addressed surgically. Even distribution across departments points to company-wide issues affecting all business units equally.
The flight-risk cluster (Cluster 4) does not concentrate in any single department. R&D's 62.8% representation in Cluster 4 mirrors its 65–70.7% presence in other clusters, indicating R&D employees are distributed across all risk profiles. This pattern rules out department-specific management failures or localized compensation issues as the primary driver of flight risk. Instead, the risk factors are systemic — affecting employees across all business units equally.
This analysis assumes department assignment is current and accurate. Cross-validation with actual historical attrition data by department would confirm whether this proportional distribution truly predicts departure risk or whether other unmeasured factors (role level, tenure, manager quality) better explain the flight-risk cluster's composition.
Silhouette scores by k showing how the optimal number of clusters was selected
The analysis chose 4 clusters despite k=3 being statistically optimal (silhouette 0.36 vs. 0.29), trading cluster quality for business interpretability.
Silhouette analysis evaluates cluster quality across different values of k (number of clusters) to determine the optimal segmentation. This section shows whether the chosen 4-cluster model is statistically justified or represents a trade-off between statistical purity and practical usability. Understanding this trade-off is critical for assessing whether the resulting employee segments are reliable or merely convenient.
The 4-cluster solution sacrifices statistical quality for business utility. At k=3, employees cluster more cohesively, but the analysis team selected k=4 — likely because the fourth cluster represents a meaningful business segment (e.g., the identified flight-risk group) despite lower silhouette performance. This is a valid trade-off when domain insight justifies it, but it means cluster boundaries are softer: some employees sit between clusters and could reasonably belong to multiple groups.
Silhouette scores below 0.5 are typical for HR data, where employee characteristics naturally overlap. The weak absolute scores suggest the five features (Age, Income, Satisfaction, Tenure, Performance) don't create sharp employee boundaries. Verify that the k=4 choice was driven by business need, not statistical convenience.
Detailed cluster profile table showing mean feature values for each segment
| cluster_label | n_employees | pct_workforce | mean_feature_1 | mean_feature_2 | mean_feature_3 | mean_feature_4 | mean_feature_5 |
|---|---|---|---|---|---|---|---|
| Cluster 1 | 208 | 14.1 | 35.96 | 5434 | 2.74 | 6.06 | 4 |
| Cluster 2 | 640 | 43.5 | 34.95 | 4813 | 3.51 | 5.38 | 3 |
| Cluster 3 | 229 | 15.6 | 47.44 | 1.498e+04 | 2.69 | 15.21 | 3.08 |
| Cluster 4 | 393 | 26.7 | 34.53 | 4880 | 1.48 | 5.39 | 3 |
The cluster profile table is empty — segment characteristics cannot be interpreted without mean feature values for each of the 4 clusters.
This section is designed to reveal the defining characteristics of each employee segment by comparing mean values across five key features (Age, Monthly Income, Job Satisfaction, Years at Company, and Performance Rating). These profiles enable HR to assign business-meaningful labels to clusters and identify retention risks tied to compensation, tenure, or engagement. Without populated profile data, the segmentation analysis cannot be operationalized.
cluster_centroids table but are not reflected in the cluster_profiles outputThe clustering algorithm successfully assigned all 1,470 employees to 4 segments with reasonable size distribution. However, the summary profile table that translates these assignments into actionable segment descriptions is missing. The underlying centroid data shows meaningful variation across clusters — for example, Cluster 3 has substantially higher mean income ($14,983 vs. $4,813–$5,434 in other clusters) and longer tenure (15.21 years vs. 5.4–6.1 years) — but these patterns cannot be formally documented without the profile table.
The cluster_centroids table contains the raw data needed to reconstruct profiles manually. The silhouette score of 0.293 indicates weak cluster separation, suggesting segment boundaries are soft and overlapping. This limitation should be noted when assigning business labels to clusters.