Executive Summary
Key findings from the diabetes risk driver analysis
The logistic regression model achieves an AUC of 0.8666 (86.7%), indicating good discrimination between diabetic and non-diabetic patients across all thresholds. At the Youden-optimal threshold of 0.3227, the model reaches sensitivity of 82.8% and specificity of 78.2%. The strongest single predictor of diabetes onset is Glucose. Zero-coded missing values were detected and imputed in 652 observations across five clinical variables before model fitting.
Clinical Summary by Diabetes Status
Mean clinical measurements with Welch t-test p-values by diabetes outcome
| Predictor | Mean Diabetic | Mean Nondiabetic | P Value |
|---|---|---|---|
| Pregnancies | 4.866 | 3.298 | 0 |
| Glucose | 142.3 | 110.6 | 0 |
| Blood Pressure | 75.27 | 70.84 | 0 |
| Skin Thickness | 32.67 | 27.17 | 0 |
| Insulin | 187.6 | 117.2 | 0 |
| BMI | 35.4 | 30.85 | 0 |
| Diabetes Pedigree Function | 0.55 | 0.43 | 0 |
| Age | 37.07 | 31.19 | 0 |
8 of 8 clinical measurements differ significantly between diabetic and non-diabetic patients (p < 0.05 by Welch t-test). The most significant predictor is Pregnancies (mean diabetic = 4.866 vs. non-diabetic = 3.298, p = 0). Among the 268 diabetic patients and 500 non-diabetic patients, nearly all continuous measurements are elevated in the diabetic group.
Zero-Value Missing Data Profile
Count of biologically impossible zero values by clinical variable
Zero-coded missing values are most prevalent in Insulin with 374 zeros (48.7% of patients). In total, 652 zero-coded values were detected across 5 clinical variables before model fitting. All zeros were replaced with class-conditional medians (computed separately for diabetic and non-diabetic patients) prior to analysis.
Clinical Measurements by Diabetes Status
Standardised distributions of top predictors by diabetes outcome
Box plots show standardised (z-score) clinical measurements for diabetic vs. non-diabetic patients across the top 5 predictors ranked by model importance. Standardisation places all variables on a common scale, enabling direct visual comparison of distributional separation. Glucose shows the greatest separation between groups, consistent with its position as the strongest predictor in the logistic regression model.
Odds Ratios — Logistic Regression Predictors
Exponentiated logistic regression coefficients
6 of 8 clinical predictors have odds ratio confidence intervals that exclude 1.0, indicating statistically significant associations with diabetes at the 95% confidence level. Diabetes Pedigree Function has the highest odds ratio of 2.198 (95% CI: 1.2235 – 3.9853). An odds ratio above 1.0 indicates increased odds of diabetes per unit increase in that predictor; a CI crossing 1.0 means the effect is not statistically distinguishable from zero.
ROC Curve — Model Discrimination
Receiver operating characteristic curve with AUC
The ROC curve plots sensitivity (true positive rate) against 1 - specificity (false positive rate) across all possible classification thresholds. The model achieves an AUC of 0.8666 (86.7%), indicating good discriminative ability for clinical screening. At the Youden-optimal threshold, sensitivity is 82.8% and specificity is 78.2%, meaning the model correctly identifies 82.8% of true diabetic patients while correctly ruling out 78.2% of non-diabetic patients.
Confusion Matrix at Optimal Threshold
True and false positives/negatives at Youden's J threshold
At the Youden-optimal threshold of 0.323, the model correctly classifies 222 diabetic patients (true positives) and 391 non-diabetic patients (true negatives), for an overall accuracy of 79.8%. 46 diabetic patients are missed (false negatives) and 109 non-diabetic patients are incorrectly flagged (false positives). Sensitivity of 82.8% means the model catches most true cases, which is important for a clinical screening tool where missed diagnoses carry high cost.
Variable Importance by Log-Odds Magnitude
Predictors ranked by absolute standardised log-odds
Glucose is the most important predictor with an absolute standardised log-odds of 0.9265, meaning a one-standard-deviation increase corresponds to the largest shift in diabetes log-odds among all clinical variables. The gap to the second-ranked predictor is 0.4459. Standardising by predictor scale puts all coefficients on a fair footing: variables with different measurement units (e.g., glucose in mg/dL vs. pregnancies as a count) are directly comparable in this ranking.