Executive Summary
Model performance overview and strongest risk factor
The logistic regression model achieved an AUC of 94%, indicating strong discriminative ability between diabetic and non-diabetic patients. The strongest single predictor was diabetes_pedigree with an odds ratio of 4.15, meaning each unit increase multiplied the odds of diabetes by that factor. Overall model accuracy was 88.1%, with 34.9% of the 768 patients in this dataset diagnosed as diabetic.
Risk Factor Ranking by Odds Ratio
Relative impact of each biomarker on diabetes risk
Biomarkers are ranked by their odds ratios from the logistic regression model. diabetes_pedigree has the highest odds ratio (4.16), indicating it is the strongest independent predictor of diabetes in this cohort. 8 of 8 predictors reach statistical significance at p < 0.05 after controlling for all other biomarkers in the model.
Measurement Distributions by Diabetes Status
Spread of each biomarker in diabetic vs non-diabetic groups
Glucose shows the clearest separation: median 136.3 mg/dL in diabetic vs 105 mg/dL in non-diabetic patients. Box plots show the interquartile range and median for each biomarker split by diagnosis. Taller boxes with greater offset between groups indicate stronger discriminating power. Overlapping distributions suggest a predictor has weaker standalone discriminative ability.
Feature Correlation Matrix
Pairwise correlations among biomarkers to assess multicollinearity
The correlation heatmap shows pairwise Pearson correlations among all eight biomarkers. The highest correlation is between bmi and glucose (r = 0.2). Correlations above 0.7 can cause multicollinearity in the regression model, making it difficult to isolate each predictor's independent contribution to diabetes risk.
Mean Biomarker Profiles: Diabetic vs Non-Diabetic
Z-score standardized group means for each biomarker
Z-score standardized mean values allow fair comparison across biomarkers with different units. glucose shows the largest group difference (1.08 standard deviations). Diabetic patients consistently score higher on metabolic risk indicators. Bars extending further from zero indicate measurements where the two groups diverge most.
Diabetes Prevalence by Age Group
Fraction of patients diagnosed with diabetes by age band
Diabetes prevalence rises sharply with age across decade bands. The youngest group (20-29) has a prevalence of 24.3%, while the oldest recorded group reaches 87.5%. Prevalence first exceeds 50% in the 50-59 age group. This monotone increase aligns with the biological understanding that insulin resistance accumulates over time alongside lifestyle and metabolic changes.
Glucose vs BMI by Diabetes Outcome
Joint distribution of glucose and BMI colored by diabetes diagnosis
Each point is one patient, with size proportional to age. Diabetic patients cluster toward higher glucose levels (mean 130 mg/dL vs 95.5 mg/dL in non-diabetics) and higher BMI (mean 34.1). The upper-right region — high glucose combined with elevated BMI — shows the highest concentration of diabetic cases, reflecting the joint metabolic risk these two factors create.
Logistic Regression Coefficients
Full regression table with odds ratios, confidence intervals, and p-values
| predictor_name | odds_ratio | ci_lower | ci_upper | p_value | significance_stars |
|---|---|---|---|---|---|
| pregnancies | 1.746 | 1.53 | 1.992 | 0 | *** |
| glucose | 1.058 | 1.046 | 1.071 | 0 | *** |
| blood_pressure | 1.039 | 1.02 | 1.059 | 0.0001 | *** |
| skin_thickness | 1.104 | 1.072 | 1.138 | 0 | *** |
| insulin | 1.015 | 1.009 | 1.02 | 0 | *** |
| bmi | 1.111 | 1.073 | 1.151 | 0 | *** |
| diabetes_pedigree | 4.155 | 1.965 | 8.785 | 0.0002 | *** |
| age | 1.074 | 1.048 | 1.102 | 0 | *** |
The table shows all eight biomarker coefficients from the logistic regression, expressed as odds ratios with 95% confidence intervals. 8 predictor(s) are statistically significant at p < 0.05; 0 are not significant after controlling for the other variables. diabetes_pedigree has the largest odds ratio (4.155, CI: 1.965–8.785).
Model Performance Metrics
Summary of classification performance at the chosen decision threshold
| metric_name | metric_value |
|---|---|
| AUC (ROC) | 0.9398 |
| Accuracy | 0.8815 |
| Sensitivity (Recall) | 0.8022 |
| Specificity | 0.924 |
| Positive Predictive Value | 0.8498 |
| N Observations | 768 |
| Diabetes Prevalence | 0.349 |
At a classification threshold of 50%, the model achieves an AUC of 94%, an overall accuracy of 88.1%, a sensitivity of 80.2% (fraction of actual diabetics correctly identified), and a specificity of 92.4% (fraction of non-diabetics correctly classified). A higher AUC threshold would increase specificity at the cost of sensitivity, and vice versa — the optimal threshold depends on the relative cost of false negatives vs false positives.