Analytics · Health · Diabetes · Risk Factors

Executive Summary

Model performance overview and strongest risk factor

n_observations

768

auc

0.9398

accuracy

0.8815

sensitivity

0.8022

specificity

0.924

diabetes_prevalence

0.349

strongest_predictor

4.1546

ppv

0.8498

tp

215

tn

462

The logistic regression model achieved an AUC of 94%, indicating strong discriminative ability between diabetic and non-diabetic patients. The strongest single predictor was diabetes_pedigree with an odds ratio of 4.15, meaning each unit increase multiplied the odds of diabetes by that factor. Overall model accuracy was 88.1%, with 34.9% of the 768 patients in this dataset diagnosed as diabetic.

Interpretation

The logistic regression model achieved an AUC of 94%, indicating strong discriminative ability between diabetic and non-diabetic patients. The strongest single predictor was diabetes_pedigree with an odds ratio of 4.15, meaning each unit increase multiplied the odds of diabetes by that factor. Overall model accuracy was 88.1%, with 34.9% of the 768 patients in this dataset diagnosed as diabetic.

Visualization

Risk Factor Ranking by Odds Ratio

Relative impact of each biomarker on diabetes risk

Interpretation

Biomarkers are ranked by their odds ratios from the logistic regression model. diabetes_pedigree has the highest odds ratio (4.16), indicating it is the strongest independent predictor of diabetes in this cohort. 8 of 8 predictors reach statistical significance at p < 0.05 after controlling for all other biomarkers in the model.

Visualization

Measurement Distributions by Diabetes Status

Spread of each biomarker in diabetic vs non-diabetic groups

Interpretation

Glucose shows the clearest separation: median 136.3 mg/dL in diabetic vs 105 mg/dL in non-diabetic patients. Box plots show the interquartile range and median for each biomarker split by diagnosis. Taller boxes with greater offset between groups indicate stronger discriminating power. Overlapping distributions suggest a predictor has weaker standalone discriminative ability.

Visualization

Feature Correlation Matrix

Pairwise correlations among biomarkers to assess multicollinearity

Interpretation

The correlation heatmap shows pairwise Pearson correlations among all eight biomarkers. The highest correlation is between bmi and glucose (r = 0.2). Correlations above 0.7 can cause multicollinearity in the regression model, making it difficult to isolate each predictor's independent contribution to diabetes risk.

Visualization

Mean Biomarker Profiles: Diabetic vs Non-Diabetic

Z-score standardized group means for each biomarker

Interpretation

Z-score standardized mean values allow fair comparison across biomarkers with different units. glucose shows the largest group difference (1.08 standard deviations). Diabetic patients consistently score higher on metabolic risk indicators. Bars extending further from zero indicate measurements where the two groups diverge most.

Visualization

Diabetes Prevalence by Age Group

Fraction of patients diagnosed with diabetes by age band

Interpretation

Diabetes prevalence rises sharply with age across decade bands. The youngest group (20-29) has a prevalence of 24.3%, while the oldest recorded group reaches 87.5%. Prevalence first exceeds 50% in the 50-59 age group. This monotone increase aligns with the biological understanding that insulin resistance accumulates over time alongside lifestyle and metabolic changes.

Visualization

Glucose vs BMI by Diabetes Outcome

Joint distribution of glucose and BMI colored by diabetes diagnosis

Interpretation

Each point is one patient, with size proportional to age. Diabetic patients cluster toward higher glucose levels (mean 130 mg/dL vs 95.5 mg/dL in non-diabetics) and higher BMI (mean 34.1). The upper-right region — high glucose combined with elevated BMI — shows the highest concentration of diabetic cases, reflecting the joint metabolic risk these two factors create.

Data Table

Logistic Regression Coefficients

Full regression table with odds ratios, confidence intervals, and p-values

predictor_name	odds_ratio	ci_lower	ci_upper	p_value	significance_stars
pregnancies	1.746	1.53	1.992	0	***
glucose	1.058	1.046	1.071	0	***
blood_pressure	1.039	1.02	1.059	0.0001	***
skin_thickness	1.104	1.072	1.138	0	***
insulin	1.015	1.009	1.02	0	***
bmi	1.111	1.073	1.151	0	***
diabetes_pedigree	4.155	1.965	8.785	0.0002	***
age	1.074	1.048	1.102	0	***

Interpretation

The table shows all eight biomarker coefficients from the logistic regression, expressed as odds ratios with 95% confidence intervals. 8 predictor(s) are statistically significant at p < 0.05; 0 are not significant after controlling for the other variables. diabetes_pedigree has the largest odds ratio (4.155, CI: 1.965–8.785).

Data Table

Model Performance Metrics

Summary of classification performance at the chosen decision threshold

metric_name	metric_value
AUC (ROC)	0.9398
Accuracy	0.8815
Sensitivity (Recall)	0.8022
Specificity	0.924
Positive Predictive Value	0.8498
N Observations	768
Diabetes Prevalence	0.349

Interpretation

At a classification threshold of 50%, the model achieves an AUC of 94%, an overall accuracy of 88.1%, a sensitivity of 80.2% (fraction of actual diabetics correctly identified), and a specificity of 92.4% (fraction of non-diabetics correctly classified). A higher AUC threshold would increase specificity at the cost of sensitivity, and vice versa — the optimal threshold depends on the relative cost of false negatives vs false positives.

What's wrong with this card?

Executive Summary

Risk Factor Ranking by Odds Ratio

Measurement Distributions by Diabetes Status

Feature Correlation Matrix

Mean Biomarker Profiles: Diabetic vs Non-Diabetic

Diabetes Prevalence by Age Group

Glucose vs BMI by Diabetes Outcome

Logistic Regression Coefficients

Model Performance Metrics

Report an Issue