Executive Summary
Top-line stroke risk findings: AUC, top odds ratios, and key predictors
The logistic regression model achieves AUC = 0.8457 on predicting stroke from clinical and demographic predictors in 5110 patients (stroke prevalence 4.87%). The strongest adjusted predictor is Hypertension (OR = 1.485). XGBoost feature importance confirms the top-ranked predictors, with the highest-gain feature being Age. BMI imputation covered 201 missing values using the median.
Risk Factor Summary by Stroke Outcome
Mean clinical values by stroke vs no-stroke group
| Stroke | Age | BMI | Hypertension | Heart Disease | Avg Glucose Level |
|---|---|---|---|---|---|
| No Stroke | 42 | 28.8 | 0.089 | 0.047 | 104.8 |
| Stroke | 67.7 | 30.1 | 0.265 | 0.189 | 132.5 |
Stroke patients are on average 67.7 years old versus 42 years for non-stroke patients — a substantial age gap. Mean average glucose level is 132.5 in stroke patients, reflecting the link between hyperglycemia and cerebrovascular risk. Hypertension and heart disease rates are both markedly higher in the stroke group, consistent with their roles as primary cardiovascular risk factors.
Age Distribution by Stroke Outcome
Age distribution comparison between stroke and non-stroke patients
The median age for stroke patients is 71 years versus 43 years for non-stroke patients — a difference of 28 years. The age distribution for stroke cases is visibly shifted higher and more right-skewed, confirming age as the single most powerful continuous predictor of stroke risk. Even the lower quartile of the stroke group exceeds the median age of non-stroke patients.
Stroke Rate by Clinical Risk Factor Group
Raw stroke incidence rates by clinical and demographic subgroups
Among clinical subgroups, Heart Disease has the highest raw stroke incidence at 17.03%. Patients with hypertension or pre-existing heart disease show stroke rates roughly 3-4x the overall cohort average of 4.87%. These unadjusted rates reflect confounding (e.g., older patients are more likely to have hypertension), so the logistic regression odds ratios provide the adjusted picture.
Logistic Regression Odds Ratios
Adjusted odds ratios with 95% confidence intervals
Odds ratios are adjusted for all other predictors in the model. The strongest independent risk factor is Hypertension (OR = 1.485). 4 predictor(s) have 95% CIs entirely above 1.0 (i.e., independently elevate stroke risk): Hypertension, Age, Avg. Glucose Level, NA. Error bars show 95% profile-likelihood confidence intervals.
XGBoost Feature Importance
Non-linear feature importance by gain metric from gradient-boosted trees
XGBoost gain importance ranks Age as the most predictive feature (gain = 0.55998), followed by BMI. Gain measures the average improvement in model accuracy when a feature is used in a split. Comparing this ranking with the logistic regression odds ratios reveals whether non-linear effects or interactions change which predictors matter most.
Risk Factor Correlation Matrix
Correlation matrix of numeric risk factors to detect multicollinearity
The heatmap shows Pearson correlations among numeric risk factors. The maximum off-diagonal correlation is 0.324. Age and average glucose level tend to be moderately correlated with stroke, but correlations among predictors are generally low, indicating the logistic regression coefficient estimates are not severely inflated by multicollinearity.
Stroke Rate by Smoking Status
Stroke prevalence by smoking status category
Among smoking status categories, 'formerly smoked' has the highest stroke rate at 7.91% versus the overall rate of 4.87%. Smoking status is a self-reported behavioral variable with a meaningful 'unknown' category, which may reflect missingness or historical non-recording. The adjusted logistic regression separates smoking's independent effect from its correlation with other risk factors like age and hypertension.