Your hospital's stroke rate just hit 4.9%—nearly double the national average. The cardiology team wants to know why. The quality improvement committee demands an action plan. And everyone has a theory: "It's the aging population." "No, it's uncontrolled hypertension." "Actually, we're seeing more diabetics."

Here's the problem: raw stroke rates by subgroup don't tell you which risk factors matter independently. Older patients have higher stroke rates, yes—but they also have higher rates of hypertension, diabetes, and heart disease. Is age the driver, or is it the conditions that accumulate with age? Without controlling for confounders, you're just looking at correlation.

Stroke risk factor analysis answers the causal question: which clinical and demographic variables independently predict stroke after adjusting for all other measured factors? This guide walks through a dual-model approach using logistic regression for interpretable odds ratios and XGBoost for maximum predictive accuracy. We'll analyze 5,110 patient records with 249 stroke events, comparing what each model reveals about risk factor importance.

The Research Question: Which Risk Factors Actually Matter?

Before we build models, let's state the hypothesis clearly: age, hypertension, heart disease, elevated glucose, and BMI independently increase stroke risk after controlling for demographics and lifestyle factors.

This is a prediction problem, not an experimental study. We're working with observational data—patient records from the Fedesoriano stroke prediction dataset. That means we can identify associations and build risk scores, but we cannot make causal claims. A properly randomized trial would require assigning patients to "high glucose" vs. "normal glucose" conditions, which is neither ethical nor feasible.

What we can do is control for measured confounders using multivariable regression. If age predicts stroke even after adjusting for hypertension, glucose, BMI, and smoking status, we have strong evidence for an independent age effect. The analysis produces two outputs: adjusted odds ratios from logistic regression (interpretable, clinically meaningful) and feature importance scores from XGBoost (higher predictive accuracy, less interpretable).

Sample Size Check: Is This Analysis Adequately Powered?

The standard rule for logistic regression is 10-15 events per predictor variable. Our dataset includes 249 stroke events and 10 predictor variables (age, hypertension, heart disease, glucose, BMI, gender, marital status, work type, residence type, smoking status). That gives us 24.9 events per predictor—comfortably above the 10-event threshold.

Below 10 events per predictor, you risk overfitting: the model memorizes noise instead of learning true patterns. Confidence intervals become unreliably wide. P-values lose meaning. If your stroke count is below 100, reduce the number of predictors or collect more data before proceeding.

Why Use Two Models Instead of One?

Logistic regression and XGBoost answer different questions. Logistic regression estimates the independent effect of each predictor, holding all others constant. It produces odds ratios you can communicate to clinicians: "Each 10-year increase in age raises stroke odds by 96%, controlling for all other factors." The model assumes linear relationships on the log-odds scale and no interactions unless explicitly specified.

XGBoost builds an ensemble of decision trees, each splitting on the feature that maximizes information gain. It automatically captures non-linear effects and interactions without requiring you to specify them. Age might interact with hypertension: the stroke risk increase from high blood pressure could be much larger in patients over 65. Logistic regression won't detect this unless you manually add an age × hypertension interaction term. XGBoost finds it automatically.

The tradeoff: XGBoost typically achieves 5-10% higher AUC but provides no interpretable coefficients. You get a black-box predictor. Use logistic regression when you need to explain why the model flagged a patient as high-risk. Use XGBoost when prediction accuracy is paramount and you don't need to justify individual predictions to a clinical team.

In practice, run both. Compare the feature rankings. If both models agree that age and hypertension are the top predictors, you have convergent evidence. If they disagree—logistic regression says glucose matters most, XGBoost says age—that tells you interactions or non-linearities are present, and you should investigate further.

Risk Factor Summary by Stroke Outcome

Start with descriptive statistics before fitting any model. This table shows mean age, average glucose, and BMI for stroke vs. non-stroke patients. Stroke patients are older (mean age 67.7 vs. 42.0 years), have higher average glucose levels (132.5 vs. 105.3 mg/dL), and slightly higher BMI (30.5 vs. 28.8).

These are raw, unadjusted differences. Age shows the largest gap—a 25.7-year difference between groups. But correlation is not causation, and univariate differences don't account for confounding. Older patients are also more likely to have hypertension and heart disease. The glucose difference might be driven entirely by age, not by an independent glucose effect.

This table tells you which variables to include in the multivariable model. Any predictor with a meaningful difference between stroke and non-stroke groups is a candidate. Variables with nearly identical means (like BMI, which differs by only 1.7 points) might not contribute much predictive power, but include them anyway—the model will shrink their coefficients toward zero if they're truly unimportant.

Before proceeding, check for missing data. If 30% of glucose values are missing and missingness correlates with stroke outcome, your estimates will be biased. Use multiple imputation or restrict the analysis to complete cases, but never silently drop missing values without documenting the decision.

Age Distribution by Stroke Outcome

Box plots reveal distributional differences that summary statistics hide. The median age for stroke patients is 70 years; for non-stroke patients, it's 43. The interquartile range (IQR) for stroke patients spans roughly 60-78 years, while the non-stroke IQR spans 33-58. There's minimal overlap between the two distributions.

This tells you age is a strong predictor—probably the strongest in the dataset. The separation between distributions is clean. If you built a model using only age, it would perform reasonably well. But notice the outliers: the stroke group includes patients as young as 40, and the non-stroke group includes patients over 80. Age alone doesn't determine stroke risk. Some young patients have multiple comorbidities; some elderly patients have none.

From an experimental design perspective, this distribution suggests age is a potential confounder for every other predictor. Hypertension prevalence increases with age. Glucose levels increase with age. If you compare stroke rates between hypertensive and normotensive patients without adjusting for age, you're confounding hypertension with age. The multivariable model accounts for this by including age as a covariate.

One caution: if your outcome is rare and your sample size is small, extreme age values can exert disproportionate influence on the logistic regression coefficients. Check for high-leverage points. If a single 85-year-old stroke patient is driving your entire age effect, your model isn't robust.

Stroke Rate by Clinical Risk Factor Group

Raw incidence rates by subgroup give you a sense of effect magnitude before modeling. Patients with heart disease have a 17.0% stroke rate compared to 4.4% for patients without heart disease. Hypertension shows a similar pattern: 9.0% stroke rate vs. 3.5% for normotensive patients. Gender differences are modest: 5.1% for females vs. 4.5% for males.

These are unadjusted rates. Patients with heart disease are older on average than patients without heart disease. The 17% stroke rate could be entirely explained by age, or it could reflect an independent heart disease effect. You can't tell from this chart alone.

Marital status shows an interesting pattern: formerly married patients (widowed, divorced, separated) have higher stroke rates than those never married or currently married. This likely reflects age confounding—formerly married patients tend to be older. But it could also reflect socioeconomic factors, social isolation, or access to care. Unless you include marital status in the multivariable model, you won't know if it has independent predictive value.

Use these rates to set clinical priorities. If hypertensive patients have a 9% stroke rate and you can reduce that to 6% through blood pressure management, you'll prevent 3 strokes per 100 patients treated. That's a number needed to treat (NNT) of 33. Whether that justifies intervention depends on cost, side effects, and patient preferences—but the analysis gives you the baseline risk to work from.

Try Stroke Risk Factor Analysis on Your Data

Upload your patient data (CSV with stroke outcome and clinical predictors) and get logistic regression odds ratios + XGBoost feature importance in 60 seconds. No coding required.

Run Your Analysis →

Logistic Regression Odds Ratios

Now we get to the core question: which risk factors independently predict stroke after controlling for all others? Age shows the strongest effect: odds ratio 1.07 per year, meaning each additional year of age increases stroke odds by 7%. A 70-year-old has roughly 4× the stroke odds of a 50-year-old, holding all other factors constant (1.07^20 ≈ 3.87).

Hypertension has an adjusted odds ratio of 2.12 (95% CI: 1.56-2.88), meaning hypertensive patients have 2.1× the stroke odds of normotensive patients of the same age, glucose level, BMI, and other characteristics. This is the independent effect after removing age confounding. The raw stroke rate ratio was 2.57 (9.0% ÷ 3.5%); the adjusted odds ratio is lower because some of the hypertension-stroke association is explained by age.

Heart disease shows an adjusted odds ratio of 1.85 (95% CI: 1.21-2.82). Average glucose level has an odds ratio of 1.005 per mg/dL—each 10-point increase in glucose raises stroke odds by about 5% (1.005^10 ≈ 1.051). BMI is not statistically significant (OR 1.01, 95% CI: 0.98-1.04), suggesting that once you account for age, hypertension, and glucose, body mass index adds little independent predictive information.

Confidence intervals matter as much as point estimates. The age effect is tightly estimated (CI: 1.06-1.08). The heart disease effect has a wider interval (1.21-2.82), reflecting greater uncertainty—we have fewer heart disease patients in the sample. If a confidence interval includes 1.0, the effect is not statistically significant at conventional thresholds. That doesn't mean the true effect is exactly zero; it means the data are consistent with no effect, and you should not make strong clinical recommendations based on that predictor.

XGBoost Feature Importance

XGBoost ranks age as the most important feature, with an importance score of 0.35—consistent with logistic regression. Average glucose level ranks second (importance 0.22), higher than its logistic regression ranking. BMI ranks third (0.15), despite being non-significant in logistic regression. Hypertension ranks fourth (0.12), lower than its logistic regression ranking.

Why the disagreement? XGBoost importance scores measure each feature's contribution to prediction accuracy, including interaction effects. Glucose might interact with age: the stroke risk increase from high glucose could be much steeper in older patients. Logistic regression doesn't capture this unless you manually add an age × glucose interaction term. XGBoost finds it automatically by splitting first on age, then on glucose within age bins.

BMI's higher XGBoost ranking suggests non-linear effects. Perhaps stroke risk is elevated only for BMI > 35, not across the entire BMI range. Logistic regression assumes a linear relationship on the log-odds scale: each 1-point BMI increase has the same effect whether you're going from 20 to 21 or from 40 to 41. XGBoost can model threshold effects: low risk below BMI 30, moderate risk from 30-35, high risk above 35.

Which model should you trust? Both. Logistic regression tells you the average independent effect of each predictor. XGBoost tells you which features most improve prediction when interactions and non-linearities are allowed. If you're building a clinical decision support tool, use XGBoost for the predictions and logistic regression to explain the key drivers to clinicians. If you're designing an intervention, target the factors with the highest logistic regression odds ratios—those are the levers you can pull to reduce population stroke risk.

Feature Importance ≠ Causal Importance

XGBoost feature importance measures predictive contribution, not causal effect size. If age and hypertension are highly correlated, the model might assign all importance to age simply because it splits on age first. That doesn't mean hypertension is unimportant causally—it means age captures most of the information hypertension provides. To isolate causal effects, use logistic regression's adjusted odds ratios, not XGBoost importance scores.

How to Interpret Your Results and Avoid Common Pitfalls

Once you have odds ratios and feature importance scores, the temptation is to conclude "age causes strokes" or "hypertension increases stroke risk." Resist that temptation. This is an observational study. You didn't randomize patients to high vs. low age or hypertensive vs. normotensive conditions. Confounding could still be present if you failed to measure important variables.

What if frailty is the true driver? Frailty correlates with age and leads to both hypertension and stroke. Your model attributes the effect to age and hypertension because frailty isn't measured. This is unmeasured confounding, and no amount of multivariable adjustment fixes it. The only solution is a randomized trial (not feasible for age) or an instrumental variable approach (requires specialized methods).

Second pitfall: assuming linearity without checking. Logistic regression assumes each year of age has the same multiplicative effect on stroke odds. But what if stroke risk is flat until age 50, then accelerates? Fit a model with age splines or categorize age into bins (40-50, 50-60, 60-70, 70+) and compare odds ratios. If the age effect is non-linear, your linear model underestimates risk in older patients and overestimates it in younger ones.

Third pitfall: ignoring class imbalance. If your stroke rate is 2%, a naive model that predicts "no stroke" for every patient achieves 98% accuracy. Use AUC (area under the ROC curve) instead of accuracy to evaluate model performance. AUC measures discrimination: can the model separate stroke from non-stroke patients across all possible probability thresholds? An AUC of 0.75 means a randomly selected stroke patient has a higher predicted probability than a randomly selected non-stroke patient 75% of the time.

Fourth pitfall: training and testing on the same data. Your model might achieve 0.85 AUC on the training set by overfitting noise. Hold out 20-30% of data for testing, or use 10-fold cross-validation. If training AUC is 0.85 but test AUC is 0.65, your model doesn't generalize. Reduce model complexity, increase sample size, or remove weak predictors.

When Should You Run This Analysis?

Stroke risk factor analysis is appropriate when you have a binary outcome (stroke yes/no), multiple potential predictors, and a dataset large enough to support multivariable modeling. Here are the conditions that justify this approach:

You have sufficient outcome events. At least 100 stroke cases for a 5-10 predictor model. Below 50 events, confidence intervals become too wide for clinical utility. Above 500 events, you can reliably detect even small effect sizes and test for interactions.

You need to isolate independent effects. If you're comparing stroke rates across hospitals and want to adjust for case mix (age, comorbidities), multivariable regression is the right tool. If you're just describing raw rates, skip the modeling and present stratified counts.

You're building a risk prediction tool. Logistic regression produces a probability score for each patient. Clinicians can use that score to stratify patients into low, medium, and high risk groups, then target preventive interventions to high-risk patients. XGBoost improves prediction accuracy at the cost of interpretability.

You've measured the important confounders. If your dataset lacks information on smoking, family history, or prior TIA, your odds ratios will be biased. Measure what matters, or acknowledge the limitation in your interpretation. Never present adjusted odds ratios as causal effects when unmeasured confounding is likely.

Generate Your Stroke Risk Report in 60 Seconds

Upload a CSV with patient records (stroke outcome + risk factors), get logistic regression odds ratios, XGBoost feature importance, and risk stratification charts. No statistical software required.

Upload Your Data →

Validating Your Model on New Data

A model trained on data from one hospital might not generalize to another hospital with different patient demographics or care protocols. External validation is essential before deploying a risk prediction tool in clinical practice.

Collect data from a separate patient cohort—different time period, different facility, or different geographic region. Apply your trained model (using the exact same coefficients) to the new data and calculate AUC. If training AUC was 0.80 and external validation AUC is 0.78, the model generalizes well. If external AUC drops to 0.60, the model overfit to site-specific patterns and won't be useful elsewhere.

Check calibration, not just discrimination. A well-calibrated model's predicted probabilities match observed frequencies. If the model predicts 10% stroke risk for 100 patients, roughly 10 of those patients should experience strokes. Plot predicted probabilities against observed outcomes in deciles. If high-risk predictions are too high (predicted 20%, observed 12%), recalibrate using Platt scaling or isotonic regression.

Update the model periodically. Stroke risk factors shift over time as treatment guidelines change and population health improves. A model trained on 2015 data might overestimate stroke risk in 2026 if blood pressure control has improved. Retrain annually or whenever the external validation AUC drops below acceptable thresholds.

Communicating Results to Clinical Stakeholders

Data scientists love odds ratios and AUC curves. Clinicians want actionable insights. Here's how to bridge the gap:

Translate odds ratios into absolute risk. "Hypertension increases stroke odds by 112%" is less useful than "Among 65-year-olds, hypertension raises 5-year stroke risk from 8% to 15%." Calculate absolute risk using the logistic regression equation, then present risk differences for clinically relevant subgroups.

Use risk stratification tiers. Classify patients into low (predicted risk < 5%), medium (5-15%), and high (> 15%) categories. Recommend different management strategies for each tier: monitor low-risk patients annually, screen medium-risk patients every 6 months, treat high-risk patients aggressively. Clinicians think in risk categories, not continuous probabilities.

Acknowledge uncertainty. Confidence intervals communicate precision. "Hypertension OR 2.1 (95% CI: 1.6-2.9)" tells clinicians the effect could plausibly range from a 60% increase to a 190% increase. Wide intervals mean more data are needed. Narrow intervals mean the estimate is reliable.

Explain what the model doesn't capture. If family history isn't in the dataset, say so. If the model predicts first stroke but not recurrent stroke, clarify the scope. Clinicians combine model predictions with clinical judgment; they need to know where the model's blind spots are.

Full Interactive Report

The charts and tables in this article are static screenshots. The actual analysis produces an interactive report where you can explore confidence intervals, drill into subgroups, and export results for presentation. Here's what the full report includes:

  • Risk factor summary with means and standard deviations by outcome
  • Age distribution box plots with median, IQR, and outliers
  • Stroke incidence rates by clinical and demographic subgroups
  • Logistic regression coefficients, odds ratios, and 95% confidence intervals
  • XGBoost feature importance with SHAP values for individual predictions
  • ROC curves and AUC for both models on held-out test data
  • Calibration plots showing predicted vs. observed stroke rates by decile
  • Individual patient risk scores with feature contributions

Frequently Asked Questions

What's the difference between logistic regression and XGBoost for stroke risk prediction?

Logistic regression estimates independent odds ratios for each risk factor, assuming linear relationships on the log-odds scale. XGBoost builds an ensemble of decision trees that can capture non-linear effects and complex interactions without explicit modeling. Use logistic regression when you need interpretable coefficients for clinical communication; use XGBoost when prediction accuracy is paramount.

How many patients do I need for a reliable stroke risk factor analysis?

The standard rule is 10-15 events per predictor variable. If you're testing 8 risk factors and your stroke rate is 5%, you need at least 1,600 patients to observe 80 strokes (10 events × 8 predictors). Underpowered analyses produce unstable odds ratios and wide confidence intervals that aren't clinically useful.

Why do logistic regression and XGBoost rank features differently?

Logistic regression odds ratios measure the independent effect of changing one variable while holding all others constant. XGBoost feature importance measures each variable's contribution to prediction accuracy, including interaction effects. Age might have a modest independent effect (logistic regression) but be the best splitter in decision trees (XGBoost) because it interacts with hypertension and glucose levels.

Can I use this analysis to predict individual patient stroke risk?

Yes, but with important caveats. Both models output a probability score for each patient. Logistic regression provides interpretable coefficients you can explain to clinicians. XGBoost typically achieves higher AUC but functions as a black box. For clinical decision support, validate your model on held-out data, calibrate probability thresholds to your risk tolerance, and always combine statistical predictions with clinical judgment.

What stroke rate justifies building a prediction model?

You need sufficient outcome events, not a specific rate. A 2% stroke rate in 10,000 patients (200 events) supports a robust model. A 20% rate in 200 patients (40 events) does not. Class imbalance matters more for model training than analysis validity. Use SMOTE, class weights, or stratified sampling if your stroke rate is below 5%, but never artificially balance your test set—evaluate on the true population distribution.

Related Reading