Executive Summary
Key findings from hospital length of stay prediction model
Analyzed 2000 patient admissions using random forest regression. The model achieved R² of 0.846 with test RMSE of 0.97 days, predicting median stay of 3.8 days. The strongest predictor is readmission_count, indicating comorbidity and lab status drive hospital duration. Model shows promise for clinical triage applications.
Analysis Overview
Analysis Overview
Data Quality & Preprocessing
Data Quality
Length of Stay Distribution
Distribution of hospital stay duration across all patients
Hospital stays range from 1 to 14 days with median of 4.0 days (mean 4.0). The middle 50% of patients stay between 2 and 5 days (IQR = 3 days). This distribution shows typical short admissions with a tail of extended stays, consistent with acute care mixed with complex cases.
Length of Stay by Major Comorbidities
Hospital stay duration stratified by end-stage renal disease and dialysis status
Patients with end-stage renal disease on dialysis (n=72) have median stay of 6.0 days versus 4.0 days for those without (n=1928). This 2.0-day difference highlights renal comorbidity as a major driver of extended hospitalizations. Dialysis patients exhibit higher clinical complexity requiring longer treatment and recovery.
Feature Correlation Matrix
Pearson correlations between clinical lab values, vitals, and length of stay
Clinical labs and vitals show varying relationships with length of stay. Serum creatinine (renal function marker) and blood urea nitrogen show strong collinearity as expected (correlation = 0.011). The strongest predictor correlation is blood_urea_nitrogen (0.137), suggesting renal and electrolyte status drive hospital duration. Collinear pairs like creatinine-BUN require careful interpretation in regression.
Feature Importance Ranking
Random forest feature importance (Mean Decrease Gini) ranked by predictive power
Random forest identifies readmission_count, major_depression, and hemoglobin as the three strongest predictors of length of stay. These three features account for 72% of the model's importance ranking. Comorbidities (dialysis, pneumonia history) combined with lab markers (renal function) dominate, suggesting clinical complexity drives hospital duration.
Actual vs. Predicted Length of Stay
Model predictions vs. observed hospital stay duration on test set
The model achieves R² = 0.846 on test set with mean absolute error of 0.69 days. Predictions cluster reasonably around the diagonal, though some underestimation of very long stays (>7 days) is visible. Mean bias is -0.03 days (negative), suggesting the model is well-calibrated on average.
Regression Coefficients (Effect Sizes)
Linear regression coefficients showing marginal effect of each predictor on length of stay (days)
Linear regression reveals clinical effects: Readmission Count (5+) increases stay by 5.45 days per unit, while Readmission Count (Level 1) decreases stay by 0.93 days. Positive coefficients identify risk factors extending hospitalization; negative coefficients suggest protective factors. This complements random forest by quantifying specific effect magnitudes for clinical decision support.
Model Performance Metrics
Summary of random forest and linear regression model performance on test set
| Metric Name | Metric Value |
|---|---|
| Total Patients | 2000 |
| Train Set Size | 1600 |
| Test Set Size | 400 |
| R² (Test Set) | 0.846 |
| RMSE (Days) | 0.97 |
| MAE (Days) | 0.69 |
| Median Predicted LOS | 3.8 |
| Random Forest MTry | 5 |
| Random Forest Trees | 100 |
The model was trained on 80% of the 2000 patients and evaluated on test set of 400 patients. Performance metrics show R² = 0.846 with RMSE = 0.97 days, indicating moderate predictive power suitable for clinical triage support (identifying high-risk admissions). Model selection and hyperparameter tuning could further improve accuracy.