Executive Summary
Key findings from the housing value driver analysis
The linear regression model explains 85.5% of variance in median home values (R² = 0.8549, Adjusted R² = 0.8501). The three strongest predictors by random forest importance are: lower_status_pct, distance_employment, crime_rate. On the held-out test set (101 properties) the model achieves RMSE = $3.3k. The Charles River location effect is small or negligible in this dataset.
Feature Importance Rankings
Random forest permutation importance — which features best predict home values?
Random forest permutation importance (%IncMSE) ranks all 13 predictors by how much test error increases when each feature is randomly shuffled. 'lower_status_pct' is the strongest driver with an importance score of 96.3 — scrambling its values hurts prediction accuracy most. Features with low scores contribute little to predictive power beyond what other variables already capture. Note that importance is a measure of predictive signal, not causal direction.
Feature Correlation Matrix
Pairwise Pearson correlations among all predictors and the target
The heatmap displays Pearson correlations among all 14 variables (features + target). 'lower_status_pct' has the strongest correlation with home value (r = -0.873). Dark red cells signal strong positive co-movement; dark blue signals strong negative association. Pairs of predictors with |r| > 0.7 are collinear and may inflate each other's regression standard errors.
Lower Status vs. Home Value
Scatter of key socioeconomic predictor against median home value
Each point represents one census tract. The scatter reveals a strong negative relationship (r = -0.873) between the percentage of lower-status residents and median home values. The relationship curves downward most steeply at low lower_status_pct values and flattens near the extremes, suggesting a non-linear component that the random forest captures better than linear regression alone. The clustering near the $50k ceiling reflects a top-coding artefact in the original Boston Housing dataset.
Regression Coefficients by Feature
Standardised OLS coefficients showing direction and magnitude of each feature's linear effect
Standardised OLS coefficients allow fair comparison across predictors with different units. 'distance_employment' has the strongest positive effect on home values. 'lower_status_pct' has the largest negative effect. 4 of 13 predictors are statistically significant (p < 0.05). Bars pointing right indicate value-raising features; bars pointing left indicate value-reducing ones.
Predicted vs. Actual Home Values
Out-of-sample prediction accuracy on the held-out test set
Predicted vs. actual home values on the held-out test set (101 properties). Points on the 45° diagonal are perfect predictions; points above overestimate, points below underestimate. Test-set RMSE is $3.3k and R² = 0.8181, confirming the model generalises well beyond training data. Systematic under-prediction near the $50k upper bound reflects the price ceiling in the dataset — the model cannot predict values above the ceiling it was trained on.
Home Value Distribution
Distribution of median home values across all census tracts
Median home value across the 506 tracts is $24.4k (mean: $22.9k). The distribution is right-skewed with a distinct cluster of high-value tracts. No strong price ceiling artefact is visible in this dataset. Distinct low-value and mid-value clusters likely correspond to inner-city and suburban neighborhoods.
Riverside vs. Non-Riverside Home Values
Mean home values comparing Charles River tracts to non-riverside neighborhoods
Tracts bordering the Charles River (n = 38) have a mean home value of $22.71k vs. $22.88k for non-riverside tracts (n = 468). This $0.17k difference is not statistically significant at the 5% level by Welch's two-sample t-test (p = 0.8972). The premium likely reflects waterfront amenity value, though the small sample of riverside tracts means estimates are less precise than for non-riverside properties.
Feature Descriptive Statistics
Range, central tendency, and spread for every housing feature
| feature_name | mean_val | median_val | std_val | min_val | max_val |
|---|---|---|---|---|---|
| home_value | 22.87 | 24.4 | 7.745 | 5 | 37.4 |
| rooms_avg | 6.28 | 6.261 | 0.699 | 4.531 | 8.113 |
| lower_status_pct | 11.21 | 9.45 | 6.797 | 2.32 | 37.97 |
| crime_rate | 1.758 | 0.281 | 6.244 | 0.006 | 89 |
| pollution_level | 0.551 | 0.546 | 0.113 | 0.385 | 0.871 |
| tax_rate | 399 | 340.5 | 172.2 | 187 | 711 |
| pupil_teacher_ratio | 18.43 | 18.5 | 2.068 | 12.6 | 22 |
| distance_employment | 4.91 | 4.374 | 2.51 | 1.13 | 12.13 |
| age_pct | 71.05 | 74.6 | 20.2 | 12.5 | 99.9 |
| industrial_pct | 10.83 | 9.135 | 7.778 | 0.46 | 27.74 |
| residential_zone_pct | 12.81 | 0 | 24.28 | 0 | 100 |
| highway_access | 5.899 | 5 | 4.852 | 1 | 24 |
| charles_river | 0.075 | 0 | 0.264 | 0 | 1 |
| black_index | 345 | 355.2 | 42.56 | 175.8 | 396.9 |
Descriptive statistics for all 14 variables (506 tracts). Comparing mean to median reveals distributional skew: 'residential_zone_pct' shows the largest mean-to-median gap relative to its spread, indicating heavy skew that may influence regression assumptions. The wide ranges across features (crime_rate spans orders of magnitude; rooms_avg is tightly bounded) explain why standardisation is essential before comparing coefficient magnitudes.