Analytics · Housing · Prices · Value Drivers

Executive Summary

Key findings from the housing value driver analysis

n_observations

506

r_squared

0.8549

rmse_test

3.3

river_premium

-0.17

n_features

13

adj_r_squared

0.8501

The linear regression model explains 85.5% of variance in median home values (R² = 0.8549, Adjusted R² = 0.8501). The three strongest predictors by random forest importance are: lower_status_pct, distance_employment, crime_rate. On the held-out test set (101 properties) the model achieves RMSE = $3.3k. The Charles River location effect is small or negligible in this dataset.

Interpretation

The linear regression model explains 85.5% of variance in median home values (R² = 0.8549, Adjusted R² = 0.8501). The three strongest predictors by random forest importance are: lower_status_pct, distance_employment, crime_rate. On the held-out test set (101 properties) the model achieves RMSE = $3.3k. The Charles River location effect is small or negligible in this dataset.

Visualization

Feature Importance Rankings

Random forest permutation importance — which features best predict home values?

Interpretation

Random forest permutation importance (%IncMSE) ranks all 13 predictors by how much test error increases when each feature is randomly shuffled. 'lower_status_pct' is the strongest driver with an importance score of 96.3 — scrambling its values hurts prediction accuracy most. Features with low scores contribute little to predictive power beyond what other variables already capture. Note that importance is a measure of predictive signal, not causal direction.

Visualization

Feature Correlation Matrix

Pairwise Pearson correlations among all predictors and the target

Interpretation

The heatmap displays Pearson correlations among all 14 variables (features + target). 'lower_status_pct' has the strongest correlation with home value (r = -0.873). Dark red cells signal strong positive co-movement; dark blue signals strong negative association. Pairs of predictors with |r| > 0.7 are collinear and may inflate each other's regression standard errors.

Visualization

Lower Status vs. Home Value

Scatter of key socioeconomic predictor against median home value

Interpretation

Each point represents one census tract. The scatter reveals a strong negative relationship (r = -0.873) between the percentage of lower-status residents and median home values. The relationship curves downward most steeply at low lower_status_pct values and flattens near the extremes, suggesting a non-linear component that the random forest captures better than linear regression alone. The clustering near the $50k ceiling reflects a top-coding artefact in the original Boston Housing dataset.

Visualization

Regression Coefficients by Feature

Standardised OLS coefficients showing direction and magnitude of each feature's linear effect

Interpretation

Standardised OLS coefficients allow fair comparison across predictors with different units. 'distance_employment' has the strongest positive effect on home values. 'lower_status_pct' has the largest negative effect. 4 of 13 predictors are statistically significant (p < 0.05). Bars pointing right indicate value-raising features; bars pointing left indicate value-reducing ones.

Visualization

Predicted vs. Actual Home Values

Out-of-sample prediction accuracy on the held-out test set

Interpretation

Predicted vs. actual home values on the held-out test set (101 properties). Points on the 45° diagonal are perfect predictions; points above overestimate, points below underestimate. Test-set RMSE is $3.3k and R² = 0.8181, confirming the model generalises well beyond training data. Systematic under-prediction near the $50k upper bound reflects the price ceiling in the dataset — the model cannot predict values above the ceiling it was trained on.

Visualization

Home Value Distribution

Distribution of median home values across all census tracts

Interpretation

Median home value across the 506 tracts is $24.4k (mean: $22.9k). The distribution is right-skewed with a distinct cluster of high-value tracts. No strong price ceiling artefact is visible in this dataset. Distinct low-value and mid-value clusters likely correspond to inner-city and suburban neighborhoods.

Visualization

Riverside vs. Non-Riverside Home Values

Mean home values comparing Charles River tracts to non-riverside neighborhoods

Interpretation

Tracts bordering the Charles River (n = 38) have a mean home value of $22.71k vs. $22.88k for non-riverside tracts (n = 468). This $0.17k difference is not statistically significant at the 5% level by Welch's two-sample t-test (p = 0.8972). The premium likely reflects waterfront amenity value, though the small sample of riverside tracts means estimates are less precise than for non-riverside properties.

Data Table

Feature Descriptive Statistics

Range, central tendency, and spread for every housing feature

feature_name	mean_val	median_val	std_val	min_val	max_val
home_value	22.87	24.4	7.745	5	37.4
rooms_avg	6.28	6.261	0.699	4.531	8.113
lower_status_pct	11.21	9.45	6.797	2.32	37.97
crime_rate	1.758	0.281	6.244	0.006	89
pollution_level	0.551	0.546	0.113	0.385	0.871
tax_rate	399	340.5	172.2	187	711
pupil_teacher_ratio	18.43	18.5	2.068	12.6	22
distance_employment	4.91	4.374	2.51	1.13	12.13
age_pct	71.05	74.6	20.2	12.5	99.9
industrial_pct	10.83	9.135	7.778	0.46	27.74
residential_zone_pct	12.81	0	24.28	0	100
highway_access	5.899	5	4.852	1	24
charles_river	0.075	0	0.264	0	1
black_index	345	355.2	42.56	175.8	396.9

Interpretation

Descriptive statistics for all 14 variables (506 tracts). Comparing mean to median reveals distributional skew: 'residential_zone_pct' shows the largest mean-to-median gap relative to its spread, indicating heavy skew that may influence regression assumptions. The wide ranges across features (crime_rate spans orders of magnitude; rooms_avg is tightly bounded) explain why standardisation is essential before comparing coefficient magnitudes.

What's wrong with this card?

Executive Summary

Feature Importance Rankings

Feature Correlation Matrix

Lower Status vs. Home Value

Regression Coefficients by Feature

Predicted vs. Actual Home Values

Home Value Distribution

Riverside vs. Non-Riverside Home Values

Feature Descriptive Statistics

Report an Issue