Analytics · Housing · Prices · Value Drivers
Executive Summary

Executive Summary

Key findings from the housing value driver analysis

n_observations
506
r_squared
0.8549
rmse_test
3.3
river_premium
-0.17
n_features
13
adj_r_squared
0.8501
The linear regression model explains 85.5% of variance in median home values (R² = 0.8549, Adjusted R² = 0.8501). The three strongest predictors by random forest importance are: lower_status_pct, distance_employment, crime_rate. On the held-out test set (101 properties) the model achieves RMSE = $3.3k. The Charles River location effect is small or negligible in this dataset.
Interpretation

The linear regression model explains 85.5% of variance in median home values (R² = 0.8549, Adjusted R² = 0.8501). The three strongest predictors by random forest importance are: lower_status_pct, distance_employment, crime_rate. On the held-out test set (101 properties) the model achieves RMSE = $3.3k. The Charles River location effect is small or negligible in this dataset.

Visualization

Feature Importance Rankings

Random forest permutation importance — which features best predict home values?

Interpretation

Random forest permutation importance (%IncMSE) ranks all 13 predictors by how much test error increases when each feature is randomly shuffled. 'lower_status_pct' is the strongest driver with an importance score of 96.3 — scrambling its values hurts prediction accuracy most. Features with low scores contribute little to predictive power beyond what other variables already capture. Note that importance is a measure of predictive signal, not causal direction.

Visualization

Feature Correlation Matrix

Pairwise Pearson correlations among all predictors and the target

Interpretation

The heatmap displays Pearson correlations among all 14 variables (features + target). 'lower_status_pct' has the strongest correlation with home value (r = -0.873). Dark red cells signal strong positive co-movement; dark blue signals strong negative association. Pairs of predictors with |r| > 0.7 are collinear and may inflate each other's regression standard errors.

Visualization

Lower Status vs. Home Value

Scatter of key socioeconomic predictor against median home value

Interpretation

Each point represents one census tract. The scatter reveals a strong negative relationship (r = -0.873) between the percentage of lower-status residents and median home values. The relationship curves downward most steeply at low lower_status_pct values and flattens near the extremes, suggesting a non-linear component that the random forest captures better than linear regression alone. The clustering near the $50k ceiling reflects a top-coding artefact in the original Boston Housing dataset.

Visualization

Regression Coefficients by Feature

Standardised OLS coefficients showing direction and magnitude of each feature's linear effect

Interpretation

Standardised OLS coefficients allow fair comparison across predictors with different units. 'distance_employment' has the strongest positive effect on home values. 'lower_status_pct' has the largest negative effect. 4 of 13 predictors are statistically significant (p < 0.05). Bars pointing right indicate value-raising features; bars pointing left indicate value-reducing ones.

Visualization

Predicted vs. Actual Home Values

Out-of-sample prediction accuracy on the held-out test set

Interpretation

Predicted vs. actual home values on the held-out test set (101 properties). Points on the 45° diagonal are perfect predictions; points above overestimate, points below underestimate. Test-set RMSE is $3.3k and R² = 0.8181, confirming the model generalises well beyond training data. Systematic under-prediction near the $50k upper bound reflects the price ceiling in the dataset — the model cannot predict values above the ceiling it was trained on.

Visualization

Home Value Distribution

Distribution of median home values across all census tracts

Interpretation

Median home value across the 506 tracts is $24.4k (mean: $22.9k). The distribution is right-skewed with a distinct cluster of high-value tracts. No strong price ceiling artefact is visible in this dataset. Distinct low-value and mid-value clusters likely correspond to inner-city and suburban neighborhoods.

Visualization

Riverside vs. Non-Riverside Home Values

Mean home values comparing Charles River tracts to non-riverside neighborhoods

Interpretation

Tracts bordering the Charles River (n = 38) have a mean home value of $22.71k vs. $22.88k for non-riverside tracts (n = 468). This $0.17k difference is not statistically significant at the 5% level by Welch's two-sample t-test (p = 0.8972). The premium likely reflects waterfront amenity value, though the small sample of riverside tracts means estimates are less precise than for non-riverside properties.

Data Table

Feature Descriptive Statistics

Range, central tendency, and spread for every housing feature

feature_namemean_valmedian_valstd_valmin_valmax_val
home_value22.8724.47.745537.4
rooms_avg6.286.2610.6994.5318.113
lower_status_pct11.219.456.7972.3237.97
crime_rate1.7580.2816.2440.00689
pollution_level0.5510.5460.1130.3850.871
tax_rate399340.5172.2187711
pupil_teacher_ratio18.4318.52.06812.622
distance_employment4.914.3742.511.1312.13
age_pct71.0574.620.212.599.9
industrial_pct10.839.1357.7780.4627.74
residential_zone_pct12.81024.280100
highway_access5.89954.852124
charles_river0.07500.26401
black_index345355.242.56175.8396.9
Interpretation

Descriptive statistics for all 14 variables (506 tracts). Comparing mean to median reveals distributional skew: 'residential_zone_pct' shows the largest mean-to-median gap relative to its spread, indicating heavy skew that may influence regression assumptions. The wide ranges across features (crime_rate spans orders of magnitude; rooms_avg is tightly bounded) explain why standardisation is essential before comparing coefficient magnitudes.

Your data has more stories to tell. Run any analysis on your own data — 60+ validated R modules, interactive reports, AI insights, and PDF export. 2,000 free credits on signup.
Try Free — No Signup Sign Up Free

Report an Issue

Tell us what's wrong. You'll get a free re-run of this analysis so you can try again with different parameters. If the re-run still doesn't meet your expectations, we'll refund your credits.

Want to run this analysis on your own data? Upload CSV — Free Analysis See Pricing