Executive Summary
Top-line finding on the strongest diamond price driver and overall model accuracy
clarity is the dominant price driver, accounting for 18.7% of random forest importance across 500 diamonds. Linear regression explains 95.5% of log-price variation (R² = 0.955). Median retail price in this dataset is $514.
Feature Importance Ranking
Random forest importance showing which diamond attributes explain the most price variation
clarity is the most important feature, explaining 18.7% of random forest node impurity reduction. carat is second at 17.8%. Importance scores are normalized to sum to 100% so relative rankings are directly comparable. The random forest achieved an out-of-bag R² of 0.
Regression Coefficients
OLS coefficients showing direction and magnitude of each attribute's effect on log(price)
Coefficients are from an OLS regression of log(price) on encoded features. length_mm has the largest positive coefficient (2.3112), corresponding to approximately 908.7% price change per grade. Model R² = 0.955; positive values increase price, negative values decrease it.
Carat vs Price
Scatter of carat vs price by cut grade, showing the carat-price relationship across quality tiers
Scatter plot of carat weight versus retail price, colored by cut grade, sampled to 500 points. Pearson r = 0.821 — carat and price are strongly correlated. Within each cut tier, higher-carat diamonds show wider price dispersion, reflecting the multiplicative interaction between weight and quality.
Price Distribution by Cut
Box plots of price distributions by cut grade, showing median and spread differences
Box plots show the price distribution within each cut grade across 500 diamonds. Good has the highest median price ($698) and Premium has the lowest ($398). Note that carat weight confounds this comparison — larger diamonds can be cut at any quality level, widening the price range within each tier.
Average Price by Color Grade
Mean price per color grade (D through J), testing the monotonic decline hypothesis
Average retail price by GIA color grade, ordered from D (colorless, most valuable) to J (most color). D-grade diamonds average $2,620 vs $1,077 for J-grade in this dataset. The decline is not strictly monotone, likely due to confounding with carat size within each color tier.
Model Performance
Model fit metrics for both random forest and linear regression (R-squared, RMSE)
| metric | value |
|---|---|
| N (observations) | 500 |
| LM R-squared | 0.955 |
| LM RMSE (price, USD) | $1095 |
| RF OOB R-squared | 0.000 |
Summary of model fit statistics across 500 diamonds. The linear regression explains 95.5% of log-price variation, which is a strong fit suitable for price estimation. Random forest OOB R² = 0.000.