Executive Summary
Key findings from random forest feature importance and linear regression analysis
Across 1599 red wines rated 3–8, the random forest model identifies Sulphates as the single strongest predictor of quality by mean decrease in accuracy. The linear regression model (R² = 0.361, explaining 36.1% of quality variance) confirms Alcohol as the predictor with the largest standardized effect, with volatile acidity acting as a key negative driver. Both methods consistently point to alcohol content and volatile acidity as the dominant physicochemical levers for wine quality.
Quality Score Distribution
Frequency distribution of expert quality scores across 1,599 red wines
The dataset contains 1599 red wine samples rated on a 3–8 quality scale. Quality score 5 is the most common with 681 wines. Mid-range scores of 5 and 6 together account for 82.5% of the dataset, meaning the models primarily differentiate average-quality wines from one another. Scores of 3 and 8 represent edge cases with very few examples.
Feature Correlation Matrix
Pairwise Pearson correlations across all 11 physicochemical features and quality score
The correlation heatmap shows pairwise Pearson correlations between all 12 variables. The feature most positively correlated with quality is Alcohol (r = 0.476). The strongest negative correlation with quality belongs to Volatile Acidity (r = -0.391), confirming it as a quality-reducing factor. High correlations between predictor pairs (such as total and free sulfur dioxide) indicate multicollinearity that the regression model must navigate.
Random Forest Feature Importance
Mean decrease in accuracy for each of the 11 physicochemical features
The random forest ranks all 11 physicochemical features by mean decrease in accuracy — how much prediction quality drops when each feature's values are randomly shuffled. Sulphates scores highest with an importance of 54.459, making it the dominant predictor of wine quality in this dataset. Residual Sugar ranks lowest, contributing least incremental predictive power. Features with near-zero importance provide little signal beyond what others already capture.
Alcohol Content by Quality Score
Box plot showing alcohol content distribution for each expert quality rating
Box plots show the distribution of alcohol content (% by volume) separately for each quality score. Wines rated 8 have a median alcohol content of 12.2%, compared to 9.9% for wines rated 3. The upward trend in median alcohol across quality scores confirms alcohol as a strong positive driver. The overlap between adjacent quality groups reflects that alcohol alone does not fully determine quality.
Volatile Acidity by Quality Score
Box plot showing volatile acidity distribution for each expert quality rating
Volatile acidity (acetic acid content) shows a clear decreasing pattern as quality rises. Wines rated 8 have a median volatile acidity of 0.37 g/L, substantially lower than the 0.845 g/L median for wines rated 3. This confirms volatile acidity as a key negative quality driver: higher levels introduce a vinegar-like taste that experts consistently penalize. Low-quality wines display greater spread, suggesting other confounders are also at play.
Linear Regression Coefficients
Standardized beta coefficients for all 11 physicochemical predictors
Standardized beta coefficients show each feature's effect on quality in comparable units, regardless of measurement scale. The overall regression model explains 36.1% of quality variance (R² = 0.361). Alcohol has the largest positive effect (β = 0.3645), while Volatile Acidity (β = -0.2403) negatively impacts quality. Bars pointing right indicate quality-boosting properties; bars pointing left indicate quality-reducing ones.
Alcohol vs Volatile Acidity by Quality Tier
Scatter plot of the two dominant quality drivers, colored by quality tier
This scatter plot maps each wine's alcohol content (x-axis) against volatile acidity (y-axis), with color indicating quality tier: High (7–8), Medium (5–6), or Low (3–4). High-quality wines cluster in the upper-right of the alcohol axis and lower volatile acidity region (mean alcohol: 11.5%, mean volatile acidity: 0.406 g/L). Low-quality wines trend toward lower alcohol and higher volatile acidity (mean alcohol: 10.2%, mean volatile acidity: 0.724 g/L). The diagonal separation confirms that the two strongest individual predictors together create clear visual quality clusters.
Mean Physicochemical Profile by Quality Score
Mean alcohol, volatile acidity, sulphates, and citric acid by quality rating group
| Quality Score | Count | Mean Alcohol | Mean Volatile Acidity | Mean Sulphates | Mean Citric Acid |
|---|---|---|---|---|---|
| 3 | 10 | 9.96 | 0.884 | 0.57 | 0.171 |
| 4 | 53 | 10.27 | 0.694 | 0.596 | 0.174 |
| 5 | 681 | 9.9 | 0.577 | 0.621 | 0.244 |
| 6 | 638 | 10.63 | 0.497 | 0.675 | 0.274 |
| 7 | 199 | 11.47 | 0.404 | 0.741 | 0.375 |
| 8 | 18 | 12.09 | 0.423 | 0.768 | 0.391 |
This table summarizes mean values of four key physicochemical properties across each quality score group. Quality score 8 wines have the highest mean alcohol content (12.09%), while quality score 7 wines show the lowest mean volatile acidity (0.404 g/L). Sulphates and citric acid both show a general upward trend with quality, though with less dramatic separation than alcohol and volatile acidity. This table covers 1599 wines, excluding any quality groups with fewer than 5 samples.