Executive Summary
Key metrics from the medical insurance cost GLM
The GLM with smoker×BMI interaction explains 72.2% of variance in medical charges (R² = 0.722) across 1338 policyholders. Smokers pay on average $23,616 more per year than non-smokers. The model's root mean squared error is $8,060, indicating the typical prediction error relative to actual annual charges.
Distribution of Medical Charges
Histogram of individual annual medical insurance charges
Medical charges are strongly right-skewed: the median is $9,382 but the mean is $13,270 — pulled up by high-cost outliers. The maximum charge reaches $63,770. This skew validates the use of a Gamma GLM with log link, which models multiplicative rather than additive cost drivers.
Average Charges by Smoker Status
Mean annual insurance charges comparing smokers to non-smokers
Smokers incur substantially higher medical costs than non-smokers. The average annual charge for smokers is $32,050 vs $8,434 for non-smokers — a gap of $23,616. Smoking status is the single strongest predictor in the model, and this gap is amplified further by the BMI interaction term.
BMI vs Charges by Smoking Status
Scatter plot of BMI against annual charges, colored by smoking status
The scatter reveals a clear interaction: among smokers, higher BMI is strongly associated with higher charges (r = 0.81), while among non-smokers the relationship is much weaker (r = 0.08). This diverging slope visually confirms the smoker×BMI interaction term in the GLM is capturing a real phenomenon.
GLM Predictor Effects
GLM log-scale coefficients sorted by absolute magnitude
The strongest predictor in the model is 'Region: Southwest' (coefficient = -0.155 on the log scale). In a Gamma GLM with log link, each coefficient represents a multiplicative shift in expected charges: positive values increase cost, negative values reduce it. The smoker×BMI interaction appears among the top effects, confirming that smoking amplifies the cost impact of excess BMI.
Average Charges by Region
Mean annual insurance charges by US geographic region
Regional variation in average charges is moderate. The highest-cost region is 'southeast' ($14,735 mean) and the lowest is 'southwest' ($12,347 mean). These regional differences likely reflect differences in healthcare costs and possibly demographic composition, though they are smaller in magnitude than the smoker premium.
Actual vs Predicted Charges
Scatter of actual vs model-predicted annual charges
Points cluster around the 45-degree line (perfect prediction), with the model explaining 72.2% of charge variance. The RMSE of $8,060 reflects the typical absolute prediction error. Systematic under-prediction at very high charges is common in Gamma GLMs due to extreme outliers in the right tail of the cost distribution.
Residuals vs Fitted Values
GLM diagnostic: residuals plotted against fitted values
Ideally, residuals should be randomly scattered around zero across all fitted values. Patterns in this plot — such as a funnel shape or curve — indicate heteroscedasticity or missing non-linear terms. Any systematic structure at high fitted values may reflect the heavy right tail of charges, which the Gamma distribution partially addresses but cannot eliminate entirely.
Descriptive Statistics
Summary statistics for all numeric variables in the dataset
| Variable | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|
| Age | 39.21 | 39 | 14.05 | 18 | 64 |
| BMI | 30.66 | 30.4 | 6.1 | 15.96 | 53.13 |
| Children | 1.09 | 1 | 1.21 | 0 | 5 |
| Charges | 1.327e+04 | 9382 | 1.211e+04 | 1122 | 6.377e+04 |
The dataset contains 1338 complete observations. Medical charges show the widest relative spread: mean $13,270 vs median $9,382, confirming strong right skew. BMI averages around 30.7 and age spans from 18 to 64 years.