Executive Summary
Overdispersion confirmation and key NB regression findings
Among 1,338 policyholders the variance-to-mean ratio of 1.3 confirms substantial overdispersion, making negative binomial regression the appropriate model choice. The NB model achieves an AIC improvement of 65.2 over Poisson (LRT p = 0.0000), with NB theta = 2.6 indicating moderate overdispersion. The strongest predictor is Region: Northwest (IRR = 1.1), and all predictors with IRR > 1 are associated with higher expected claim frequency.
Claim Count Distribution
Raw claim count values — the renderer bins them into a histogram
The distribution of claim counts has a mean of 1.09 and variance of 1.45, yielding a variance-to-mean ratio of 1.3 — well above 1 and confirming overdispersion. 42.9% of policyholders have zero claims, and the long right tail shows that a small fraction of policyholders account for a disproportionate share of claim events — a pattern Poisson cannot accommodate without extra variance. The shape strongly supports the negative binomial specification.
Model Comparison: Poisson vs Negative Binomial
AIC, log-likelihood, dispersion ratio, and LRT p-value for both models
| Model | Aic | Log Likelihood | Dispersion | P Value |
|---|---|---|---|---|
| Poisson | 3900 | -1942 | 1.5 | 1 |
| Negative Binomial | 3835 | -1908 | 1.13 | 2.401e-16 |
The Poisson model has AIC = 3899.8 and log-likelihood = -1941.9. The negative binomial model improves AIC by 65.2 units (AIC = 3834.6), a statistically significant improvement (LRT p = 2.401e-16). The Poisson residual deviance ratio of 1.5 far exceeds 1, confirming overdispersion, while the NB ratio of 1.13 indicates a much better calibrated fit.
Mean Claim Count by Region
Observed mean claim counts per geographic region
Geographic region shows modest variation in mean claim counts. Policyholders in northwest have the highest average of 1.15 claims, while those in northeast average 1.05 claims. This raw difference does not control for age, BMI, or smoking status — the IRR chart shows the region effect after accounting for all covariates. All regions contain at least 5 policyholders.
Mean Claim Count by Policyholder Group
Observed mean claim counts for smoker/sex subgroups
Policyholder subgroups defined by smoking status and sex reveal clear differences in mean claim counts. The Yes / Male subgroup has the highest average of 1.19 claims, compared to an overall mean of 1.09. Smoking status typically drives the largest subgroup difference because smokers tend to have more frequent health-related events requiring claims. These observed means provide intuition; the IRR card shows adjusted effects.
Negative Binomial Coefficients (Incidence Rate Ratios)
IRR with 95% CI for each predictor — values > 1 increase expected claim count
Incidence rate ratios from the negative binomial model show the multiplicative effect of each predictor on expected claim count, holding other variables constant. 0 predictor(s) have IRR significantly above 1 (95% CI excludes 1), indicating increased claim frequency, while 0 predictor(s) are associated with significantly lower frequency. The largest effect belongs to Region: Northwest (IRR = 1.1), meaning that group or unit change multiplies the expected claim count by 1.1x relative to the reference.
Predicted vs Actual Claim Counts
NB fitted values vs observed counts — close alignment indicates good model calibration
The scatter of predicted vs actual claim counts assesses whether the negative binomial model captures the observed distribution without systematic bias. The correlation between fitted and observed values is 0.059, and the mean absolute error is 0.98 claims. Points aligned along the diagonal indicate accurate predictions; clusters offset to one side would signal under- or over-prediction for a particular policyholder profile. Discrete integer outcomes produce horizontal bands, which is normal for count models.