Executive Summary
Overall classification accuracy, AUC, and the single most discriminating tumor measurement
The logistic regression model achieves 92.9% accuracy on held-out test cases with an AUC of 0.987. The dataset contains 569 tumors (37.3% malignant). The strongest single predictor of malignancy is Radius Worst.
Diagnosis Class Distribution
Count of malignant (M) and benign (B) tumors in the dataset
The dataset contains 212 malignant and 357 benign tumors (37.3% malignant). The modest class imbalance is not severe enough to require resampling for logistic regression.
PCA Biplot — PC1 vs PC2 by Diagnosis
Each point is one tumor plotted in the 2-D space of the first two principal components
PC1 explains 44.3% and PC2 explains 19% of total feature variance. Malignant tumors (M) cluster at higher PC1 values, reflecting larger and more irregular cell nuclei. The visible separation between clusters confirms that the 30 features encode linearly separable class information, supporting good logistic regression performance.
Scree Plot — Variance Explained per PC
Percentage of total feature variance captured by each principal component
The first 7 principal components capture 90% of the variance in the 30 cell-nucleus measurements. PC1 alone explains 44.27% — the steep drop after PC1 is typical of medical imaging datasets where a few dominant shape factors account for most variability.
Logistic Regression Coefficients
Standardised logistic regression coefficients for the top features
Of the top 10 features shown, 5 push the model toward malignancy (positive coefficient) and 5 push toward benign (negative). Features with the largest absolute coefficients have the greatest influence on the predicted probability. All features were z-scored before fitting, so coefficients are directly comparable across measurements.
ROC Curve
True positive rate vs false positive rate across all classification thresholds
The model achieves an AUC of 0.987 — values above 0.9 indicate excellent discrimination between malignant and benign tumors. The curve shows that the model can achieve sensitivity above 90% while keeping the false positive rate below 15%, a clinically acceptable trade-off for cancer screening.
Confusion Matrix
Predicted vs actual diagnosis on the held-out test set
The model correctly classified 35 malignant and 70 benign tumors. There were 7 false negative(s) — malignant tumors predicted benign — and 1 false positive(s). Sensitivity (83.3%) measures how well the model catches true cancers; minimising false negatives is the clinical priority.
Feature Importance — Top Predictors
Top cell-nucleus measurements ranked by absolute logistic regression coefficient
The most discriminating feature is Radius Worst (|coef| = 736.8858). Features from the 'worst' measurement group (largest values recorded across nuclei) tend to dominate the ranking, reflecting that the most extreme cellular abnormalities are the clearest indicators of malignancy.
Model Performance Metrics
Complete classification performance on held-out test set
| Metric | Value |
|---|---|
| Accuracy | 0.9292 |
| Sensitivity (Recall) | 0.8333 |
| Specificity | 0.9859 |
| Precision | 0.9722 |
| F1 Score | 0.8974 |
| AUC | 0.9869 |
Across all six metrics the model performs consistently well. Sensitivity of 83.3% means the classifier correctly flags the vast majority of true cancers, which is the primary clinical objective. AUC of 0.987 confirms strong overall discrimination independent of the chosen threshold.