Analysis overview and configuration

Configuration

Analysis TypeLogistic

CompanyEducational Research Institute

ObjectiveIdentify student characteristics that predict test preparation completion using logistic regression

Analysis Date2026-03-14

Processing Idtest_1773549132

Total Observations1000

Module Parameters

Parameter	Value	_row
confidence_level	0.95	confidence_level
test_size	0.3	test_size
classification_threshold	0.5	classification_threshold
positive_class	completed	positive_class

Logistic analysis for Educational Research Institute

Interpretation

Purpose

This logistic regression analysis identifies which student characteristics predict test preparation completion at an Educational Research Institute. The model evaluates five predictors (math score, reading score, writing score, gender, and lunch plan) against a binary outcome (completed vs. none) using 1,000 complete student records with no missing data.

Key Findings

AUC (0.80): Model demonstrates good discrimination ability between students who complete and don't complete test preparation, exceeding the 0.70 quality threshold.
Accuracy (0.75): Overall correct classification rate, though masked by class imbalance (35.8% positive cases).
Specificity (0.87): Strong at identifying non-completers, but Sensitivity (0.52) is notably weaker at identifying completers.
Significant Predictors (4 of 5): Writing score (OR=1.26), student gender male (OR=7.46), math score (OR=0.92), and reading score (OR=0.91) are statistically significant; lunch plan is not.
McFadden R² (0.17): Model explains 17% of variance, indicating moderate explanatory power.

Interpretation

The model successfully identifies non-completers but struggles with true positive detection. Male students show

Data preprocessing and column mapping

Data Quality

Initial Rows1000

Final Rows1000

Rows Removed0

Retention Rate100

Data Quality

Metric	Value
Initial Rows	1,000
Final Rows	1,000
Rows Removed	0
Retention Rate	100%

Processed 1,000 observations, retained 1,000 (100.0%) after cleaning

Interpretation

Purpose

This section documents the data cleaning and preparation phase for the logistic regression model predicting test preparation completion. Perfect data retention indicates no observations were excluded during preprocessing, which is critical for maintaining statistical power and representativeness when identifying student characteristics that predict completion behavior.

Key Findings

Retention Rate: 100% (1,000 of 1,000 rows retained) - No observations were removed during cleaning, suggesting either minimal data quality issues or that missing values were handled through imputation rather than deletion
Rows Removed: 0 - The dataset required no exclusions, contrasting with the metadata note that "missing values removed" yet maintaining full row count
Train/Test Split: Not documented - The absence of explicit split information suggests the model may use the full dataset or employ alternative validation methods not captured here

Interpretation

The perfect retention rate supports the model's ability to leverage the complete 1,000-student sample for logistic regression estimation. This maximizes statistical power for detecting significant predictors of test preparation completion. However, the discrepancy between "missing values removed" and zero rows removed suggests preprocessing may have occurred at the column level rather than row level, potentially through imputation or feature engineering that isn't explicitly documented here.

Context

The lack of train/test split documentation limits visibility into how model performance metrics (AUC=0.8, Accuracy=

Key Metrics

AUC: 0.8
Accuracy: 74.6%
Observations_Used: 1000
Predictors: 5

Key Findings

Metric	Value	Interpretation
AUC	0.8	Good
Accuracy	74.6%	Moderate
Sensitivity	52.3%	Low
Specificity	87%	High
F1 Score	0.596	Low
McFadden R²	0.172	Weak
Significant Predictors	4	Many factors

Summary

Bottom Line: Logistic regression on 1000 observations predicts 'completed' vs 'none' with AUC = 0.8 (good). Overall accuracy: 74.6%. 4 of 5 predictors are statistically significant.

Key Findings:
• Model discrimination: AUC = 0.8 (good — model reliably separates the two classes)
• Sensitivity: 52.3% — proportion of 'completed' cases correctly identified
• Specificity: 87% — proportion of 'none' cases correctly identified
• McFadden R² = 0.172 (weak model fit — consider adding predictors)

Recommendation: The model provides useful discrimination. Use an optimal threshold of 0.367 to classify new cases. Focus interventions on predictors with large, significant odds ratios.

Interpretation

EXECUTIVE SUMMARY

Purpose

This section synthesizes the logistic regression model's performance in predicting test preparation completion across 1,000 students. Understanding whether the model achieves sufficient predictive accuracy and identifies actionable student characteristics is critical for determining deployment viability and intervention strategy effectiveness.

Key Findings

AUC (0.8): Model demonstrates good discrimination ability—reliably separates students who completed test prep from those who did not
Accuracy (74.6%): Overall correct classification rate, though masked by class imbalance (35.8% positive cases)
Sensitivity (52.3%): Captures only about half of students who actually completed prep; high false-negative risk
Specificity (87%): Excellent at identifying non-completers, reducing false-positive interventions
Significant Predictors (4 of 5): Student gender, writing score, math score, and reading score drive predictions; lunch plan is non-significant
McFadden R² (0.172): Weak explanatory power suggests unmeasured factors influence completion behavior

Interpretation

The model achieves the business objective of identifying predictive characteristics with acceptable discrimination (AUC 0.8). However, the low sensitivity reveals a critical trade-off: while the model excels at identifying non-completers

ROC curve showing model discrimination ability

Interpretation

Purpose

The ROC curve evaluates the logistic regression model's ability to discriminate between students who completed test preparation and those who did not across all possible classification thresholds. This section directly addresses the model's predictive quality for the stated objective of identifying student characteristics that predict test preparation completion.

Key Findings

AUC (0.8): Indicates good discrimination ability—the model correctly ranks a randomly selected completer higher than a non-completer 80% of the time, substantially better than random guessing (0.5).
Optimal Threshold (0.367): Balances sensitivity and specificity by maximizing Youden's J statistic, suggesting predictions below this probability should be classified as "none" and above as "completed."
Sensitivity-Specificity Trade-off: The curve shows the model achieves ~52% sensitivity (true positive rate) at ~13% false positive rate, reflecting the class imbalance (35.8% positive cases).

Interpretation

The AUC of 0.8 demonstrates the model has meaningful predictive power for test preparation completion. The model performs substantially better than chance, validating that the selected student characteristics (gender, lunch plan, and test scores) contain discriminative information. However, the moderate AUC reflects inherent complexity in predicting behavioral outcomes and suggests room for improvement through additional

Confusion matrix showing classification accuracy by class

Interpretation

Purpose

This confusion matrix evaluates how well the logistic regression model predicts test preparation completion across the two outcome classes. It reveals the model's ability to correctly identify students who completed preparation versus those who did not, which directly addresses the core objective of identifying predictive student characteristics.

Key Findings

Accuracy (74.6%): Overall correctness across both classes; the model correctly classifies nearly 3 in 4 students
Sensitivity (52.3%): The model identifies only about half of students who actually completed preparation, missing 48% of true completers (51 false negatives)
Specificity (87%): Strong performance identifying non-completers; correctly classifies 87% of students who did not complete
F1 Score (0.596): Moderate balance between precision and recall, reflecting the trade-off between false positives (25) and false negatives (51)

Interpretation

The model demonstrates asymmetric performance: it excels at identifying non-completion but struggles with completion detection. The high specificity (87%) indicates the model conservatively predicts completion, resulting in many false negatives. This imbalance reflects the class distribution (35.8% positive cases) and suggests the model's decision boundary favors the majority class. The moderate F1 score indicates reasonable but imperfect predictive utility for

Odds ratios with 95% confidence intervals for all predictors

Interpretation

Purpose

This section quantifies the individual effect of each predictor on the odds of test preparation completion. The odds ratios and confidence intervals reveal which student characteristics are statistically reliable predictors and the magnitude of their influence on completion likelihood. This directly supports the analysis objective to identify which characteristics predict test preparation completion.

Key Findings

Student Gender (Male): OR = 7.46 (95% CI: 4.3–13.25) - Male students have 7.5 times higher odds of completing test preparation; highly significant (p<0.001) with a confidence interval far above 1.0
Writing Score: OR = 1.26 (95% CI: 1.2–1.33) - Each unit increase in writing score increases completion odds by 26%; statistically significant (p<0.001)
Math & Reading Scores: OR ≈ 0.91–0.92 - Both decrease completion odds by ~8–9% per unit; significant protective effects (p<0.001)
Lunch Plan: OR = 0.82 (95% CI: 0.55–1.2) - Not statistically significant (p=0.305); confidence interval crosses 1.0, indicating no reliable effect

Interpretation

Four of five predictors significantly influence completion odds.

Distribution of predicted probabilities by actual class

Interpretation

Purpose

This section visualizes how well the logistic regression model separates students who completed test preparation from those who did not. The distribution of predicted probabilities reveals the model's confidence in its classifications and identifies overlap regions where the model struggles to distinguish between classes. This directly supports the objective of identifying student characteristics that predict test preparation completion.

Key Findings

Positive Class Percentage: 35.8% of students completed test preparation, creating moderate class imbalance that affects model calibration
Classification Threshold: Set at 0.367, below the 50% default, reflecting the class imbalance and optimizing for balanced sensitivity/specificity
Predicted Probability Range: Mean of 0.36 with standard deviation of 0.22 indicates moderate spread; skewness of 0.45 suggests slight right-skew toward higher probabilities
Class Separation: Moderate overlap between distributions suggests the model achieves reasonable but imperfect discrimination between completed and non-completed cases

Interpretation

The predicted probabilities show meaningful separation between the two classes, consistent with the model's AUC of 0.80. The threshold of 0.367 is optimized below 0.50 because only 35.8% of students completed preparation, allowing the model to balance false positives and false negatives. The observed overlap explains why sensitivity (0

Full coefficient table with log-odds, odds ratios, and significance

variable	log_odds	std_error	z_stat	p_value	odds_ratio	ci_lower	ci_upper	significant
(Intercept)	-5.032	0.5788	-8.694	0	0.007	0.002	0.02	Yes
math score	-0.0882	0.0158	-5.595	0	0.916	0.887	0.944	Yes
reading score	-0.0946	0.0218	-4.334	0	0.91	0.871	0.949	Yes
writing score	0.2322	0.0249	9.309	0	1.261	1.203	1.326	Yes
student_gendermale	2.009	0.287	7.001	0	7.457	4.295	13.25	Yes
lunch_planstandard	-0.2038	0.1986	-1.026	0.3048	0.816	0.552	1.204	No

Interpretation

Purpose

This section quantifies the relationship between each student characteristic and test preparation completion. The coefficient table reveals which factors statistically predict completion and the magnitude of their effects, directly addressing the research objective to identify predictive student characteristics through logistic regression.

Key Findings

Student Gender (Male): Odds ratio of 7.46 (p<0.001) — male students have 7.5× higher odds of completing test preparation than females, the strongest predictor in the model
Writing Score: Odds ratio of 1.26 (p<0.001) — each additional point increases completion odds by 26%, the only positive academic predictor
Math & Reading Scores: Odds ratios of 0.92 each (p<0.001) — counterintuitively, higher scores decrease completion odds by 8% per point, suggesting high-performing students may skip preparation
Lunch Plan: Odds ratio of 0.82 (p=0.305) — not statistically significant; socioeconomic status shows no meaningful effect
Model Fit: McFadden's R² = 0.172 indicates modest explanatory power; the model explains 17% of variance in completion

Interpretation

The model identifies gender as the dominant predictor of completion, with writing ability as a secondary factor. The inverse relationship

Logistic Regression

Configuration

Module Parameters

Interpretation

Purpose

Key Findings

Interpretation

Data Preprocessing

Data Quality

Data Quality

Interpretation

Purpose

Key Findings

Interpretation

Context