Your district reports "no significant achievement gaps" across demographic groups. But did anyone check if you had enough students to detect a 10-point gap? Most gap analyses fail before they start—not because gaps don't exist, but because the statistical methodology can't find them. Here's how to run achievement gap analysis that produces defensible results you can act on.
The Three Questions Every Gap Analysis Must Answer
Before we discuss methodology, let's be clear about what student performance gap analysis actually does. It answers three specific questions:
- Which demographic groups show statistically significant differences in test scores? This separates real patterns from random variation in your data.
- How large are these gaps in practical terms? Statistical significance tells you the gap is real; effect size tells you if it matters.
- Did you have adequate statistical power to detect gaps? If you had 15 students per group, a "no gap found" result is meaningless—your test was blind.
Most gap analyses answer question one poorly, ignore question two entirely, and never ask question three. That's how districts conclude "no gaps exist" when they actually lack the statistical power to detect a 15-point difference.
The Experimental Design Perspective
Achievement gap analysis is an observational study, not a randomized experiment. You cannot randomly assign students to demographic groups. This means confounding variables—socioeconomic status, prior achievement, teacher assignment—may explain observed gaps. Statistical significance does not prove causation. Before designing interventions, ask: what else could explain this pattern?
What Your Sample Size Actually Tells You
Here's what most people get wrong about sample size: they think bigger is always better, and smaller means "run the test and see what happens." Both are wrong.
Sample size determines your minimum detectable effect—the smallest gap you can reliably find. With 30 students per group and standard assumptions (α=0.05, power=0.80), you can detect gaps around d=0.75—roughly 11-12 points on a typical standardized test. Anything smaller, and your test lacks power.
Statistical Power and Why It Matters
Statistical power is the probability your test will detect a gap that actually exists. Standard practice uses 80% power—a 1-in-5 chance of missing a real gap. Here's what that looks like across different sample sizes for detecting a medium gap (d=0.5, about 7-8 points):
| Students per Group | Power to Detect d=0.5 | What This Means |
|---|---|---|
| 15 | 29% | You'll miss the gap 71% of the time |
| 30 | 54% | Coin flip odds of detection |
| 64 | 80% | Adequate power (standard) |
| 100 | 92% | High confidence in detection |
This table reveals why small schools struggle with gap analysis. If your demographic subgroups contain 20-25 students each, you have roughly 50/50 odds of detecting a meaningful gap even if one exists. Reporting "no significant gaps" from such data is misleading—you lacked the power to find gaps.
Calculating Required Sample Size
Before collecting data, answer this question: what's the smallest gap that would warrant intervention? If you'd implement support programs for a 5-point gap (approximately d=0.35), you need about 130 students per group to detect it reliably.
The formula for required sample size per group is:
n = 2 × ((Z_α/2 + Z_β) / d)²
Where:
Z_α/2 = 1.96 for α=0.05 (two-tailed)
Z_β = 0.84 for 80% power
d = Cohen's d effect size you want to detect
For d=0.5:
n = 2 × ((1.96 + 0.84) / 0.5)²
n = 2 × 7.84
n ≈ 64 per group
MCP Analytics calculates this automatically and flags when your sample size provides insufficient power for the gaps you're trying to detect.
What's Your Sample Size? Is This Test Adequately Powered?
Never report gap analysis results without calculating power. If your power is below 70%, state this limitation clearly. Better yet, combine multiple years of data or aggregate across schools to achieve adequate sample sizes. A well-powered null finding ("we found no gap") is valuable. An underpowered one is misleading.
How Student Performance Gap Analysis Works
The methodology is straightforward—the rigor is in the execution. Here's the actual statistical process:
Step 1: Independent Samples t-Tests for Binary Comparisons
For demographic categories with two groups (male/female, ELL/non-ELL, FRL/non-FRL), the analysis runs independent samples t-tests comparing mean scores:
t = (M₁ - M₂) / √(SE₁² + SE₂²)
Where:
M₁, M₂ = group means
SE₁, SE₂ = standard errors
df = n₁ + n₂ - 2
The t-statistic measures how many standard errors separate the two groups. Larger absolute values indicate larger gaps. The p-value tells you if this gap could plausibly occur by chance if no true difference existed.
Step 2: Effect Size Calculation (Cohen's d)
Statistical significance answers "is this gap real?" Effect size answers "does this gap matter?" Cohen's d expresses the gap in standard deviation units:
d = (M₁ - M₂) / SD_pooled
Where:
SD_pooled = √((SD₁² + SD₂²) / 2)
Interpretation guidelines:
- d = 0.2: Small gap (3-4 points on typical test)
- d = 0.5: Medium gap (7-8 points, crosses percentile bands)
- d = 0.8: Large gap (12+ points, major intervention needed)
A Cohen's d of 0.5 means the average student in the higher-performing group scores better than 69% of students in the lower-performing group. That's not just statistically significant—it's educationally meaningful.
Step 3: Multiple Comparison Corrections
Here's where most analyses fail. If you test gender gaps in math, reading, and writing (3 tests), plus ELL status (3 more), plus FRL status (3 more), you've run 9 comparisons. At α=0.05, you'd expect about 0.45 false positives even if no gaps existed.
Two correction approaches:
Bonferroni correction (conservative): Divide your α level by the number of comparisons. For 9 tests: α = 0.05 / 9 ≈ 0.0056. Only gaps with p < 0.0056 are significant. This controls family-wise error rate but reduces power.
Benjamini-Hochberg procedure (recommended): Controls false discovery rate (FDR) while maintaining better power than Bonferroni. For most educational contexts, use FDR = 0.05. Read our guide to the Benjamini-Hochberg procedure for implementation details.
MCP Analytics applies Benjamini-Hochberg corrections automatically when running multiple demographic comparisons on the same dataset.
Step 4: Assumption Checking
T-tests assume:
- Independence: Each student's score is independent. Violated if you have siblings, students who took the test together, or repeated measures.
- Normality: Scores are approximately normally distributed within each group. T-tests are robust to violations with n>30, but severely skewed data needs attention.
- Equal variances: Both groups have similar spreads. If one group's SD is >2× the other's, use Welch's t-test instead.
Check these before interpreting results. Violations don't automatically invalidate your analysis, but they affect which version of the t-test to use. See our guide to t-test assumptions for diagnostic approaches.
When to Use Student Performance Gap Analysis
Use this analysis when you have:
- Continuous outcome variables: Test scores, scale scores, percentile ranks—any numeric measurement
- Clear demographic categories: Binary groups (ELL/non-ELL) or categorical groups (race/ethnicity)
- Adequate sample sizes: At minimum, 30+ students per group for medium gaps; 64+ for higher confidence
- Independent observations: Each student appears once in each comparison
- Standardized assessment data: Common measures across groups (same test, same conditions)
Ideal Use Cases
Annual achievement gap monitoring: Compare demographic performance on state tests, district benchmarks, or grade-level assessments. Track gaps over time to measure intervention effectiveness.
Equity audits: Systematic analysis of gaps across multiple subjects, grade levels, and demographic dimensions. Identify which gaps are largest and most actionable.
Program evaluation: Compare outcomes for students in intervention programs (ELL support, Title I, special education) versus matched comparison groups. Did the program close the gap?
School comparison: Benchmark achievement gaps across schools in your district. Which schools show smaller gaps for similar demographics? What are they doing differently?
When NOT to Use This Analysis
Small sample sizes (n<20 per group): Results are unstable and lack power. Combine years, aggregate schools, or use Bayesian methods designed for small samples.
Non-independent observations: If students took multiple tests, use repeated-measures analysis. If data includes siblings or students nested in classrooms, use multilevel models.
Proficiency categories instead of scores: If you only have "below/meets/exceeds" data, use chi-square tests instead. T-tests require numeric scores.
Severely non-normal data: If scores are heavily skewed or have major outliers, use Mann-Whitney U tests instead of t-tests.
Causal claims: This is an observational analysis. You can say "ELL students scored 12 points lower on average" but not "ELL status caused lower scores." Confounding variables—prior achievement, teacher quality, class size—may explain the gap.
Data Requirements and Preparation
Before we discuss methodology, let's check the experimental design: what data do you actually have?
Required Data Columns
Your dataset needs:
- Student identifier: Unique ID for each student (can be anonymized)
- Test score variables: Numeric scores for math, reading, writing, or other assessments
- Demographic variables: Gender, race/ethnicity, ELL status, FRL status, IEP status, etc.
Example data structure:
student_id,math_score,reading_score,writing_score,gender,race,ell,frl
1001,342,356,348,F,Hispanic,Yes,Yes
1002,378,365,371,M,White,No,No
1003,351,342,338,F,Black,No,Yes
1004,389,392,385,M,Asian,No,No
...
Data Cleaning Checklist
Remove duplicate records: Each student should appear once. If you have multiple test dates, choose one (typically most recent or highest stakes).
Handle missing data appropriately: Students with missing test scores or demographic data get excluded from that specific comparison. Don't impute test scores—if a student didn't test, they're out. Document how many students were excluded and why.
Verify demographic coding consistency: Check for variations in category names (ELL vs EL, F vs Female vs f). Standardize before analysis.
Flag small groups: Identify demographic categories with fewer than 30 students. Results from these groups should be suppressed or flagged as low-confidence.
Check for impossible values: Test scores outside valid ranges, negative numbers, dates that don't make sense. These are data errors that will corrupt your results.
Privacy Considerations
Federal and state laws (FERPA, state privacy statutes) prohibit reporting results that could identify individual students. Common standard: suppress results for groups smaller than 10 students. Some states use 15 or 20 as thresholds. Check your local requirements before publishing gap analysis results.
Understanding Your Report
A well-designed gap analysis report presents four key elements. Here's what to look for and how to interpret each:
1. Summary Statistics Table
This shows the basic descriptive statistics for each demographic group:
| Group | n | Mean Score | SD | Min | Max |
|---|---|---|---|---|---|
| Male | 847 | 358.2 | 45.3 | 245 | 485 |
| Female | 823 | 366.4 | 43.8 | 251 | 492 |
Check the sample sizes first. Both groups exceed 64, so this comparison has adequate power. The SDs are similar (45.3 vs 43.8), so equal variance assumption is met.
2. Statistical Test Results
This table shows the formal hypothesis test outcomes:
| Comparison | Gap | t-statistic | p-value | Significant? |
|---|---|---|---|---|
| Female - Male | 8.2 | 3.01 | 0.003 | Yes* |
The gap of 8.2 points favors females. The p-value of 0.003 is well below 0.05, indicating statistical significance. But did you randomize? Of course not—you can't randomly assign gender. This is an observed association, not a causal finding.
3. Effect Size Estimates
This is where statistical significance becomes practical meaning:
| Comparison | Cohen's d | Magnitude | Percentile Shift |
|---|---|---|---|
| Female - Male | 0.18 | Small | 50th → 57th |
Despite statistical significance, the effect size is small (d=0.18). This gap moves the average student from the 50th to 57th percentile—real but modest. Before designing interventions, ask: is this gap worth addressing relative to larger gaps in your data?
4. Power Analysis
This tells you whether non-significant results are meaningful:
| Comparison | Sample Size | Power for d=0.3 | Power for d=0.5 | Power for d=0.8 |
|---|---|---|---|---|
| ELL - Non-ELL | n₁=45, n₂=823 | 37% | 76% | 98% |
This comparison has only 45 ELL students. Power is inadequate (37%) for detecting small gaps, marginal (76%) for medium gaps, but excellent (98%) for large gaps. If you find no significant gap here, you can rule out large gaps but not small or medium ones.
How to Interpret the Results
Interpretation requires looking beyond p-values to understand what the data actually tells you. Here's the framework:
The Four-Quadrant Interpretation Matrix
Quadrant 1: Significant with Large Effect (p<0.05, d≥0.5)
Interpretation: A real gap that matters. The difference is both statistically reliable and educationally meaningful.
Action: Priority for intervention. This gap likely affects student outcomes in meaningful ways. Investigate root causes before designing interventions.
Example: FRL vs non-FRL math gap of 15 points (d=0.82, p<0.001) with n=400 per group.
Quadrant 2: Significant with Small Effect (p<0.05, d<0.3)
Interpretation: A real gap that's small. Large sample size detected a genuine but modest difference.
Action: Monitor but deprioritize. Focus resources on larger gaps unless this one is easily addressed.
Example: Gender gap of 4 points in reading (d=0.15, p=0.012) with n=800 per group.
Quadrant 3: Non-significant with Adequate Power (p≥0.05, power≥80%)
Interpretation: No meaningful gap exists. Your test had sufficient power to detect medium or large gaps, and none appeared.
Action: Celebrate equity in this area. Investigate what's working and apply lessons elsewhere.
Example: IEP vs non-IEP reading gap of 2 points (d=0.08, p=0.43) with n=200 per group, power=95% for d=0.5.
Quadrant 4: Non-significant with Inadequate Power (p≥0.05, power<70%)
Interpretation: Inconclusive. Your test couldn't detect medium gaps, so "no significant difference" doesn't mean much.
Action: Collect more data, combine years, or aggregate schools. Don't conclude "no gap" from underpowered analysis.
Example: Asian vs White math gap of 6 points (d=0.32, p=0.18) with n=25 per group, power=42% for d=0.5.
Contextualizing Effect Sizes
Cohen's d becomes more meaningful when you translate it to familiar metrics:
Percentile movement: d=0.5 moves a student from 50th to 69th percentile. d=0.8 moves them from 50th to 79th.
Score points: On tests with SD=50 (common for scale scores), d=0.5 = 25 points. On tests with SD=15 (like SAT subscales), d=0.5 = 7.5 points.
Overlap coefficient: d=0.5 means the two distributions overlap by 80%. d=0.8 means 73% overlap. d=1.0 means 62% overlap.
Years of learning: Rough heuristic—d=0.3 to 0.4 approximates one year's growth on many standardized tests. A d=0.8 gap represents roughly two years of learning difference.
Before Designing Interventions
Observed gaps don't prove discrimination or bias. They might reflect prior achievement differences, resource allocation, curriculum alignment, teacher effectiveness, or countless other factors. Use gap analysis to identify disparities, then use additional research methods to understand causes. Correlation is interesting. Causation requires more rigorous investigation.
Common Pitfalls to Avoid
Most gap analysis failures fall into predictable categories. Here's how to avoid them:
Pitfall 1: Confusing "No Significant Gap" with "No Gap"
The Error: Reporting "our analysis found no achievement gaps" when you had 20 students per group and 40% statistical power.
Why It Matters: Absence of evidence is not evidence of absence. With low power, you'd miss even large gaps. Stakeholders may assume equity exists when you simply lacked the data to test it.
The Fix: Always report sample sizes and power analysis alongside results. State "with n=20 per group, we had insufficient power to detect gaps below d=0.9. No gaps larger than this were observed." Very different from "no gaps exist."
Pitfall 2: Running Multiple Tests Without Correction
The Error: Testing 5 demographic variables across 3 subjects (15 comparisons) at α=0.05, expecting no false positives.
Why It Matters: With 15 tests, you'd expect about 0.75 false positives even if no real gaps existed. One "significant" finding might be random noise.
The Fix: Apply Benjamini-Hochberg FDR correction or Bonferroni correction. Report both raw and corrected p-values. Flag which findings survive correction.
Pitfall 3: Ignoring Effect Sizes
The Error: With 1,000 students per group, finding a 2-point gap (d=0.08) statistically significant and treating it as urgent.
Why It Matters: Large samples detect tiny effects. A 2-point gap might be statistically real but too small to justify intervention resources.
The Fix: Set practical significance thresholds before analysis. Many districts use d≥0.4 as the threshold for action. Report both p-values and effect sizes, prioritizing large gaps over barely-significant ones.
Pitfall 4: Violating Assumptions Without Checking
The Error: Running standard t-tests when one group has SD=60 and the other has SD=25 (heterogeneous variances).
Why It Matters: Unequal variances inflate Type I error rates. Your "significant" finding might be an artifact of the assumption violation.
The Fix: Check assumptions first. Use Levene's test or plot residuals. If variances differ substantially (ratio >2:1), use Welch's t-test instead of Student's t-test. See our guide to t-test assumptions.
Pitfall 5: Making Causal Claims from Observational Data
The Error: Concluding "being female causes higher reading scores" or "ELL status causes lower math achievement."
Why It Matters: Did you randomize? No—students weren't randomly assigned to demographic groups. Confounding variables (prior achievement, teacher assignment, family support) might explain the gap. Causal claims require experimental designs or quasi-experimental methods like propensity score matching.
The Fix: Use precise language. Say "students qualifying for FRL scored an average of 12 points lower" not "FRL status caused lower scores." Gaps indicate disparities that warrant investigation, not proven causes.
Pitfall 6: Suppression Rule Violations
The Error: Publishing results for a group with 8 students, potentially identifying individual students.
Why It Matters: Federal law (FERPA) and state statutes prohibit releasing data that could identify students. Small groups make individuals identifiable, especially in combination with other variables.
The Fix: Follow your state's suppression rules (typically n≥10 or n≥15). Flag suppressed results as "N<10, data suppressed for privacy." Consider combining years or schools to reach minimum thresholds while maintaining privacy.
Connecting Gaps to Interventions: The Next Steps
You've identified significant gaps with meaningful effect sizes. Now what? Here's how to move from analysis to action:
Step 1: Investigate Root Causes
Statistical gaps don't explain themselves. Before designing interventions, ask:
- Prior achievement: Do gaps exist on earlier assessments? Is this a new gap or continuation of existing patterns?
- Opportunity gaps: Do demographic groups have equal access to advanced courses, experienced teachers, instructional resources?
- Assessment bias: Could test format, language, or cultural content favor certain groups?
- Support systems: Do all students have equal access to tutoring, intervention programs, family support?
Use qualitative methods—teacher interviews, student focus groups, classroom observations—to understand mechanisms behind statistical patterns.
Step 2: Design Targeted Interventions
Match intervention intensity to gap magnitude:
Small gaps (d=0.2-0.4): Targeted supports within existing structures. Modified instruction, peer tutoring, additional practice.
Medium gaps (d=0.4-0.7): Systematic intervention programs. Dedicated instructional time, specialist support, evidence-based curricula.
Large gaps (d>0.7): Comprehensive support systems. Intensive interventions, family engagement, multi-year programs, possible structural changes.
Step 3: Build in Evaluation
Before implementing interventions, establish your evaluation design:
- Baseline measurement: Document current gap size and significance
- Target reduction: Define success—"reduce gap by 50%" or "achieve d<0.3"
- Comparison group: Identify students not receiving intervention for comparison
- Timeline: Set checkpoints for interim evaluation
- Success metrics: How will you know if the intervention worked?
Run the same gap analysis annually to track progress. Did the gap narrow? Did it widen? Did scores improve for both groups, maintaining the gap?
Step 4: Communicate Results Responsibly
When sharing gap analysis findings:
- Lead with context: Explain what gaps measure and what they don't
- Emphasize variability: Most variation is within groups, not between them. Many students in lower-performing groups outscore those in higher-performing groups.
- Avoid deficit language: Frame as opportunity gaps, not student deficits. Focus on what schools can control.
- Highlight progress: Show trend data when gaps narrow. Celebrate schools that reduce gaps.
- Include uncertainty: Report confidence intervals and power limitations. Statistical findings aren't absolute truth.
The Accountability Perspective
Achievement gap analysis serves accountability systems, but remember its limits. Gaps reflect complex interactions of student background, school resources, curriculum quality, teacher effectiveness, and measurement properties. Holding schools accountable for gap closure makes sense only when accompanied by resources, support, and recognition that some factors lie outside school control. Use gap analysis to identify disparities, allocate resources, and monitor progress—not to punish schools for demographic composition.
Real-World Example: District Math Performance Analysis
Let's examine a realistic scenario to see gap analysis in practice:
The Research Question
A mid-sized district (8,500 students, grades 3-8) wants to analyze math achievement gaps across demographic groups using state assessment data. They're particularly concerned about ELL and FRL gaps based on previous reports.
The Data
Spring 2025 state math assessment scale scores (range: 200-500, M=350, SD=50). Demographic breakdowns:
- Female: n=847, M=358.2, SD=45.3
- Male: n=823, M=350.1, SD=47.8
- ELL: n=412, M=325.8, SD=51.2
- Non-ELL: n=1,258, M=361.4, SD=42.6
- FRL: n=1,142, M=338.6, SD=48.5
- Non-FRL: n=528, M=374.8, SD=38.2
Step 1: Check Sample Sizes and Power
All groups exceed n=400, providing excellent power (>99%) to detect medium effects (d=0.5) and adequate power (~85%) to detect small effects (d=0.3). Any gaps larger than d=0.3 should be detectable.
Step 2: Run t-Tests
| Comparison | Gap | t-statistic | df | p-value | Cohen's d |
|---|---|---|---|---|---|
| Female - Male | +8.1 | 3.52 | 1,668 | 0.0004 | 0.17 |
| Non-ELL - ELL | +35.6 | 13.45 | 1,668 | <0.0001 | 0.76 |
| Non-FRL - FRL | +36.2 | 14.82 | 1,668 | <0.0001 | 0.82 |
Step 3: Apply Multiple Comparison Correction
With 3 comparisons, Benjamini-Hochberg FDR correction at q=0.05:
- Rank p-values: p₁<0.0001, p₂<0.0001, p₃=0.0004
- Critical values: p₁≤0.0167, p₂≤0.0333, p₃≤0.05
- All three comparisons remain significant after correction
Step 4: Interpret Effect Sizes
Gender gap (d=0.17): Small effect. Statistically significant but practically modest. Moves average student from 50th to 57th percentile. Monitor but not urgent priority.
ELL gap (d=0.76): Large effect. 35.6-point gap represents about 71% of one standard deviation. Average ELL student scores better than only 22% of non-ELL students. This gap warrants immediate investigation and intervention.
FRL gap (d=0.82): Large effect. 36.2-point gap represents about 72% of one SD. This is the most substantial gap in the analysis. Average FRL student scores better than only 21% of non-FRL students.
Step 5: Check Assumptions
Equal variances: SD ratios range from 1.06 to 1.27, well within acceptable limits. Standard t-tests are appropriate.
Normality: With n>400 in all groups, t-tests are robust to normality violations by central limit theorem. Q-Q plots show approximate normality.
Independence: Each student appears once. No siblings or repeated measures in dataset.
Recommendations to District Leadership
Immediate priorities: ELL and FRL gaps demand attention. Both exceed d=0.7, indicating substantial educational disparities.
Root cause investigation: Before designing interventions, examine:
- Do ELL students receive adequate math instruction in accessible language?
- Are FRL students disproportionately assigned to less experienced teachers?
- Do these gaps exist at earlier grade levels, or do they emerge over time?
- Are there interaction effects? (ELL+FRL students may face compounded gaps)
Intervention planning: Gaps of this magnitude require systematic, well-resourced interventions. Consider:
- Math intervention programs with proven effect sizes >0.3
- Professional development on supporting ELL students in math
- Family engagement programs to build math support at home
- Extended learning time (tutoring, summer programs) for identified students
Evaluation design: Establish baseline (current gaps), set targets (reduce gaps to d<0.4 within 3 years), and track progress annually using the same analysis.
What This Example Demonstrates
This analysis moved from statistical findings to actionable recommendations by:
- Verifying adequate power before interpreting results
- Applying appropriate corrections for multiple comparisons
- Translating effect sizes into practical meaning
- Checking assumptions to ensure valid inference
- Prioritizing gaps by magnitude, not just significance
- Framing gaps as opportunities for improvement, not student deficits
- Connecting findings to intervention planning and evaluation
This is what rigorous gap analysis looks like—not just p-values, but a complete picture of statistical validity, practical significance, and actionable next steps.
Try It Yourself: MCP Analytics Student Performance Gap Analysis
MCP Analytics automates the entire workflow described in this article. Upload your CSV file with student IDs, test scores, and demographic variables. The system:
- Runs independent samples t-tests for all binary demographic comparisons
- Calculates Cohen's d effect sizes with confidence intervals
- Applies Benjamini-Hochberg multiple comparison corrections
- Checks statistical assumptions and flags violations
- Performs power analysis to identify underpowered comparisons
- Generates visualizations showing gap magnitude and significance
- Produces a complete report with interpretation guidelines
The analysis runs in about 60 seconds and returns a comprehensive report you can share with stakeholders.
Frequently Asked Questions
What sample size do I need to detect achievement gaps?
For a medium effect size (d=0.5, roughly 7-8 points on a standardized test), you need approximately 64 students per group to achieve 80% power at α=0.05. Smaller groups (n<30) lack statistical power to detect anything but the largest gaps. Always run a power analysis before concluding "no gap exists"—you may simply have insufficient sample size.
How do I interpret Cohen's d effect size for test score gaps?
Cohen's d expresses the gap in standard deviation units: d=0.2 is small (about 3 points on a typical test), d=0.5 is medium (7-8 points), d=0.8 is large (12+ points). A d=0.5 gap means the average student in the higher-performing group scores better than 69% of students in the lower-performing group. Focus on d≥0.4 for intervention planning—smaller gaps may be statistically significant but too small to justify resource allocation.
Can I use this analysis with small demographic groups?
Groups smaller than 15-20 students severely limit statistical power and make assumption checks unreliable. If you have small groups, consider combining years of data, aggregating across schools, or using Bayesian methods that handle small samples better. Never report gaps for groups under 10 students—results are unstable and may violate student privacy requirements (FERPA).
What's the difference between statistical and practical significance in achievement gaps?
Statistical significance (p<0.05) means the gap likely isn't due to chance. Practical significance means the gap is large enough to matter. With 500 students per group, a 2-point gap might be statistically significant but too small for intervention. Conversely, a 15-point gap in small groups (n=20) might not reach significance but clearly warrants investigation. Always examine both p-values and effect sizes before making decisions.
Should I use t-tests or ANOVA for multiple demographic comparisons?
For binary comparisons (male vs female, ELL vs non-ELL), use independent samples t-tests. For three or more groups (multiple race/ethnicity categories), use ANOVA followed by post-hoc tests with Bonferroni or Benjamini-Hochberg corrections. Running multiple uncorrected t-tests inflates false positive rates—with 5 groups, you'd run 10 comparisons, giving ~40% chance of false discoveries.
Ready to Analyze Achievement Gaps in Your District?
Upload your student test score data and get comprehensive gap analysis in 60 seconds.
Try MCP Analytics Free