ICC Reliability Study Overview
ICC Reliability Study Configuration
Analysis overview and configuration
test_1772935870
Analysis Overview
This analysis evaluates inter-rater reliability across five clinicians assessing clinical severity ratings for 120 subjects across four diagnostic categories. The objective is to determine whether clinicians provide consistent and reproducible severity assessments, which is critical for ensuring diagnostic validity and treatment planning consistency in clinical practice.
The
Analysis Overview
This analysis evaluates inter-rater reliability across five clinicians assessing clinical severity ratings for 120 subjects across four diagnostic categories. The objective is to determine whether clinicians provide consistent and reproducible severity assessments, which is critical for ensuring diagnostic validity and treatment planning consistency in clinical practice.
The
Data Quality & Completeness
Data preprocessing and column mapping
Data Preprocessing
This section documents the data preprocessing pipeline for the inter-rater reliability study. Perfect data retention (100%) indicates no observations were excluded during cleaning, which is critical for maintaining the integrity of ICC calculations and ensuring all 120 subjects with 5 raters each are represented in the final analysis.
The perfect retention rate demonstrates exceptionally clean input data with no missing values, duplicates, or invalid entries requiring removal. This is particularly important for ICC(2,k) calculations, which depend on balanced designs across all raters and subjects. The absence of any filtering or exclusion criteria means the reported ICC of 0.854 and stratified reliability estimates are based on the complete intended sample, strengthening confidence in the inter-rater reliability conclusions across all four diagnostic categories.
No train/test split was applied because this analysis assesses measurement agreement rather than predictive performance. The 100% retention aligns with the study design of
Data Preprocessing
This section documents the data preprocessing pipeline for the inter-rater reliability study. Perfect data retention (100%) indicates no observations were excluded during cleaning, which is critical for maintaining the integrity of ICC calculations and ensuring all 120 subjects with 5 raters each are represented in the final analysis.
The perfect retention rate demonstrates exceptionally clean input data with no missing values, duplicates, or invalid entries requiring removal. This is particularly important for ICC(2,k) calculations, which depend on balanced designs across all raters and subjects. The absence of any filtering or exclusion criteria means the reported ICC of 0.854 and stratified reliability estimates are based on the complete intended sample, strengthening confidence in the inter-rater reliability conclusions across all four diagnostic categories.
No train/test split was applied because this analysis assesses measurement agreement rather than predictive performance. The 100% retention aligns with the study design of
Key Findings and Recommendations
Key Findings & Recommendations
| Finding | Value |
|---|---|
| Overall ICC | 0.854 |
| Reliability Level | Excellent |
| 95% Confidence Interval | [0.812, 0.890] |
| ICC Model Used | ICC2k |
| Subjects Analyzed | 120 |
| Number of Raters | 5 |
| Total Observations | 600 |
| F-Test Result | F=0.00, p=1.000 |
| Standard Error (SEM) | 6.39 points |
Bottom Line: This inter-rater_id reliability study assessed 120 subjects rated by 5 clinicians. The Intraclass Correlation Coefficient (ICC) of 0.854 indicates excellent reliability (95% CI: [0.812, 0.890]).
Key Findings:
• ICC Model: ICC2k was used to assess inter-rater_id reliability with random raters (generalizable)
• Statistical Significance: F-test (F=0.00, p=1.000) does not confirm significant between-subject_id variance
• Measurement Error: SEM = 6.39 points (expected error in individual ratings)
• Bland-Altman: Mean difference = -1.51 points, 95% LOA [-19.55, 16.53]
• Stratified Analysis: Reliability assessed across 4 diagnostic categories
Recommendation: Current rater_id training and protocols are effective. Maintain current practices and monitor reliability over time.
Executive Summary
This analysis evaluates whether five clinicians can reliably and consistently rate clinical severity across 120 patients with four diagnostic categories. The objective is to validate that the assessment protocol produces trustworthy, reproducible ratings regardless of which clinician performs the evaluation—a critical requirement for clinical decision-making and research validity.
Executive Summary
This analysis evaluates whether five clinicians can reliably and consistently rate clinical severity across 120 patients with four diagnostic categories. The objective is to validate that the assessment protocol produces trustworthy, reproducible ratings regardless of which clinician performs the evaluation—a critical requirement for clinical decision-making and research validity.
Model Comparison with Confidence Intervals
Model Comparison with 95% Confidence Intervals
All 10 ICC forms with rater_confidence intervals for model selection
ICC Values - All 10 Forms
This section quantifies inter-rater reliability across all ICC model variants to establish whether clinician severity ratings are sufficiently consistent for clinical decision-making. The primary ICC(2k) value of 0.854 directly addresses the core objective: assessing whether the five clinicians provide reliable, generalizable assessments across the 120 subjects and four diagnostic categories.
The ICC(2k) model was appropriately selected because it assumes raters are random representatives of a larger clinician population, making findings generalizable beyond these five clinicians. The 0.11-point gap between single-rater (0.85) and average-rater
ICC Values - All 10 Forms
This section quantifies inter-rater reliability across all ICC model variants to establish whether clinician severity ratings are sufficiently consistent for clinical decision-making. The primary ICC(2k) value of 0.854 directly addresses the core objective: assessing whether the five clinicians provide reliable, generalizable assessments across the 120 subjects and four diagnostic categories.
The ICC(2k) model was appropriately selected because it assumes raters are random representatives of a larger clinician population, making findings generalizable beyond these five clinicians. The 0.11-point gap between single-rater (0.85) and average-rater
Statistical Significance and Measurement Error
ICC Model Justification and SEM
ICC model selection guidance and statistical significance
| Metric | Value |
|---|---|
| ICC Model Used | ICC2k |
| ICC Value | 0.854 |
| 95% CI | [0.812, 0.890] |
| F-Statistic | 0.00 |
| P-Value | 1.000 |
| SEM | 6.39 |
| Interpretation | Excellent |
Model Selection & Statistics
This section evaluates whether the ICC model assumptions are statistically valid and quantifies measurement error inherent in the clinical severity rating process. It determines which ICC variant (fixed vs. random raters) is most appropriate for generalizing reliability findings beyond the current five clinicians.
Despite the non-significant F-test, the excellent ICC(2k) of 0.854 (95% CI: 0.812–0.890) demonstrates strong inter-rater reliability across 120 subjects and 600 observations. The SEM of 6.39 indicates clinically acceptable measurement precision for severity assessment. The ICC(
Model Selection & Statistics
This section evaluates whether the ICC model assumptions are statistically valid and quantifies measurement error inherent in the clinical severity rating process. It determines which ICC variant (fixed vs. random raters) is most appropriate for generalizing reliability findings beyond the current five clinicians.
Despite the non-significant F-test, the excellent ICC(2k) of 0.854 (95% CI: 0.812–0.890) demonstrates strong inter-rater reliability across 120 subjects and 600 observations. The SEM of 6.39 indicates clinically acceptable measurement precision for severity assessment. The ICC(
Individual Rater Statistics and Bias
Individual Rater Performance & Bias
Rater-level statistics showing mean scores and systematic bias
| rater_id | mean_score | sd_score | n_subjects | bias | mean_confidence |
|---|---|---|---|---|---|
| Dr_Adams | 51.384 | 15.544 | 120.000 | 0.042 | 3.725 |
| Dr_Baker | 52.892 | 17.155 | 120.000 | 1.550 | 3.683 |
| Dr_Chen | 49.087 | 15.999 | 120.000 | -2.254 | 3.508 |
| Dr_Davis | 53.471 | 17.193 | 120.000 | 2.129 | 3.775 |
| Dr_Evans | 49.875 | 17.511 | 120.000 | -1.467 | 3.800 |
Rater Statistics
This section evaluates individual rater performance and systematic bias to assess whether clinicians are applying severity rating scales consistently. Understanding rater-level variation is essential for validating the overall inter-rater reliability findings and identifying whether observed agreement reflects true clinical consensus or masks individual scoring patterns.
The minimal bias across all five clinicians (SD = 1.88) demonstrates that systematic scoring differences are negligible relative to the overall scale range. Despite individual confidence variations, raters maintain comparable mean scores and rating variability, supporting the excellent ICC (0.854) observed at the aggregate
Rater Statistics
This section evaluates individual rater performance and systematic bias to assess whether clinicians are applying severity rating scales consistently. Understanding rater-level variation is essential for validating the overall inter-rater reliability findings and identifying whether observed agreement reflects true clinical consensus or masks individual scoring patterns.
The minimal bias across all five clinicians (SD = 1.88) demonstrates that systematic scoring differences are negligible relative to the overall scale range. Despite individual confidence variations, raters maintain comparable mean scores and rating variability, supporting the excellent ICC (0.854) observed at the aggregate
Pairwise Correlation Heatmap
Pairwise Correlation Heatmap
Pairwise correlations between all raters showing agreement patterns
Rater-by-Rater Agreement
This section evaluates pairwise agreement between all five clinicians to identify whether specific rater combinations show systematic disagreement. Strong correlations across all pairs validate that the overall ICC(2k) = 0.854 reflects genuine consensus rather than masking problematic rater combinations. Identifying weak pairs is critical for understanding whether inter-rater reliability issues are localized or pervasive.
The absence of any weak pairwise correlations (<0.5) confirms that the excellent overall ICC reflects genuine consensus rather than compensating disagreements. All five clinicians interpret clinical severity ratings similarly, with minimal systematic bias between any pair. This uniform agreement across
Rater-by-Rater Agreement
This section evaluates pairwise agreement between all five clinicians to identify whether specific rater combinations show systematic disagreement. Strong correlations across all pairs validate that the overall ICC(2k) = 0.854 reflects genuine consensus rather than masking problematic rater combinations. Identifying weak pairs is critical for understanding whether inter-rater reliability issues are localized or pervasive.
The absence of any weak pairwise correlations (<0.5) confirms that the excellent overall ICC reflects genuine consensus rather than compensating disagreements. All five clinicians interpret clinical severity ratings similarly, with minimal systematic bias between any pair. This uniform agreement across
Difference vs Mean Plot with Limits of Agreement
Difference vs Mean Plot
Bland-Altman plot visualizing agreement between two key raters
Bland-Altman Agreement
This section quantifies agreement between two individual raters on clinical severity ratings using the Bland-Altman method. It complements the overall ICC analysis by examining systematic bias and variability patterns between specific rater pairs, revealing whether disagreements are random or reflect consistent scoring tendencies that could affect clinical decision-making.
The near-zero mean difference indicates no systematic bias between these two raters—a critical finding for clinical validity. However, the wide limits of agreement (±19.5 points) reveal substantial individual-level disagreement despite strong overall reliability. This pattern reflects the distinction between group-level consistency (ICC) and pairwise agreement: while raters rank severity similarly across subjects
Bland-Altman Agreement
This section quantifies agreement between two individual raters on clinical severity ratings using the Bland-Altman method. It complements the overall ICC analysis by examining systematic bias and variability patterns between specific rater pairs, revealing whether disagreements are random or reflect consistent scoring tendencies that could affect clinical decision-making.
The near-zero mean difference indicates no systematic bias between these two raters—a critical finding for clinical validity. However, the wide limits of agreement (±19.5 points) reveal substantial individual-level disagreement despite strong overall reliability. This pattern reflects the distinction between group-level consistency (ICC) and pairwise agreement: while raters rank severity similarly across subjects
Actionable Next Steps for Improving Reliability
Actionable Next Steps
Actionable recommendations based on ICC analysis findings
Recommendations
This section synthesizes the inter-rater reliability analysis to guide clinical practice improvements. It translates the ICC findings into actionable insights for the Clinical Assessment Research Lab, helping stakeholders understand whether current rater training and assessment protocols are sufficiently reliable for clinical decision-making across diagnostic categories.
The excellent overall ICC (0.854) confirms that clinical severity ratings are highly consistent across raters, validating current assessment protocols. However, stratified analysis reveals diagnostic categories warrant differential attention—Depression ratings are more reliable than Anxiety or PTSD assessments. Individual rater correlations (mean 0.89) further support protocol effectiveness, though modest bias patterns suggest some clinicians systematically
Recommendations
This section synthesizes the inter-rater reliability analysis to guide clinical practice improvements. It translates the ICC findings into actionable insights for the Clinical Assessment Research Lab, helping stakeholders understand whether current rater training and assessment protocols are sufficiently reliable for clinical decision-making across diagnostic categories.
The excellent overall ICC (0.854) confirms that clinical severity ratings are highly consistent across raters, validating current assessment protocols. However, stratified analysis reveals diagnostic categories warrant differential attention—Depression ratings are more reliable than Anxiety or PTSD assessments. Individual rater correlations (mean 0.89) further support protocol effectiveness, though modest bias patterns suggest some clinicians systematically