Context and Data Preparation

ICC Reliability Study Overview

Analysis Overview

ICC Reliability Study Configuration

Analysis overview and configuration

Icc

Clinical Assessment Research Lab

Assess inter-rater_id reliability of clinical severity ratings across multiple clinicians and diagnostic categories

Module Configuration

icc_model ICC2k

confidence_level 0.95

min_subjects 30

min_raters 3

Processing ID

test_1772935870

Key Insights

Analysis Overview

Purpose

This analysis evaluates inter-rater reliability across five clinicians assessing clinical severity ratings for 120 subjects across four diagnostic categories. The objective is to determine whether clinicians provide consistent and reproducible severity assessments, which is critical for ensuring diagnostic validity and treatment planning consistency in clinical practice.

Key Findings

Primary ICC (ICC2k): 0.854 (95% CI: 0.812–0.890) - Excellent agreement indicating strong reliability among raters
Average Rater ICC: 0.97 - When ratings are averaged across all five raters, reliability reaches near-perfect levels
Single Rater ICC: 0.85–0.86 - Individual clinician assessments show good but lower reliability than averaged scores
Bland-Altman Mean Difference: -1.508 with limits of agreement (-19.547 to 16.532) - Minimal systematic bias but moderate individual variation
Stratified Reliability: Depression (0.90) shows highest agreement; Anxiety and PTSD (0.81–0.82) show lowest, suggesting diagnostic category influences consistency
Rater Correlations: Range 0.81–1.00 (mean 0.89) - Most pairwise comparisons demonstrate strong agreement

Interpretation

The

Key Insights

Analysis Overview

Purpose

Key Findings

Primary ICC (ICC2k): 0.854 (95% CI: 0.812–0.890) - Excellent agreement indicating strong reliability among raters
Average Rater ICC: 0.97 - When ratings are averaged across all five raters, reliability reaches near-perfect levels
Single Rater ICC: 0.85–0.86 - Individual clinician assessments show good but lower reliability than averaged scores
Bland-Altman Mean Difference: -1.508 with limits of agreement (-19.547 to 16.532) - Minimal systematic bias but moderate individual variation
Stratified Reliability: Depression (0.90) shows highest agreement; Anxiety and PTSD (0.81–0.82) show lowest, suggesting diagnostic category influences consistency
Rater Correlations: Range 0.81–1.00 (mean 0.89) - Most pairwise comparisons demonstrate strong agreement

Interpretation

The

Data Preprocessing

Data Quality & Completeness

600

Final Observations

Data preprocessing and column mapping

Data Pipeline

600

Initial Records

600

Clean Records

Column Mapping

subject

→

subject_id

rater

→

rater_id

rating

→

severity_score

diagnostic_category

→

Key Insights

Data Preprocessing

Purpose

This section documents the data preprocessing pipeline for the inter-rater reliability study. Perfect data retention (100%) indicates no observations were excluded during cleaning, which is critical for maintaining the integrity of ICC calculations and ensuring all 120 subjects with 5 raters each are represented in the final analysis.

Key Findings

Retention Rate: 100% (600/600 rows) - All observations passed quality checks with zero exclusions
Rows Removed: 0 - No data loss occurred during preprocessing
Train/Test Split: Not applicable - This is a reliability assessment, not a predictive model requiring data partitioning
Data Integrity: Complete dataset preserved ensures ICC estimates reflect the full sample without selection bias

Interpretation

The perfect retention rate demonstrates exceptionally clean input data with no missing values, duplicates, or invalid entries requiring removal. This is particularly important for ICC(2,k) calculations, which depend on balanced designs across all raters and subjects. The absence of any filtering or exclusion criteria means the reported ICC of 0.854 and stratified reliability estimates are based on the complete intended sample, strengthening confidence in the inter-rater reliability conclusions across all four diagnostic categories.

Context

No train/test split was applied because this analysis assesses measurement agreement rather than predictive performance. The 100% retention aligns with the study design of

Key Insights

Data Preprocessing

Purpose

Key Findings

Retention Rate: 100% (600/600 rows) - All observations passed quality checks with zero exclusions
Rows Removed: 0 - No data loss occurred during preprocessing
Train/Test Split: Not applicable - This is a reliability assessment, not a predictive model requiring data partitioning
Data Integrity: Complete dataset preserved ensures ICC estimates reflect the full sample without selection bias

Interpretation

Context

No train/test split was applied because this analysis assesses measurement agreement rather than predictive performance. The 100% retention aligns with the study design of

Executive Summary

Key Findings and Recommendations

TLDR

Executive Summary

Key Findings & Recommendations

0.854

Overall ICC

Key Performance Indicators

Primary icc

85.4%

Icc interpretation

Excellent

N subjects

120

N raters

Key Findings

Key findings

Finding	Value
Overall ICC	0.854
Reliability Level	Excellent
95% Confidence Interval	[0.812, 0.890]
ICC Model Used	ICC2k
Subjects Analyzed	120
Number of Raters	5
Total Observations	600
F-Test Result	F=0.00, p=1.000
Standard Error (SEM)	6.39 points

Executive Summary

Bottom Line: This inter-rater_id reliability study assessed 120 subjects rated by 5 clinicians. The Intraclass Correlation Coefficient (ICC) of 0.854 indicates excellent reliability (95% CI: [0.812, 0.890]).

Key Findings:
• ICC Model: ICC2k was used to assess inter-rater_id reliability with random raters (generalizable)
• Statistical Significance: F-test (F=0.00, p=1.000) does not confirm significant between-subject_id variance
• Measurement Error: SEM = 6.39 points (expected error in individual ratings)
• Bland-Altman: Mean difference = -1.51 points, 95% LOA [-19.55, 16.53]
• Stratified Analysis: Reliability assessed across 4 diagnostic categories

Recommendation: Current rater_id training and protocols are effective. Maintain current practices and monitor reliability over time.

Key Insights

Executive Summary

EXECUTIVE SUMMARY: INTER-RATER RELIABILITY ASSESSMENT

Purpose

This analysis evaluates whether five clinicians can reliably and consistently rate clinical severity across 120 patients with four diagnostic categories. The objective is to validate that the assessment protocol produces trustworthy, reproducible ratings regardless of which clinician performs the evaluation—a critical requirement for clinical decision-making and research validity.

Key Findings

Primary ICC (0.854): Excellent inter-rater reliability with narrow 95% confidence interval [0.812–0.890], indicating strong agreement among clinicians
Measurement Error (SEM = 6.39): Expected rating variation of approximately ±6 points on the severity scale, clinically acceptable given the grand mean of 51.3
Bland-Altman Agreement: Mean difference of −1.51 points with limits of agreement [−19.55, 16.53], showing minimal systematic bias but moderate individual variation
Diagnostic Consistency: Stratified analysis across four categories (Anxiety, PTSD, OCD, Depression) demonstrates reliability ranges 0.81–0.90, with Depression showing strongest agreement (0.90)
Rater Parity: Individual clinician biases range −2.25 to +2.13 points; no rater systematically infl

Key Insights

Executive Summary

EXECUTIVE SUMMARY: INTER-RATER RELIABILITY ASSESSMENT

Purpose

Key Findings

Primary ICC (0.854): Excellent inter-rater reliability with narrow 95% confidence interval [0.812–0.890], indicating strong agreement among clinicians
Measurement Error (SEM = 6.39): Expected rating variation of approximately ±6 points on the severity scale, clinically acceptable given the grand mean of 51.3
Bland-Altman Agreement: Mean difference of −1.51 points with limits of agreement [−19.55, 16.53], showing minimal systematic bias but moderate individual variation
Diagnostic Consistency: Stratified analysis across four categories (Anxiety, PTSD, OCD, Depression) demonstrates reliability ranges 0.81–0.90, with Depression showing strongest agreement (0.90)
Rater Parity: Individual clinician biases range −2.25 to +2.13 points; no rater systematically infl

ICC Results - All Forms

Model Comparison with Confidence Intervals

ICC

ICC Values - All 10 Forms

Model Comparison with 95% Confidence Intervals

0.854

Primary ICC

All 10 ICC forms with rater_confidence intervals for model selection

0.854

primary icc

ICC2k

icc model

Excellent

icc interpretation

Key Insights

ICC Values - All 10 Forms

Purpose

This section quantifies inter-rater reliability across all ICC model variants to establish whether clinician severity ratings are sufficiently consistent for clinical decision-making. The primary ICC(2k) value of 0.854 directly addresses the core objective: assessing whether the five clinicians provide reliable, generalizable assessments across the 120 subjects and four diagnostic categories.

Key Findings

Primary ICC (ICC2k): 0.854 [0.812–0.890] – Exceeds the 0.75 threshold for excellent reliability, confirming ratings are suitable for clinical use
Single-Rater Reliability: 0.85–0.86 – Individual clinician assessments show strong agreement, though slightly lower than averaged ratings
Average-Rater Reliability: 0.97 – Combining multiple raters yields near-perfect consistency, demonstrating systematic measurement validity
Confidence Interval Width: 0.078 points – Narrow CI reflects stable estimates across the sample of 120 subjects and 600 observations

Interpretation

The ICC(2k) model was appropriately selected because it assumes raters are random representatives of a larger clinician population, making findings generalizable beyond these five clinicians. The 0.11-point gap between single-rater (0.85) and average-rater

Key Insights

ICC Values - All 10 Forms

Purpose

Key Findings

Primary ICC (ICC2k): 0.854 [0.812–0.890] – Exceeds the 0.75 threshold for excellent reliability, confirming ratings are suitable for clinical use
Single-Rater Reliability: 0.85–0.86 – Individual clinician assessments show strong agreement, though slightly lower than averaged ratings
Average-Rater Reliability: 0.97 – Combining multiple raters yields near-perfect consistency, demonstrating systematic measurement validity
Confidence Interval Width: 0.078 points – Narrow CI reflects stable estimates across the sample of 120 subjects and 600 observations

Interpretation

Model Selection & Statistics

Statistical Significance and Measurement Error

Model Selection & Statistics

ICC Model Justification and SEM

F statistic

ICC model selection guidance and statistical significance

f statistic

p value

6.39

sem

16.73

pooled sd

Icc model comparison

Metric	Value
ICC Model Used	ICC2k
ICC Value	0.854
95% CI	[0.812, 0.890]
F-Statistic	0.00
P-Value	1.000
SEM	6.39
Interpretation	Excellent

Key Insights

Model Selection & Statistics

Purpose

This section evaluates whether the ICC model assumptions are statistically valid and quantifies measurement error inherent in the clinical severity rating process. It determines which ICC variant (fixed vs. random raters) is most appropriate for generalizing reliability findings beyond the current five clinicians.

Key Findings

F-statistic (0.00) & p-value (1.000): The non-significant F-test indicates insufficient between-subject variance to formally validate ICC computation, though the overall ICC(2k) of 0.854 remains excellent and practically meaningful.
Standard Error of Measurement (6.39): Individual ratings deviate from true scores by approximately ±6.39 points (68% confidence) or ±12.78 points (95% confidence), representing ~38% of the pooled standard deviation (16.73).
Model Selection: ICC(2) was appropriately selected, treating the five raters as a random sample generalizable to similar clinician populations.

Interpretation

Despite the non-significant F-test, the excellent ICC(2k) of 0.854 (95% CI: 0.812–0.890) demonstrates strong inter-rater reliability across 120 subjects and 600 observations. The SEM of 6.39 indicates clinically acceptable measurement precision for severity assessment. The ICC(

Key Insights

Model Selection & Statistics

Purpose

Key Findings

F-statistic (0.00) & p-value (1.000): The non-significant F-test indicates insufficient between-subject variance to formally validate ICC computation, though the overall ICC(2k) of 0.854 remains excellent and practically meaningful.
Standard Error of Measurement (6.39): Individual ratings deviate from true scores by approximately ±6.39 points (68% confidence) or ±12.78 points (95% confidence), representing ~38% of the pooled standard deviation (16.73).
Model Selection: ICC(2) was appropriately selected, treating the five raters as a random sample generalizable to similar clinician populations.

Interpretation

Rater Performance Summary

Individual Rater Statistics and Bias

Rater Statistics

Individual Rater Performance & Bias

Number of Raters

Rater-level statistics showing mean scores and systematic bias

rater_id	mean_score	sd_score	n_subjects	bias	mean_confidence
Dr_Adams	51.384	15.544	120.000	0.042	3.725
Dr_Baker	52.892	17.155	120.000	1.550	3.683
Dr_Chen	49.087	15.999	120.000	-2.254	3.508
Dr_Davis	53.471	17.193	120.000	2.129	3.775
Dr_Evans	49.875	17.511	120.000	-1.467	3.800

n raters

51.34

grand mean

Key Insights

Rater Statistics

Purpose

This section evaluates individual rater performance and systematic bias to assess whether clinicians are applying severity rating scales consistently. Understanding rater-level variation is essential for validating the overall inter-rater reliability findings and identifying whether observed agreement reflects true clinical consensus or masks individual scoring patterns.

Key Findings

Bias Range: -2.25 to +2.13 points—all raters deviate minimally from the grand mean (51.34), indicating no systematic over- or under-rating exceeds the ±5-point threshold
Standard Deviation: 15.54 to 17.51 across raters—consistent variability in severity scoring, suggesting similar rating dispersion patterns
Mean Confidence: 3.51 to 3.80 (on presumed 5-point scale)—Dr. Evans shows highest confidence (3.80) despite largest negative bias (-1.47); Dr. Chen shows lowest confidence (3.51)
Balanced Contribution: All raters evaluated identical 120 subjects, ensuring equal representation

Interpretation

The minimal bias across all five clinicians (SD = 1.88) demonstrates that systematic scoring differences are negligible relative to the overall scale range. Despite individual confidence variations, raters maintain comparable mean scores and rating variability, supporting the excellent ICC (0.854) observed at the aggregate

Key Insights

Rater Statistics

Purpose

Key Findings

Bias Range: -2.25 to +2.13 points—all raters deviate minimally from the grand mean (51.34), indicating no systematic over- or under-rating exceeds the ±5-point threshold
Standard Deviation: 15.54 to 17.51 across raters—consistent variability in severity scoring, suggesting similar rating dispersion patterns
Mean Confidence: 3.51 to 3.80 (on presumed 5-point scale)—Dr. Evans shows highest confidence (3.80) despite largest negative bias (-1.47); Dr. Chen shows lowest confidence (3.51)
Balanced Contribution: All raters evaluated identical 120 subjects, ensuring equal representation

Interpretation

Rater-by-Rater Agreement

Pairwise Correlation Heatmap

Rater-by-Rater Agreement

Pairwise Correlation Heatmap

Pairwise correlations between all raters showing agreement patterns

Key Insights

Rater-by-Rater Agreement

Purpose

This section evaluates pairwise agreement between all five clinicians to identify whether specific rater combinations show systematic disagreement. Strong correlations across all pairs validate that the overall ICC(2k) = 0.854 reflects genuine consensus rather than masking problematic rater combinations. Identifying weak pairs is critical for understanding whether inter-rater reliability issues are localized or pervasive.

Key Findings

Mean Pairwise Correlation: 0.89 - All rater pairs demonstrate strong agreement, well above the 0.5 threshold for acceptable reliability
Range: 0.81–1.0 - Minimum correlation (Dr_Evans & Dr_Baker = 0.81) remains in the “excellent” range, indicating no problematic rater pairs
Consistency Pattern: No correlations fall below 0.80, suggesting uniform interpretation of severity criteria across all clinicians
Strongest Agreement: Dr_Adams & Dr_Chen (r = 0.93) and Dr_Adams & Dr_Davis (r = 0.90) show exceptional alignment

Interpretation

The absence of any weak pairwise correlations (<0.5) confirms that the excellent overall ICC reflects genuine consensus rather than compensating disagreements. All five clinicians interpret clinical severity ratings similarly, with minimal systematic bias between any pair. This uniform agreement across

Key Insights

Rater-by-Rater Agreement

Purpose

Key Findings

Mean Pairwise Correlation: 0.89 - All rater pairs demonstrate strong agreement, well above the 0.5 threshold for acceptable reliability
Range: 0.81–1.0 - Minimum correlation (Dr_Evans & Dr_Baker = 0.81) remains in the “excellent” range, indicating no problematic rater pairs
Consistency Pattern: No correlations fall below 0.80, suggesting uniform interpretation of severity criteria across all clinicians
Strongest Agreement: Dr_Adams & Dr_Chen (r = 0.93) and Dr_Adams & Dr_Davis (r = 0.90) show exceptional alignment

Interpretation

Bland-Altman Agreement Analysis

Difference vs Mean Plot with Limits of Agreement

Bland-Altman Agreement

Difference vs Mean Plot

-1.508

Mean Difference

Bland-Altman plot visualizing agreement between two key raters

-1.508

ba mean diff

9.204

ba sd diff

-19.547

ba loa lower

Key Insights

Bland-Altman Agreement

Purpose

This section quantifies agreement between two individual raters on clinical severity ratings using the Bland-Altman method. It complements the overall ICC analysis by examining systematic bias and variability patterns between specific rater pairs, revealing whether disagreements are random or reflect consistent scoring tendencies that could affect clinical decision-making.

Key Findings

Mean Difference (-1.51 points): Negligible systematic bias; neither rater consistently scores higher or lower than the other, indicating balanced rating behavior across the 120 subjects.
Limits of Agreement (±19.55 to +16.53): 95% of rating differences fall within this ±18-point range around a mean score of ~52, representing clinically meaningful variability for individual rater pairs.
Standard Deviation (9.20): Moderate scatter in differences suggests raters diverge by approximately 9 points on average, consistent with the excellent ICC (0.854) observed at the group level.

Interpretation

The near-zero mean difference indicates no systematic bias between these two raters—a critical finding for clinical validity. However, the wide limits of agreement (±19.5 points) reveal substantial individual-level disagreement despite strong overall reliability. This pattern reflects the distinction between group-level consistency (ICC) and pairwise agreement: while raters rank severity similarly across subjects

Key Insights

Bland-Altman Agreement

Purpose

Key Findings

Mean Difference (-1.51 points): Negligible systematic bias; neither rater consistently scores higher or lower than the other, indicating balanced rating behavior across the 120 subjects.
Limits of Agreement (±19.55 to +16.53): 95% of rating differences fall within this ±18-point range around a mean score of ~52, representing clinically meaningful variability for individual rater pairs.
Standard Deviation (9.20): Moderate scatter in differences suggests raters diverge by approximately 9 points on average, consistent with the excellent ICC (0.854) observed at the group level.

Interpretation

Recommendations

Actionable Next Steps for Improving Reliability

Recommendations

Actionable Next Steps

0.854

Primary icc

Actionable recommendations based on ICC analysis findings

0.854

primary icc

Excellent

icc interpretation

120

n subjects

n raters

Key Insights

Recommendations

Purpose

This section synthesizes the inter-rater reliability analysis to guide clinical practice improvements. It translates the ICC findings into actionable insights for the Clinical Assessment Research Lab, helping stakeholders understand whether current rater training and assessment protocols are sufficiently reliable for clinical decision-making across diagnostic categories.

Key Findings

Primary ICC (0.854): Excellent reliability indicates strong agreement among the 5 clinicians rating 120 subjects across 600 observations
Confidence Interval (0.812–0.890): Narrow range demonstrates stable, reproducible reliability estimates
Stratified Performance: Depression shows strongest ICC (0.90), while Anxiety and PTSD show slightly lower values (0.81–0.82), suggesting category-specific variation
Rater Bias: Individual clinicians show minimal systematic bias (range: −2.25 to +2.13 points), with Dr. Chen and Dr. Evans slightly underrating severity

Interpretation

The excellent overall ICC (0.854) confirms that clinical severity ratings are highly consistent across raters, validating current assessment protocols. However, stratified analysis reveals diagnostic categories warrant differential attention—Depression ratings are more reliable than Anxiety or PTSD assessments. Individual rater correlations (mean 0.89) further support protocol effectiveness, though modest bias patterns suggest some clinicians systematically

Key Insights

Recommendations

Purpose

Key Findings

Primary ICC (0.854): Excellent reliability indicates strong agreement among the 5 clinicians rating 120 subjects across 600 observations
Confidence Interval (0.812–0.890): Narrow range demonstrates stable, reproducible reliability estimates
Stratified Performance: Depression shows strongest ICC (0.90), while Anxiety and PTSD show slightly lower values (0.81–0.82), suggesting category-specific variation
Rater Bias: Individual clinicians show minimal systematic bias (range: −2.25 to +2.13 points), with Dr. Chen and Dr. Evans slightly underrating severity