Context and Data Preparation

ICC Reliability Study Overview

OV

Analysis Overview

ICC Reliability Study Configuration

Analysis overview and configuration

Icc
Clinical Assessment Research Lab
Assess inter-rater_id reliability of clinical severity ratings across multiple clinicians and diagnostic categories
Module Configuration
icc_model ICC2k
confidence_level 0.95
min_subjects 30
min_raters 3
Processing ID
test_1772935870
IN

Key Insights

Analysis Overview

Purpose

This analysis evaluates inter-rater reliability across five clinicians assessing clinical severity ratings for 120 subjects across four diagnostic categories. The objective is to determine whether clinicians provide consistent and reproducible severity assessments, which is critical for ensuring diagnostic validity and treatment planning consistency in clinical practice.

Key Findings

  • Primary ICC (ICC2k): 0.854 (95% CI: 0.812–0.890) - Excellent agreement indicating strong reliability among raters
  • Average Rater ICC: 0.97 - When ratings are averaged across all five raters, reliability reaches near-perfect levels
  • Single Rater ICC: 0.85–0.86 - Individual clinician assessments show good but lower reliability than averaged scores
  • Bland-Altman Mean Difference: -1.508 with limits of agreement (-19.547 to 16.532) - Minimal systematic bias but moderate individual variation
  • Stratified Reliability: Depression (0.90) shows highest agreement; Anxiety and PTSD (0.81–0.82) show lowest, suggesting diagnostic category influences consistency
  • Rater Correlations: Range 0.81–1.00 (mean 0.89) - Most pairwise comparisons demonstrate strong agreement

Interpretation

The

IN

Key Insights

Analysis Overview

Purpose

This analysis evaluates inter-rater reliability across five clinicians assessing clinical severity ratings for 120 subjects across four diagnostic categories. The objective is to determine whether clinicians provide consistent and reproducible severity assessments, which is critical for ensuring diagnostic validity and treatment planning consistency in clinical practice.

Key Findings

  • Primary ICC (ICC2k): 0.854 (95% CI: 0.812–0.890) - Excellent agreement indicating strong reliability among raters
  • Average Rater ICC: 0.97 - When ratings are averaged across all five raters, reliability reaches near-perfect levels
  • Single Rater ICC: 0.85–0.86 - Individual clinician assessments show good but lower reliability than averaged scores
  • Bland-Altman Mean Difference: -1.508 with limits of agreement (-19.547 to 16.532) - Minimal systematic bias but moderate individual variation
  • Stratified Reliability: Depression (0.90) shows highest agreement; Anxiety and PTSD (0.81–0.82) show lowest, suggesting diagnostic category influences consistency
  • Rater Correlations: Range 0.81–1.00 (mean 0.89) - Most pairwise comparisons demonstrate strong agreement

Interpretation

The

PP

Data Preprocessing

Data Quality & Completeness

600
Final Observations

Data preprocessing and column mapping

Data Pipeline
600
Initial Records
600
Clean Records
Column Mapping
subject
subject_id
rater
rater_id
rating
severity_score
diagnostic_category
category
secondary_rating
functional_impairment
confidence
rater_confidence
600 Records
MCP Analytics
IN

Key Insights

Data Preprocessing

Purpose

This section documents the data preprocessing pipeline for the inter-rater reliability study. Perfect data retention (100%) indicates no observations were excluded during cleaning, which is critical for maintaining the integrity of ICC calculations and ensuring all 120 subjects with 5 raters each are represented in the final analysis.

Key Findings

  • Retention Rate: 100% (600/600 rows) - All observations passed quality checks with zero exclusions
  • Rows Removed: 0 - No data loss occurred during preprocessing
  • Train/Test Split: Not applicable - This is a reliability assessment, not a predictive model requiring data partitioning
  • Data Integrity: Complete dataset preserved ensures ICC estimates reflect the full sample without selection bias

Interpretation

The perfect retention rate demonstrates exceptionally clean input data with no missing values, duplicates, or invalid entries requiring removal. This is particularly important for ICC(2,k) calculations, which depend on balanced designs across all raters and subjects. The absence of any filtering or exclusion criteria means the reported ICC of 0.854 and stratified reliability estimates are based on the complete intended sample, strengthening confidence in the inter-rater reliability conclusions across all four diagnostic categories.

Context

No train/test split was applied because this analysis assesses measurement agreement rather than predictive performance. The 100% retention aligns with the study design of

IN

Key Insights

Data Preprocessing

Purpose

This section documents the data preprocessing pipeline for the inter-rater reliability study. Perfect data retention (100%) indicates no observations were excluded during cleaning, which is critical for maintaining the integrity of ICC calculations and ensuring all 120 subjects with 5 raters each are represented in the final analysis.

Key Findings

  • Retention Rate: 100% (600/600 rows) - All observations passed quality checks with zero exclusions
  • Rows Removed: 0 - No data loss occurred during preprocessing
  • Train/Test Split: Not applicable - This is a reliability assessment, not a predictive model requiring data partitioning
  • Data Integrity: Complete dataset preserved ensures ICC estimates reflect the full sample without selection bias

Interpretation

The perfect retention rate demonstrates exceptionally clean input data with no missing values, duplicates, or invalid entries requiring removal. This is particularly important for ICC(2,k) calculations, which depend on balanced designs across all raters and subjects. The absence of any filtering or exclusion criteria means the reported ICC of 0.854 and stratified reliability estimates are based on the complete intended sample, strengthening confidence in the inter-rater reliability conclusions across all four diagnostic categories.

Context

No train/test split was applied because this analysis assesses measurement agreement rather than predictive performance. The 100% retention aligns with the study design of

Executive Summary

Key Findings and Recommendations

TLDR

Executive Summary

Key Findings & Recommendations

0.854
Overall ICC

Key Performance Indicators

Primary icc
85.4%
Icc interpretation
Excellent
N subjects
120
N raters
5

Key Findings

Key findings

Finding Value
Overall ICC 0.854
Reliability Level Excellent
95% Confidence Interval [0.812, 0.890]
ICC Model Used ICC2k
Subjects Analyzed 120
Number of Raters 5
Total Observations 600
F-Test Result F=0.00, p=1.000
Standard Error (SEM) 6.39 points

Executive Summary

Bottom Line: This inter-rater_id reliability study assessed 120 subjects rated by 5 clinicians. The Intraclass Correlation Coefficient (ICC) of 0.854 indicates excellent reliability (95% CI: [0.812, 0.890]).

Key Findings:
• ICC Model: ICC2k was used to assess inter-rater_id reliability with random raters (generalizable)
• Statistical Significance: F-test (F=0.00, p=1.000) does not confirm significant between-subject_id variance
• Measurement Error: SEM = 6.39 points (expected error in individual ratings)
• Bland-Altman: Mean difference = -1.51 points, 95% LOA [-19.55, 16.53]
• Stratified Analysis: Reliability assessed across 4 diagnostic categories

Recommendation: Current rater_id training and protocols are effective. Maintain current practices and monitor reliability over time.

IN

Key Insights

Executive Summary

EXECUTIVE SUMMARY: INTER-RATER RELIABILITY ASSESSMENT

Purpose

This analysis evaluates whether five clinicians can reliably and consistently rate clinical severity across 120 patients with four diagnostic categories. The objective is to validate that the assessment protocol produces trustworthy, reproducible ratings regardless of which clinician performs the evaluation—a critical requirement for clinical decision-making and research validity.

Key Findings

  • Primary ICC (0.854): Excellent inter-rater reliability with narrow 95% confidence interval [0.812–0.890], indicating strong agreement among clinicians
  • Measurement Error (SEM = 6.39): Expected rating variation of approximately ±6 points on the severity scale, clinically acceptable given the grand mean of 51.3
  • Bland-Altman Agreement: Mean difference of −1.51 points with limits of agreement [−19.55, 16.53], showing minimal systematic bias but moderate individual variation
  • Diagnostic Consistency: Stratified analysis across four categories (Anxiety, PTSD, OCD, Depression) demonstrates reliability ranges 0.81–0.90, with Depression showing strongest agreement (0.90)
  • Rater Parity: Individual clinician biases range −2.25 to +2.13 points; no rater systematically infl
IN

Key Insights

Executive Summary

EXECUTIVE SUMMARY: INTER-RATER RELIABILITY ASSESSMENT

Purpose

This analysis evaluates whether five clinicians can reliably and consistently rate clinical severity across 120 patients with four diagnostic categories. The objective is to validate that the assessment protocol produces trustworthy, reproducible ratings regardless of which clinician performs the evaluation—a critical requirement for clinical decision-making and research validity.

Key Findings

  • Primary ICC (0.854): Excellent inter-rater reliability with narrow 95% confidence interval [0.812–0.890], indicating strong agreement among clinicians
  • Measurement Error (SEM = 6.39): Expected rating variation of approximately ±6 points on the severity scale, clinically acceptable given the grand mean of 51.3
  • Bland-Altman Agreement: Mean difference of −1.51 points with limits of agreement [−19.55, 16.53], showing minimal systematic bias but moderate individual variation
  • Diagnostic Consistency: Stratified analysis across four categories (Anxiety, PTSD, OCD, Depression) demonstrates reliability ranges 0.81–0.90, with Depression showing strongest agreement (0.90)
  • Rater Parity: Individual clinician biases range −2.25 to +2.13 points; no rater systematically infl

ICC Results - All Forms

Model Comparison with Confidence Intervals

ICC

ICC Values - All 10 Forms

Model Comparison with 95% Confidence Intervals

0.854
Primary ICC

All 10 ICC forms with rater_confidence intervals for model selection

0.854
primary icc
ICC2k
icc model
Excellent
icc interpretation
IN

Key Insights

ICC Values - All 10 Forms

Purpose

This section quantifies inter-rater reliability across all ICC model variants to establish whether clinician severity ratings are sufficiently consistent for clinical decision-making. The primary ICC(2k) value of 0.854 directly addresses the core objective: assessing whether the five clinicians provide reliable, generalizable assessments across the 120 subjects and four diagnostic categories.

Key Findings

  • Primary ICC (ICC2k): 0.854 [0.812–0.890] – Exceeds the 0.75 threshold for excellent reliability, confirming ratings are suitable for clinical use
  • Single-Rater Reliability: 0.85–0.86 – Individual clinician assessments show strong agreement, though slightly lower than averaged ratings
  • Average-Rater Reliability: 0.97 – Combining multiple raters yields near-perfect consistency, demonstrating systematic measurement validity
  • Confidence Interval Width: 0.078 points – Narrow CI reflects stable estimates across the sample of 120 subjects and 600 observations

Interpretation

The ICC(2k) model was appropriately selected because it assumes raters are random representatives of a larger clinician population, making findings generalizable beyond these five clinicians. The 0.11-point gap between single-rater (0.85) and average-rater

IN

Key Insights

ICC Values - All 10 Forms

Purpose

This section quantifies inter-rater reliability across all ICC model variants to establish whether clinician severity ratings are sufficiently consistent for clinical decision-making. The primary ICC(2k) value of 0.854 directly addresses the core objective: assessing whether the five clinicians provide reliable, generalizable assessments across the 120 subjects and four diagnostic categories.

Key Findings

  • Primary ICC (ICC2k): 0.854 [0.812–0.890] – Exceeds the 0.75 threshold for excellent reliability, confirming ratings are suitable for clinical use
  • Single-Rater Reliability: 0.85–0.86 – Individual clinician assessments show strong agreement, though slightly lower than averaged ratings
  • Average-Rater Reliability: 0.97 – Combining multiple raters yields near-perfect consistency, demonstrating systematic measurement validity
  • Confidence Interval Width: 0.078 points – Narrow CI reflects stable estimates across the sample of 120 subjects and 600 observations

Interpretation

The ICC(2k) model was appropriately selected because it assumes raters are random representatives of a larger clinician population, making findings generalizable beyond these five clinicians. The 0.11-point gap between single-rater (0.85) and average-rater

Model Selection & Statistics

Statistical Significance and Measurement Error

MS

Model Selection & Statistics

ICC Model Justification and SEM

0
F statistic

ICC model selection guidance and statistical significance

0
f statistic
1
p value
6.39
sem
16.73
pooled sd

Icc model comparison

Metric Value
ICC Model Used ICC2k
ICC Value 0.854
95% CI [0.812, 0.890]
F-Statistic 0.00
P-Value 1.000
SEM 6.39
Interpretation Excellent
IN

Key Insights

Model Selection & Statistics

Purpose

This section evaluates whether the ICC model assumptions are statistically valid and quantifies measurement error inherent in the clinical severity rating process. It determines which ICC variant (fixed vs. random raters) is most appropriate for generalizing reliability findings beyond the current five clinicians.

Key Findings

  • F-statistic (0.00) & p-value (1.000): The non-significant F-test indicates insufficient between-subject variance to formally validate ICC computation, though the overall ICC(2k) of 0.854 remains excellent and practically meaningful.
  • Standard Error of Measurement (6.39): Individual ratings deviate from true scores by approximately ±6.39 points (68% confidence) or ±12.78 points (95% confidence), representing ~38% of the pooled standard deviation (16.73).
  • Model Selection: ICC(2) was appropriately selected, treating the five raters as a random sample generalizable to similar clinician populations.

Interpretation

Despite the non-significant F-test, the excellent ICC(2k) of 0.854 (95% CI: 0.812–0.890) demonstrates strong inter-rater reliability across 120 subjects and 600 observations. The SEM of 6.39 indicates clinically acceptable measurement precision for severity assessment. The ICC(

IN

Key Insights

Model Selection & Statistics

Purpose

This section evaluates whether the ICC model assumptions are statistically valid and quantifies measurement error inherent in the clinical severity rating process. It determines which ICC variant (fixed vs. random raters) is most appropriate for generalizing reliability findings beyond the current five clinicians.

Key Findings

  • F-statistic (0.00) & p-value (1.000): The non-significant F-test indicates insufficient between-subject variance to formally validate ICC computation, though the overall ICC(2k) of 0.854 remains excellent and practically meaningful.
  • Standard Error of Measurement (6.39): Individual ratings deviate from true scores by approximately ±6.39 points (68% confidence) or ±12.78 points (95% confidence), representing ~38% of the pooled standard deviation (16.73).
  • Model Selection: ICC(2) was appropriately selected, treating the five raters as a random sample generalizable to similar clinician populations.

Interpretation

Despite the non-significant F-test, the excellent ICC(2k) of 0.854 (95% CI: 0.812–0.890) demonstrates strong inter-rater reliability across 120 subjects and 600 observations. The SEM of 6.39 indicates clinically acceptable measurement precision for severity assessment. The ICC(

Rater Performance Summary

Individual Rater Statistics and Bias

RS

Rater Statistics

Individual Rater Performance & Bias

5
Number of Raters

Rater-level statistics showing mean scores and systematic bias

rater_id mean_score sd_score n_subjects bias mean_confidence
Dr_Adams 51.384 15.544 120.000 0.042 3.725
Dr_Baker 52.892 17.155 120.000 1.550 3.683
Dr_Chen 49.087 15.999 120.000 -2.254 3.508
Dr_Davis 53.471 17.193 120.000 2.129 3.775
Dr_Evans 49.875 17.511 120.000 -1.467 3.800
5
n raters
51.34
grand mean
IN

Key Insights

Rater Statistics

Purpose

This section evaluates individual rater performance and systematic bias to assess whether clinicians are applying severity rating scales consistently. Understanding rater-level variation is essential for validating the overall inter-rater reliability findings and identifying whether observed agreement reflects true clinical consensus or masks individual scoring patterns.

Key Findings

  • Bias Range: -2.25 to +2.13 points—all raters deviate minimally from the grand mean (51.34), indicating no systematic over- or under-rating exceeds the ±5-point threshold
  • Standard Deviation: 15.54 to 17.51 across raters—consistent variability in severity scoring, suggesting similar rating dispersion patterns
  • Mean Confidence: 3.51 to 3.80 (on presumed 5-point scale)—Dr. Evans shows highest confidence (3.80) despite largest negative bias (-1.47); Dr. Chen shows lowest confidence (3.51)
  • Balanced Contribution: All raters evaluated identical 120 subjects, ensuring equal representation

Interpretation

The minimal bias across all five clinicians (SD = 1.88) demonstrates that systematic scoring differences are negligible relative to the overall scale range. Despite individual confidence variations, raters maintain comparable mean scores and rating variability, supporting the excellent ICC (0.854) observed at the aggregate

IN

Key Insights

Rater Statistics

Purpose

This section evaluates individual rater performance and systematic bias to assess whether clinicians are applying severity rating scales consistently. Understanding rater-level variation is essential for validating the overall inter-rater reliability findings and identifying whether observed agreement reflects true clinical consensus or masks individual scoring patterns.

Key Findings

  • Bias Range: -2.25 to +2.13 points—all raters deviate minimally from the grand mean (51.34), indicating no systematic over- or under-rating exceeds the ±5-point threshold
  • Standard Deviation: 15.54 to 17.51 across raters—consistent variability in severity scoring, suggesting similar rating dispersion patterns
  • Mean Confidence: 3.51 to 3.80 (on presumed 5-point scale)—Dr. Evans shows highest confidence (3.80) despite largest negative bias (-1.47); Dr. Chen shows lowest confidence (3.51)
  • Balanced Contribution: All raters evaluated identical 120 subjects, ensuring equal representation

Interpretation

The minimal bias across all five clinicians (SD = 1.88) demonstrates that systematic scoring differences are negligible relative to the overall scale range. Despite individual confidence variations, raters maintain comparable mean scores and rating variability, supporting the excellent ICC (0.854) observed at the aggregate

Rater-by-Rater Agreement

Pairwise Correlation Heatmap

RA

Rater-by-Rater Agreement

Pairwise Correlation Heatmap

Pairwise correlations between all raters showing agreement patterns

IN

Key Insights

Rater-by-Rater Agreement

Purpose

This section evaluates pairwise agreement between all five clinicians to identify whether specific rater combinations show systematic disagreement. Strong correlations across all pairs validate that the overall ICC(2k) = 0.854 reflects genuine consensus rather than masking problematic rater combinations. Identifying weak pairs is critical for understanding whether inter-rater reliability issues are localized or pervasive.

Key Findings

  • Mean Pairwise Correlation: 0.89 - All rater pairs demonstrate strong agreement, well above the 0.5 threshold for acceptable reliability
  • Range: 0.81–1.0 - Minimum correlation (Dr_Evans & Dr_Baker = 0.81) remains in the “excellent” range, indicating no problematic rater pairs
  • Consistency Pattern: No correlations fall below 0.80, suggesting uniform interpretation of severity criteria across all clinicians
  • Strongest Agreement: Dr_Adams & Dr_Chen (r = 0.93) and Dr_Adams & Dr_Davis (r = 0.90) show exceptional alignment

Interpretation

The absence of any weak pairwise correlations (<0.5) confirms that the excellent overall ICC reflects genuine consensus rather than compensating disagreements. All five clinicians interpret clinical severity ratings similarly, with minimal systematic bias between any pair. This uniform agreement across

IN

Key Insights

Rater-by-Rater Agreement

Purpose

This section evaluates pairwise agreement between all five clinicians to identify whether specific rater combinations show systematic disagreement. Strong correlations across all pairs validate that the overall ICC(2k) = 0.854 reflects genuine consensus rather than masking problematic rater combinations. Identifying weak pairs is critical for understanding whether inter-rater reliability issues are localized or pervasive.

Key Findings

  • Mean Pairwise Correlation: 0.89 - All rater pairs demonstrate strong agreement, well above the 0.5 threshold for acceptable reliability
  • Range: 0.81–1.0 - Minimum correlation (Dr_Evans & Dr_Baker = 0.81) remains in the “excellent” range, indicating no problematic rater pairs
  • Consistency Pattern: No correlations fall below 0.80, suggesting uniform interpretation of severity criteria across all clinicians
  • Strongest Agreement: Dr_Adams & Dr_Chen (r = 0.93) and Dr_Adams & Dr_Davis (r = 0.90) show exceptional alignment

Interpretation

The absence of any weak pairwise correlations (<0.5) confirms that the excellent overall ICC reflects genuine consensus rather than compensating disagreements. All five clinicians interpret clinical severity ratings similarly, with minimal systematic bias between any pair. This uniform agreement across

Bland-Altman Agreement Analysis

Difference vs Mean Plot with Limits of Agreement

BA

Bland-Altman Agreement

Difference vs Mean Plot

-1.508
Mean Difference

Bland-Altman plot visualizing agreement between two key raters

-1.508
ba mean diff
9.204
ba sd diff
-19.547
ba loa lower
IN

Key Insights

Bland-Altman Agreement

Purpose

This section quantifies agreement between two individual raters on clinical severity ratings using the Bland-Altman method. It complements the overall ICC analysis by examining systematic bias and variability patterns between specific rater pairs, revealing whether disagreements are random or reflect consistent scoring tendencies that could affect clinical decision-making.

Key Findings

  • Mean Difference (-1.51 points): Negligible systematic bias; neither rater consistently scores higher or lower than the other, indicating balanced rating behavior across the 120 subjects.
  • Limits of Agreement (±19.55 to +16.53): 95% of rating differences fall within this ±18-point range around a mean score of ~52, representing clinically meaningful variability for individual rater pairs.
  • Standard Deviation (9.20): Moderate scatter in differences suggests raters diverge by approximately 9 points on average, consistent with the excellent ICC (0.854) observed at the group level.

Interpretation

The near-zero mean difference indicates no systematic bias between these two raters—a critical finding for clinical validity. However, the wide limits of agreement (±19.5 points) reveal substantial individual-level disagreement despite strong overall reliability. This pattern reflects the distinction between group-level consistency (ICC) and pairwise agreement: while raters rank severity similarly across subjects

IN

Key Insights

Bland-Altman Agreement

Purpose

This section quantifies agreement between two individual raters on clinical severity ratings using the Bland-Altman method. It complements the overall ICC analysis by examining systematic bias and variability patterns between specific rater pairs, revealing whether disagreements are random or reflect consistent scoring tendencies that could affect clinical decision-making.

Key Findings

  • Mean Difference (-1.51 points): Negligible systematic bias; neither rater consistently scores higher or lower than the other, indicating balanced rating behavior across the 120 subjects.
  • Limits of Agreement (±19.55 to +16.53): 95% of rating differences fall within this ±18-point range around a mean score of ~52, representing clinically meaningful variability for individual rater pairs.
  • Standard Deviation (9.20): Moderate scatter in differences suggests raters diverge by approximately 9 points on average, consistent with the excellent ICC (0.854) observed at the group level.

Interpretation

The near-zero mean difference indicates no systematic bias between these two raters—a critical finding for clinical validity. However, the wide limits of agreement (±19.5 points) reveal substantial individual-level disagreement despite strong overall reliability. This pattern reflects the distinction between group-level consistency (ICC) and pairwise agreement: while raters rank severity similarly across subjects

Recommendations

Actionable Next Steps for Improving Reliability

RC

Recommendations

Actionable Next Steps

0.854
Primary icc

Actionable recommendations based on ICC analysis findings

0.854
primary icc
Excellent
icc interpretation
120
n subjects
5
n raters
IN

Key Insights

Recommendations

Purpose

This section synthesizes the inter-rater reliability analysis to guide clinical practice improvements. It translates the ICC findings into actionable insights for the Clinical Assessment Research Lab, helping stakeholders understand whether current rater training and assessment protocols are sufficiently reliable for clinical decision-making across diagnostic categories.

Key Findings

  • Primary ICC (0.854): Excellent reliability indicates strong agreement among the 5 clinicians rating 120 subjects across 600 observations
  • Confidence Interval (0.812–0.890): Narrow range demonstrates stable, reproducible reliability estimates
  • Stratified Performance: Depression shows strongest ICC (0.90), while Anxiety and PTSD show slightly lower values (0.81–0.82), suggesting category-specific variation
  • Rater Bias: Individual clinicians show minimal systematic bias (range: −2.25 to +2.13 points), with Dr. Chen and Dr. Evans slightly underrating severity

Interpretation

The excellent overall ICC (0.854) confirms that clinical severity ratings are highly consistent across raters, validating current assessment protocols. However, stratified analysis reveals diagnostic categories warrant differential attention—Depression ratings are more reliable than Anxiety or PTSD assessments. Individual rater correlations (mean 0.89) further support protocol effectiveness, though modest bias patterns suggest some clinicians systematically

IN

Key Insights

Recommendations

Purpose

This section synthesizes the inter-rater reliability analysis to guide clinical practice improvements. It translates the ICC findings into actionable insights for the Clinical Assessment Research Lab, helping stakeholders understand whether current rater training and assessment protocols are sufficiently reliable for clinical decision-making across diagnostic categories.

Key Findings

  • Primary ICC (0.854): Excellent reliability indicates strong agreement among the 5 clinicians rating 120 subjects across 600 observations
  • Confidence Interval (0.812–0.890): Narrow range demonstrates stable, reproducible reliability estimates
  • Stratified Performance: Depression shows strongest ICC (0.90), while Anxiety and PTSD show slightly lower values (0.81–0.82), suggesting category-specific variation
  • Rater Bias: Individual clinicians show minimal systematic bias (range: −2.25 to +2.13 points), with Dr. Chen and Dr. Evans slightly underrating severity

Interpretation

The excellent overall ICC (0.854) confirms that clinical severity ratings are highly consistent across raters, validating current assessment protocols. However, stratified analysis reveals diagnostic categories warrant differential attention—Depression ratings are more reliable than Anxiety or PTSD assessments. Individual rater correlations (mean 0.89) further support protocol effectiveness, though modest bias patterns suggest some clinicians systematically