Inter-Rater Reliability for Clinical Teams — ICC Analysis Without SPSS

Three psychiatrists independently rate 50 patients on the Hamilton Depression Rating Scale. Before you can publish those ratings as outcome data, your IRB and your journal reviewers will ask the same question: do the raters agree? The Intraclass Correlation Coefficient (ICC) answers that question with a single number and a confidence interval. An ICC of 0.85 means excellent agreement. An ICC of 0.45 means nearly half the variation in scores comes from rater disagreement, not patient differences. The analysis takes hours to set up in SPSS and requires choosing among 10 ICC forms. Upload a CSV and get all 10 forms with model selection guidance in under 60 seconds.

Why Inter-Rater Reliability Matters in Healthcare

Any clinical study that uses subjective assessments depends on rater agreement. Psychiatric symptom severity scales, pain ratings, radiological imaging reads, pathology grading, functional status assessments, wound staging, and neonatal behavioral scoring all involve human judgment. If different clinicians produce different scores for the same patient, the measurement itself is unreliable, and any downstream analysis built on those measurements is compromised.

The stakes are concrete. A systematic review and meta-analysis of inter-rater reliability in psychiatric diagnosis found that agreement varies substantially across diagnostic categories, with some conditions achieving good reliability while others remain problematic (PMC, 2025). In radiology, studies have shown that when two radiologists interpreted the same X-rays without a standardized rating scale, their diagnoses had a variability of 15%, which clear guidelines and training reduced to just 3% (Encord). These are not academic abstractions. A depression rating that varies by 15 points depending on which clinician administers it could mean the difference between "clinically improved" and "no change" in a treatment efficacy study.

Journals and IRBs expect ICC values with confidence intervals before they will accept clinical rating data. The Shrout and Fleiss (1979) taxonomy defines the standard framework, and Koo and Li (2016) provide the widely adopted interpretation benchmarks: below 0.50 is poor, 0.50-0.75 is moderate, 0.75-0.90 is good, and above 0.90 is excellent. Meeting this reporting standard typically requires either an SPSS Reliability Analysis module ($2,600+/year), SAS PROC MIXED, or proficiency with R's psych::ICC function. Many clinical researchers report the wrong ICC form because the choice depends on study design, and the software does not guide the decision.

When to Run an ICC Analysis

ICC analysis is required or strongly recommended in several common clinical research scenarios:

Before launching a multi-rater clinical study. During rater training calibration exercises, you need ICC above 0.75 before data collection begins. If raters cannot achieve acceptable agreement on training cases, the study protocol needs revision. Running ICC on the calibration data identifies which raters are outliers and whether additional training can close the gap. This is a prerequisite for many IRBs.

Validating a new clinical assessment instrument. If your department developed a new pain assessment scale, functional status measure, or behavioral checklist, you must demonstrate inter-rater reliability before it can be used in research. The validation study typically involves 3-5 raters independently scoring 30-50 patients, with ICC as the primary outcome. Below 0.70 on single-measure ICC means the instrument needs revision.

Multi-site clinical trials with centralized rating. When multiple sites contribute patient data rated by different clinicians, demonstrating consistent scoring across sites is essential. Stratified ICC by site shows whether some locations have systematically higher or lower scores, which would bias the treatment comparison. CROs routinely include ICC analysis in their rater qualification protocols.

Quality improvement for clinical assessments. Nursing quality leads and clinical directors use ICC to audit whether staff apply standardized scales consistently. If your ICU nurses score the Richmond Agitation-Sedation Scale differently, patient sedation management is inconsistent. ICC quantifies the problem. Rater bias analysis identifies which nurses need recalibration.

Dual-read imaging studies. Radiology and pathology departments routinely have two readers independently interpret the same imaging studies or tissue samples. ICC and Bland-Altman analysis quantify agreement and identify systematic bias between readers. This is standard practice for mammography, CT interpretation, and histopathological grading.

What Data You Need

A CSV in long format with one row per subject-rater combination. At minimum, three columns:

Subject ID — patient identifier, specimen number, or case ID
Rater ID — clinician name, reviewer code, or instrument label
Rating score — the numeric assessment (severity score, pain rating, diagnostic grade)

Optional but valuable: a category column for stratified ICC analysis (diagnostic category, study site, patient subgroup). Stratified ICC often reveals that overall agreement masks category-specific problems. Your overall ICC might be 0.82, but depression ratings could be at 0.86 while anxiety ratings sit at 0.58.

For stable ICC estimates, aim for at least 30 subjects rated by 3 or more raters (minimum 90 observations). With 100 subjects, the 95% confidence interval around ICC is typically narrow enough to make definitive decisions. Fewer than 20 subjects produces confidence intervals so wide that you cannot distinguish good reliability from poor reliability.

If your data is in wide format (one column per rater), reshape it to long format before uploading. Each row should be one rater's score for one patient.

Which ICC Form to Report

This is the most common source of errors in clinical ICC reporting. There are 10 ICC forms, and the correct choice depends on two study design decisions. The report computes all 10 and recommends the appropriate form, but understanding the logic helps you verify the recommendation.

Are your raters a random sample from a larger population of potential raters? If your five clinicians represent "any trained clinician who might use this scale" (which is true in most instrument validation and multi-site studies), use ICC(2) — the two-way random effects model. This allows you to generalize your reliability finding to raters not in your study. If these are the only raters who will ever use the instrument (rare in clinical research), use ICC(3) — the two-way mixed effects model.

Do you care about absolute agreement or rank-order consistency? Agreement ICC penalizes systematic differences between raters. If Clinician A consistently scores 10 points higher than Clinician B, agreement ICC will be lower. Use agreement when the actual score matters for clinical decisions (PHQ-9 cutoff scores, sedation scale thresholds). Use consistency when only the relative ordering matters (ranking patients for severity-based triage).

Single measure or average? Single-measure ICC reflects the reliability of one rater's score. Average-measure ICC reflects the reliability of the mean across all raters. If each patient in your clinical trial will be assessed by a single clinician at each visit, report single-measure ICC. If outcomes are based on consensus ratings (average of a panel), report average-measure ICC.

For most clinical research: ICC(2,1) — two-way random effects, single measure, absolute agreement — is the correct choice. The report recommends a form based on your data structure, but always verify against your study design.

How to Read the Report

ICC summary with confidence intervals. The headline number and its 95% CI. Using Koo and Li (2016) benchmarks: below 0.50 is poor, 0.50-0.75 is moderate, 0.75-0.90 is good, above 0.90 is excellent. The confidence interval matters as much as the point estimate. An ICC of 0.72 with a CI of 0.65-0.78 is clearly in the "moderate" range. An ICC of 0.72 with a CI of 0.45-0.88 could be anything from poor to good — you need more data.

All 10 ICC forms. A horizontal bar chart with confidence interval error bars for each form. This lets you compare agreement versus consistency models and single versus average measures side by side. If agreement ICC and consistency ICC are nearly identical, there is no systematic bias between raters. A large gap means one or more raters score consistently higher or lower than the others.

Stratified ICC by category. If you provided a grouping variable, this chart shows ICC within each subgroup. Look for categories where reliability drops below your threshold. In psychiatric research, structured interview instruments often achieve good ICC for depression and schizophrenia but lower ICC for personality disorders and somatoform conditions. The stratified view tells you exactly where to focus rater training.

Rater correlation heatmap. Pairwise correlations between all raters. A uniformly warm heatmap (all correlations above 0.60) means general agreement. A cold row or column identifies a rater who disagrees with the group. Suspiciously perfect correlations (above 0.95) between specific pairs may indicate raters shared information, violating independence.

Bland-Altman agreement plots. The gold standard for visualizing agreement. Points clustered around zero with no pattern means good agreement. A cluster shifted above or below zero reveals systematic bias. A funnel shape (wider spread at higher values) means agreement deteriorates for more severe cases. The limits of agreement (mean difference plus or minus 1.96 SD) directly quantify the expected range of disagreement between any two raters.

Rater summary table. Each rater's mean score, standard deviation, and systematic bias relative to the group average. If Dr. Chen averages 72 while the group averages 65, that is a +7 systematic bias. This table is what you bring to a rater calibration meeting.

What to Do With the Results

If ICC is adequate (above 0.75)

Proceed with data collection (if this was a calibration study) and report the ICC in your methods section
Include the ICC table and Bland-Altman plot in your manuscript appendix or supplementary materials
Note any stratified categories where ICC drops below threshold for targeted monitoring during the study

If ICC is below threshold

Identify problematic raters from the bias table and correlation heatmap. Focus training on specific raters, not the entire team.
Check category-specific ICC — the problem may be isolated to specific diagnostic categories or score ranges where the rating criteria are ambiguous
Revise the rating instrument if multiple raters struggle. Add anchor points, behavioral descriptors, or decision rules at each severity level.
Increase the number of raters if using averaged ratings — average-measure ICC is always higher than single-measure ICC, so using consensus panels can compensate for moderate individual reliability

When to Use Something Else

Binary ratings (yes/no, present/absent): Use Cohen's kappa (2 raters) or Fleiss' kappa (3+ raters). ICC is designed for continuous or ordinal scales with 5+ levels. Binary data produces unstable ICC estimates.
Internal consistency of a multi-item scale: Use Cronbach's alpha. It measures whether the items on a questionnaire measure the same construct. ICC measures whether different people applying the same instrument agree.
Comparing two measurement methods: A dedicated Bland-Altman analysis with limits of agreement and proportional bias testing goes deeper than what the ICC report provides.
Want to compare satisfaction scores across departments: Patient satisfaction analysis uses ANOVA and post-hoc tests for group comparisons rather than rater agreement.

References

Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin. 1979;86(2):420-428.
Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine. 2016;15(2):155-163.
Alavi M et al. A primer of inter-rater reliability in clinical measurement studies. Journal of Clinical Nursing. 2022. Wiley
Inter-rater reliability of psychiatric diagnosis: a systematic review and meta-analysis. PMC. 2025. PMC
Inter-Rater Reliability: Definition and Applications. Encord. encord.com