ICC Analysis — Measure Inter-Rater Reliability and Agreement

Two doctors rate the same 50 patients on a severity scale. Do they agree? Three teachers grade the same set of essays. Are the scores consistent? A factory runs two instruments on the same batch. Do the readings match? The Intraclass Correlation Coefficient (ICC) answers these questions with a single number: an ICC of 0.9 means excellent agreement, while 0.4 means your raters are barely better than noise. Upload a CSV and find out in under 60 seconds.

What Is the Intraclass Correlation Coefficient?

The ICC measures how consistent ratings are when multiple raters (or instruments, or time points) evaluate the same set of subjects. Unlike a simple correlation between two raters, ICC handles any number of raters simultaneously and distinguishes between two fundamentally different questions: do the raters rank subjects in the same order (consistency), and do they assign the same actual scores (agreement)?

The intuition is straightforward. ICC partitions the total variance in your ratings into three buckets: variance due to real differences between subjects (the signal you want), variance due to differences between raters (systematic bias), and residual noise (random disagreement). An ICC close to 1.0 means almost all the variance comes from real subject differences — raters agree strongly. An ICC close to 0 means the raters contribute as much noise as the subjects contribute signal — the ratings are unreliable.

Consider a concrete example. Five psychologists independently rate 100 patients on depression severity using a 0-100 scale. If the psychologists tend to give similar scores to the same patient — Patient A gets scores of 72, 75, 70, 73, and 71 from the five raters — that is high agreement and the ICC will be high. But if Patient A gets scores of 40, 85, 55, 90, and 30, the raters disagree wildly and the ICC will be low, even if each rater individually uses the full scale.

ICC is not the same as Pearson correlation. Two raters can correlate at r = 0.95 while one consistently scores 20 points higher than the other. Correlation only captures the ranking relationship; it misses systematic bias entirely. ICC catches both. If you need to know whether raters agree on actual values (not just relative ordering), ICC is the right measure.

Real-World Applications

ICC shows up wherever multiple people or instruments measure the same thing, and you need to know whether those measurements are trustworthy.

Inter-rater reliability in clinical settings. Do two doctors agree on diagnoses? In mental health, multiple clinicians often rate patient severity on standardized scales like the Hamilton Depression Rating Scale or the PHQ-9. Before you can use those ratings to make treatment decisions, you need to know whether different clinicians would reach the same conclusions. An ICC below 0.60 for a clinical scale means the scores depend more on who rates the patient than on the patient's actual condition — a serious problem for diagnosis and treatment planning.

Test-retest reliability in surveys and assessments. Does the same survey give consistent results when administered twice? If you measure employee engagement this month and again next month (assuming nothing changed), the scores should be similar. ICC quantifies that stability. An engagement survey with test-retest ICC of 0.85 is a reliable instrument. One with ICC of 0.50 means half the score variation comes from measurement noise rather than actual differences in engagement.

Measurement reliability across instruments. Do two instruments agree? A manufacturing lab might use two spectrophotometers to measure the same samples. A hospital might compare blood pressure readings from two devices. ICC tells you whether the instruments are interchangeable or whether switching instruments introduces meaningful measurement error.

Performance review consistency. When three managers rate the same set of employees, are the ratings fair and consistent? ICC reveals whether one manager systematically scores higher than the others (bias) and whether the managers agree on who the top and bottom performers are (consistency). Companies use this to audit their review processes and identify managers who need calibration training.

Content moderation and quality scoring. Online platforms have teams of reviewers scoring content for quality, toxicity, or relevance. ICC tells you whether the scoring rubric is being applied consistently. Low ICC means the rubric is ambiguous or the reviewers need training.

What Data Do You Need?

You need a CSV in long format with one row per subject-rater combination. At minimum, you need three columns: a subject identifier (patient ID, employee ID, sample number), a rater identifier (clinician name, manager ID, instrument label), and a numeric rating (severity score, performance score, measurement value). An optional fourth column can define subgroups (diagnostic category, department, product line) for stratified analysis.

For example, if 5 raters each score 100 subjects, your CSV will have 500 rows — one row for each rater-subject pair. The columns might be patient_id, clinician_id, severity_score, and diagnosis_category.

For stable ICC estimates, aim for at least 30 subjects rated by 3 or more raters (minimum 90 observations). More subjects improve precision — with 100 subjects, the 95% confidence interval around ICC is typically narrow enough to make clear decisions. Fewer than 30 subjects produces wide confidence intervals that make it hard to distinguish good reliability from poor reliability.

ICC works best with continuous or ordinal rating scales that have at least 5 levels. If your ratings are binary (yes/no, pass/fail), use Cohen's kappa or Fleiss' kappa instead. If your scale has only 2-3 levels, the ICC estimate becomes unstable and kappa-based measures are more appropriate.

How to Read the Report

ICC Summary Metrics

The report opens with the headline number: the recommended ICC value, its 95% confidence interval, the F-statistic, and the standard error of measurement (SEM). The ICC value itself ranges from 0 to 1. Widely used benchmarks classify reliability as: poor (below 0.40), fair (0.40-0.59), good (0.60-0.74), or excellent (0.75 and above). The confidence interval matters — an ICC of 0.72 with a CI of 0.65-0.78 is clearly "good," but an ICC of 0.72 with a CI of 0.45-0.88 might be anywhere from fair to excellent, and you need more data.

All 10 ICC Forms

There is not just one ICC — there are 10 forms defined by two choices. First, the model: ICC(1) treats raters as random, ICC(2) treats raters as a random sample from a larger population, and ICC(3) treats raters as the only raters of interest (fixed effects). Second, the unit: single-measure ICC reflects the reliability of one rater's score, while average-measure ICC reflects the reliability of the mean across all raters. The report computes all 10 forms as a horizontal bar chart with confidence interval error bars. Which form to use depends on your study design. If your raters are a random sample and you want to generalize to other raters, use ICC(2). If you only care about these specific raters, use ICC(3). The report recommends the most appropriate form based on your data structure.

Stratified ICC by Category

If you provide a grouping variable (like diagnostic category or department), the report computes ICC within each subgroup. This is often where the most actionable insights appear. Your overall ICC might be 0.82 (excellent), but the stratified view might reveal that anxiety disorder ratings have ICC of only 0.58 while depression ratings reach 0.86. That tells you exactly where to focus rater training. The bar chart with error bars makes it easy to spot categories where reliability drops below your threshold.

Rater Correlation Heatmap

The heatmap shows pairwise correlations between all raters. Each cell indicates how closely two raters track each other. A uniformly warm heatmap (all correlations above 0.60) means everyone agrees with everyone. A cold row or column points to a rater who disagrees with the group — either due to training gaps, different interpretation of criteria, or systematic bias. Suspiciously perfect agreement (correlations above 0.95) between specific pairs may indicate that raters collaborated or shared notes, which violates the independence assumption.

Bland-Altman Agreement Plot

The Bland-Altman plot shows the difference between rater pairs plotted against their average. If raters agree, points cluster around zero with no pattern. A cluster shifted above or below zero reveals systematic bias — one rater scores consistently higher. A funnel shape (wider spread at higher values) means agreement is worse for more severe cases. This plot is the gold standard for visualizing agreement and is directly interpretable: you can read off the limits of agreement (mean difference plus or minus 1.96 standard deviations) and decide whether the disagreement is clinically or practically meaningful.

Rater Summary Table

The rater summary table shows each rater's mean score, standard deviation, and systematic bias relative to the group average. If Rater 3 averages 75 while the group averages 65, that is a +10 systematic bias. The table makes it easy to identify lenient and strict raters. This is especially valuable for HR teams auditing performance reviews or clinical directors overseeing assessment programs.

Model Comparison Table and Category Reliability Table

The model comparison table presents ICC values for the recommended models side by side — agreement ICC (which penalizes systematic bias between raters) versus consistency ICC (which ignores it). If the two differ substantially, systematic bias is present and needs to be addressed through calibration. The category reliability table shows ICC, sample size, and confidence intervals for each subgroup, making it easy to flag categories that fall below your reliability threshold.

Interpretation and Recommendations

The final section translates all the numbers into plain language: which ICC model was selected and why, whether reliability meets your threshold, which raters or categories need attention, and specific recommendations for improvement. This is the card to share with stakeholders who do not need the statistical details.

Choosing the Right ICC Model

The most common mistake in ICC analysis is choosing the wrong model. The choice comes down to two questions.

Are your raters a random sample from a larger population? If you are studying inter-rater reliability for a clinical scale and your five clinicians represent "any trained clinician who might use this scale," use ICC(2) — the two-way random effects model. This allows you to generalize your reliability finding beyond the specific raters in your study. If only these five clinicians will ever use the scale, use ICC(3) — the two-way mixed effects model.

Do you care about absolute agreement or just consistency? Agreement ICC penalizes systematic differences between raters. If Rater A averages 80 and Rater B averages 60, the agreement ICC will be lower than the consistency ICC. Use agreement when the actual score matters (clinical diagnosis, certification testing). Use consistency when only the ranking matters (relative performance comparisons).

Single measure or average? Single-measure ICC tells you how reliable one rater's score is. Average-measure ICC tells you how reliable the mean of all raters' scores is. If your decision will be based on a single rater's score (e.g., a patient sees one clinician), report single-measure ICC. If it will be based on the average of multiple raters (e.g., a panel of judges), report average-measure ICC.

When to Use Something Else

ICC is powerful but not universal. Several situations call for a different tool.

If your ratings are categorical rather than numeric — pass/fail, present/absent, category A/B/C — use Cohen's kappa (for two raters) or Fleiss' kappa (for three or more). Kappa is designed for nominal and ordinal categories. Applying ICC to binary data produces unstable and misleading results.

If you want to compare two measurement methods — not raters, but instruments or techniques — and visualize where they agree and disagree, Bland-Altman plots are the standard approach. The ICC report includes Bland-Altman plots as one of its layout cards, but if method comparison is your primary question, a dedicated Bland-Altman analysis gives you limits of agreement and proportional bias testing in more detail.

If you are assessing the internal consistency of a multi-item questionnaire — do the 10 items on my survey all measure the same underlying construct? — use Cronbach's alpha. Alpha measures consistency among items, not among raters. It answers a fundamentally different question: is the questionnaire itself reliable, rather than are the people using it consistent.

If you have only two raters and a continuous scale, a simple Pearson correlation paired with a Bland-Altman plot may be sufficient. ICC still works with two raters, but the simpler tools are easier to interpret and communicate when the study design is straightforward.

Common Pitfalls

Using correlation instead of ICC. Pearson correlation measures linear association but ignores systematic bias. Two raters can correlate r = 0.90 while one consistently scores 20 points higher than the other. Correlation says "excellent," ICC correctly says "poor agreement." Always use ICC when absolute agreement matters.

Ignoring the confidence interval. An ICC of 0.70 sounds good, but if the 95% CI is 0.40-0.90, you cannot rule out poor reliability. With small samples (under 30 subjects), confidence intervals are wide enough to be nearly useless. Report the CI alongside the point estimate, and collect enough data to get a CI width below 0.15.

Treating ordinal ratings as continuous without checking. ICC assumes equal intervals between rating levels. A 5-point Likert scale treated as continuous is usually fine in practice (and widely accepted in the literature), but a 3-point scale is too coarse for stable ICC estimates. If you have 5 or more levels, proceed with ICC. Below that, consider weighted kappa.

The R Code Behind the Analysis

Every report includes the exact R code used to produce the results — reproducible, auditable, and citable. The analysis uses the irr package's icc() function to compute all 10 ICC forms with confidence intervals, following the Shrout and Fleiss (1979) framework that defines the standard ICC taxonomy used in virtually every reliability textbook. Rater correlation heatmaps use base R's cor(), Bland-Altman plots use custom calculations of mean difference and limits of agreement, and stratified ICC is computed by subsetting the data by category and running icc() on each subset. No custom implementations, no black boxes. Every step is visible in the code tab of your report.