The assessment of IRR provides a way of quantifying the degree of agreement between two or more coders who make independent ratings about the features of a set of subjects. In this paper, subjects will be used as a generic term for the people, things, or events that are rated in a study, such as the number of times a child reaches for a caregiver, the level of empathy displayed by an interviewer, or the presence or absence of a psychological diagnosis. Coders will be used as a generic term for the individuals who assign ratings in a study, such as trained research assistants or randomly-selected participants.
In classical test theory (Lord, 1959
; Novick, 1966
), observed scores (X
) from psychometric instruments are thought to be composed of a true score (T
) that represents the subject’s score that would be obtained if there were no measurement error, and an error component (E
) that is due to measurement error (also called noise), such that
or in abbreviated symbols,
where the variance of the observed scores is equal to the variance of the true scores plus the variance of the measurement error, if the assumption that the true scores and errors are uncorrelated is met.
Measurement error (E) prevents one from being able to observe a subject’s true score directly, and may be introduced by several factors. For example, measurement error may be introduced by imprecision, inaccuracy, or poor scaling of the items within an instrument (i.e., issues of internal consistency); instability of the measuring instrument in measuring the same subject over time (i.e., issues of test-retest reliability); and instability of the measuring instrument when measurements are made between coders (i.e., issues of IRR). Each of these issues may adversely affect reliability, and the latter of these issues is the focus of the current paper.
IRR analysis aims to determine how much of the variance in the observed scores is due to variance in the true scores after the variance due to measurement error between coders has been removed (Novick, 1966
), such that
For example, an IRR estimate of 0.80 would indicate that 80% of the observed variance is due to true score variance or similarity in ratings between coders, and 20% is due to error variance or differences in ratings between coders.
Because true scores (T
) and measurement errors (E
) cannot be directly accessed, the IRR of an instrument cannot be directly computed. Instead, true scores can be estimated by quantifying the covariance among sets of observed scores (X
) provided by different coders for the same set of subjects, where it is assumed that the shared variance between ratings approximates the value of Var
) and the unshared variance between ratings approximates Var
), which allows reliability to be estimated in accordance with equation 3
IRR analysis is distinct from validity analysis, which assesses how closely an instrument measures an actual construct rather than how well coders provide similar ratings. Instruments may have varying levels of validity regardless of the IRR of the instrument. For example, an instrument may have good IRR but poor validity if coders’ scores are highly similar and have a large shared variance but the instrument does not properly represent the construct it is intended to measure.
How are studies designed to assess IRR?
Before a study utilizing behavioral observations is conducted, several design-related considerations must be decided a priori that impact how IRR will be assessed. These design issues are introduced here, and their impact on computation and interpretation are discussed more thoroughly in the computation sections below.
First, it must be decided whether a coding study is designed such that all subjects in a study are rated by multiple coders, or if a subset of subjects are rated by multiple coders with the remainder coded by single coders. The contrast between these two options is depicted in the left and right columns of . In general, rating all subjects is acceptable at the theoretical level for most study designs. However, in studies where providing ratings is costly and/or time-intensive, selecting a subset of subjects for IRR analysis may be more practical because it requires fewer overall ratings to be made, and the IRR for the subset of subjects may be used to generalize to the full sample.
Designs for assigning coders to subjects IRR studies.
Second, it must be decided whether the subjects that are rated by multiple coders will be rated by the same set of coders (fully crossed design) or whether different subjects are rated by different subsets of coders. The contrast between these two options is depicted in the upper and lower rows of . Although fully crossed designs can require a higher overall number of ratings to be made, they allow for systematic bias between coders to be assessed and controlled for in an IRR estimate, which can improve overall IRR estimates. For example, ICCs may underestimate the true reliability for some designs that are not fully crossed, and researchers may need to use alternative statistics that are not well distributed in statistical software packages to assess IRR in some studies that are not fully crossed (Putka, Le, McCloy, & Diaz, 2008
Third, the psychometric properties of the coding system used in a study should be examined for possible areas that could strain IRR estimates. Naturally, rating scales already shown to have poor IRR are likely to produce low IRR estimates in subsequent studies. However, even if a rating system has been shown to have good IRR, restriction of range can potentially occur when a rating system is applied to new populations, which can substantially lower IRR estimates. Restriction of range often lowers IRR estimates because the Var
) component of equation 3
is reduced, producing a lower IRR estimate even if Var
) does not change. For example, consider two hypothetical studies where coders rate therapists’ levels of empathy on a well-validated 1 to 5 Likert-type scale where 1 represents very low empathy and 5 represents very high empathy. The first study recruits therapists from a community clinic and results in a set of ratings that are normally distributed across the five points of the scale, and IRR for empathy ratings is good. The second study uses the same coders and coding system as the first study and recruits therapists from a university clinic who are highly trained at delivering therapy in an empathetic manner, and results in a set of ratings that are restricted to mostly 4’s and 5’s on the scale, and IRR for empathy ratings is low. IRR is likely to have been reduced due to restriction of range where Var
) was reduced in the second study even though Var
) may have been similar between studies because the same coders and coding system were used. In cases where restricted range is likely, it is worth considering whether the scale should be modified, for example by expanding it into a 1 to 9 Liker-type scale, adjusting the anchoring points, or omitting the scale altogether. These decisions are best made before a study begins, and pilot testing may be helpful for assessing the suitability of new or modified scales.
Fourth, in studies using trained coders, it may often be necessary to conduct a considerable amount of training with practice subjects before subjects from the real study are coded. In these cases it is common to specify an a priori level of IRR that must be achieved before subjects from the real study are rated and to report this in the final study write-up. Commonly, the qualitative ratings for different IRR statistics can be used to assign these cutoff points; for example, a researcher may require all IRR estimates to be at least in the “good” range before coders can rate the real subjects in a study.
What are common mistakes that people make in assessing and reporting IRR?
Most general courses in statistics and experimental design devote little or no time to the study of IRR, which, combined with the lack of published comprehensive guidelines for assessing and reporting IRR, may result in several commonly-made mistakes in behavioral research. Several of these mistakes are briefly described below.
Using percentages of agreement
Despite being definitively rejected as an adequate measure of IRR (Cohen, 1960
; Krippendorff, 1980
), many researchers continue to report the percentage that coders agree in their ratings as an index of coder agreement. For categorical data, this may be expressed as the number of agreements in observations divided by the total number of observations. For ordinal, interval, or ratio data where close-but-not-perfect agreement may be acceptable, percentages of agreement are sometimes expressed as the percentage of ratings that are in agreement within a particular interval. Perhaps the biggest criticism of percentages of agreement is that they do not correct for agreements that would be expected by chance and therefore overestimate the level of agreement. For example, if coders were to randomly rate 50% of subjects as “depressed” and 50% as “not depressed” without regard to the subject’s actual characteristics, the expected percentage of agreement would be 50% even though all overlapping ratings were due to chance. If coders randomly rated 10% of subjects as depressed and 90% as not depressed, the expected percentage of agreement would be 82% even though this seemingly high level of agreement is still due entirely to chance.
Not reporting which statistic or variant was used in an IRR analysis
Many studies fail to report which statistic was used to compute IRR (e.g., Cohen’s kappa, Fleiss’s kappa, ICCs) or which variant of that statistic was computed (e.g., Siegel & Castellan’s 1988
variant of Cohen’s kappa, two-way consistency average-measures ICC). Reporting both the statistic and its computational variant are crucial because there are many statistics for computing IRR and different variants can substantially influence the interpretation of IRR estimates. Reference manuals for statistical software packages typically will provide references for the variants of IRR statistics that are used for computations, and some software packages allow users to select which variant they wish to compute.
Not using the correct statistic for the study design
Many factors must be considered in the selection of the most appropriate statistical test, such as the metric in which a variable was coded (e.g., nominal vs. ordinal, interval, or ratio), the design of the study (e.g., whether all subjects vs. a subset of subjects are rated by multiple coders), and the intended purpose of the IRR estimate (e.g., to estimate the reliability of individual coders’ ratings vs. the reliability of the mean ratings from multiple coders). Researchers should be careful to assess the appropriateness of a statistic for their study design and look for alternative options that may be more suitable for their study. Appropriate statistics for various study designs are discussed in more depth in the computation sections below.
Not performing IRR analyses on variables in their final transformed form
It is often more appropriate to report IRR estimates for variables in the form that they will be used for model testing rather their raw form. For example, if a researcher counts the frequency of certain behaviors then square-root transforms these for use in subsequent hypothesis testing, assessing IRR for the transformed variables, rather than the raw behavior counts, more accurately indicates the relative level of measurement error that is present in the final hypothesis testing. In situations where IRR estimates are high for a variable in its raw form but low for the variable in its final form (or vice versa), both IRR estimates may be reported to demonstrate that coders reliably rated subjects, despite the IRR for the final variable being low and possibly containing too much measurement error for further analysis.
Not interpreting the effect of IRR on power and pertinent study questions
Finally, many researchers neglect to interpret the effect of IRR estimates on questions of interest to their study. For example, if it is important to show that coders can independently reach similar conclusions about the subjects they observe, it can be helpful to provide qualitative interpretations of IRR estimates by comparing them to previously-observed IRR estimates from similar instruments or providing qualitative ratings based on pre-established cutoff points for good, acceptable, and unacceptable IRR.
Implications of IRR estimates on statistical power should be commented on if the variables observed in the study are subject to subsequent hypothesis testing. Low IRR indicates that the observed ratings contain a large amount of measurement error, which adds noise to the signal a researcher wishes to detect in their hypothesis tests. Low IRR may increase the probability of type-II errors, as the increase in noise may suppress the researcher’s ability to detect a relationship that actually exists, and thus lead to false conclusions about the hypotheses under study.
Possible reasons for low IRR should be discussed, e.g., IRR may be low due to restricted range, poor psychometric properties of a scale, poorly trained coders, difficulty in observing or quantifying the construct of interest, or other reasons. Decisions about dropping or retaining variables with low IRR from analyses should be discussed, and alternative models may need to be proposed if variables are dropped.