A conceptual understanding of kappa may still leave the actual calculations a mystery. The following example is intended for those who desire a more complete understanding of the kappa statistic.

Let us assume that 2 hopeless clinicians are assessing the presence of Murphy's sign in a group of patients. They have no idea what they are doing, and their evaluations are no better than blind guesses. Let us say they are each guessing the presence and absence of Murphy's sign in a 50:50 ratio: half the time they guess that Murphy's sign is present, and the other half that it is absent. If you were completing a 2 х 2 table, with these 2 clinicians evaluating the same 100 patients, how would the cells, on average, get filled in?

represents the completed 2 х 2 table. Guessing at random, the 2 hopeless clinicians have agreed on the assessments of 50% of the patients. How did we arrive at the numbers shown in the table? According to the laws of chance, each clinician guesses that half of the 50 patients assessed as positive by the other clinician (i.e., 25 patients) have Murphy's sign.

How would this exercise work if the same 2 hopeless clinicians were to randomly guess that 60% of the patients had a positive result for Murphy's sign? provides the answer in this situation. The clinicians would agree for 52 of the 100 patients (or 52% of the time) and would disagree for 48 of the patients. In a similar way, using 2 х 2 tables for higher and higher positive proportions (i.e., how often the observer makes the diagnosis), you can figure out how often the observers will, on average, agree by chance alone (as delineated in ).

At this point, we have demonstrated 2 things. First, even if the reviewers have no idea what they are doing, there will be substantial agreement by chance alone. Second, the magnitude of the agreement by chance increases as the proportion of positive (or negative) assessments increases.

But how can we calculate kappa when the clinicians whose assessments are being compared are no longer “hopeless,” in other words, when their assessments reflect a level of expertise that one might actually encounter in practice? It's not very hard.

Let's take a simple example, returning to the premise that each of the 2 clinicians assesses Murphy's sign as being present in 50% of the patients. Here, we assume that the 2 clinicians now have some knowledge of Murphy's sign and their assessments are no longer random. Each decides that 50% of the patients have Murphy's sign and 50% do not, but they still don't agree on every patient. Rather, for 40 patients they agree that Murphy's sign is present, and for 40 patients they agree that Murphy's sign is absent. Thus, they agree on the diagnosis for 80% of the patients, and they disagree for 20% of the patients (see ). How do we calculate the kappa score in this situation?

Recall that if each clinician found that 50% of the patients had Murphy's sign but their decision about the presence of the sign in each patient was random, the clinicians would be in agreement 50% of the time, each cell of the 2 х 2 table would have 25 patients (as shown in ), chance agreement would be 50%, and maximum agreement beyond chance would also be 50%.

The no-longer-hopeless clinicians' agreement on 80% of the patients is therefore 30% above chance. Kappa is a comparison of the observed agreement above chance with the maximum agreement above chance: 30%/50% = 60% of the possible agreement above chance, which gives these clinicians a kappa of 0.6, as shown in .

Hence, to calculate kappa when only 2 alternatives are possible (e.g., presence or absence of a finding), you need just 2 numbers: the percentage of patients that the 2 assessors agreed on and the expected agreement by chance. Both can be determined by constructing a 2 х 2 table exactly as illustrated above.

The bottom line

Chance agreement is not always 50%; rather, it varies from one clinical situation to another. When the prevalence of a disease or outcome is low, 2 observers will guess that most patients are normal and the symptom of the disease is absent. This situation will lead to a high percentage of agreement simply by chance. When the prevalence is high, there will also be high apparent agreement, with most patients judged to exhibit the symptom. Kappa measures the agreement after correcting for this variable degree of chance agreement.