Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2859621

Formats

Article sections

- SUMMARY
- 1. Introduction
- 2. Methods
- 3. Simulation Studies
- 4. Example
- 5. Discussion
- Supplementary Material
- REFERENCES

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 April 26.

Published in final edited form as:

Published online 2008 May 23. doi: 10.1111/j.1541-0420.2008.01054.x

PMCID: PMC2859621

NIHMSID: NIHMS186704

Department of Biostatistics, Rollins School of Public Health of Emory University, 1518 Clifton RD NE, Atlanta, GA, 30322, USA

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Assessing agreement is often of interest in clinical studies to evaluate the similarity of measurements produced by different raters or methods on the same subjects. We present a modified weighted kappa coefficient to measure agreement between bivariate discrete survival times. The proposed kappa coefficient accommodates censoring by redistributing the mass of censored observations within the grid where the unobserved events may potentially happen. A generalized modified weighted kappa is proposed for multivariate discrete survival times. We estimate the modified kappa coefficients nonparametrically through a multivariate survival function estimator. The asymptotic properties of the kappa estimators are established and the performance of the estimators are examined through simulation studies of bivariate and trivariate survival times. We illustrate the application of the modified kappa coefficient in the presence of censored observations with data from a prostate cancer study.

In many scenarios in biomedical sciences, the same outcome may be measured by different methods or raters. For example, the disease status of patients may be evaluated by different raters; an event may be measured by a gold standard or a relatively simple method. Evaluation of agreement between these measurements is needed to determine whether different raters generate similar values on same patients; or whether the simple method reproduces the results of the gold standard.

Statistical methods for measuring agreement between categorical outcomes are well established. Cohen (1960) developed the kappa statistic as an agreement index for two binary variables. It has an appealing interpretation as a measure of chance-corrected agreement. Later, Cohen (1968) generalized the original kappa to the weighted kappa coefficient for ordinal discrete outcomes. Since its development, kappa with its extensions (Cohen, 1960 and 1968; Fleiss 1971; Kraemer, 1980) have been well studied in the literature and broadly applied in many areas (Maclure and Willett, 1987; Korten et al., 1992; Klar et al., 2000; Williamson et al., 2000). However, applications of kappa in survival studies have been quite limited due to the presence of censoring.

In biomedical sciences, researchers are often interested in assessing agreement between two survival times that are measured on the same subjects. For example, in depression studies, the time of onset of clinical depression is measured using both clinician-administered and patient self-reported scales. Evaluating agreement between the disease onset times is useful in assessing the reliability of patient self-report and identifying an appropriate instrument for diagnosing depression. Later in this paper, we will describe another study where the time to the recurrence of prostate cancer is assessed by two different techniques.

Due to the presence of censored observations in survival studies, most existing agreement measures such as kappa cannot be directly applied. To assess the agreement between discrete survival times, Guo and Manatunga (2005) developed a local kappa coefficient that measures local agreement between bivariate survival times on each point of the two-dimensional time grid. The agreement between the survival times is reflected by the pattern of the local kappas on the two-dimensional time grid. This method can be used for investigating covariate effects on the agreement through modeling the local kappa coefficient.

In many studies, a single agreement index is needed to represent the overall strength of agreement between two survival times. Since discrete survival times can be viewed as ordinal outcomes, Cohen's weighted kappa (Cohen, 1968) seems to be a natural choice for our purpose. However, the presence of censoring makes it infeasible to estimate the weighted kappa with survival times. In this paper, we propose a modified weighted kappa coefficient that can incorporate censored observations. A major difference between the modified weighted kappa and Cohen's weighted kappa is that they are defined based on two types of cell probabilities in the contingency table of two ordinal outcomes. Cohen's weighted kappa is based on unconditional cell probabilities. The modified weighted kappa is based on redistributed cell probabilities which are the conditional probabilities for an event to occur in a particular cell given the observed survival times and censoring indicators. In calculating the redistributed cell probabilities, the mass of a censored observation is redistributed to the cells where the unobserved event may possibly occur. We also extend the method for multivariate cases by proposing a generalized modified weighted kappa for multivariate survival times. Estimates of the modified weighted kappa coefficient can be obtained through a nonparametric estimator of the bivariate survival function. Asymptotic properties such as strong consistency and asymptotic normality are established for the proposed estimator in the paper.

The paper is organized as follows. In Section 2, we present the methodology for the modified weighted kappa coefficient and its extension with multiple raters. In Section 3, we evaluate the performance of the proposed method through simulations. In Section 4, we present an example from a prostate cancer study. Finally, we conclude with a discussion.

Cohen (1960) proposed the kappa index as a measure of agreement for a binary test. It is interpreted as the proportion of observed agreement to its maximum possible value after chance agreement is removed from consideration. Cohen (1968) considered the extension of the original kappa to ordinal variables: measurements that incorporate natural orders such as the degree of severity of a disease that can be categorized as normal, mild, moderate and severe. For ordinal data, the disagreement in adjacent categories is less serious than the disagreement in more disparate categories. Hence Cohen (1968) proposed the weighted kappa to allow various degrees of disagreement to be differentially weighted in evaluating the overall agreement.

Let *Y*_{1} and *Y*_{2} denote ordinal ratings with *R* ordered categories by two raters. Without loss of generality, we assume *Y _{j}* = 1, …,

$$\kappa =\frac{{P}_{o}^{w}-{P}_{e}^{w}}{1-{P}_{e}^{w}},$$

(1)

where ${P}_{o}^{w}={\displaystyle {\sum}_{i=1}^{R}{\displaystyle {\sum}_{j=1}^{R}{w}_{\mathit{\text{ij}}}{p}_{\mathit{\text{ij}}}}}$ denotes weighted observed agreement probabilities and ${P}_{e}^{w}={\displaystyle {\sum}_{i=1}^{R}{\displaystyle {\sum}_{j=1}^{R}{w}_{\mathit{\text{ij}}}{p}_{i.}{p}_{.j}}}$ represents the expected agreement when raters are independent. Estimates are calculated by replacing the true probabilities *p _{ij}* in (1) with the sample proportions

Various weighted kappa coefficients can be obtained by choosing different weight functions. The two most common sets of weights are the quadratic weights with ${w}_{\mathit{\text{ij}}}=1-\frac{{(i-j)}^{2}}{{(R-1)}^{2}}$ and the Cicchetti weights (Cicchetti and Allison, 1971) with ${w}_{\mathit{\text{ij}}}=1-\frac{\left|i-j\right|}{\left|R-1\right|}$. When the quadratic weights are applied, the weighted kappa coefficient has been shown to be closely related to various agreement measures for continuous outcomes. Cohen (1968) proved the weighted kappa coefficient with quadratic weights is equivalent to the product-moment correlation coefficient under the assumption that the two marginal distributions are the same. Fleiss and Cohen (1973) showed the weighted kappa with quadratic weights is asymptotically equivalent to the intraclass correlation coefficient for ordinal ratings from a Gaussian general linear model. More recently, King and Chinchilli (2001) proved that the weighted kappa with quadratic weights is equivalent to Lin's concordance correlation coefficient (Lin, 1989), a popular measure of agreement for continuous measurements.

We now consider the assessment of agreement between survival outcomes. Let *T*_{1} and *T*_{2} denote two correlated discrete survival times and assume that the joint distribution of (*T*_{1}, *T*_{2}) is concentrated on an integral grid {(*l*_{1}, *l*_{2}), *l*_{1} = 1, …, *m*_{1}, *l*_{2} = 1, …, *m*_{2}}. The survival times (*T*_{1}, *T*_{2}) has the joint survival function of *S*(*l*_{1}, *l*_{2}) = Pr(*T*_{1} > *l*_{1}, *T*_{2} > *l*_{2}) and the density function of *p*(*l*_{1}, *l*_{2}) = Pr(*T*_{1} = *l*_{1}, *T*_{2} = *l*_{2}). Following Oakes (1989), the distribution of the censoring times (*C*_{1}, *C*_{2}) is assumed to be concentrated on the grid $\{({l}_{1}+\frac{1}{2},{l}_{2}+\frac{1}{2}),\phantom{\rule{thinmathspace}{0ex}}{l}_{1}=1,\dots \phantom{\rule{thinmathspace}{0ex}},{m}_{1},{l}_{2}=1,\dots \phantom{\rule{thinmathspace}{0ex}},{m}_{2}\phantom{\rule{thinmathspace}{0ex}}\}$ to avoid ties between censored and uncensored observations. The joint survival function for (*C*_{1}, *C*_{2}) is $G({l}_{1}+\frac{1}{2},{l}_{2}+\frac{1}{2})=\text{Pr}({C}_{1}>{l}_{1}+\frac{1}{2},{C}_{2}>{l}_{2}+\frac{1}{2})$ and the density function is $g({l}_{1}+\frac{1}{2},{l}_{2}+\frac{1}{2})=\text{Pr}({C}_{1}={l}_{1}+\frac{1}{2},{C}_{2}={l}_{2}+\frac{1}{2})$ It is assumed that the censoring times are independent of the survival times. The observed times and censoring indicators are * _{j}* =

We define the redistributed cell probability as

$${O}_{{l}_{1}{l}_{2}}=\text{Pr}({T}_{1}={l}_{1},{T}_{2}={l}_{2}|{\tilde{T}}_{1},{\tilde{T}}_{2},{\delta}_{1},{\delta}_{2}),\phantom{\rule{thinmathspace}{0ex}}{l}_{j}=1,\dots \phantom{\rule{thinmathspace}{0ex}},{m}_{j}.$$

*O*_{l1,l2} is the probability for the event to happen at (*l*_{1}, *l*_{2}) given a random observation (_{1}, _{2}, δ_{1}, δ_{2}). Depending on the censoring status of (δ_{1}, δ_{2}), *O*_{l1l2} can be written in terms of the joint survival function *S* in the following ways,

- if (δ
_{1}, δ_{2}) = (1, 1),$${O}_{{l}_{1},{l}_{2}}=\{\begin{array}{cc}1\hfill & \text{for}\phantom{\rule{thinmathspace}{0ex}}({l}_{1},{l}_{2})=({\tilde{T}}_{1},{\tilde{T}}_{2}),\hfill \\ 0\hfill & \text{otherwise}.\hfill \end{array}$$(2) - if (δ
_{1}, δ_{2}) = (0, 1),$${O}_{{l}_{1}{l}_{2}}=\{\begin{array}{cc}\frac{S({l}_{1}-1,{l}_{2}-1)-S({l}_{1}-1,{l}_{2})-S({l}_{1},{l}_{2}-1)+S({l}_{1},{l}_{2})}{S({\tilde{T}}_{1},{\tilde{T}}_{2}-1)-S({\tilde{T}}_{1},{\tilde{T}}_{2})}\hfill & \text{for}\phantom{\rule{thinmathspace}{0ex}}({l}_{1},{l}_{2})=({\tilde{T}}_{1}+{a}_{1},{\tilde{T}}_{2})\hfill \\ \hfill & \text{where}\phantom{\rule{thinmathspace}{0ex}}{a}_{1}=1,\dots ,{m}_{1}-{\tilde{T}}_{1},\hfill \\ 0\hfill & \text{otherwise}.\hfill \end{array}$$(3) - if (δ
_{1}, δ_{2}) = (1, 0),$${O}_{{l}_{1}{l}_{2}}=\{\begin{array}{cc}\frac{S({l}_{1}-1,{l}_{2}-1)-S({l}_{1}-1,{l}_{2})-S({l}_{1},{l}_{2}-1)+S({l}_{1},{l}_{2})}{S({\tilde{T}}_{1}-1,{\tilde{T}}_{2})-S({\tilde{T}}_{1},{\tilde{T}}_{2})}\hfill & \text{for}\phantom{\rule{thinmathspace}{0ex}}({l}_{1},{l}_{2})=({\tilde{T}}_{1},{\tilde{T}}_{2}+{a}_{2})\hfill \\ \hfill & \text{where}\phantom{\rule{thinmathspace}{0ex}}{a}_{2}=1,\dots ,{m}_{2}-{\tilde{T}}_{2},\hfill \\ 0\hfill & \text{otherwise}.\hfill \end{array}$$(4) - if (δ
_{1}, δ_{2}) = (0, 0),$${O}_{{l}_{1}{l}_{2}}=\{\begin{array}{cc}\frac{S({l}_{1}-1,{l}_{2}-1)-S({l}_{1}-1,{l}_{2})-S({l}_{1},{l}_{2}-1)+S({l}_{1},{l}_{2})}{S({\tilde{T}}_{1},{\tilde{T}}_{2})}\hfill & \text{for}\phantom{\rule{thinmathspace}{0ex}}({l}_{1},{l}_{2})=({\tilde{T}}_{1}+{a}_{1},{\tilde{T}}_{2}+{a}_{2})\hfill \\ \hfill & \text{where}\phantom{\rule{thinmathspace}{0ex}}{a}_{j}=1,\dots ,{m}_{j}-{\tilde{T}}_{j},\hfill \\ \hfill & j=1,2,\hfill \\ 0\hfill & \text{otherwise}.\hfill \end{array}$$(5)

Therefore, the probability *O*_{l1,l2} represents all the mass of an uncensored observation at (*l*_{1}, *l*_{2}) and redistributed mass at (*l*_{1}, *l*_{2}) from an earlier censored observation. In the next theorem, we show that the expectation of the redistributed cell probability *O*_{l1,l2} is equal to the unconditional cell probability *p*(*l*_{1}, *l*_{2}).

*E* (*O*_{l1,l2}) = *Pr*(*T*_{1} = *l*_{1}, *T*_{2} = *l*_{2}) = *p*(*l*_{1}, *l*_{2}), *for* *l _{j}* = 1, …,

See Web Appendix A.

We propose the following modified weighted kappa coefficient to measure the overall agreement between (*T*_{1}, *T*_{2}),

$${\kappa}^{w}=\frac{{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{w}_{{l}_{1},{l}_{2}}\phantom{\rule{thinmathspace}{0ex}}{O}_{{l}_{1}{l}_{2}}-}}{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{w}_{{l}_{1},{l}_{2}}{O}_{{l}_{1.}}{O}_{.{l}_{2}}}}}{1-{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{w}_{{l}_{1},{l}_{2}}{O}_{{l}_{1.}}{O}_{.{l}_{2}}}}},$$

(6)

where ${O}_{{l}_{1.}}={\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{O}_{{l}_{1}{l}_{2}}\phantom{\rule{thinmathspace}{0ex}}\text{and}\phantom{\rule{thinmathspace}{0ex}}{O}_{.{l}_{2}}=}{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{O}_{{l}_{1}{l}_{2}}}$. For the weight function, we recommend using the quadratic weights because the quadratic weights lead to close connection between the weighted kappa coefficient and agreement measures for continuous outcomes such as Lin's concordance correlation coefficient (Lin, 1989). Other weight functions can be applied under the same framework.

Let (_{k1}, _{k2}, δ_{k1}, δ*k*_{2}), *k* = 1, …, *n* be *n* random observations. The sample estimator of the proposed modified weighted kappa is

$${\widehat{\kappa}}^{w}=\frac{{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{w}_{{l}_{1},{l}_{2}}\phantom{\rule{thinmathspace}{0ex}}{\widehat{O}}_{{l}_{1}{l}_{2}}-}}{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{w}_{{l}_{1},{l}_{2}}{\widehat{O}}_{{l}_{1}.}{\widehat{O}}_{.{l}_{2}}}}}{1-{\displaystyle {\sum}_{{l}_{1}=1}^{{m}_{1}}{\displaystyle {\sum}_{{l}_{2}=1}^{{m}_{2}}{w}_{{l}_{1},{l}_{2}}{\widehat{O}}_{{l}_{1}.}{\widehat{O}}_{.{l}_{2}}}}},$$

(7)

where ${\widehat{O}}_{{l}_{1}{l}_{2}}={\displaystyle {\sum}_{k=1}^{n}{\widehat{O}}_{{l}_{1}{l}_{2}}^{k}/n}$ is the sample estimator of the redistributed cell probability with ${\widehat{O}}_{{l}_{1}{l}_{2}}^{k}=\widehat{\text{Pr}}({T}_{1}={l}_{1},{T}_{2}={l}_{2}|{\tilde{T}}_{k1},{\tilde{T}}_{k2},{\delta}_{k1},{\delta}_{k2})$ Without censoring, * ^{w}* is equivalent to the sample estimator of Cohen's weighted kappa. In the presence of censoring, a nonparametric estimator of the bivariate survival function

*The estimator* ^{w} *has the following asymptotic statistical properties as n* → ∞:

^{w}*is strongly consistent. That is*, |^{w}− κ^{w}| → 0*with probability 1*.- $\sqrt{n}\left({\widehat{\kappa}}^{w}-{\kappa}^{w}\right)$
*converges weakly to a zero-mean normal distribution*. *Let*(_{k1},_{k2}, δ_{k1}, δ_{k2}),*k*= 1, …,*n**represent the observed data. A bootstrap estimator*κ^{#}*can be obtained based on a bootstrap sample that is drown randomly with replacement from*(_{k1},_{k2}, δ_{k1}, δ_{k2}),*k*= 1, …,*n*. Then $\sqrt{n}\left({\kappa}^{\#}-{\widehat{\kappa}}^{w}\right)$,*given the observed data, weakly converges to the same limit distribution as*$\sqrt{n}\left({\widehat{\kappa}}^{w}-{\kappa}^{w}\right)$*in probability*.

See Web Appendix B

The analytical expression for the variance of * ^{w}* is technically challenging since it involves the covariance of

In many studies, the time to events are measured by more than two raters or methods. We now extend the proposed modified weighted kappa to measure agreement among multivariate discrete survival times. Suppose the survival times of the same set of subjects are measured by *R* raters (or methods) as *T*_{1}, …, *T _{R}*. Let {(

Define the multivariate redistributed cell probabilities as

$${O}_{{l}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{l}_{R}}=\text{Pr}({T}_{1}={l}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{T}_{R}={l}_{R}|{\tilde{T}}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{\tilde{T}}_{R},{\delta}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{\delta}_{R}),\phantom{\rule{thinmathspace}{0ex}}{l}_{j}=1,\dots ,{m}_{j},$$

To measure the agreement among (*T*_{1}, …, *T _{R}*), we propose the generalized modified weighted kappa,

$${\kappa}_{g}^{w}=\frac{{P}_{o}-{P}_{e}}{1-{P}_{e}}.$$

The observed agreement *P _{o}* and expected agreement

$${P}_{o}={\displaystyle \sum _{{l}_{1}=1}^{{m}_{1}}\cdots}{\displaystyle \sum _{{l}_{R}=1}^{{m}_{R}}\frac{{\displaystyle {\sum}_{\underset{{i}_{1},{i}_{2}\in (1,\dots ,R)}{{i}_{1}<{i}_{2}}}w({l}_{{i}_{1}},{l}_{{i}_{2}})}}{\left(\begin{array}{c}R\hfill \\ 2\hfill \end{array}\right)}}\phantom{\rule{thinmathspace}{0ex}}{O}_{{l}_{1},\dots ,{l}_{R}}$$

(8)

$${P}_{e}=\frac{1}{\left(\begin{array}{c}R\hfill \\ 2\hfill \end{array}\right)}{\displaystyle \sum _{\underset{{i}_{1},{i}_{2}\in (1,\dots ,R)}{{i}_{1}<{i}_{2}}}{\displaystyle \sum _{{l}_{{i}_{1}}=1}^{{m}_{{i}_{1}}}{\displaystyle \sum _{{l}_{{i}_{2}}=1}^{{m}_{{i}_{2}}}w({l}_{{i}_{1}},{l}_{{i}_{2}})\phantom{\rule{thinmathspace}{0ex}}{O}_{{l}_{{i}_{\dot{1}}}}{O}_{{l}_{{i}_{\dot{2}}}}},}}$$

(9)

where

$$\begin{array}{cc}{O}_{{l}_{{i}_{\dot{1}}}}\hfill & =\text{Pr}({T}_{{i}_{1}}={l}_{{i}_{1}}|{\tilde{T}}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{\tilde{T}}_{R},{\delta}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{\delta}_{2}),\hfill \\ {O}_{{l}_{{i}_{\dot{2}}}}\hfill & =\text{Pr}({T}_{{i}_{2}}={l}_{{i}_{2}}|{\tilde{T}}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{\tilde{T}}_{R},{\delta}_{1},\dots \phantom{\rule{thinmathspace}{0ex}},{\delta}_{2}).\hfill \end{array}$$

Note that when calculating the observed overall agreement *P _{o}*, the weight assigned to

The generalized modified weighted kappa ${\kappa}_{g}^{w}$ is closely related to the pairwise modified weighted kappa κ* ^{w}* defined in (6). When

$${\kappa}_{g}^{w}=\frac{{\displaystyle \sum _{{i}_{1}=1}^{R-1}{\displaystyle \sum _{{i}_{2}={i}_{1}+1}^{R}{P}_{{e}_{{i}_{1}{i}_{2}}}{\kappa}_{{i}_{1}{i}_{2}}^{w}}}}{{\displaystyle \sum _{{i}_{1}=1}^{R-1}{\displaystyle \sum _{{i}_{2}={i}_{1}+1}^{R}{P}_{{e}_{{i}_{1}{i}_{2}}}}}},$$

where ${P}_{e{i}_{1}{i}_{2}}={\displaystyle {\sum}_{{l}_{{i}_{1}=1}}^{{m}_{{i}_{1}}}{\displaystyle {\sum}_{{l}_{{i}_{2}}=1}^{{m}_{{i}_{2}}}w({l}_{{i}_{1}},{l}_{{i}_{2}}){O}_{{l}_{{i}_{\dot{1}}}}{O}_{{l}_{{i}_{\dot{2}}}}}}$ is agreement between raters *i*_{1} and *i*_{2} that is expected by chance and ${\kappa}_{{i}_{1}{i}_{2}}^{w}$ is the pairwise modified kappa for raters *i*_{1} and *i*_{2}. Therefore, the generalized kappa is a weighted average of all the pairwise kappas among the multiple raters. Heavier weights are given to pairwise kappas between raters that have higher expected-by-chance agreement.

The redistributed cell probabilities *O*_{l1, …, lR} and *O _{lj˙}* can be written in terms of the joint survival function

There are two types of mechanisms through which discrete survival times arise in practice(Kalbfleisch and Prentice, 2002). Often, discrete survival data arise when underlying continuous survival time is subject to interval censoring. In other instance, time may be truly discrete, as, for example, time to pregnancy measured as the number of menstrual cycles before a woman conceives.

The performance of the proposed estimator * ^{w}* was evaluated using discrete bivariate survival times generated from both mechanisms. For grouped times, we first generated continuous times from the Clayton model (Clayton, 1978). We assumed that

$$S({t}_{1},{t}_{2})={({e}^{{t}_{1}/\theta}+{e}^{{t}_{2}/\theta}-1)}^{-\theta},$$

(10)

with a constant odds ratio of $1+\frac{1}{\theta}$, where θ → ∞ indicates independence and θ → 0 indicates maximal positive dependence. The continuous random survival times ${T}_{j}^{*},j=1,2$ were then grouped into 5 intervals: [*a*_{1}, *a*_{2}), [*a*_{2}, *a*_{3}), [*a*_{3}, *a*_{4}), [*a*_{4}, *a*_{5}), [*a*_{5}, ∞) and the discrete survival times were defined as

$$\begin{array}{cc}{T}_{j}=\hfill & 1I\{{T}_{j}^{*}\in [{a}_{1},{a}_{2})\}+2I\{{T}_{j}^{*}\in [{a}_{2},{a}_{3})\}+3I\{\phantom{\rule{thinmathspace}{0ex}}({T}_{j}^{*}\in [{a}_{3},{a}_{4})\}+\hfill \\ \hfill & 4I\{{T}_{j}^{*}\in [{a}_{4},{a}_{5})\}+5I\{{T}_{j}^{*}\in [{a}_{5},\infty )\},\text{\hspace{1em}}j=1,2.\hfill \end{array}$$

Here *I* is an indicator variable. The cut points *a _{k}* (

We considered three sets of simulations by specifying three different θ parameters (θ = 0.95, 0.5 and 0.25) in the Clayton model. The true weighted kappa coefficients of the three Clayton models were 0.472, 0.651 and 0.804, representing moderate, substantial and almost perfect agreement respectively (Landis and Koch, 1977). For each set of simulation, we generated 50, 100 or 200 pairs of discrete survival times. The censoring variables (*C*_{1}, *C*_{2}) were independent from the survival times and from each other. Each censoring variable was assumed to follow a multinomial distribution that took values in $(1\frac{1}{2},2\frac{1}{2},3\frac{1}{2},4\frac{1}{2},5\frac{1}{2})$ with probability *p* = (*p*_{1}, *p*_{2}, *p*_{3}, *p*_{4}, *p*_{5})′. Three scenarios were considered: (1) *p* = (0.05, 0.05, 0.05, 0.05, 0.80) which induced a censoring proportion of 10%; (2) *p* = (0.1, 0.15, 0.25, 0.2, 0.3) which induced a censoring proportion of 30%; and (3) *p* = (0.2, 0.3, 0.3, 0.17, 0.03) with a censoring proportion of 50%.

We also performed the simulation study using truly discrete survival data generated through the discrete Clayton model (Shih, 1989). The discrete survival times *T _{j}, j* = 1, 2 took values in 1, …, 5 with marginal probabilities of 0.15, 0.2, 0.3, 0.2 and 0.15, respectively. Three values for the constant local odds ratio were considered such that true weighted kappa coefficients of the three discrete Clayton models were 0.472, 0.651 and 0.804. Censoring variables were generated from multinomial distributions to induce censoring proportion of 10%, 30% and 50%.

The variance of the kappa estimator was evaluated using the bootstrap procedure. As a comparison to the proposed method, we considered the currently available alternate method which does not take into account of censored observations and estimates sample kappa based on empirical bivariate distribution calculated based on complete observations in the simulated data sets.

The simulation results based on grouped times and truly discrete times are presented Table 1 and and2,2, respectively. The results are based on 500 simulation runs. Sample means and sample standard deviations for * ^{w}* are presented, along with the mean of bootstrap standard error estimates based on 200 bootstrap samples. From Table 1, we can see the modified weighted kappa estimates show small bias under low and medium censoring percentages but are biased downwards with heavy censoring. However, The bias is reduced as the sample size increases. From Table 2, the estimator is biased downwards with the increase of censoring but the bias again decreases with larger sample size. In both tables, the average of the bootstrap standard error of the proposed estimator is very close to the standard deviation of the estimates. For grouped times (Table 1), the coverage probabilities of the proposed estimator are close to the nominal level with low and medium censoring but are lower with heavy censoring. For truly discrete times (Table 2), the coverage probabilities are close to the nominal level in all scenarios except when the data has a small sample size and heavy censoring. In both tables, our proposed kappa estimator has significantly smaller bias and higher coverage probabilities than the sample estimator of kappa based on complete observations.

Simulation results for the modified weighted kappa estimator with grouped survival times simulated from the Clayton model based on 500 simulation runs, with comparison to sample kappa estimator based on complete observations

Simulation results for the modified weighted kappa estimator with discrete survival times simulated from the Discrete Clayton model based on 500 simulation runs, with comparison to sample kappa estimator based on complete observations

Comparisons between simulation results from Table 1 and and22 show that the performance of the proposed kappa estimator is generally better for truly discrete survival times than for grouped times. For grouped times, the kappa estimator demonstrates a complicated pattern in small sample size, for example the estimator appears to be biased upwards for low censoring but downwards under heavy censoring. The explanation is that there are two sources of bias in estimating kappa coefficient based on grouped survival times. The first source of bias is due to censoring. For medium to heavy censoring, the proposed method underestimates kappa. The bias due to censoring is present for both grouped times and the truly discrete times. The second source of bias is present only with grouped times and is due to estimating the discrete bivariate distribution based on grouped data that are generated from continuous survival models. To demonstrate the bias due to grouping, we generated both grouped times and truly discrete survival times in the absence of censoring and evaluated the bias in the estimated discrete bivariate distribution and kappa estimator. Results in Table 3 show that the bias is negligible for truly discrete times but is noticeable for grouped survival times with small to moderate sample sizes (Table 3). More specifically, we found for grouped times the estimated discrete distribution tends to be over-estimated on the diagonal line, i.e. (*T*_{1}, *T*_{2}) *s.t.T*_{1} = *T*_{2}, and negatively biased on the off-diagonal line (similar trend is also observed when grouping continuous times from other bivariate survival functions such as Gumbel model). Consequently, the estimated kappa is positively biased for grouped data without censoring. The additional source of variability due to the estimation of a discrete distribution based on grouped data causes the decrease in the accuracy in estimating kappa based on grouped survival times. Furthermore, the two sources of bias have opposite directions resulting in the complicated pattern in the kappa estimator with grouped times in small sample sizes. In the presence of light censoring (10%), the positive bias due to grouping is dominant and hence the kappa estimate is biased upwards. When the censoring proportion increases, the negative bias due to censoring becomes more prominent and hence the kappa estimates demonstrate downwards bias. However, both censoring and grouping bias diminish with the increase of sample size. Our simulation studies with more sample size scenarios (Web Table 1) show that the proposed estimator demonstrates desirable performance for grouped survival times with larger sample sizes.

In this section, we evaluate the performance of the generalized modified weighted kappa coefficient through a simulation study of discrete trivariate survival times that are obtained by grouping continuous times. Continuous trivariate survival times were generated from the continuous multivariate Clayton model with unit exponential marginals,

$$S({t}_{1},{t}_{2},{t}_{3})={({e}^{{t}_{1}/\theta}+{e}^{{t}_{2}/\theta}+{e}^{{t}_{3}/\theta}-2)}^{-\theta},$$

(11)

The discrete survival times were then created by grouping continuous times into intervals. Three sets of simulations were considered using the same θ parameters as in the bivariate simulations. We consider sample sizes of 100, 200 and 350. Table 4 provides a summary of the statistics for the trivariate simulation study. The estimates for the generalized modified weighted kappa coefficient demonstrate similar performance as the estimates of the bivariate modified kappa.

Prostate cancer is the most common cancer in US men. Depending on the patients’ demographic and disease characteristics, various kinds of treatments are available. One major diffculty in treating and monitoring prostate cancer is the lack of a standard definition for disease freedom after treatments. There is a general consensus that posttreatment disease status is reflected in the prostate specific antigen (PSA) with high level PSA indicating relapse. However, there is no universal agreement regarding the exact pattern of the PSA level that defines disease recurrence. Various definitions had been proposed for different kinds of treatments and the disease-free survival rates based on these definitions are used as important guidaffce for physicians in treatment selection. Since disease-free survival rates depend heavily on the definition of disease freedom, potential discrepancies between definitions may lead to different conclusions regarding the treatments effectiveness. Therefore, it is important to assess the agreement between different definitions before comparing the disease-free survival rates derived from them.

Radical prostatectomy and irradiation are two commonly used treatments for prostate cancer (Critz et al., 1995). For radical prostatectomy, disease freedom is defined by reaching and maintaining an undetectable prostate specific antigen (PSA) nadir ranging from 0.2ng/ml to 0.5ng/ml (Critz et al., 1996). For irradiation, according to the American Society of Therapeutic Radiation Oncology(ASTRO) consensus criteria (1997), posttreatment disease freedom is represented by a non-rising PSA with a rising PSA defined as three consecutive PSA increases measured six months apart. For years, there is a controversy regarding the disease free rates between the two treatments. Some researchers claim irradiation cures fewer patients than radical prostatectomy while others argue the two treatments are equally effective (Critz et al., 1996). In order to establish the comparability of disease-free survival rates between the two treatments, researchers (Critz et al., 1996) are interested in studying the agreement between the two definitions of disease freedom. Additionally, potential difference in the strength of agreement among various covariate subgroups is also of interest.

In this study, 1305 men with prostate cancer received simultaneous irradiation by integrating iodine 125 prostate implant with a follow-up external beam radiation. The disease status of all subjects was evaluated every six months after the treatment of the external beam radiation. The survival time was defined as the time elapsed from the end of the irradiation to the recurrence of prostate cancer which was determined based on two different definitions. In specific, ${T}_{1}^{*}$ is the time when patients’ posttreatment PSA level exceeded the nadir of 0.2 ng/ml while ${T}_{2}^{*}$ is based on the ASTRO definition and represents the midpoint between the time when the lowest PSA was achieved after irradiation and the time when the first of the three consecutive rises in the PSA level occurred. The survival time was measured in terms of months. The two cancer recurrence times were subject to censoring due to end of the follow-up on a patient. During the study period, 156 subjects experienced prostate cancer recurrence based on both definitions and 64 had cancer recurrence based on one of the definitions, representing approximately 80% of censoring. The absolute difference between the observed times based on the two definitions ranges between 0 and 108 months with the mean of 1.9 months.

In the prostate cancer study, cancer recurrence times were collected in a discrete manner since the subjects were evaluated only every six months. Therefore, the proposed modified kappa could be used to evaluate agreement between cancer recurrence times measured by the two definitions. Due to the sparsity of events, we grouped survival times ${T}_{1}^{*}$ and ${T}_{2}^{*}$ into five intervals: no more than 30 months, 31–60 months, 61–90 months, 91–162 months and >162 months, and the resulting discrete survival times are denoted as *T*_{1} and *T*_{2}. Without loss of generality, *T _{j}* = 1, 2, …, 5 for

As an alternative approach, we also measured the agreement between the two definitions using Lin's concordance correlation coefficient (CCC) (Lin, 1998), treating the ungrouped data ${T}_{j}^{*}(j=1,2)$ as continuous survival times. A nonparametric estimator (Guo and Manatunga, 2007) was used to accommodate censored observations. The estimated CCC for the ungrouped data is 0.792 and is quite close to the estimated modified kappa of the grouped data. In both cases, we concluded there is strong agreement between cancer recurrence times based on the two definitions in the prostate cancer data.

In this paper, we propose an extension to Cohen's (1968) weighted kappa coefficient for measuring agreement between discrete survival times. To the best of our knowledge, there has been no previous work for adapting kappa coefficient to survival outcomes. To accommodate censored observations, we first estimate the joint survival function of two survival times and then redistribute the mass of a censored observation to those cells where the unobserved event may potentially happen. A key underlying assumption for the proposed method is that the censoring distribution is independent of the joint survival function. This assumption ensures the mass of censored observations can be appropriately redistributed based on the estimated survival function. Among various estimators of the survival function, we choose Prentice and Cai’s (1992) estimator since it is adequate for most practical purposes and is more effcient than many alternatives (Kalbfleisch and Prentice, 2002). Additionally, Prentice-Cai estimator can incorporate both univariate and bivariate censoring, which is an advantage over estimators that are only applicable to univariate censoring (Lin and Ying, 1993; Tsai and Crowley, 1998).

The proposed modified weighted kappa is developed for measuring agreement between discrete survival times. However, it is also useful for continuous outcomes in certain situations: for example, when the survival times are actually measured in a discrete way or when the events are too sparse in the original continuous time scale. In these cases, one can discretize the continuous times and apply the proposed modified weighted kappa to measure agreement between the grouped survival times. We recommend several guidelines for the discretization. When it is possible, we suggest grouping the survival times into practically meaningful intervals which are associated with clinical interpretations. For example, time to onset of diabetes may be grouped into juvenile, adult and elderly diabetes. The second guideline is that discretization shall capture the empirical distribution of observed events so that the distribution of discrete survival times demonstrates a reasonable representation of the underlying continuous survival distribution. For example, it is undesirable to group in a way that most of the observed events fall into a few of the intervals while the rest of the intervals are empty. In that case, the estimated modified weighted kappa is calculated based on only a few nonempty cells in the contingency table and hence may not correctly reflect agreement between the original times. Finally, we recommend grouping times after the last observed event into one or two intervals where the last interval is either one sided or ends at the maximum time point where the event of interest may possibly happen. In practice, different discretizations inevitably influence the magnitude of the estimated kappa because kappa is known to be dependent on the marginal distributions (Cook, 1998). The above guidelines can help the estimated kappa correctly reflect agreement between original survival times. Alternatively, one can use an agreement measure defined for continuous scales. One popular agreement index for continuous outcomes is the concordance correlation coefficient (CCC) (Lin, 1998). A nonparametric estimator for the CCC has been proposed for survival times that can accommodate censored observations (Guo and Manatunga, 2007).

The performance of the proposed estimates for the modified weighted kappa and the generalized modified weighted kappa are satisfactory with low to medium censoring. With heavy censoring, the estimates are more biased with relatively low coverage probabilities in small to medium sample sizes. With increased sample size, the performance of the estimator improves. It should be pointed out that estimation bias for a statistic constructed based on survival functions is unavoidable in the presence of heavy censoring due to the difficulty in estimating the right tail of the survival function. For example, Lin and Ying (1993) noted bias when estimating the correlation coefficient between two survival times in the presence of heavy censoring.

When there is poor agreement between two survival time measurements, we can evaluate the empirical distribution of the observed events in the contingency table to find out causes of disagreement. For example, if the proposed kappa is calculated based on discrete survival times that are grouped from continuous times. One may want to check whether current discretization is suitable and whether agreement improves when choosing more appropriate discretization by following the suggested guidelines we present above. Evaluation of the cause of disagreement can provide helpful information when investigators are interested in adjusting the two measurements to reduce discrepancy between them, especially when one measurement is consistently biased from the other.

As a reviewer pointed out, the proposed agreement method may be useful to the multiple time scale problem. In many survival studies, the time to event of interest may be measured using multiple plausible scales. Often, the first time scale is the original chronological time or age of a subject at failure. The second alternative time scale is based on time-varying covariates, such as usage or exposure measures, and is often viewed as the operational time. One classical example is that the lifetime of a car can be measured either by its age or by the milage driven. In multiple time scale problems, the first chronological time scale is readily defined while the definition of the operational time scale is usually not obvious. Existing methods (Farewell and Cox, 1979; Oakes, 1995; Kordonsky and Gertsbakh, 1997; Duchesne and Lawless, 2000) aim to select a time scale that “captures” most of the variation in the failure times given the time-dependent covariates. Duchesne and Lawless (2000) have proven that the time scale with the smallest squared coefficient of variation is the ideal time scale within certain survival distribution family. Compared to the existing methods, our proposed agreement method is useful for identifying an ideal time scale that is equivalent to the original time scale in measuring the event of interest. We propose the following scheme: First, fix the partition on the chronological time scales where there exists a natural or meaningful grid. Second, partition the range of the operational time and measure the agreement between the two discretized time scales using the proposed modified weighted kappa. Repeat the second step for different partitions and select the time scale that maximizes the modified weighted kappa. The ideal scale that provides the strongest agreement to the original time scale can then be used in interpreting the risk associated with the operational time scale with respect to the risk associated with the chronological time scale.

We thank the referees, Associate Editor and Editor for valuable comments resulting in substantial improvements to the manuscript. This work was supported by an NIH grant R01-MH079448-02.

**Supplementary Materials**

Web Appendices and Tables referenced in Sections 2, 3 and 4 are available under the Paper Information link at the Biometrics web site http://www.tibs.org/biometrics/.

- American Society for Therapeutic Radiology and Oncology Consensus Panel. Consensus statement: guidelines for PSA following radiation therapy. International Journal of Radiation Oncology, Biology, Physics. 1997;37:1035–1041. [PubMed]
- Cicchetti DV, Allison T. A new procedure for assessing reliability of scoring EEG sleep recordings. American Journal of EEG Technology. 1971;11:101–109.
- Clayton DG. A model for association in bivariate life tables and its application in epidemiological studies of familial tendency in chronic disease incidence. Biometrika. 1978;65:141–151.
- Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20:37–46.
- Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin. 1968;70:213–220. [PubMed]
- Cook RJ. Kappa and its dependence on marginal rates. In: Armitage P, Colton T, editors. The Encyclopedia of Bio-statistics. New York: Wiley; 1998. pp. 2166–2168.
- Critz FA, Tarlton RS, Holladay DA. Prostate specific antigen -monitored combination radiotherapy for patients with prostate cancer. Cancer. 1995;75:2383–2391. [PubMed]
- Critz FA, Levinson K, Williams WH, Holladay DA. Prostate-Specific Antigen Nadir: the optimum level after irradiation for prostate cancer. Journal of Clinical Oncology. 1996;14:2893–2900. [PubMed]
- Duchesne T, Lawless J. Alternative time scales and failure time models. Lifetime Data Analysis. 2000;6:157–179. [PubMed]
- Farewell VT, Cox DR. A note on multiple time scales in life testing. Applied Statistics. 1979;28:73–75.
- Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin. 1971;76:378–382.
- Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement. 1973;33:613–619.
- Guo Y, Manatunga AK. Modeling the agreement of discrete bivariate survival times using kappa coefficient. Lifetime Data Analysis. 2005;11:309–332. [PubMed]
- Guo Y, Manatunga AK. Nonparametric estimation of the concordance correlation coefficient under univariate censoring. Biometrics. 2007;63:164–172. [PubMed]
- Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. New Jersey: John Wiley & Sons; 2002.
- King TS, Chinchilli VM. A generalized concordance correlation coefficient for continuous and categorical data. Statistics in Medicine. 2001;20:2131–2147. [PubMed]
- Klar N, Lipsitz SR, Ibrahim J. An estimating equations approach for modelling kappa. Biometrical Journal. 2000;42:45–58.
- Korten AE, Jorm AF, Henderson AS, McCusker E, Creasey H. Control-informant agreement on exposure history in case-control studies of Alzheimer's disease. International Journal of Epidemiology. 1992;21:1121–1131. [PubMed]
- Kraemer HC. Extension of the kappa coefficient. Biometrics. 1980;36:207–216. [PubMed]
- Kordonsky KB, Gertsbakh I. Multiple time scales and the lifetime coefficient of variation: Engineering applications. Lifetime Data Analysis. 1997;3:139–156. [PubMed]
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [PubMed]
- Lin DY, Ying Z. A simple nonparametric estimator of the bivariate survival function under univariate censoring. Biometrika. 1993;80:573–581.
- Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45:255–268. [PubMed]
- Maclure M, Willett WC. Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology. 1987;126:161–169. [PubMed]
- Oakes D. Bivariate survival models induced by frailties. Journal of the American Statistical Association. 1989;84:487–493.
- Oakes D. Multiple time scales in survival analysis. Lifetime Data Analysis. 1995;1:7–18. [PubMed]
- Prentice RL, Cai J. Covariance and survival function estimation using censored multivariate failure time data. Biometrika. 1992;79:495–512.
- Shih JH. Modeling multivariate discrete failure time data. Biometrics. 1998;54:1115–1128. [PubMed]
- Tsai WY, Crowley J. A note on nonparametric estimators of the bivariate survival function under univariate censoring. Biometrika. 1998;85:573–580.
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996.
- Williamson JM, Manatunga A, Lipsitz SR. Modelling kappa for measuring dependent categorical agreement data. Biostatistics. 2000;1:191–202. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |