Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2754119

Formats

Article sections

- Summary
- 1. Introduction
- 2. Methods
- 3. Design of Simulation Study
- 4. Example
- 5. Discussion
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2009 September 29.

Published in final edited form as:

Published online 2008 January 11. doi: 10.1111/j.1541-0420.2007.00975.x

PMCID: PMC2754119

NIHMSID: NIHMS112702

Division of Biostatistics, Medical College of Wisconsin, 8701 Watertown Plank Road, Milwaukee, Wisconsin 53226, U.S.A.

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

In some clinical studies comparing treatments in terms of their survival curves, researchers may anticipate that the survival curves will cross at some point, leading to interest in a long-term survival comparison. However, simple comparison of the survival curves at a fixed point may be inefficient, and use of a weighted log-rank test may be overly sensitive to early differences in survival. We formulate the problem as one of testing for differences in survival curves after a prespecified time point, and propose a variety of techniques for testing this hypothesis. We study these methods using simulation and illustrate them on a study comparing survival for autologous and allogeneic bone marrow transplants.

When comparing treatments in terms of their time-to-event distribution, there may be reason to believe that the survival curves will cross, and standard comparison techniques in such cases could lead to misleading results. Often researchers in such cases will focus on which treatment has a better long-term survival probability. In particular, this research is motivated by a common scenario in hematopoietic stem cell transplantation, illustrated using a study comparing autologous and allogeneic bone marrow transplants for follicular lymphoma (Van Besien et al., 2003). The sample contained 175 patients with an HLA-identical sibling allogeneic transplant and 596 patients with an unpurged autologous transplant. We are interested in comparing the disease-free survival (DFS) curves (i.e., the probability a patient is alive and disease free) between the two treatment arms. However, this comparison is complicated by the likely possibility that the hazard functions from these two treatments will cross at some point. Allogeneic transplants tend to have a higher mortality early due to the toxicity of the higher doses of chemotherapy used to ablate the immune system as well as graft-versus-host disease from the donor cells. However, the donor cells may provide a graft-versus-lymphoma effect resulting in less relapse of the primary disease in long-term survivors. In contrast, autologous transplants have lower early toxicity because patients do not experience graft-versus-host disease. However, these patients do not benefit from the protection against relapse from the graft-versus-lymphoma effect, so they tend to experience more relapses. These contrasting profiles are illustrated by the Kaplan–Meier curves for this dataset in Figure 1. The DFS of the allogeneic transplant arm drops quickly early, but then levels off, whereas the DFS of the autologous transplant arm decreases more slowly but does not plateau. The two curves appear to cross at about 1 year. The unweighted log-rank test will have poor power to detect such a difference in survival curves.

Generally, the question of interest is, which if any of the treatments yields better long-term survival? Several strategies for addressing this question are possible. One may pick a single long-term time point, and compare the survival estimates between the treatment groups at this single time point, as is discussed in Klein et al. (2007). However, there are some potential problems with this. First the results may be sensitive to the time point chosen. Second, this strategy ignores events occurring after the selected time point. For example, in a clinical trial comparing two treatments one might select 3-year survival as the primary endpoint. However, if everyone is followed up for 3 years and accrual occurs over a period of time such as 2 years, then there is substantial information on later events (between 3 and 5 years) for patients enrolled early in the trial. Therefore, selecting a single time point may be inefficient.

Another alternative would be to estimate simultaneous confidence bounds for the difference in survival curves (Parzen, Wei, and Ying, 1997; Zhang and Klein, 2001), which identify time regions where the two treatments are different. However, because of the large number of time points being considered and adjusted for, these tend to be quite wide and may be inefficient in determining late differences between treatments.

Another option would be the weighted log-rank test, with more weight placed on later time points to reflect interest in late events. For example, Fleming and Harrington (1981) proposed a class of weighted log-rank tests with a weight function equal to *Ŝ*(*t*)^{ρ}(1 − *Ŝ*(*t*))^{γ}. Here setting *ρ* = 0 and *γ* = 1 would place more weight on late events and hence late differences in the hazard rates and/or the survival curves. However, even though the weight is placed appropriately, this test is still designed to test the null hypothesis that the entire survival curves are equal. If we are focused instead on late differences rather than the entire survival curve, even the weighted log-rank test may be overly sensitive to early differences in the survival curves. We will illustrate this point in simulations presented later in the article.

We propose a specific formulation of the hypothesis to focus on late differences in the survival curve. We assume that atime point *t*_{0} can be prespecified, so that survival curves are presumed likely to cross prior to that time point if at all. The null hypothesis is H_{0} : *S*_{1}(*t*) = *S*_{0}(*t*), for all *t* ≥ *t*_{0}, where *S*_{1}(*t*) and *S*_{0}(*t*) denote the survival curves at time *t* for the treatment and control groups, respectively, versus the alternative, *H*_{1} : *S*_{1}(*t*) ≠ *S*_{0}(*t*), for some *t* ≥ *t*_{0}. This formulation allows us to specify exactly over what time range the comparison of treatments is of interest, for example, after *t*_{0}.

Note that this null hypothesis is equivalent to H_{0} : {*S*_{1}(*t*_{0}) = *S*_{0}(*t*_{0})} {*λ*_{1}(*t*) = *λ*_{0}(*t*), *t* > *t*_{0}}, where *λ _{k}*(

In the next section, we describe possible methods for testing this null hypothesis.

The data consist of *n*_{1} + *n*_{0} = *n* subjects with event times *t _{j}*. Let the distinct event times be ordered such that

The Kaplan–Meier estimate of survival in group *k* is given by

$${\widehat{S}}_{k}\left(t\right)=\underset{{t}_{j}<t}{\Pi}\left(1-\frac{{d}_{\mathit{kj}}}{{Y}_{\mathit{kj}}}\right).$$

The variance of the Kaplan–Meier estimate is estimated by Greenwood's formula given by

$$\widehat{\mathrm{var}}\left\{{\widehat{S}}_{k}\left(t\right)\right\}={\widehat{S}}_{k}{\left(t\right)}^{2}{\widehat{\sigma}}_{k}{\left(t\right)}^{2},$$

where

$${\widehat{\sigma}}_{k}{\left(t\right)}^{2}=\underset{{t}_{j}\le t}{\Sigma}\frac{{d}_{\mathit{kj}}}{{Y}_{\mathit{kj}}({Y}_{\mathit{kj}}-{d}_{\mathit{kj}})}.$$

The Nelson–Aalen estimate of the cumulative hazard function is

$${\widehat{\Lambda}}_{k}\left(t\right)=\underset{{t}_{j}\le t}{\Sigma}\frac{{d}_{\mathit{kj}}}{{Y}_{\mathit{kj}}},$$

with variance estimated by

$$\widehat{\mathrm{var}}\left\{{\widehat{\Lambda}}_{k}\left(t\right)\right\}=\underset{{t}_{j}\le t}{\Sigma}\frac{{d}_{\mathit{kj}}}{{Y}_{\mathit{kj}}^{2}}.$$

The simplest method for testing the null hypothesis that the survival curves after time *t*_{0} are equal would be to compare the survival curves at a selected point *t*′ > *t*_{0}, using the difference in Kaplan–Meier estimates of survival at *t*′. One can also construct a test statistic based on transformations of the survival probabilities at a fixed point in time, as described in Klein et al. (2007). Their recommendations were that the complementary log–log transformation of the survival probability works the best overall, resulting in the test statistic

$${Z}_{\mathrm{CLL}}\left({t}^{\prime}\right)=\frac{\mathrm{log}[-\mathrm{log}\left\{{\widehat{S}}_{1}\left({t}^{\prime}\right)\right\}]-\mathrm{log}[-\mathrm{log}\left\{{\widehat{S}}_{0}\left({t}^{\prime}\right)\right\}]}{\sqrt{{\widehat{\sigma}}_{1}^{2}\left({t}^{\prime}\right)\u2215{\left[\mathrm{log}\left\{{\widehat{S}}_{1}\left({t}^{\prime}\right)\right\}\right]}^{2}+{\widehat{\sigma}}_{0}^{2}\left({t}^{\prime}\right)\u2215{\left[\mathrm{log}\left\{{\widehat{S}}_{0}\left({t}^{\prime}\right)\right\}\right]}^{2}}}.$$

(1)

Alternatively, one could compare the cumulative hazard functions at a selected time point *t*′ > *t*_{0}, using the Nelson–Aalen estimates at *t*′, _{k}(*t*′). Tests based on the cumulative hazard function should behave similarly to those using a log transformation of the survival function.

One way to compare the entire survival curve after *t*_{0} is to consider a modification of the weighted Kaplan–Meier statistic (Pepe and Fleming, 1989, 1991), where the integral is taken over the restricted range after *t*_{0}. The statistic is given by

$${W}_{\mathrm{WKM}}\left({t}_{0}\right)={\int}_{{t}_{0}}^{{t}_{m}}\widehat{w}\left(t\right)\{{\widehat{S}}_{1}\left(t\right)-{\widehat{S}}_{0}\left(t\right)\}\phantom{\rule{thinmathspace}{0ex}}\mathit{dt},$$

where *ŵ*(*t*) = {*n*_{1}*Ĝ*_{1}(*t*) + *n*_{0}*Ĝ*_{0}(*t*)}^{−1}*nĜ*_{1}(*t*)*Ĝ*_{0}(*t*) and *Ĝ _{k}*(

$${\widehat{\mathrm{var}}}_{\mathrm{WKM}}\left({t}_{0}\right)=\underset{k=0}{\overset{1}{\Sigma}}\left\{{A}_{k0}^{2}\underset{j=1}{\overset{\ell -1}{\Sigma}}\frac{{d}_{\mathit{kj}}}{{Y}_{\mathit{kj}}^{2}}+\underset{j=\ell}{\overset{m-1}{\Sigma}}{A}_{\mathit{kj}}^{2}\frac{{d}_{\mathit{kj}}}{{Y}_{\mathit{kj}}^{2}}\right\},$$

where ${A}_{k0}={\int}_{{t}_{0}}^{{t}_{m}}\widehat{w}\left(t\right){\widehat{S}}_{k}\left(t\right)\mathit{dt}$ and ${A}_{\mathit{kj}}={\int}_{{t}_{j}}^{{t}_{m}}\widehat{w}\left(t\right){\widehat{S}}_{k}\left(t\right)\mathit{dt}$. A sketch of the derivation of this variance expression is given in Web Appendix A. Then the standardized weighted K–M statistic follows a standard normal distribution under the null hypothesis and is given by

$${Z}_{\mathrm{WKM}}\left({t}_{0}\right)=\frac{{W}_{\mathrm{WKM}}\left({t}_{0}\right)}{\sqrt{{\widehat{\mathrm{var}}}_{\mathrm{WKM}}\left({t}_{0}\right)}}.$$

(2)

Another test is based on a pseudo-value regression technique proposed by Andersen, Klein, and Rosthoj (2003) and Klein and Andersen (2005). Originally applied in the context of regression models for multistate models and competing risks data, it can also be used in the simple survival comparison context. For a given time point *τ _{j}*, compute the pooled sample Kaplan–Meier estimator,

To perform inference on survival curves after a fixed time *t*_{0}, we use the pseudo-values defined for event times *t* > *t*_{0}. Let *τ*_{1} correspond to the earliest event time occurring after *t*_{0}, *τ*_{2} correspond to the next earliest event time after *t*_{0}, and so forth, so that there are a total of *m*′ such observed event times in the dataset. We consider a generalized linear model for the pseudo-values, given by *g*(*θ _{ij}*) =

Inference on *β* may be performed using generalized estimating equations (GEE; Liang and Zeger, 1986). Let *μ*(·) = *g*^{−1}(·) be the mean function. Define *dμ _{i}* (

$$U(\beta ,\alpha )=\underset{i}{\Sigma}d{\mu}_{i}{(\beta ,\alpha )}^{\prime}{V}_{i}^{-1}({\widehat{\theta}}_{i}-{\theta}_{i})=\underset{i}{\Sigma}{U}_{i}(\beta ,\alpha )=0.$$

Let (, ) be the solution to this equation. Using results of Liang and Zeger (1986), under standard regulatory conditions, it follows that $\sqrt{n}\{(\widehat{\beta},\widehat{\alpha})-(\beta ,\alpha )\}$ is asymptotically multivariate normal with mean 0. The covariance matrix of (, ) can be estimated by the sandwich estimator (, ) where

$$\widehat{\Sigma}(\beta ,\alpha )={\left\{I(\beta ,\alpha )\right\}}^{-1}\left\{\underset{i}{\Sigma}{U}_{i}(\beta ,\alpha ){U}_{i}{(\beta ,\alpha )}^{\prime}\right\}{\left\{I(\beta ,\alpha )\right\}}^{-1},$$

and

$$I(\beta ,\alpha )=\underset{i}{\Sigma}d{\mu}_{i}(\beta ,\alpha ){V}_{i}^{-1}d{\mu}_{i}{(\beta ,\alpha )}^{\prime}$$

is the model-based equivalent of the information matrix (Andersen et al., 2003).

When the number of time points or pseudo-values being included for each patient is large, this can present numerical difficulties in several aspects. Estimation of the parameters can be slow because there are a large number of parameters, and numerical algorithms must be used. Furthermore, calculation of the matrix requires the difficult inversion of a high-dimensional matrix *I*(, ). One option is to consider alimited number of points (say 5 or 10) spread out equally on an event scale over the time period after *t*_{0}. An alternative is to use the generalized score statistic for *β* (Rotnitzky and Jewell, 1990; Boos, 1992), as considered in Lu (2006) for the pseudo-value regression context. The generalized score statistic for *β* when there is asingle dichotomous predictor can be shown to have a closed form, assuming an independent working correlation matrix and using the complementary log-log link function. Let * _{j}* = log{−log(

$$\begin{array}{cc}\hfill {\chi}_{\mathrm{PSV}}^{2}\left({t}_{0}\right)& ={\left\{\widehat{\Sigma}{(0,\stackrel{~}{\alpha})}_{11}\right\}}^{-1}{\left[{\left\{{I}^{-1}(0,\stackrel{~}{\alpha})\right\}}_{11}U{(0,\stackrel{~}{\alpha})}_{1}\right]}^{2}\hfill \\ \hfill & =\frac{{n}^{2}{\left\{{\displaystyle \underset{j}{\Sigma}}{n}_{1}{q}_{j}({\stackrel{-}{\theta}}_{1j}-{\stackrel{-}{\theta}}_{\cdot j})\right\}}^{2}}{{\displaystyle \underset{j,{j}^{\prime}}{\Sigma}}{q}_{j}{q}_{{j}^{\prime}}\left[{\displaystyle \underset{i}{\Sigma}}\{{n}_{0}^{2}{Z}_{i}+{n}_{1}^{2}(1-{Z}_{i})\}({\widehat{\theta}}_{\mathit{ij}}-{\stackrel{-}{\theta}}_{\cdot j})({\widehat{\theta}}_{{\mathit{ij}}^{\prime}}-{\stackrel{-}{\theta}}_{\cdot {j}^{\prime}})\right]},\hfill \end{array}$$

(3)

where the matrix element (·)_{11} or vector element (·)_{1} refers to the *β* component. This statistic asymptotically follows a ${\chi}_{1}^{2}$ distribution under the null hypothesis that *β* = 0. Note, however, that the method can be biased when the censoring distribution depends on covariates.

The pseudo-value regression technique offers several potential improvements over the other methods studied in this article. First, it allows for straightforward inclusion of additional covariates in the generalized linear model. Although other methods discussed here also can be extended to include additional covariates, the generalized linear model framework makes this extension very straightforward. Another advantage is that the pseudo-value regression approach allows one to model the effect of treatment as a time-dependent predictor. Even though we are attempting to eliminate the effect of early differences in outcome by comparing the survival curves after *t*_{0}, there is still the possibility that the effect of treatment is still not constant or even consistent after time *t*_{0}. Using a single parameter to describe the treatment effect may not be sensitive to these kinds of differences in the late survival curves, whereas allowing for a time-dependent effect in the generalized linear model will have better power to capture such a treatment effect. However, both of these extensions make the analysis more complex, because the simplified form of the generalized score test no longer holds. It is likely that these extensions would require one to use alimited number of time points after *t*_{0}, rather than all event times as is done here. We do not consider these extensions further in this article.

Finally, we consider alternative test statistics based on the formulation of the overall hypothesis as an intersection of two sub-hypotheses, H_{0} = H_{01} ∩ H_{02}, given by H_{01} : *S*_{1}(*t*_{0}) = *S*_{0}(*t*_{0}), and H_{02} : *λ*_{1}(*t*) = *λ*_{0}(*t*), *t* > *t*_{0}. Hypothesis H_{01} can be tested using either a standardized difference in Kaplan–Meier estimates or alternatively a standardized difference in Nelson–Aalen estimates of the cumulative hazard function. Let *X*_{NA}(*t*_{0}) = _{1}(*t*_{0}) – _{0}(*t*_{0}), and let ${\widehat{\sigma}}_{\mathrm{NA}}^{2}\left({t}_{0}\right)=\widehat{\mathrm{var}}\left\{{\widehat{\Lambda}}_{1}\left({t}_{0}\right)\right\}+\widehat{\mathrm{var}}\left\{{\widehat{\Lambda}}_{0}\left({t}_{0}\right)\right\}$. Then the test statistic for H_{01} is *Z*_{NA}(*t*_{0}) = *X*_{NA}/_{NA}.

Hypothesis H_{02} can be tested using a log-rank test, starting from *t*_{0}, given by

$${X}_{\mathrm{LR}}\left({t}_{0}\right)=\underset{{t}_{j}>{t}_{0}}{\Sigma}\frac{{Y}_{1j}{Y}_{0j}}{{Y}_{j}}\left\{\frac{{d}_{1j}}{{Y}_{1j}}-\frac{{d}_{0j}}{{Y}_{0j}}\right\}.$$

The log-rank test has variance consistently estimated by

$${\widehat{\sigma}}_{\mathrm{LR}}^{2}\left({t}_{0}\right)=\underset{{t}_{j}>{t}_{0}}{\Sigma}\frac{{Y}_{1j}{Y}_{0j}}{{Y}_{j}^{2}}\left(\frac{{Y}_{j}-{d}_{j}}{{Y}_{j}-1}\right){d}_{j},$$

where *d _{j}* and

$${Z}_{\mathrm{LR}}\left({t}_{0}\right)=\frac{{X}_{\mathrm{LR}}\left({t}_{0}\right)}{{\widehat{\sigma}}_{\mathrm{LR}}\left({t}_{0}\right)},$$

(4)

which asymptotically follows a standard normal distribution under H_{0}. One also could consider a weighted log-rank test of H_{02}; however, in simulations it made little difference, probably because we are only testing hazard rates after *t*_{0}, so we do not consider it further.

Note that $\sqrt{n}({X}_{\mathrm{NA}}\left({t}_{0}\right),{X}_{\mathrm{LR}}\left({t}_{0}\right))$ asymptotically follows a bivariate normal distribution under H_{0} with mean (0, 0). The variance–covariance matrix of (*X*_{NA}(*t*_{0}), *X*_{LR}(*t*_{0})) can be estimated by $\widehat{\Sigma}=\text{Diag}\phantom{\rule{thinmathspace}{0ex}}\{{\widehat{\sigma}}_{\mathrm{NA}}^{2}\left({t}_{0}\right),{\widehat{\sigma}}_{\mathrm{LR}}^{2}\left({t}_{0}\right)\}$. The off-diagonal elements of the covariance matrix are 0 because the events contributing to each statistic are occurring over nonoverlapping times; the Nelson–Aalen estimator only considers events occurring up until time *t*_{0}, whereas the log-rank test only considers events starting after *t*_{0}. An equivalent result is that (*Z*_{NA}(*t*_{0}), *Z*_{LR}(*t*_{0})) follows a bivariate normal distribution with mean (0, 0) and variance–covariance matrix equal to the identity matrix. To test the composite null hypothesis H_{0}, we can consider a number of ways of combining *Z*_{NA}(*t*_{0}) and *Z*_{LR}(*t*_{0}), including linear combination tests, a quadratic form test, and a Union-Intersection (Roy, 1953) test. However, the Union-Intersection test is omitted for brevity because it did not perform as well as the quadratic test.

Linear combination tests consist of a linear combination of the test statistics for each component null hypothesis. These tests have the form

$$Z\left({t}_{0}\right)=\frac{{\mathit{aX}}_{\mathrm{NA}}\left({t}_{0}\right)+{\mathit{bX}}_{\mathrm{LR}}\left({t}_{0}\right)}{\sqrt{{a}^{2}{\widehat{\sigma}}_{\mathrm{NA}}^{2}\left({t}_{0}\right)+{b}^{2}{\widehat{\sigma}}_{\mathrm{LR}}^{2}\left({t}_{0}\right)}}.$$

Several sets of weights are possible; we present one which performed well in simulations. If we set $a={\left\{2{\widehat{\sigma}}_{\mathrm{NA}}^{2}\left({t}_{0}\right)\right\}}^{-1\u22152}$ and $b={\left\{2{\widehat{\sigma}}_{\mathrm{LR}}^{2}\left({t}_{0}\right)\right\}}^{-1\u22152}$, this yields,

$${Z}_{\mathrm{OLS}}\left({t}_{0}\right)=\frac{1}{\sqrt{2}}\{{Z}_{\mathrm{NA}}\left({t}_{0}\right)+{Z}_{\mathrm{LR}}\left({t}_{0}\right)\}.$$

(5)

This statistic is analogous to the ordinary least squares (OLS) test of O'Brien (1984) for the multiple endpoint testing problem.

Sposto, Stablein, and Carter-Campbell (1997) proposed a partially grouped log-rank test, which is another special case of the linear combination test. Instead of testing H_{01} with a difference in Nelson–Aalen estimates at *t*_{0}, they used the difference in Kaplan–Meier estimates. They estimate the variance of this difference using the pooled samples, yielding a final statistic

$${Z}_{\mathrm{SP},\mathrm{P}}\left({t}_{0}\right)=\frac{\left[{n}_{1}{n}_{0}{n}^{-1}\{{\widehat{S}}_{0}\left({t}_{0}\right)-{\widehat{S}}_{1}\left({t}_{0}\right)\}\right]+{X}_{\mathrm{LR}}\left({t}_{0}\right)}{\sqrt{{n}_{1}{n}_{0}\widehat{\mathrm{var}}\left\{{\widehat{S}}_{p}\left({t}_{0}\right)\right\}+{\widehat{\sigma}}_{\mathrm{LR}}^{2}\left({t}_{0}\right)}},$$

(6)

where *Ŝ _{p}*(

Next we consider quadratic forms of (*X*_{NA}(*t*_{0}), *X*_{LR}(*t*_{0})), which result in a *χ*^{2} test asymptotically. Here

$$\begin{array}{cc}\hfill {\chi}^{2}\left({t}_{0}\right)& ={({X}_{\mathrm{NA}}\left({t}_{0}\right),{X}_{\mathrm{LR}}\left({t}_{0}\right))}^{\prime}{\widehat{\Sigma}}^{-1}({X}_{\mathrm{NA}}\left({t}_{0}\right),{X}_{\mathrm{LR}}\left({t}_{0}\right))\hfill \\ \hfill & ={Z}_{\mathrm{NA}}^{2}\left({t}_{0}\right)+{Z}_{\mathrm{LR}}^{2}\left({t}_{0}\right).\hfill \end{array}$$

(7)

This follows a ${\chi}_{2}^{2}$ distribution under H_{0}.

A simulation study was designed to compare the performance of the different procedures in terms of their type I error rate and power to detect late differences in survival curves. We set it up based on comparing the survival curves between two equal-sized groups after *t*_{0} = 24 months, and we assume that all patients have a maximum of 72 months of follow-up. In addition to censoring at the fixed time of 72 months, we also overlay an exponential censoring pattern prior to 72 months, with rates selected to induce one of three censoring percentages by 24 months: (1) 0% censoring in each group at 24 months, (2) 15% censoring in each group by 24 months, and (3) 10% censoring in the control group and 20% censoring in the treatment group by 24 months. The overall censoring percentage is between 35 and 60% depending on the scenario, and the censoring time was generated independently of the event time.

For the type I error rate simulations, total sample sizes of *n* = 100, 200, and 400 with equal sample size per group are studied, across four null hypothesis scenarios. To generate these scenarios we assume piecewise exponential survival curves, where the survival curves differ by 0%, 5%, 10%, or 15% at 8 months (scenarios A–D, respectively), but are equal at 24 months and beyond. These curves are shown in Figure 2a.

Survival curves for treatment (S1) and control (S0) groups used in simulations. Curves for the null hypothesis simulations are shown in (a) for each of the four scenarios, and curves for the alternative hypothesis simulations are shown in (b)–(f) **...**

For the power simulations, total sample sizes of *n* = 400 are studied, across five alternative hypothesis scenarios. We generate these scenarios using Weibull survival curves. The curves are shown in Figure 2b–f for scenarios E–I, respectively. Scenario E refers to a proportional hazards situation with no crossing hazards or survival curves. Scenarios F and G refer to situations where the survival curves cross prior to *t*_{0} but to differing degrees. Scenario H is a situation where the survival curves cross exactly at *t*_{0}, and in Scenario I the survival curves cross after *t*_{0}. The percentage of patients still at risk at time *t*_{0}, which impacts the power, varies between 30 and 66% depending on the scenario, and can be roughly obtained for a given scenario by combining the survival probability from Figure 2 at *t*_{0} with the censoring percentage at *t*_{0} (0%, 15%, or 10%, 20%).

For comparison purposes, the log-rank test and the weighted log-rank test with Fleming–Harrington weights of *ρ* = 0, *γ* = 1 are also shown, even though they are not specifically suited for testing the null hypothesis of interest here. All simulations used 10,000 Monte Carlo samples.

To summarize the simulation results, we applied analysis of variance (ANOVA) techniques. For the type I error rate, we defined the outcome *Y* as the percent rejection rate minus the nominal rate of 5%, so that good performance is indicated by values of the expectation of *Y* near 0. For the power simulations, the outcome *Y* is defined as the percent rejection rate, so that good performance is defined by high values of the expectation of *Y*.

There were 11 test statistics considered in the simulations: three pointwise comparisons based on the complementary log–log transformation (1) evaluated at three different time points *t*′ after *t*_{0}, the statistics given in equations (2)-(7), and the log-rank and weighted log-rank tests starting at time 0. In addition, we included in the ANOVA model factors for scenario (A–D; 4 levels), total sample size *n* (100, 200, and 400; 3 levels), and censoring pattern (0%, 15%, or (10%,20%); 3 levels). We also considered models in which each of these factors interacted with the main test statistic effect to examine performance of these statistics for specific levels of the other factors (lower order terms included when interaction is present). Specifically, we fit the following models:

$$E\left({Y}_{\mathit{tsnc}}\right)={\mu}_{\mathit{ts}}+{\alpha}_{n}+{\beta}_{c},$$

(8)

$$E\left({Y}_{\mathit{tsnc}}\right)={\mu}_{\mathit{tn}}+{\beta}_{c}+{\gamma}_{s},$$

(9)

$$E\left({Y}_{\mathit{tsnc}}\right)={\mu}_{\mathit{tc}}+{\alpha}_{n}+{\gamma}_{s},$$

(10)

$$E\left({Y}_{\mathit{tsnc}}\right)={\mu}_{t}+{\alpha}_{n}+{\beta}_{c}+{\gamma}_{s}.$$

(11)

Here the subscripts *t*, *s*, *n*, and *c* refer to test (11 levels), scenario (4 levels), *n* (3 levels), and censoring pattern (3 levels), respectively, as described above. We fit the model without an intercept and normalized the effects of the other factors to have a sum of zero because then the estimates for the interaction terms have the interpretation as average deviations from the nominal level of 5% adjusted for the effects of the other factors. The results are shown for each test in Table 1 by scenario (model (8)), by total sample size *n* (model (9)), by censoring pattern (model (10)), and overall (model (11)).

We can see that most of the methods control the type I error rate accurately even for a total sample size of *n* = 100, whereas the *Z*_{OLS}(*t*_{0}) statistic has the most accurate control. Entries in the table where the type I error rate is more than 2 SEs from the nominal level are marked in bold. The weighted Kaplan–Meier statistic, *Z*_{WKM}(*t*_{0}), has an inflated type I error by about 1% for *n* = 100, although this inflation of the error rate dissipates with larger sample size. The *χ*^{2}(*t*_{0}) statistic is somewhat conservative for smaller sample sizes. Also, we point out that the log-rank and weighted log-rank tests starting at time 0 do not control the type I error rate for this hypothesis test for scenarios B–D, because of the early differences in the survival curves prior to *t*_{0} = 24. The inflation of the type I error rate worsens with larger early differences in the survival curves (scenario D) and with larger sample sizes, because the test has higher power to detect these early differences. However, the log-rank and weighted log-rank procedures are not designed to test the null hypothesis H_{0} of equal survival curves after *t*_{0}, so this result is not unexpected.

For the power results, we only considered one sample size (total *n* = 400) and the same censoring patterns as in the type I error simulations. The power of the various procedures is expected to depend heavily on the scenario; therefore to summarize the power results, we fit an ANOVA model with an interaction between the test statistic and the scenario, adjusting for censoring pattern,

$$E\left({Y}_{\mathit{tsc}}\right)={\mu}_{\mathit{ts}}+{\beta}_{c}.$$

(12)

Here the subscripts *t*, *s*, and *c* refer to test (11 levels), scenario (5 levels), and censoring pattern (3 levels), respectively.

The results are shown for each test and scenario combination in Table 2. Several general patterns emerge from examining this table. The pointwise comparisons based on the complementary log–log transformation, *Z*_{CLL}(*t*′), are sensitive only to differences at that point *t*′ and do not compare the entire curves. The pointwise comparison at 72 months has the highest power among pointwise comparisons because the largest differences in survival curves are seen there. Although the pointwise comparison at 72 months does well in scenarios F and G, it suffers from loss of power compared to some of the other methods for the proportional hazards situation (E) and when the differences at 72 months are not as pronounced (H–I). In these cases, a comparison of the entire curves using one of the other techniques can have more power.

The weighted Kaplan–Meier comparison, *Z*_{WKM}(*t*_{0}), and the pseudo-value technique, ${\chi}_{\mathrm{PSV}}^{2}\left({t}_{0}\right)$, both aggregate differences in survival curves across the times after *t*_{0}, and as expected they perform well when those differences are in a consistent direction (scenarios E–G) and poorly when those differences are not in the same direction (scenario I) or when many of those early differences are very small (scenario H). Similarly, the linear combination tests (*Z*_{OLS}(*t*_{0}) and *Z*_{SP,P}(*t*_{0})) combine the test statistics before and after *t*_{0} in a linear fashion and would be expected to perform well when those statistics have the same sign (E–G) and poorly when they have the opposite sign (I) or one of them has a zero mean (H). Among these, the OLS test *Z*_{OLS}(*t*_{0}) has the best power in scenarios (F–I) and has only slightly less power for scenario (E), and would be recommended.

The *χ*^{2}(*t*_{0}) combination test performs well for scenarios (H–I), but may be less efficient when the statistics have the same sign (E–G). However, it appears that the magnitude of the relative power loss for scenarios (E–G) is modest, so this may be a good general method. Note that the log-rank test starting at *t*_{0}, *Z*_{LR}(*t*_{0}), has high power for scenarios (G–I); this component of the combination tests is driving the power much more than the other component, and this disparity in the component statistics is what produces good power for the *χ*^{2}(*t*_{0}) combination test, as opposed to a linear combination test.

It is also worth noting that although the pseudo-value test ${\chi}_{\mathrm{PSV}}^{2}\left({t}_{0}\right)$ had poor power for scenarios (H–I), it was implemented assuming a time-independent effect of treatment on survival. One advantage of these models is that one can test for whether there is a time-dependent effect and incorporate this into the models, thereby improving the sensitivity to the treatment effect captured in these scenarios. Further investigation of this is needed, however.

The power for the log-rank and weighted log-rank tests starting at time 0 is also shown for comparative purposes. As expected, the log-rank test has the highest power for the proportional hazards alternative (E), but performs poorly for the remaining crossing hazards alternatives (H–I). The weighted log-rank test has the highest power among all the tests for scenarios (F–H), but performs somewhat worse for E and I. Also, the weighted log-rank test can be overly sensitive to early differences in survival curves, and is not suitable for testing the null hypothesis H_{0} of equal survival curves after *t*_{0}.

Overall, the *χ*^{2}(*t*_{0}) test from equation (7) has the highest power in scenarios H and I, and power that is not substantially worse than the other methods for scenarios E–G, and could be recommended as a general omnibus technique against a variety of alternative hypothesis scenarios. However, note that in scenarios E–G there is a survival difference at *t*_{0} that is at least maintained or increased, implying a consistent and easily interpretable difference in long-term survival. On the other hand, in scenario I the survival curves come together and cross sometime after *t*_{0}. This makes it difficult to interpret which treatment is better, because it depends on when after *t*_{0} you look and how much follow-up there is in the study. The power advantage of the *χ*^{2}(*t*_{0}) test for scenario I may not be worth the power loss for the more relevant and interpretable scenarios E–G. The OLS linear combination test *Z*_{OLS}(*t*_{0}) from equation (5) has high power for scenarios E–G, and is a better choice than the *χ*^{2}(*t*_{0}) test for identifying consistent differences in long-term survival after *t*_{0}.

We now return to the motivating example comparing autologous and allogeneic bone marrow transplants for follicular lymphoma. We are interested in comparing the DFS curves (i.e., the probability that a patient is alive and disease free) between the two treatment arms, but in particular we are interested in comparing the DFS curves after 1 year, which should eliminate much of the anticipated differences in early mortality between autologous and allogeneic transplants. There is a modest amount (42%) of censoring present in the dataset.

The usual log-rank test gives *p* = 0.443, whereas a weighted log-rank test with Fleming–Harrington weights of *ρ* = 0, *γ* = 1 gives a *p*-value of <0.001. Each of the methods described above was applied using *t*_{0} = 12 months. The pointwise comparison using the complementary log–log transformation is shown at *t*′ = 36 months. The *p*-values are given in Table 3.

The table indicates that the pointwise comparison *Z*_{CLL}(36) does not find a significant difference in the survival estimates at 36 months. However, if you compare the entire survival curves after 12 months, the weighted Kaplan–Meier comparison, the pseudo-value approach, and the *Z*_{SP,P}(12) test all have nonsignificant *p*-values, whereas the *χ*^{2}(12) test and the linear combination test *Z*_{OLS}(12) have significant *p*-values.

In understanding the discrepancies in the results, we need to look at the shape of the survival curves and where they cross. The survival curves cross right around *t*_{0}, whereas the simple log-rank test of the hazard functions after *t*_{0}, *Z*_{LR}(12), is highly significant. This situation is most similar to simulation scenario H. The weighted Kaplan–Meier statistic, the pseudo-value approach, and the *Z*_{SP,P}(*t*_{0}) combination test all had low power in this scenario, whereas the *χ*^{2}(*t*_{0}) test and the *Z*_{OLS}(*t*_{0}) test had higher power. The latter two tests are more sensitive to the component test of the hazard functions after *t*_{0}. Using these latter tests, one would conclude that there is a significant difference in the survival curves after 12 months.

We have considered a number of methods for comparing two survival curves after a prespecified time point, *t*_{0}. This situation may be of interest when the survival curves are expected to cross, so that we are only interested in late differences. Our simulations indicate that a simple pointwise comparison of the curves at some time after *t*_{0} is sensitive to the time point chosen and may be inefficient in some settings because they ignore differences in the curves at other times after *t*_{0}. Use of the weighted log-rank test may be overly sensitive to early differences prior to *t*_{0}. Two of the methods studied stand out, both of which are based on a combination of a pointwise comparison of the survival curves at *t*_{0} and a log-rank test after *t*_{0}. A *χ*^{2}(*t*_{0}) combination test given in equation (7) performs well as an omnibus test against a wide variety of alternative hypothesis scenarios. But if the interest is on identifying consistent differences in long-term survival after *t*_{0}, a simple equally weighted linear combination test *Z*_{OLS}(*t*_{0}) given in equation (5) has better power for this alternative hypothesis and is recommended. A pseudo-value regression approach was also discussed, which can be easily extended to account for covariates. Also, the regression model framework can be used to test whether the treatment effect is consistent across time, thereby allowing flexibility for dealing with different alternative hypotheses such as survival curves crossing after the prespecified time *t*_{0}. Further research is needed on adapting this approach.

This research was partially supported by a grant (R01 CA54706-10) from the National Cancer Institute.

6. Supplementary Materials

Web Appendix A, referenced in Section 2.3, is available under the Paper Information link at the *Biometrics* website http://www.biometrics.tibs.org.

- Andersen PK, Klein JP, Rosthoj S. Generalized linear models for correlated pseudo-observations with applications to multi-state models. Biometrika. 2003;90:15–27.
- Boos D. On generalized score tests. American Statistician. 1992;46:327–333.
- Fleming TR, Harrington DP. A class of hypothesis tests for one and two samples of censored survival data. Communications in Statistics. 1981;10:763–794.
- Klein JP, Andersen PK. Regression modeling of competing risks data based on pseudo-values of the cumulative incidence function. Biometrics. 2005;61:223–229. [PubMed]
- Klein JP, Logan B, Harhoff M, Andersen PK. Analyzing survival curves at a fixed point in time. Statistics in Medicine. 2007;26:4505–4519. [PubMed]
- Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;78:13–22.
- Lu L. Explained variation in survival analysis and hypothesis testing for current leukemia-free survival. Medical College of Wisconsin; Milwaukee, Wisconsin: 2006. Ph.D. Dissertation.
- O'Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40:1079–1087. [PubMed]
- Parzen MI, Wei LJ, Ying Z. Simultaneous confidence intervals for the difference of two survival functions. Scandinavian Journal of Statistics. 1997;24:309–314.
- Pepe MS, Fleming TR. Weighted Kaplan-Meier statistics: A class of distance tests for censored survival data. Biometrics. 1989;45:497–507. [PubMed]
- Pepe MS, Fleming TR. Weighted Kaplan-Meier statistics: Large sample and optimality considerations. Journal of the Royal Statistical Society, Series B. 1991;53:341–352.
- Rotnitzky A, Jewell NP. Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data. Biometrika. 1990;77:485–497.
- Roy SN. On aheuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 1953;24:220–238.
- Sposto R, Stablein D, Carter-Campbell S. A partially grouped logrank test. Statistics in Medicine. 1997;16:695–704. [PubMed]
- Van Besien K, Loberiza F, Bajorunaite R, et al. Comparison of autologous and allogeneic hematopoietic stem cell transplantation for follicular lymphoma. Blood. 2003;102:3521–3529. [PubMed]
- Zhang M-J, Klein JP. Confidence bands for the difference of two survival curves under proportional hazards model. Lifetime Data Analysis. 2001;7:243–254. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |