Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3172162

Formats

Article sections

- Summary
- 1 Introduction
- 2 The additive risks model for effect of haplo-match
- 3 Large sample properties
- 4 Simulations
- 5 Haplotype Match for BMT patients
- 6 Discussion
- References

Authors

Related links

Scand Stat Theory Appl. Author manuscript; available in PMC 2012 September 1.

Published in final edited form as:

Scand Stat Theory Appl. 2011 September; 38(3): 409–423.

doi: 10.1111/j.1467-9469.2010.00720.xPMCID: PMC3172162

NIHMSID: NIHMS244415

Thomas H. Scheike

Department of Biostatistics, University of Copenhagen

Torben Martinussen

Department of Biostatistics, University of Southern Denmark

Division of Biostatistics, Medical College of Wisconsin

Thomas H. Scheike, Department of Biostatistics, University of Copenhagen, Øster Farimagsgade 5B, P.O.B. 2099, DK-1014 Copenhagen K, Denmark ; Email: kd.uk.tatsoib@st

See other articles in PMC that cite the published article.

In this paper we consider a problem from bone marrow transplant (BMT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the overall survival. The BMT study we consider is based on donors and patients that are genotype matched, and this therefore leads to a missing data problem. We show how Aalen's additive risk model can be applied in this setting with the benefit that the time-varying haplo-match effect can be easily studied. This problem has not been considered before, and the standard approach where one would use the EM-algorithm cannot be applied for this model because the likelihood is hard to evaluate without additional assumptions. We suggest an approach based on multivariate estimating equations that are solved using a recursive structure. This approach leads to an estimator where the large sample properties can be developed using product-integration theory. Small sample properties are investigated using simulations in a setting that mimics the motivating haplo-match problem.

Hematopoietic cell transplantation (HCT) is a life saving procedure for many cancer patients, but it also has side effects, especially for transplants from unrelated donors, for example when the donor is not from the patient's family. Graft-versus-host disease (GVHD) complications is one of the major causes for treatment related death. To reduce GVHD, human leukocyte antigen (HLA) fully matched graft needs to be selected. When a patient does not have an HLA-identical sibling donor, then an HLA fully matched unrelated transplant often will be considered. For an HLA fully matched unrelated transplant, the donor is selected such that the patient and donor have identical HLA genotypes, but they may not necessarily have matching haplotype pairs. For example, suppose the donor and patient have identical HLA-A/B/DRB1 alleles of {(A1, A3), (B8, B7), (DR3, DR2)}. Now, there are four potential haplotypes that are consistent with this genotype, namely {(A1, B8, DR3), (A3, B7, DR2)}, {(A3, B8, DR3), (A1, B7, DR2)}, {(A1, B8, DR2), (A3, B7, DR3)}, and {(A3, B8, DR2), (A1, B7, DR3)}. Here, a donor with {(A1, B8, DR3), (A3, B7, DR2)} and a patient with {(A1, B8, DR3), (A3, B7, DR2)} are haplotype matched, but a donor with {(A3, B8, DR3), (A1, B7, DR2)} and a patient with {(A1, B8, DR3), (A3, B7, DR2)} are haplotype miss-matched, see Petersdorf et al. (2007) for details. Among HLA-genotype matched transplants with unrelated donors it has not been fully investigated whether there is an effect of haplotype matching. This is an important question since if a positive beneficial haplotype matching effect can be identified, then patient and physician need to search for a haplotype matched donor among those available HLA-genotype fully matched potential donors. Among a patient and donor with identical HLA genotype, the typical probability of a HLA haplotype match is around 70% (Petersdorf et al., 2007).

Recently, Petersdorf et al. (2007) studied the effect of major histocompatibility complex (MHC) haplotype match of donor and patient after HLA-identical unrelated HCT for leukemia. Their analysis was based on observing the haplotype pairs directly and therefore was carried out as a Cox regression analysis. Identifying the haplotype pairs is a very costly, time-consuming and not entirely unproblematic procedure. In this study, we show how modeling of the missing haplotypes for donor and patient can also be used to address the issue using only the observed genotype and failure time data. For illustration purpose, we consider a similar cohort of patients with HLA-identical unrelated HCT for acute leukemia. The data was selected from the Center for International Blood and Marrow Transplant Research (CIBMTR). We found a strong time-varying effect for the haplotype matching, and therefore the use of Cox's proportional hazards model is not appropriate to assess the effect of haplo-type match for this transplant data. We instead consider Aalen's additive intensity model (Aalen, 1989). This model allows the haplotype matching effect to be varying over time and the time-dependent excess risk can be estimated directly. This excess risk translates directly into a simple relative survival percentage. One particular advantage of this approach is that the absolute size of the effect is directly accessible and that the model does not rely on any particular structure for the effect.

Missing data for survival analysis has been considered in detail in many different settings for Cox's regression model. For example with missing partial covariates for some patients and in frailty models for unobserved cluster effect. Estimating haplotype effects using Cox's regression model has been considered in a large number of papers (Schaid et al., 2002; Zaykin et al., 2002; Schaid, 2004; Zhao et al., 2003; Tregouet & Tiret, 2004; Lin & Zeng, 2006; Epstein & Satten, 2003; Chen et al., 2004; Scheike et al., 2010). One of the attractions for fitting a Cox's regression model is that the EM-algorithm can be applied in this setting. This makes it easy to deal with the missing data. Alternative estimating equations have been considered in Zucker (2005); Gorfine et al. (2005); Scheike (2006); Scheike et al. (2010).

There has been no related statistical work on how to fit an additive intensity model in this setting where no simple likelihood is available. In this paper, we suggest a new approach that fits the additive risk model without smoothing or additional assumptions. The proposed method is simple to implement and use. The basic idea is to solve the estimating equations recursively to deal with the structure of the model appropriately. It turns out that the large sample properties of the proposed estimators can be derived based on product integration theory (Gill & Johansen, 1990).

The model and estimating procedures are presented in section 2. Section 3 gives large sample properties and also a test for investigating time-invariance. Small sample performs are assessed by simulation studies that are presented in section 4. In section 5 we consider a worked example that assesses the effect of haplotype matching for HCT patients. Section 6 concludes with some final remarks. Proofs are collected in the Appendix.

Let $\stackrel{~}{T}$ be the failure time of interest and let C be a potential censoring time such that we observe $T=\stackrel{~}{T}\bigwedge C$ and = $\delta =I(\stackrel{~}{T}\le C)$. Let *N*(*t*) = *I*(*T* ≤ *t*, δ = 1). We assume that, given covariates *X* and observed haplotype pairs for the donor *H _{d}* = (

$$\lambda \left(t\right)=R\left(t\right)\lambda (t;{H}_{d},{H}_{p},X)=R\left(t\right)\left\{X{({H}_{d},{H}_{p})}^{\u22ba}\alpha \left(t\right)\right\}$$

(1)

where *R*(*t*) = *I*(*T* ≥ *t*) indicates individual is at risk at time *t*, and *X*(*H _{d}*,

If *H _{d}* and

To estimate the effect of haplo-match we need to model how the haplotype pairs relate to the observed genotypes. We assume that the probability of a specific haplotype pair for both donor and patient *H* = (*h _{i}, h_{j}*) can be computed under Hardy-Weinberg equilibrium as

$$P\{H=({h}_{i},{h}_{j})\}={\pi}_{i}{\pi}_{j}$$

(2)

where π_{i}, *i* = 1, .., *K* gives the distribution of the considered haplotypes. One may consider any relevant parametric model for the frequencies of the haplotype pairs. Explicitly, we parameterize the haplotype frequencies as

$${\pi}_{j}=\frac{\mathrm{exp}\left({\eta}_{j}\right)}{\mathrm{exp}\left({\eta}_{1}\right)+\cdots +\mathrm{exp}\left({\eta}_{K}\right)}$$

with η_{j} ] − ∞, ∞[ and with the identifiability restriction η_{K} = 0 (Σ π_{i} = 1). We denote the haplotype parameters as = ν = (η_{1},…, η_{K−1}). For the HCT data, these parameters are unknown and must be estimated from the data, but one may also search known data-bases to get estimates of these parameters. We here treat the haplotype parameters as unknown. In the HCT data, we have a very rich haplotype space. Thus, a very large number of parameters need to be estimated and since we only have limited observed data, we will make further simplifying assumptions about the haplotype frequencies (see section 5).

Based on this model we can infer the conditional distribution of the underlying haplotype pairs given the observed genotype, *P* (*H = h|G = g*). The observed intensity given genotype can be written as

$$\begin{array}{cc}\hfill \lambda (t;G,X)& =\frac{{\Sigma}_{{h}_{d},{h}_{p}\in G}P({H}_{d}={h}_{d},{H}_{p}={h}_{p}\mid G)\mathrm{exp}\left\{-{\int}_{0}^{{t}^{-}}\lambda (s;{h}_{d},{h}_{p},X)ds\right\}\lambda (t;{h}_{d},{h}_{p},X)}{{\Sigma}_{{h}_{d},{h}_{p}\in G}P({H}_{d}={h}_{d},{H}_{p}={h}_{p}\mid G)\mathrm{exp}\left\{-{\int}_{0}^{{t}^{-}}\lambda (s;{h}_{d},{h}_{p},X)ds\right\}}R\left(t\right)\hfill \\ \hfill & =\frac{{\Sigma}_{{h}_{d},{h}_{p}\in G}P({H}_{d}={h}_{d},{H}_{p}={h}_{p}\mid G)\mathrm{exp}\left\{-{\int}_{0}^{{t}^{-}}\lambda (s;{h}_{d},{h}_{p},X)ds\right\}R\left(t\right)X{({h}_{d},{h}_{p})}^{\u22ba}}{{\Sigma}_{{h}_{d},{h}_{p}\in G}P({H}_{d}={h}_{d},{H}_{p}={h}_{d}\mid G)\mathrm{exp}\left\{-{\int}_{0}^{{t}^{-}}\lambda (s;{h}_{d},{h}_{p},X)ds\right\}}\alpha \left(t\right)\hfill \\ \hfill & =R\left(t\right)\frac{\mathcal{T}{(t,\nu ,G,X)}^{T}}{\mathcal{N}(t,\nu ,G,X)}\alpha \left(t\right)=R\left(t\right)f{\{\nu ,A\left({t}^{-}\right)\}}^{\u22ba}\alpha \left(t\right),\hfill \end{array}$$

where $A\left(t\right)={\int}_{0}^{t}\alpha \left(s\right)$ and Σ_{hd,hpG} is the sum over all haplotype pairs that are consistent with the observed genotype. For unrelated donors and patients, we can assume that *P*(*H _{d} = h_{d}, H_{p} = h_{p}|G*) =

The likelihood for a subject is

$$P\left(G\right)\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}\lambda {(T;G,X)}^{\delta}\mathrm{exp}\left\{-{\int}_{0}^{T}\lambda (s;G,X)ds\right\}$$

and the score equation for ν can be partitioned into two parts

$$\stackrel{~}{U}\left(\nu \right)={U}^{G}\left(\nu \right)+{U}^{N}\left(\nu \right)$$

where *U ^{G}* is the part of the score that relates to

For the nonparametric additive structure of the hazard function, the likelihood is hard to work with even when haplotype pairs, *H*, are fully observed. Therefore, we consider an alternative approach for estimating the nonparametric regression functions. We assume that *n* i.i.d. observations are available in the time-interval [0, τ], for some τ < ∞. Let *N*(*t*) = {*N*_{1}, ..,*N*_{n}(*t*)}^{T} be the *n*-dimensional counting process of all subjects with corresponding intensity λ(*t*) = {λ_{1}(*t*), …, λ_{n}(*t*)}^{T}.

We first note that estimation can be done recursively. Assuming for the moment that ν is known and $\widehat{A}\left(t\right)=0$ for all t smaller than the first jump-time. The intensity of *N _{i}*(

$$\widehat{A}\left(t\right)={\int}_{0}^{t}{\stackrel{~}{Y}}^{-}\{s,\widehat{A}\left({s}^{-}\right)\}dN\left(s\right),$$

where the *i*th row of $\stackrel{~}{Y}\{t,A\left({t}^{-}\right)\}\mathrm{is}{\stackrel{~}{Y}}_{i}\{t,\widehat{A}\left({t}^{-}\right)\}$ and ${\stackrel{~}{Y}}^{-}(\cdot )$ is the generalized weighted inverse of $\stackrel{~}{Y}(\cdot )$, which is given by

$${\left({\stackrel{~}{Y}}^{\u22ba}W\stackrel{~}{Y}\right)}^{-1}{\stackrel{~}{Y}}^{\u22ba}W.$$

Here the diagonal elements of the weight function, *W* are *N* (*t*, ν,*G*_{i}, *X*_{i})/*T*(t, ν, *G*_{i}, *X*_{i}). The efficient weight is equivalent to the efficient weight in the one-dimensional case, which involves inverse of *X*(*h _{d}, h_{p}*)

Next, we look at an estimating equation for ν based on the survival data. We first need some definitions. Let the *i*th row of $\stackrel{~}{Z}\left(t\right)\phantom{\rule{thinmathspace}{0ex}}\mathrm{be}\phantom{\rule{thinmathspace}{0ex}}{R}_{i}\left(t\right){\stackrel{~}{Z}}_{i}(t,\nu ,G,X)$ where

$${\stackrel{~}{Z}}_{i}(t,\nu ,G,X)=\left[\frac{{D}_{\nu}\mathcal{T}(t,\nu ,{G}_{i},{X}_{i})}{\mathcal{T}(t,\nu ,{G}_{i},{X}_{i})}-\frac{{D}_{\nu}\mathcal{N}(t,\nu ,{G}_{i},{X}_{i})}{\mathcal{N}(t,\nu ,{G}_{i},{X}_{i})}\right]{R}_{i}\left(t\right)$$

is the derivative of the intensity with respect to ν multiplied by the weight described above. Here *D _{ν}h*(

This leads to an estimating equation for the unknown ν based on the survival data

$$\begin{array}{cc}\hfill {U}^{N}(\nu ,A)& ={\int}_{0}^{\tau}{\stackrel{~}{Z}}^{\u22ba}\left(t\right)\{dN\left(t\right)-\lambda (t;G,X)dt\}\hfill \\ \hfill & ={\int}_{0}^{\tau}{\stackrel{~}{Z}}^{\u22ba}\left(t\right)\left\{dN\left(t\right)-\stackrel{~}{Y}\{t,A\left({t}^{-}\right)\}dA\left(t\right)\right\}.\hfill \end{array}$$

We now replace A(t) with Â(*t*) and add the score contribution coming from *P* (*G*) to obtain the following estimating equation for ν:

$$U(\nu ,\widehat{A})={\int}_{0}^{\tau}{\stackrel{~}{Z}}^{\u22ba}\left(t\right)\left\{I-\stackrel{~}{Y}\{t,\widehat{A}\left({t}^{-}\right)\}{\stackrel{~}{Y}}^{-}\{t,\widehat{A}\left({t}^{-}\right)\}\right\}dN\left(t\right)+{U}^{G}\left(\nu \right).$$

(3)

Then, ν is estimated by solving *U*(ν, Â) = 0 and denoted by $\widehat{\nu}$. Finally, we then estimate *A*(*t*) by least square estimator with $\widehat{\nu}$.

One may ignore the information about ν contained in the survival data and base estimation solely on the genotype data. This we refer to as the two-stage approach.

In this Section we outline the large sample properties and derive estimators for the variance of the suggested estimators. When the estimating function, *U*, is evaluated in the true parameter values ν_{0} and *A* it becomes a martingale, but the asymptotic analysis is complicated by the fact that *A* is estimated using a recursively constructed estimator.

With $\widehat{\nu}$ the proper root of (3) and with additional regularity conditions it follows that it is consistent and asymptotically normal, we refer to the Appendix for further details. But the key arguments are based on the one dimensional case as in Bagdonavicius & Nikulin (1999); Chen et al. (2002); Dabrowska (2005); Gorfine et al. (2005); Zucker (2005). The key to the extension to the multivariate additive risk model is based on the powerful and elegant product integration theory given by Gill & Johansen (1990). This also is utilized by Dabrowska (2005).

It can be established that

$$\begin{array}{cc}\hfill {n}^{-1\u22152}U(t,{\nu}_{0})& ={n}^{-1\u22152}\sum _{i=1}^{n}{U}_{i}(t,{\nu}_{0})+{o}_{p}\left(1\right)\hfill \\ \hfill & ={n}^{-1\u22152}\sum _{i=1}^{n}{\int}_{0}^{t}{q}_{i}(s,{\nu}_{0})d{M}_{i}\left(s\right)+{n}^{-1\u22152}\sum _{i=1}^{n}{U}_{i}^{G}\left({\nu}_{0}\right)+{o}_{p}\left(1\right)\hfill \end{array}$$

where *U*_{i}(*t*, ν_{0}) is defined in the second equation, ${M}_{i}\left(t\right)={N}_{i}\left(t\right)-{\int}_{0}^{t}{\lambda}_{i}\left(s\right)\mathit{ds}$ is the basic martingale and *q _{i}*(

$$\widehat{\Sigma}(t,\widehat{\nu})={n}^{-1}\sum _{i=1}^{n}{\widehat{U}}_{i}{(t,\widehat{\nu})}^{\otimes 2},$$

(4)

where ${\widehat{U}}_{i}(t,\widehat{\nu})={\int}_{0}^{t}{\widehat{q}}_{i}(s,\widehat{\nu})d{\widehat{M}}_{i}\left(s\right)+{U}_{i}^{G}\left(\widehat{\nu}\right)$ and ${\widehat{q}}_{i}$ and ${\widehat{M}}_{i}$ are estimators of *q _{i}* and

Based on this it can be derived that $\sqrt{n}(\widehat{\nu}-{\nu}_{0})$ converges to a zero-mean normal distribution with covariance matrix that can be consistently estimated by

$$n{\mathcal{I}}^{-1}(\tau ,\widehat{\nu})\widehat{\Sigma}(\tau ,\widehat{\nu})n{\mathcal{I}}^{-1}(\tau ,\widehat{\nu}),$$

where $\mathcal{I}$ denotes minus of the derivative of *U* with respect to ν.

It also follows that $\sqrt{n}\{\widehat{A}\left(t\right)-A\left(t\right)\}$ converges to a Gaussian process. The key is again to establish an i.i.d. decomposition. It turns out, based on a Taylor series expansion, that (up to an *o _{p}*(1) term)

$$\sqrt{n}\{\widehat{A}\left(t\right)-A\left(t\right)\}={n}^{-1\u22152}\sum _{i=1}^{n}{L}_{i}(t,{\nu}_{0})$$

where the *L _{i}*'s are defined in (10) in the Appendix and they are i.i.d. processes.

Let ${\widehat{L}}_{i}$ denote the estimator of *L _{i}* by plugging in the estimates of the unknown quantities, see Appendix for details. The variance of

$${n}^{-1}\sum _{i=1}^{n}{\widehat{L}}_{i}^{\otimes 2}(t,\widehat{\nu}).$$

(5)

The asymptotic distribution can be approximated by a resampling scheme

$$\Delta \left(t\right)={n}^{-1\u22152}\sum _{i=1}^{n}{G}_{i}{\widehat{L}}_{i}(t,\widehat{\nu}),$$

(6)

where *G*_{1}, …,*G _{n}* are independent standard normals and are independent of the counting processes and their covariates.

By a resampling technique we can construct (1 – α)% confidence bands for each cumulative regression coefficient function of *A _{j}*(

To derive a formal test to see if a given effect varies over time or if it leads to constant excess risk, we consider a supremum test that we evaluate based on the resampling technique. Considering for example the jth component of the model, α_{j}(*t*), we now wish to test *H*_{0} : α_{j}(*t*) γ_{j} for *t* [0, τ]. As in Martinussen & Scheike (2006) a test can be based on a supremum test statistic

$${\widehat{W}}_{j}=\underset{t\in [0,\tau ]}{\mathrm{sup}}\mid {\widehat{A}}_{j}\left(t\right)-\frac{{\widehat{A}}_{j}\left(\tau \right)}{\tau}t\mid .$$

Let ${\stackrel{~}{\Delta}}_{j}^{k}\left(t\right)={n}^{-1\u22152}{\Delta}_{j}^{k}\left(t\right)$ where ${\Delta}_{j}^{k}\left(t\right)$ is the kth realization of the resampling process of $\sqrt{n}\{{\widehat{A}}_{j}\left(t\right)-{A}_{j}\left(t\right)\}$. Based on derived large sample property and under *H*_{0}, ${\widehat{W}}_{j}$ can be asymptotically approximated by

$${\widehat{W}}_{j}^{k}=\underset{t\in [0,\tau ]}{\mathrm{sup}}\mid {\stackrel{~}{\Delta}}_{j}^{k}\left(t\right)-\frac{{\stackrel{~}{\Delta}}_{j}^{k}\left(\tau \right)}{\tau}t\mid $$

for κ = 1, …, *K*. A formal test can be derived by comparing observed ${\widehat{W}}_{j}$ and simulated $\{{\widehat{W}}_{j}^{1},\dots ,{\widehat{W}}_{j}^{K}\}$

We simulated data similar to the BMT example data considered in section 5. The underlying additive hazard model had a baseline at 0.3 with the effect of haplotype match being −0.2 in the first 1.5 time units and 0 afterwards. The survival times were right censored at 3 time units. In the 2054 patients we consider below we found that there where 20, 31 and 13 different allele types for the loci A, B, and DRB1, and that this lead to a total of 2,325 possible haplotypes. These haplotypes where asssumed to have frequencies similar to those we found in a structured MLE based on the genotype data where we found that 1901 of the haplotypes had frequency 9.95 * 10^{−5}, 214 had frequency 0.00072, 192 had frequency 0.0019, 14 had frequency 0.0087, 2 had frequency 0.027, 1 had frequency 0.042, and 1 had frequency 0.083, respectively. We tried various other haplotype distributions that were not based directly on the underlying structured model, and we will study this issue further in future work. We then considered sample sizes of 2400, 1200, 600 and 300 that lead to approximately 1100, 650, 320 and 160 number of deaths on average, respectively. 2000 random samples were generated for each simulation study, and the resampling based tests were based on 1000 resamples.

We present detailed results for sample size of 600. Figure 1(a) and 1(b) show the mean of the estimated cumulative intensity (broken line) and the true cumulative coefficient (solid line) for the baseline and the effect of the haplotype match, respectively. Figure 1(c) and 1(d) give the mean of the estimated standard errors (broken line) and the sample standard deviation of the realizations (solid line) for baseline and effect of the haplotype match, respectively. Figure 1 shows that the estimators are essentially unbiased and the standard errors are well estimated by the derived estimator. There is some slight underestimation of the variance, that diminishes when the sample sizes increase (figures are not shown here).

Mean of estimated cumulative hazards (broken line) versus true cumulative hazards (solid line) for the baseline and haplomatch effect (Panels a and b), and observed pointwise standard deviation of estimates (solid line) versus mean of estimated standard **...**

The standard errors of the cumulative coefficients are approximately halved if the haplotype match had been observed. We also considered the effect of estimating the haplotype parameters and this only had minor impact on the standard errors, estimates and coverage probabilities.

Figure 2 shows the pointwise coverage probabilities for the baseline intensity and for the effect of haplotype match, respectively. The slight bias in the variances result in coverage probabilities that are slightly lower for most of the time-axis. The initial problems with the coverage probabilities are well known and due restrictions close to 0, these problems may be remedied by choosing an appropriate transformation.

Observed pointwise coverage of 95 % pointwise confidence interval for sample size 500 for the cumulative baseline (solid line) and cumulative excess effect of haplomatch (broken line).

We also considered the test for constant effects developed earlier as well as the coverage of the confidence bands. The observed coverage is considered in the time-period from 0.3 to 3 to disregard the initial problematic time period. This resulted in the observed levels reported in Table 1.

Observed power for test of constant effects and coverage probabilities for confidence bands for model where baseline is constant and haplotype match effect is non-constant.

The test for constant effect achieves the nominal 5% power for the constant effects for the baseline and the power for the time-varying effect of the haplotype match increases with increasing sample size. The coverage probabilities of confidence bands are in an acceptable level. Slightly to low for the baseline, but close to the true 95% for the haplo-match effect.

The finite sample properties are well described by the estimators. Both the pointwise and uniform description of the uncertainty works well. The slight problems are similar to those found for the additive intensity model with known covariates.

Recently, Petersdorf et al. (2007) studied the effect of MHC haplotype match based on 246 leukemia patients. All these 246 patients received a HLA-identical unrelated HCT between 1986 and 2003. Their analysis was based on observing the haplotypes directly. This is a very costly and time-consuming procedure. They identified 191 (78%) transplants that were haplotype-matched. They demonstrated that the MHC haplotype match of donor and patient had significant impact on the transplantation outcome of developing acute graft-versus-host disease (GVHD) and recurrent malignancy, but had no impact on the treatment related mortality (TRM) which is defined as death in complete remission and overall mortality. For the illustration purpose, we consider a similar cohort of patients with HLA-identical unrelated HCT for leukemia and transplanted between 1995 and 2007, and donor-patient matched on four loci HLA-A/B/C/DRB1 based on low-resolution (2 digits) classification. The data presented here were obtained from the Statistical Center of the Center for International Blood and Marrow Transplant Research (CIBMTR) located at the Medical College of Wisconsin. The analysis has not been reviewed or approved by the CIBMTR. The CIBMTR is comprised of clinical and basic scientists who share data on their transplant patients with CIBMTR Statistical Center.

The selected data consists of 5,145 leukemia patients (2,054 for acute myeloid leukemia (AML), 1,171 for acute lymphoblastic leukemia (ALL), 1,037 for chronic myelogenous leukemia (CML) and 883 for myelodysplastic syndrome (MDS)). The analysis showed that a haplotype match only had a significant effect on AML patients. For illustration purposes, we here focus on the overall survival of the 2,054 AML patients.

The genotype data used in estimation is based on the three-loci HLA-A, HLA-B and HLA-DRB1. This gives a very rich haplotype space. To reduce the total number of possible haplotypes, low-resolution (2 digits) classification of alleles at loci {A1, A2}, {B1, B2} and {DRB11, DRB12} were used to estimated haplotype frequencies. There were 20, 31 and 13 different allele types for the loci A, B, and DRB1, respectively. This leads to a total of 2,325 possible haplotypes. Some additional structure is needed to be able to estimate the parameters in this setting, and we grouped rare haplotypes into groups with a common haplotype frequency. We ended up with a model with 10 parameters where the 5 largest frequencies have their own parameters and other frequencies where allowed to be in 5 different groups. We tried modifying the models in many different ways but the results were quite robust to these choices. In the regression analysis below we adjust for disease stage at pre-transplant (high risk patient (36%) versus early or intermediate risk patient (64%)), age of patient (> 45 (50%) versus 19 – 45 (38%) versus ≤ 18 (12%) years), GVHD prophylaxis (Cyclosporine-A (CsA) ± other (44%) versus others (56%)), graft source (bone marrow (BM) (45%) vs peripheral blood (PB) (55%)) and year of transplant (1995–2001: early, (25%) versus 2002–2007 (75%)). We considered the survival for the first 90 months (τ = 90). There were 1,206 patients died within this study period.

We fitted an additive model to evaluate the effect of haplo-match by *X*^{T}(*H _{P}, H_{D}*) = {

Cumulative excess hazard for covariates based on additive model with 95% point-wise confidence intervals and 95% confidence bands (broken lines).

The formal tests for a constant excess risk lead to p-values for disease stage, age group 19 – 45, age group > 45, GVHD prophylaxis, graft source and year of transplant that are <0.001, 0.051, <0.001, 0.002, 0.014 and 0.002, respectively. The haplo-match effect lead to a p-value of *p* = 0.033. Figure 3 and these tests results indicate that a similar Cox regression analyses would not be appropriate here.

Figure 3 indicates the overall effect of matching haplotypes is significant even though the significance test only leads to a p-value of 0.055. Note, also that the test for constant effect also provides an alternative test for significance.

Also, the negative cumulative hazards for the haplo-matching indicates that patient with a haplo-matched HCT has a higher survival probability. Figure 4 gives the predicted survival probabilities of haplo-mismatch (panel a) and haplo-matched (panel b) for a young patient (≤ 18) with early or intermediate disease at pre-transplant, received other than CsA ± others for GVHD prophylaxis and received a PB transplant between 2002–2007. The survival predictions are provided with 95% pointwise confidence intervals (dotted lines) and with 95% confidence bands (broken lines). The predicted 5 year survival probabilities for haplo-mismatched and haplo-matched HCT are 55% (95% CI: 38%–72%) and 83% (64%–100%), respectively.

Predicted survival probabilities with 95% pointwise confidence intervals (dotted lines) and 95% confidence bands (broken lines) based on additive risk model for a specific patient (see text) with haplotype mismatched HCT (Panel a) and with haplotype matched **...**

This BMT example showed a positive beneficial effect in overall survival for haplotype matched transplant. As we discussed in section 1, this positive effect could lead to a new decision in choosing a best matched donor.

The popularity of the additive hazards model risk model has grown as it is well suited to study time-dynamics, that is whether or not effects of covariates change with time. In this paper we have demonstrated how the additive risk model can be applied also when there is missing data, and in particular how one can estimate effects of haplo-type match in BMT studies.

The large sample properties of the suggested estimators were derived using product integration theory, and these seem to give an accurate description of the variability even for very small samples. Furthermore, the formal test of whether or not covariate effects vary with time worked quite well in the simulations.

In the context of the the BMT example, we studied the effect of haplotype match, and here specific modeling of the haplotype frequencies were needed. It would be interesting to further consider how to make effects robust to misspecification of the haplotype modeling, and one may pursue ideas like Allen et al. (2005). Further, one may model the haplotypes with additional complexity to allow for lacking Hardy-Weinberg Equilibrium and recombination that is know to present on the HLA gene for the three loci we consider.

The approach is implemented in the R-package HaploSurvival, that is being developed by the first author.

We appreciate helpful and constructive suggestions by the Editors and a referee.

This research was supported by NIH-grant 2 R01 CA54706-10, and a grant from the Danish Research Council on “point process modeling and statistical inference”.

We start by making the observation that (up to *o _{p}*(

$$\widehat{A}({\nu}_{0},t)-{A}_{0}\left(t\right)={\int}_{0}^{t}\stackrel{~}{Y}-({\nu}_{0},\widehat{A})dM-{\int}_{0}^{t}\stackrel{~}{Y}-({\nu}_{0},\widehat{A})\left\{\stackrel{~}{Y}({\nu}_{0},\widehat{A})-\stackrel{~}{Y}({\nu}_{0},{A}_{0})\right\}d{A}_{0}\left(s\right),$$

where *M*(*s*) is the (vector) counting process martingale, and where the time arguments were omitted for notational simplicity. Let ${S}_{0}(t,{\nu}_{0})={\stackrel{~}{Y}}^{\mathrm{T}}(t,{\nu}_{0},\widehat{A})\stackrel{~}{Y}(t,{\nu}_{0},\widehat{A})$ and let the limit of this quantity be denoted as *s*_{0}(*t*). Note that the first term, *V _{n}*, by the martingale CLT, can be shown to converge to a Gaussian martingale process

$${V}_{n}\left(t\right)=\sum _{i=1}^{n}{\int}_{0}^{t}{S}_{0}^{-1}(s,{\nu}_{0}){\stackrel{~}{Y}}_{i}{W}_{i}d{M}_{i\cdot}$$

Also, a Taylor expansion using matrix derivatives yields (up to *o _{p}*(

$$\stackrel{~}{Y}(t,{\nu}_{0},\widehat{A})-\stackrel{~}{Y}(t,{\nu}_{0},{A}_{0})=\sum _{j}{D}_{A,j}\stackrel{~}{Y}(t,{\nu}_{0},{A}_{0}))\left\{{\widehat{A}}_{j}\left({t}^{-}\right)-{A}_{0}\left({t}^{-}\right)\right\},$$

where the *i*th row of ${D}_{{A}_{j}}\left(\stackrel{~}{Y}({\nu}_{0},A)\right)$ is *D _{Aj}f_{i}*(ν,

$$\begin{array}{cc}\hfill {E}_{j}\left(t\right)& ={\stackrel{~}{Y}}^{-}(t,{\nu}_{0},{A}_{0}){D}_{{A}_{j}}\stackrel{~}{Y}(t,{\nu}_{0},{A}_{0}))\hfill \\ \hfill \stackrel{~}{Y}-({\nu}_{0},\widehat{A})\left\{\stackrel{~}{Y}(t,{\nu}_{0},\widehat{A})-\stackrel{~}{Y}(t,{\nu}_{0},{A}_{0})\right\}& =\sum _{j}{E}_{j}\left(t\right)\left\{{\widehat{A}}_{j}\left({t}^{-}\right)-{A}_{0}\left({t}^{-}\right)\right\}+{o}_{p}\left({n}^{-1\u22152}\right).\hfill \end{array}$$

This implies that $\sqrt{n}\{\widehat{A}({\nu}_{0},t)-{A}_{0}\left(t\right)\}$ converges in distribution to a p-dimensional process *W* that satisfies the integral equation

$$W\left(t\right)=\sum _{j=1}^{p}{\int}_{0}^{t}-{W}_{j}\left({s}^{-}\right){e}_{j}(s,{\nu}_{0})d{A}_{0}\left(s\right)+V\left(t\right).$$

Denote the elements of *e _{j}*(

$$dW\left(t\right)=-W\left({t}^{-}\right)Fdt+dV\left(t\right).$$

Using the product integration formulae (Gill & Johansen, 1990) leads to the solution

$$\begin{array}{cc}\hfill W\left(t\right)& ={\int}_{0}^{t}dV\left(s\right)\mathcal{F}(s,t),\hfill \\ \hfill \mathcal{F}(s,t)& =\prod _{u\in ]s,t]}(I-Fdu),\hfill \end{array}$$

(7)

and where *F*(*s, t*) is the product integral of *F*. In the one-dimensional (commutative) case this equals exp $(-{\int}_{s}^{t}e(u,{\nu}_{0})d{A}_{0}\left(u\right))$. Note that the matrix *F*(*s, t*) is estimated consistently by the product of (the atoms)

$$\widehat{\mathcal{F}}(s,t)=\prod _{]s,t]}(I-d\widehat{F}),$$

(8)

where $d{\widehat{F}}_{\mathit{jk}}\left(t\right)={\Sigma}_{m=1}^{p}{E}_{j,\mathrm{km}}\left(t\right)d{\widehat{A}}_{m}\left(t\right)$. Using the product integral rather than the exponential in the one-dimensional case yields an estimator with the same asymptotic properties.

We now consider the estimating function for ν based on the counting process data, ${\stackrel{~}{U}}^{N}\left(\nu \right)$. By the martingale decomposition

$${\stackrel{~}{U}}^{N}\left({\nu}_{0}\right)=\int {Z}^{\u22ba}\left\{I-\stackrel{~}{Y}({\nu}_{0},\widehat{A}){\stackrel{~}{Y}}^{-}({\nu}_{0},\widehat{A})\right\}dM+\int {Z}^{\u22ba}\left\{I-\stackrel{~}{Y}({\nu}_{0},\widehat{A}){\stackrel{~}{Y}}^{-}({\nu}_{0},\widehat{A})\right\}\stackrel{~}{Y}({\nu}_{0},{A}_{0})d{A}_{0}.$$

By a Taylor expansion, as before, it may be shown that the difference between the last two terms in the previous equation (up to *o _{p}*(

$$-\sum _{j=1}^{p}{Z}^{\u22ba}\left\{I-\stackrel{~}{Y}({\nu}_{0},{A}_{0}){\stackrel{~}{Y}}^{-}({\nu}_{0},{A}_{0})\right\}{D}_{{A}_{j}}\stackrel{~}{Y}(\nu ,{A}_{0})\left\{{\widehat{A}}_{j}\left({t}^{-}\right)-{A}_{0j}\left({t}^{-}\right)\right\}=\sum _{j=1}^{p}{\stackrel{~}{s}}_{j}\left\{{\widehat{A}}_{j}(t-)-{A}_{0j}(t-)\right\}.$$

With the matrices in each of the terms converging to ${\stackrel{~}{s}}_{j}$.

The last two term therefore equals (*o _{p}*(n

$$\sum _{j=1}^{p}\int {\stackrel{~}{s}}_{j}{W}_{j}d{A}_{0},$$

that can be written as

$$\int G\left(t\right)W\left(t\right)dt,$$

with *G*(*t*) is an *q* × *p* matrix with elements ${G}_{\mathit{kj}}\left(t\right)={\Sigma}_{m=1}^{p}{\stackrel{~}{s}}_{j,\mathrm{km}}{\alpha}_{0,m}\left(t\right)$. Now changing the order of integration and using the structure of *W* (*t*) one gets

$${\int}_{0}^{\tau}G\left(t\right)W\left(t\right)dt={\int}_{0}^{\tau}G\left(t\right){\int}_{0}^{t}\mathcal{F}(s,t)V\left(s\right)dsdt={\int}_{0}^{\tau}g\left(s\right)V\left(s\right)ds=\sum _{i=1}^{n}{\int}_{0}^{\tau}g\left(s\right){s}_{0}^{-1}(s,\nu ){\stackrel{~}{Y}}_{i}d{M}_{i}\left(s\right),$$

where

$$g\left(s\right)={\int}_{s}^{\tau}G\left(t\right)\mathcal{F}(s,t)dt={\int}_{s}^{\tau}\mathcal{F}(s,t)dG\left(t\right).$$

Define

$$\begin{array}{cc}\hfill {q}_{1i}(t,\nu )& ={\stackrel{~}{Z}}_{i}-{s}_{z}(t,\nu ){s}_{0}^{-1}(t,\nu ){\stackrel{~}{Y}}_{i}{W}_{i},\hfill \\ \hfill {q}_{2i}(t,\nu )& =g\left(t\right){s}_{0}^{-1}(t,\nu ){\stackrel{~}{Y}}_{i}{W}_{i},\hfill \\ \hfill {q}_{i}\left(t\right)& ={q}_{1i}(t,\nu )+{q}_{2i}(t,\nu ).\hfill \end{array}$$

(9)

Then one can write the estimating function as follows (up to an *o _{p}*(1) term)

$${n}^{-1\u22152}{\stackrel{~}{U}}^{N}\left({\nu}_{0}\right)={n}^{-1\u22152}\sum _{i=1}^{n}{\int}_{0}^{\tau}\{{q}_{1i}(t,{\nu}_{0})+{q}_{2i}(t,{\nu}_{0})\}d{M}_{i}\left(t\right).$$

This is a sum of i.i.d. terms (or a martingale) and therefore converges to a normal distribution with variance that is estimated by the robust estimator given in (4). The total score for ν is *U ^{N}*(ν

Based on a Taylor series expansion (up to an *o _{p}*(1) term) we find that

$$\begin{array}{cc}\hfill & \sqrt{n}\left\{\widehat{A}\left(t\right)-{A}_{0}\left(t\right)\right\}\hfill \\ \hfill & ={n}^{1\u22152}{\int}_{0}^{t}\frac{\partial}{\partial \nu}{\stackrel{~}{Y}}^{-}(t,{\nu}_{0})d\mathit{N}(\widehat{\nu}-{\nu}_{0})+{n}^{-1\u22152}\sum _{i=1}^{n}{\int}_{0}^{t}\mathcal{F}(s,t){s}_{0}^{-1}(s,{\nu}_{0}){\stackrel{~}{Y}}_{i}\left(s\right)d{M}_{i}\left(s\right)\hfill \\ \hfill & ={n}^{-1\u22152}\sum _{i=1}^{n}{L}_{i}(t,{\nu}_{0}),\hfill \end{array}$$

where

$${L}_{i}(t,\nu )=P(t,{\nu}_{0}){I}^{-1}\left(\tau \right){\int}_{0}^{\tau}{q}_{i}(t,{\nu}_{0})d{M}_{i}\left(t\right)+{\int}_{0}^{t}\mathcal{F}(s,t){s}_{0}^{-1}(s,{\nu}_{0}){\stackrel{~}{Y}}_{i}\left(s\right)d{M}_{i}\left(s\right),$$

(10)

and with *P*(*t*, ν_{0}) the limit of ${n}^{-1}\widehat{P}(t,\widehat{\nu})$ defined in (11), with *I*(τ) the limit of *n*^{−1}*I*(τ, ν_{0}) (12), and with the product integral *F*(*s, t*) defined in (7).

The derivative ofÃ(*t*) with respect to ν is

$$\stackrel{~}{P}(t,\nu )=-{\int}_{0}^{t}{S}_{0}^{-1}\left(\widehat{A}\right){D}_{\nu}{S}_{0}\left(\widehat{A}\right){S}_{0}^{-1}\left(\widehat{A}\right){\stackrel{~}{Y}}^{\u22ba}WdN+{\int}_{0}^{t}{S}_{0}^{-1}\left(\widehat{A}\right){D}_{\nu}{\stackrel{~}{Y}}^{\u22ba}WdN+{\int}_{0}^{t}{S}_{0}^{-1}\left(\widehat{A}\right){\stackrel{~}{Y}}^{\u22ba}{D}_{\nu}WdN.$$

(11)

The derivative with respect to ν of the estimating function, *U*, is

$$\mathcal{I}\left(\nu \right)=\mathcal{I}(\tau ,\nu )=\int {D}_{\nu}\left({\stackrel{~}{Z}}^{\u22ba}\right)\{I-\stackrel{~}{Y}{\stackrel{~}{Y}}^{-}\}dN-\int {\stackrel{~}{Z}}^{\u22ba}{D}_{\nu}\left({\stackrel{~}{Y}}^{\u22ba}\right){S}_{0}^{-1}{\stackrel{~}{Y}}^{\u22ba}WdN+\int {\stackrel{~}{Z}}^{\u22ba}\stackrel{~}{Y}{S}_{0}^{-1}\left({D}_{\nu}{S}_{0}\right){S}_{0}^{-1}{\stackrel{~}{Y}}^{\u22ba}WdN-{\stackrel{~}{Z}}^{\u22ba}\stackrel{~}{Y}{S}_{0}^{-1}{D}_{\nu}{\left(\stackrel{~}{Y}\right)}^{\u22ba}WdN-\int {\stackrel{~}{Z}}^{\u22ba}\stackrel{~}{Y}{S}_{0}^{-1}{\stackrel{~}{Y}}^{T}{D}_{\nu}WdN.$$

(12)

Based on the derivation of the estimating function it straightforward to do a Newton-Raphson iteration to find a solution to the estimating equations. Note that the first term is a martingale when evaluated in the true parameter, and therefore will converge towards 0 in probability.

Note also that an estimator of *F*(*s, t*) is given in (8).

- Aalen OO. A linear regression model for the analysis of life times. Statist. Med. 1989;8:907–925. [PubMed]
- Allen AS, Satten GA, Tsiatis AA. Locally-efficient robust estimation of haplotype-disease association in family-based studies. Biometrika. 2005;92:559–571.
- Bagdonavicius V, Nikulin M. Generalised proportional hazards model based on modified parital likelihood. Lifetime Data Anal. 1999;5:329–350. [PubMed]
- Chen J, Peters U, Foster C, Chatterjee N. A haplotype-based test of association using data from cohort and nested case-control epidemiologic studies. Human Heredity. 2004;58:18–29. [PubMed]
- Chen K, Jin Z, Ying Z. Semiparametric analysis of transformation models with censored data. Biometrika. 2002;89:659–668.
- Dabrowska DM. Quantile regession in transfomation models. arXiv:math.ST. 2005;05115082v1:1–34.
- Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics. 2003;73:1316–1329. [PubMed]
- Gill RD, Johansen S. A survey of product-integration with a view towards application in survival analysis. Ann. Statist. 1990;18:1501–1555.
- Gorfine M, Zucker D, Hsu L. Prospective survival analysis with a general semiparametric shared frailty model: A pseudo full likelihood approach. Biometrika. 2005;93:735–741. [PMC free article] [PubMed]
- Huffer FW, McKeague IW. Weighted least squares estimation for Aalen's additive risk model. J. Amer. Statist. Assoc. 1991;86:114–129.
- Lin D, Zeng D. Likelihood based inference on haplotype effects in genetic association studies. J. Amer. Statist. Assoc. 2006;101:89–110.
- Martinussen T, Scheike T. Dynamic Regression Models for Survival Data. Springer; Verlag New York: 2006.
- Petersdorf E, Malkki M, Gooley T, Martin P, Guo Z. MHC haplotype matcing for unrelated hematopoietic cell transplantation. PLOS medicine. 2007;4:59–68.
- Schaid DJ. Evaluating associations of haplotypes with traits. Genetic Epidemiology. 2004;27:348–364. [PubMed]
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score tests for association between traits and haplotypes when the linkage phase is ambiguous. American Journal of Human Genetics. 2002;70:425–434. [PubMed]
- Scheike T, Martinussen T, Silver J. Estimating haplotype effects for survival data. Biometrics. 2010 DOI: 10.1111/j.1541-0420.2009.01329.x. [PubMed]
- Scheike TH. A flexible semiparametric transformation model for survival data. Lifetime Data Anal. 2006;12:461–80. [PubMed]
- Tregouet D, Tiret L. Cox proportional hazards survival regression in haplotype-based association analysis using the stochastic-em algorithm. European Journal of Human Genetics. 2004;12:971–974. [PubMed]
- Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Human Heredity. 2002;53:79–91. [PubMed]
- Zhao LP, Li SS, Khalid N. A method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. American Journal of Human Genetics. 2003;72:1231–1250. [PubMed]
- Zucker DM. A pseudo-partial likelihood method for semiparametric survival regression with covariate errors. J. Amer. Statist. Assoc. 2005;100:1264–1277.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |