Let

be the failure time of interest and let C be a potential censoring time such that we observe

and =

. Let

*N*(

*t*) =

*I*(

*T* ≤

*t*, δ = 1). We assume that, given covariates

*X* and observed haplotype pairs for the donor

*H*_{d} = (

*H*_{d1},

*H*_{d2}) and the patient

*H*_{p} = (

*H*_{p1},

*H*_{p2}), the counting process

*N*(

*t*) has an additive intensity

where

*R*(

*t*) =

*I*(

*T* ≥

*t*) indicates individual is at risk at time

*t*, and

*X*(

*H*_{d},

*H*_{p}) is a function of the observed covariates and haplotypes of donors and patients. In this paper, we consider the design where

*X*(

*H*_{d}, H_{p})

^{T} = {

*X*^{T},

*I*(

*H*_{d} = H_{p})}. This simple design gives the effect of haplotype matching between donor and patient.

If *H*_{d} and *H*_{p} are known, a simple standard survival analysis based on the additive model can be used. However, the problem here is that we only observe the underlying HLA genotypes related to *H*_{d} and *H*_{p}. For HLA-identical unrelated transplant data, the patient and donor have identical genotypes, but may not necessarily have matched haplotype pairs (see section 1).

To estimate the effect of haplo-match we need to model how the haplotype pairs relate to the observed genotypes. We assume that the probability of a specific haplotype pair for both donor and patient

*H* = (

*h*_{i}, h_{j}) can be computed under Hardy-Weinberg equilibrium as

where π

_{i},

*i* = 1, ..,

*K* gives the distribution of the considered haplotypes. One may consider any relevant parametric model for the frequencies of the haplotype pairs. Explicitly, we parameterize the haplotype frequencies as

with η

_{j} ] − ∞, ∞[ and with the identifiability restriction η

_{K} = 0 (Σ π

_{i} = 1). We denote the haplotype parameters as = ν = (η

_{1},…, η

_{K−1}). For the HCT data, these parameters are unknown and must be estimated from the data, but one may also search known data-bases to get estimates of these parameters. We here treat the haplotype parameters as unknown. In the HCT data, we have a very rich haplotype space. Thus, a very large number of parameters need to be estimated and since we only have limited observed data, we will make further simplifying assumptions about the haplotype frequencies (see section 5).

Based on this model we can infer the conditional distribution of the underlying haplotype pairs given the observed genotype,

*P* (

*H = h|G = g*). The observed intensity given genotype can be written as

where

and Σ

_{hd,hpG} is the sum over all haplotype pairs that are consistent with the observed genotype. For unrelated donors and patients, we can assume that

*P*(

*H*_{d} = h_{d}, H_{p} = h_{p}|G) =

*P*(

*H*_{d} = h_{d}|G) ×

*P*(

*H*_{p} = h_{p}|G). For related donors and patients this independence assumption does not hold. Here, the observed hazard function has an additive structure and it also depends on the cumulated regression function evaluated at time

*t*^{−}, i.e. just before time

*t*.

The likelihood for a subject is

and the score equation for ν can be partitioned into two parts

where

*U*^{G} is the part of the score that relates to

*P*_{ν}(

*G*) and

*U*^{N} relates to the information contained in the survival data.

For the nonparametric additive structure of the hazard function, the likelihood is hard to work with even when haplotype pairs, *H*, are fully observed. Therefore, we consider an alternative approach for estimating the nonparametric regression functions. We assume that *n* i.i.d. observations are available in the time-interval [0, τ], for some τ < ∞. Let *N*(*t*) = {*N*_{1}, ..,*N*_{n}(*t*)}^{T} be the *n*-dimensional counting process of all subjects with corresponding intensity λ(*t*) = {λ_{1}(*t*), …, λ_{n}(*t*)}^{T}.

We first note that estimation can be done recursively. Assuming for the moment that ν is known and

for all t smaller than the first jump-time. The intensity of

*N*_{i}(

*t*) can be written as

, where

. This leads to a least square estimator

where the

*i*th row of

and

is the generalized weighted inverse of

, which is given by

Here the diagonal elements of the weight function,

*W* are

*N* (

*t*, ν,

*G*_{i},

*X*_{i})/

*T*(t, ν,

*G*_{i},

*X*_{i}). The efficient weight is equivalent to the efficient weight in the one-dimensional case, which involves inverse of

*X*(

*h*_{d}, h_{p})

^{T} α(

*t*), and it can be implemented using a two-stage approach as for the additive risk model (

Huffer & McKeague, 1991).

Next, we look at an estimating equation for ν based on the survival data. We first need some definitions. Let the

*i*th row of

where

is the derivative of the intensity with respect to ν multiplied by the weight described above. Here

*D*_{ν}h(

*x*) denotes the derivative with respect to ν evaluated at

*x*.

This leads to an estimating equation for the unknown ν based on the survival data

We now replace A(t) with Â(

*t*) and add the score contribution coming from

*P* (

*G*) to obtain the following estimating equation for ν:

Then, ν is estimated by solving

*U*(ν, Â) = 0 and denoted by

. Finally, we then estimate

*A*(

*t*) by least square estimator with

.

One may ignore the information about ν contained in the survival data and base estimation solely on the genotype data. This we refer to as the two-stage approach.