Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biometrics. Author manuscript; available in PMC 2010 December 1.
Published in final edited form as:
PMCID: PMC2819485

Estimated Pseudo-Partial-Likelihood Method for Correlated Failure Time Data with Auxiliary Covariates


As biological studies become more expensive to conduct, statistical methods that take advantage of existing auxiliary information about an expensive exposure variable are desirable in practice. Such methods should improve the study efficiency and increase the statistical power for a given number of assays. In this paper, we consider an inference procedure for multivariate failure time with auxiliary covariate information. We propose an estimated pseudo-partial likelihood estimator under the marginal hazard model framework and develop the asymptotic properties for the proposed estimator. We conduct simulation studies to evaluate the performance of the proposed method in practical situations and demonstrate the proposed method with a data set from the Studies of Left Ventricular Dysfunction (SOLVD,1991).

Keywords: Auxiliary covariate, Marginal hazard model, Multivariate data, Pseudo-partial likelihood, Validation sample

1. Introduction

Statistical methods are usually developed assuming that the exposure variable is fully observed. In many studies, due to financial limitations or technical difficulties, the true exposure may only be measured precisely in a subset of the study cohort. This subset is often referred to as the validation set. With the continuing advancement in the use of biological markers in epidemiology and genetic studies, which often involve expensive assays, there is a growing incentive to further improve study efficiency and power by optimally incorporating into the statistical analysis the available auxiliary covariate. For example, in the Studies of Left Ventricular Dysfunction (SOLVD,1991) prevention trial, it is of interest to assess the effects of covariates (e.g. ejection fraction, intervention, and gender) on the risk of heart failure and on the first myocardial infarction. The gold standard of the ejection fraction (LVEF) measurement was to use a standardized radionucleotide technique. Since this standardized radionucleotide technique is too expensive to be used on every patient, the LVEF was only measured for a subset of 108 out of a total of the 4228 SOLVD patients. A cheaper and easily obtained measure of ejection fraction (EF) was, however, ascertained for all the patients using a nonstandardized technique. EF was considered as an auxiliary covariate of LVEF. The auxiliary covariate is defined as the surrogate information (the nonstandardized EF measure in the SOLVD study) that relates to the true exposure variable (LVEF) but provides no additional information to the regression model when the true covariates are known. Some proposed methods have been developed for the univariate survival time data in the areas of mismeasured covariates, missing data, and auxiliary covariate problems. This includes but is not limited to Pepe and Fleming (1991), Carroll and Wang (1991), Lin and Ying (1993), Zhou and Pepe (1995), Wang, Lin, and Gutierrez (1998), Hu, Tsiatis and Davidian (1998), Tsiatis and Davidian (2001), Huang and Wang (2000), Zhou and Wang (2000), Hu and Lin (2002) and Wang and Zhou (2006).

All the aforementioned studies assumed that each failure time is taken from independent subjects. In practice, multivariate or correlated failure time data with auxiliary data is just as likely to be encountered. For example, in the SOLVD study, the heart failure time and the first myocardial infarction time from the same subject could be correlated. Models dealing with multivariate failure time data where the true covariates of interest are fully available for all subjects have been well studied. In particular, if the correlation among the observations is not of interest, the marginal proportional hazards model is widely used, e.g., Wei, Lin and Weissfeld (1989); Lee, Wei and Amato (1992); Liang, Self and Chang (1993); Lin (1994); Cai and Prentice (1995, 1997); Spiekerman and Lin (1998); Clegg, Cai and Sen (2000). There has been limited progress on the methods for dealing with covariate measurement error for multivariate failure time. Greene and Cai (2004) proposed using the SIMEX approach for handling measurement errors in the marginal hazards model for multivariate failure time data, when a validation set is not available.

In this paper, assuming a validation set is available, we develop an estimated pseudo partial likelihood method for handling auxiliary covariates in the presence of a validation sample under the framework of the marginal hazards model with distinguishable baseline hazards (Wei et. al., 1989). The auxiliary covariate could be a mismeasured surrogate to the true covariate, or any covariate which is informative about the true covariate. The proposed method is nonparametric with respect to the conditional distribution of the exposure variable conditional on the auxiliary covariate.

The rest of the paper is organized as follows. In Section 2, we outline the model and present the estimated pseudo-partial likelihood estimator. We develop the asymptotic properties of the proposed estimator and propose a variance estimator in Section 3. In Section 4, we evaluate the proposed methodology through simulation studies. We apply the proposed estimator to study the effect of ejection fraction on the risk of heart failure and first myocardial infarction using the data from the SOLVD study. Final remarks are given in Section 5. Outline of the proof for theoretical results are given in the Web Appendix.

2. Model and Estimation

2.1 Notation and Data Structure

Suppose that there is a random sample of n independent subjects from an underlying population and that there are K different types of failures of interest for each subject. Let (i, k) denote the kth failure type for the ith subject. Let Tik and Cik denote the potential failure time and censoring time, respectively. With censoring, we observe Xik = min(Tik, Cik). Let Δik = I(XikCik) be the failure indicator and Yik(t) = I(Xikt) denote the at-risk indicator process. Let (Eik, Zik) denote a set of covariates, where Eik is the primary exposure subjecting to missing and Zik = (Zik1, ···, Zikp)’ is the remaining covariate vector that is always observed. We denote variable A as an auxiliary variable for the exposure variable E, assuming that conditional on E, A provides no additional information to the regression model, i.e. λ(t; E(t), Z(t), A(t)) = λ(t; E(t), Z(t)).

Suppose that there is a simple random validation sample with sample size nv, denoted by V, such that subjects belonging to V have their (E, A) measured. Similarly, let V denote the remaining subjects, the non-validation set, and assume that the subjects in V will only have their A measured. Hence, the observed data structure for (i, k) is:


2.2 Models and Estimated Pseudo-Partial Likelihood Function

Assume that, the marginal hazard function for the kth failure type of subject i takes the form:


where Eik is an m-vector consisting of Eik and possibly interaction terms between Eik and some fully observed covariates, β=(β1,β2) is the relative risk parameter to be estimated, and λ0k(t) is an unspecified marginal baseline hazard function pertaining to the type k failure.

If subject i belongs to the validation set, then Zik and Eik are observed and the marginal model takes the form as in (1). If subject i belongs to the non-validation set V, we only observe Zik(t) and Aik(t). Under this situation, we can show, using the argument of Prentice (1982) and Zhou and Pepe (1995), that the hazard function for λik(t; Zik(t), Aik(t)) satisfied the induced model


where A* includes auxiliary variable A and the part of the information in covariate Z that, given A, are still related to E. That is, A* satisfying the following conditional dependence f(Eik(t)Xikt,Zik(t),Aik(t))=f(Eik(t)Xikt,Aik(t)). Notice that under this formulation, A* still satisfies the auxiliary assumption that given E and Z, A* does not contribute to the regression model, i.e., λ(t; Z(t), E(t), A*(t)) = λ(t; Z(t), E(t)).

Equation (2) implies that this induced hazard model is also a proportional hazard model with the relative risk function exp(β2Zik(t))ϕik(β1;t), where ϕik(β1,t)=E{eβ1Eik(t)Yik(t)=1,Aik(t)}. Based on (1) and (2), the relative risk function can be written in general as




and the binary variable ηi = 1 or 0 denote whether subject i is in validation set V or not. In addition to the unspecified baseline hazard function λ0k(t), the expectation above also depends on the underlying distributions of Eik(t) and Zik(t). If f(Eik(t)Xikt,Aik(t)) is a known function up to a parameter θ, then the inference about β and θ can be drawn from a pseudo partial likelihood based on the model for multivariate failure times (Wei et. al., 1989; Cai and Prentice, 1995). However, misspecification of such parameterization may lead to biased estimates. We develop an estimated pseudo-partial likelihood approach for correlated failure time data that avoids making undesirable parametric assumptions on the conditional distribution.

If all the observations were independent, we could write the partial likelihood as


When the failure times within a subject are not independent, the above function is referred to as the pseudo-partial likelihood (Wei et. al., 1989; Cai and Prentice, 1995). Since the induced relative risk function rik(β, t) is unknown, we will estimate it by using the validation set. Without loss of generality, we assume that {Aik} are identically distributed categorical variables with the distribution Pr(A* = am) = pm, m = 1, ···, L, m=1Lpm=1. Hence, if subject i is in the non-validation set V, we will estimate the induced hazard function for the kth type failure, ϕik(β1, t), as:


It follows that the estimated relative risk function is: r^ik(β,t)=R^ik(β1,t)exp[β2Zik(t)], where R^ik(β1,t)=exp(β1Eik(t))ηi+ϕ^ik(β1,t)(1ηi).

Replacing rik(β, t) by [r with circumflex]ik(β, t) in (3), we obtain an estimated pseudo-partial likelihood function:


We define our proposed estimator [beta]E as the maximizer of (5). [beta]E can be obtained by solving the estimated pseudo-partial likelihood score equation, Û(β) = ([partial differential]/[partial differential]β){log EPPL(β)} = 0, where


and Nik(t) = I(Xikt, Δik = 1) is the counting process corresponding to failure time Tik. For a function g(β, u), g(j)(β, u) denotes the jth derivative of g(β, u) with respect to β. A Newton-Raphson iterative procedure can be invoked to obtain [beta]E.

3. Asymptotic Properties

To investigate the asymptotic properties of the estimated pseudo-partial likelihood estimator [beta]E, we define the following notations. For a vector a, define a[multiply sign in circle]0 = 1, a[multiply sign in circle]1 = a, a[multiply sign in circle]2 = aa’, ||a|| = supi |ai|. For a matrix A, define ||A|| = supi,j |aij|. We also define


Assume that the study duration is from 0 to τ. Suppose that β0=(β10,β20) is the true hazards parameter. Our asymptotic results rely on the following assumptions:

  • (i) (Finite interval): 0τλ0k(t)dt< (k = 1, ···, K);
  • (ii) Pr(Yik(t)=1Aik(t)=am)>0, for m = 1, 2, ···, L;
  • (iii) For any k = 1, ···, K, there exits a neighborhood B2 of β20 such that
  • (iv) There exists an open set B1, containing β10, such that ϕik(β1, t) is bounded away from 0 on B1 × [0, τ]. Σ(β0), as defined in Theorem 2, is positive definite.
  • (v) For any k = 1, ···, K,
  • (vi) supt[0,τ]Lk(d)(t)=Op(1), d = 0, 1, where
    and γik(β1,t)=exp(β1Eik).

Following closely the argument of Foutz (1977), we can show that [beta]E is consistent for β0. To show the asymptotic normality of [beta]E, we use the Taylor expansion of the score equation which, using martingale representation and theory, can be shown to be asymptotically equivalent to a sum of two independent terms. Each of the terms can be shown to be a sum of independent vectors. The multivariate central limit theorem is then applied. We summarize the results in the following theorems and give the outline of the proofs in the Web Appendix.

Theorem 1

(Consistency) [beta]E is a consistent estimator of β0 under assumptions (i)-(vi).

Theorem 2

(Asymptotic Normality) Under the assumptions (i)-(vi), we have that n1/2([beta]E - β0) is asymptotically normally distributed with mean zero and variance matrix ΣEPPL(β0)=Σ1(β0)[Σ1(β0)+Σ2(β0)]Σ1(β0)T, where




Here sk(11)(β0,t) is the first m elements of sk(1)(β0,t) and sk(12)(β0,t) contains the remaining p elements, so sk(1)(β0,t)=(sk(11)(β0,t)sk(12)(β0,t)), and


q=limn(nvn) denotes the validation fraction and Mik(t)=Nik(t)0tλik(u)du is the marginal martingale.

Remark 1

Observe that, when the validation fraction q = 1, the variance matrix ΣEPPL is the same as that of the usual pseudo-partial likelihood estimator (Wei et. al., 1989; Cai and Prentice, 1995), as it should be.

The variance estimator for [beta]E can be consistently estimated by replacing the population quantities in the asymptotic covariance matrix ΣEPPL(β0) with their corresponding sample quantities. The cumulative hazard Λ0k(t)=0tλ0k(u)du can be estimated by Aalen-Breslow type of estimator:


where S^k(0)(β,s) is the empirical estimator of sk(0)(β,s) which is defined as:


4. Numerical Studies

4.1 Simulation Studies

In this section, we examine the finite sample properties of the proposed estimator [beta]E via simulation studies. We compare [beta]E with two naive estimators, [beta]V, which is the pseudo-partial likelihood estimator (Wei et. al., 1989) based only on the validation set, and [beta]N, which is the pseudo-partial likelihood estimator using the auxiliary variable in place of the true exposure variable. We evaluate these estimators under various situations with different levels of censoring proportions, validation fractions, dependence between failure times within a subject, and the informativeness of the auxiliary covariate.

We generated the multivariate failure times from the popular multivariate Clayton and Cuzick (1985) distribution with exponential marginals. The joint survival distribution function takes the form


where Dk=(Ek,Zk) and β(k) is the corresponding coefficient for Dk. The parameter θ, where θ > 0, controls the degree of dependence between Tj and Tl (j, l = 1, ···, K), with θ → ∞ corresponding to independence and θ → 0 corresponding to increasing positive correlation. We take λ0k to be an arbitrary constant baseline hazard. We simulate failure times that follow Clayton-Cuzick distribution by transforming independent uniform (0, 1) variables (u1, ···, uK) as follows: t1=(1λ01)eβ(1)D1×ln(1u1) and tk=(1λ0k)eβ(k)Dkln[(k1)i=1k1ai+(i=1k1ai(K2))(1uk)(θ+k1)1], for k = 2, ···, K, where al=eθ1tleβ(l)Dl for l = 1, ···, k - 1. We considered an exposure variable E which could have different effect for different failure types, and an adjustment covariate Z which has the same effect for different failure types. The generated failure times satisfy the following marginal hazards model:


where βk1 denotes the exposure (E) effect on the kth failure type and β2 is the common effect of Z.

We simulate K = 2 failure types with baseline hazard λ0k = 1. We consider (β11, β21) = (0, 0) or (β11, β21)= (ln(2), ln(1.3))=(0.693, 0.262) and β2 = -0.2. We set θ = 0.25, 1.5, or 5.7, which respectively represents a strong, modest, or weak positive dependence between failure times within a subject. When β11 = β21 = 0 and β2 = -0.2 with 20% censoring, these values of θ correspond to the correlation coefficients (for the failure times within a subject) of 62%, 30% and 10%, respectively. Censoring times were generated from uniform distribution over (0, c), where c was chosen to yield the censoring percentage of 20% or 50%.

Mimicking the SOLVD study data structure, where f(E|A, Z) = f(E|A), we create our exposure and auxiliary variables for each subject from the following scheme. We generate the partly observed covariate (E1, E2) from a multivariate normal distribution with marginal mean of 0, standard deviation of 1, and the correlation between E1 and E2 being r = 0 and 0.8, which represent cases of independence and strong dependence between E1 and E2. The fully observed covariate (Z1, Z2) are generated from independent normal distribution N(0, 1). To generate auxiliary variable A, we first generate Wk = Ek + ek, (k = 1, 2), where ek ~ N(0, σ2), and σ is the parameter that controls the strength of the association between Ek and Wk. The auxiliary covariate A is then assigned the value of 1, 2, 3 or 4 based on whether Wk is in the interval (-∞, Q1], (Q1, Q2], (Q2, Q3], or (Q3, +∞), where Q1, Q2, Q3 are the quartiles of W. The parameter σ also controls the strength of the association between E and A: as σ2 increases, A becomes less informative about E. We set σ = 0.2 or 0.8.

For each specified set of parameters, the number of independent subjects is n = 200 and each simulation is repeated 1000 times. The sample standard deviation of the 1000 estimates are given in the corresponding SD columns. The SE columns give the average of the estimated standard errors and “95% CI” is the nominal 95% confidence interval coverage of the true parameter using the estimated standard error.

Table 1 displays the simulation results when the exposure E has no effect on failures, i.e. β11 = β21 = 0, under a variety of configurations of the parameters when validation fraction is 50% and censoring rate is 20%. Under this situation, failure times satisfy the following model:


Table 2 provides results under the situation when E has different effect on different failure types. From Tables Tables11 and and2,2, we make the following observations. (i) In Table 1, i.e. when β1=0, all the estimates are approximately unbiased. In Table 2, i.e. when β11 ≠ 0, β21 ≠ 0, [beta]N is biased towards 0. Both the validation estimator [beta]V and the proposed estimator [beta]E are approximately unbiased. (ii) The proposed estimator [beta]E is more efficient than the validation set only estimator [beta]V. (iii) When σ is large, i.e. when A is not as informative about E, [beta]E is less accurate in estimating the true β, e. g. the bias exibited in [beta]E is larger for σ = 0.8. This bias, however, decreases as we increase the sample size to n = 500 (results not shown). (iv) The proposed variance estimator provides a good estimation of ΣEPPL(β). (v) The 95% confidence intervals based on the proposed estimated standard errors provide good coverage for most of the situations studied when σ = 0.2. The coverage rates are lower when σ = 0.8. However, when we increased n to 500 (results not shown), the 95% CI coverage rates increased and were close to 95%.

Table 1
Simulation results for β1 = 0 under model: λik(t; Eik, Zik) = λ0k(t) exp1Eik + β2Zik} (k = 1, 2). Validation fraction is 50% and censoring rate is 20%
Table 2
Simulation results for different covariate effects across failure types under the model: λik(t; Eik(t), Zik(t)) = λ0k(t) expk1Eik(t) + β2Zik(t)} (k = 1, 2). β11 = log(2) = 0.693, β21 = log(1.3) = 0.262, ...

Table 3 compares the estimated relative efficiency of [beta]E vs. [beta]V under different censoring proportions (20%, 50%), validation fractions (30%, 50%, 70%) and strength of correlations between failure times within a subject (θ = 0.25, 1.5, 5.7). The estimated relative efficiency (RE) are calculated as (SD([beta]V)/SD([beta]E))2. From Table 3, we had the following observation: when the validation fraction decreases, the efficiency gain of [beta]E relative to [beta]V increases. For example, when θ = 0.25 and 20% censoring, the REs for ([beta]11, [beta]21, [beta]2) increase from (1.452, 1.399, 1.423) to (3.466, 3.551, 3.894) when validation fraction decreases from 70% to 30%. We observe the same trend when censoring rate is 50% or θ=1.5 or 5.7. This suggests that when the validation fraction is small, using our proposed method is even more beneficial compared to using the estimator based on the validation set only.

Table 3
Relative efficiency for [beta]E vs. [beta]V with β11 = log(2) = 0.693, β21 = log(1.3) = 0.262, β2 = -0.2, r = 0.8, and σ = 0.2

4.2 Analysis of the SOLVD Study Data Set

We illustrated the proposed method with a data set from the SOLVD study to evaluate the effect of ejection fraction on the time of heart failure and the time to first nonfatal myocardial infarction (nonfatal MI). The SOLVD study was a randomized, double-masked, placebo-controlled trial between 1986 and 1991. The trial had a three year recruitment and a two year follow-up. The basic inclusion criteria for the prevention trial were: age between 21 and 80 years, inclusive, no overt symptoms of congestive heart failure, and left ventricular ejection fraction less than 35 percent. Ejection fraction is a number between 0 and 100 that measures the efficiency of the heart in ejecting blood. A total of 4228 patients with asymptomatic left ventricular dysfunction were randomly assigned to receive either enalapril or placebo at one of the 83 hospitals linked to 23 centers in the United States, Canada, and Belgium.

The correlated outcomes of interest are time to heart failure and time to the first nonfatal MI after the randomization. The primary clinical issues of interest are the effects of covariates on the risk of heart failure and on the nonfatal MI after adjusting for the confounding variables. In the SOLVD study, only 108 among the total of 4228 patients have their ejection fraction accurately measured using a standardized radionucleotide technique (LVEF). A related nonstandardized measure (EF) was, however, ascertained for all the patients. Therefore, the nonstandardized measure (EF) is a surrogate measure for the standardized measure for ejection fraction (LVEF) in this case. Both LVEF and EF were measured in percentage.

The average LVEF in the validation set is 28.3% ranging from 12.3% to 45.4%. The average EF for the entire cohort is 19.37% ranging from 1% to 32%. We create a new auxiliary variable VEF taking values a1, ···, a4 depending on whether EF belongs to the interval [min(EF), Q1], (Q1, Q2], (Q2, Q3] and (Q3, max(EF)] respectively, where a1, ···, a4 are the values of the midpoint of the aforementioned intervals, and Q1, Q2, and Q3 are the quartiles of EF. We use VEF as the auxiliary covariate for LVEF. Other covariates we considered here are patient’s gender (SEX), which is coded 1 for male and 0 for female; treatment (TRT), which is coded as 1 for enalapril and 0 for placebo, and patient’s age (AGE), which was measured in years. The average age of the patients is 59 years old with a standard deviation of 10 years. Hence, in terms of the notation in the previous sections, we have E = LVEF, A = VEF, Z = (TRT, SEX, AGE)’. To check whether, for given VEF, LVEF is dependent on TRT, SEX, and AGE, we added TRT, SEX, and AGE to the linear model of LVEF on VEF. The results showed that the TRT, SEX, and AGE effects are not statistically significant for given VEF. We also examined this relationship for subjects who are at risk at some selected time points and the same conclusion was arrived. Hence we took A* = VEF in this case.

We fit the following marginal hazards model to the SOLVD data:



Rik(β1,t)={exp(β1LVEFik+γ1I(k=2)LVEFik)whenLVEFis observedϕik(β1,γ1,t)whenLVEFis missing}



where k denotes failure type with k = 1 for heart failure and k = 2 for nonfatal MI and i denotes the patient with i = 1, ···, 4228.

Table 4 presents the data analysis results. The columns under “Proposed method” list results using the proposed method. The columns under “Validation method” are the results from fitting model (1), using the pseudo-partial likelihood approach, based only on the validation set. The proposed method utilizes the auxiliary information while the validation analysis relies only on the available true ejection fraction.

Table 4
SOLVD data analysis results

An inspection of the point estimates of the covariate effects reveals that the estimates from the proposed method are very close to those from the validation set only analysis, except for SEX. However, the proposed estimator is much more precise than the validation set only analysis, e.g, for the effect of LVEF on heart failure, the standard error is 0.008 for the proposed estimator and 0.038 for the validation only estimator. Consequently, the proposed method provided a tighter 95% confidence intervals, e.g., the 95% CI for LVEF for heart failure is (-0.071, -0.039) for the proposed method and (-0.149, -0.001) for the validation set only analysis. The p-values in Table 4 indicate that at 0.05 significance level, LVEF, TRT, SEX and AGE are all statistically significant for heart failure using the proposed method, while only LVEF is marginally significant for heart failure for the validation analysis. Note that this real data set has an extremely low validation fraction with a moderate validation sample size. From our additional simulations (results not shown), our proposed variance estimator could over-estimate the true variance when the validation fraction is low (see Concluding Remarks). Hence these confidence intervals should be interpreted on the conservative side.

The results from the proposed method also indicate that the effects of ejection fraction, treatment, sex, and age are different for the heart failure and for the non-fatal MI. Specifically, ejection fraction, treatment, sex and age do not seem to affect the risk of non-fatal MI, but is related to the risk of heart failure. The risk of heart failure increases by 2.5% (95% CI: (1.3%, 3.7%)) per year increase in age. With 1% decrease in ejection fraction, the risk of heart failure is about 5.3% (95% CI: (3.8%, 6.8%)) higher. Females are at 25.3% (95% CI: (0, 44.6%)) higher risk for heart failure than males. Enalapril reduces the risk of heart failure by 35.5% (95% CI: (17.6%, 49.5%)).

In conclusion, we found that the pseudo-partial likelihood method using only the validation data yielded no significant covariate effect and the estimated standard errors were much bigger than those from the proposed method. This is because the validation set is only a very small subset (n = 108) of the entire cohort (n = 4228). Utilizing the auxiliary information in the proposed method, we had in effect regained the statistical power of the study that would have been lost had one conducted the analysis using only the validation data set.

5. Concluding Remarks

In this paper, we studied an estimated pseudo-partial likelihood method for multivariate failure time data with an auxiliary covariate. A key feature of this method is that it is nonparametric with respect to the association between the missing covariate and the observed auxiliary covariate. The estimated pseudo-partial likelihood estimator asymptotically follows a normal distribution. Simulation studies demonstrate that the proposed estimator approaches the large sample properties with moderate sample size. These results also show that the proposed estimator outperforms the estimator which uses only data from the validation sample. The proposed variance estimator based on the approximated asymptotic variance performs well. When the auxiliary covariate A is more informative about the partly observed covariate E, the proposed estimator is more efficient.

The real data example also demonstrates that a much more precise estimator can be obtained by incorporating the auxiliary covariate information. The proposed method shows improved statistical power over what would be achieved using only the validation set.

We have a couple of cautionary notes on the limitations of the proposed method. First, the auxiliary variable A* is assumed to be discrete with the number of categories fixed. The asymptotic properties were developed under this assumption. One way to deal with a continuous auxiliary variable is to discretize it into categories and then apply the proposed method. In practice, the number of categories of the auxiliary variable cannot be too large (no more than 6), especially when the sample size is small. In our simulation, we have run into convergence problems when the validation size is less than 60 and the number of categories is greater than 8. We recommend reducing the number of categories of the auxiliary variable if it is greater than 8. Second, if the validation size is small and the validation fraction is also very low, the resulting variance estimator tends to over-estimate the true variance. Increasing the validation sample size helps to alleviate this problem (see Web Table 1).

To fully take advantage of a continuous auxiliary covariate, a nonlinear smoothing version of (4) can be developed. Cai and Prentice (1995) showed that more efficient β-estimators could be obtained by introducing weights into the pseudo-partial likelihood score equations. Future work that introduces suitable weights to our proposed method to improve the efficiency of estimators is certainly warranted.

Supplementary Material


This work was partially supported by SRF for ROCS. SEM. and the National Natural Science Foundation of China 10771163 (Liu) and NIH grants R01 CA79949 (Zhou) and R01 HL57444 (Cai). The authors thank the co-editor, the associate editor, and the two referees for their valuable suggestions which have led to significant improvement of this paper.


6. Supplementary Materials The Web Appendix and Web Table referenced in Sections 1, 3 and 5, are available under the Paper Information link at the Biometrics web site

Contributor Information

Yanyan Liu, School of Mathematics and Statistics, Wuhan University, P. R. of China.

Haibo Zhou, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, N.C. 27599-7420, U.S.A.

Jianwen Cai, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, N.C. 27599-7420, U.S.A.


  • Andersen PK, Gill RD. Cox’s regression model for counting processed: A large sample study. Ann.Statist. 1982;10:1100–20.
  • Breslow NE. Covariate analysis of censored survival data. Biometrics. 1974;30:89–99. [PubMed]
  • Clayton D, Cuzick J. Multivariate generalizations of the propotional hazard model (with discussion) (Series A).J. Royal Statist. Soc. 1985;54:168–184.
  • Cai J, Prentice RL. Estimating equations for hazard ratio parameters based on correlated failure time data. Biometrika. 1995;82:151–164.
  • Cai J, Prentice RL. Regression analysis for correlated failure time data. Lifetime Data Analysis. 1997;3:197–213. [PubMed]
  • Clegg LX, Cai J, Sen PK. Handbook of Statistics. Vol. 18. 2000. Modeling multivariate Failure time data; pp. 804–838.
  • Carroll RJ, Wang MP. Semiparametric estimation in logistic measurement error models. (Series B).J. Roy. Statist. Soc. 1991;53:573–585.
  • Cox DR. Regression Models and Life-Tables. (B).J. Roy. Statist. Soc. 1972;34:187–202.
  • Foutz RV. On the Unique consistent solution to the likelihood equations. Journal of the American Statistical Association. 1977;72:147–148.
  • Greene WF, Cai J. Measurement Error in Covariate in the Marginal Hazards Model for Multivariate Failure Time Data. Biometrics. 2004;60:987–996. [PubMed]
  • Huang Y, Wang CY. Cox Regression with Accurate Covariates Uncertainable-A Noparametric Approach. Journal of the American Statistical Association. 2000;95:1209–1219.
  • Hu C, Lin D. Cox Regression with Covariate Measurement Error. Scand. J. Statist. 2002;29:637–655.
  • Hu P, Tsiatis AA, Davidian M. Estimating the parameters in the Cox model when covariates variables are measured with error. Biometrics. 1998;54:1407–1419. [PubMed]
  • Lee EW, Wei LJ, Amato DA. Cox-Type Regression Analysis for Large Numbers of Small Groups of Correlated Failure Time Observations. In: Klein JP, Goel PK, editors. Survival Analysis: State of the Art. Kluwer Academic Publishers; 1992. pp. 237–247.
  • Liang KY, Self SG, Chang Y. Modeling Marginal Hazards in Multivariate Failure Time Data. (B).J. Royal Statist. Soc. 1993;55:441–453.
  • Lin DY. Cox Regression Analysis of Multivariate Failure Time Data: The Marginal Approach. Statist. Med. 1994;13:2233–2247. [PubMed]
  • Lin DY, Ying Z. Cox regression with incomplete covariate measurements. Journal of the American Statistical Association. 1993;88:1341–1349.
  • Pepe MS, Fleming TR. A nonparametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
  • Prentice RL. covariate measurement errors and parameter estimation in a failure time regression model. Biometrika. 1982;69:331–342.
  • Spiekerman CF, Lin DY. Marginal regression models for multivaraite failure time data. J. Amer. Statist. Assoc. 1998;93:1164–1175.
  • SOLVD Investigators Studies of Left Ventricular Dysfunction (SOLVD)—Rationale, Design and Methods: Two Trials that Evaluate the Effect of Enalapril in Patients with Reduced Ejection Fraction’ The American Journal of Cardiology. 1990;66:315–322. [PubMed]
  • SOLVD Investigators Effect of Enalapril on Survival in Patients with Reduced Left Ventricular Ejection Fractions and Congestive Heart Failure. New England Journal of Medicine. 1991;325:293–302. [PubMed]
  • Tsiatis AA, Davidian M. A semiparametric estimator for the proportional hazards model with longitudinal covariates measured with error. Biometrika. 2001;88:447–458.
  • Wang N, Lin X, Gutierrez RG, Carroll RJ. Bias analysis and SIMEX approach in generalized linear mixed measurement error models. Journal of the American Statistical Association. 1998;93:249–261.
  • Wang X, Zhou H. A semiparametric Empirical Likelihood Method for Biased Sampling Schemed with Auxiliary Covariates. Biometrics. 2006;62:1149–1160. [PubMed]
  • Wei LJ, Lin DY, Weissfeld L. Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. Journal of the American Statistical Association. 1989;84:1065–73.
  • Zhou H. University of Washington Ph.D. Thesis. 1992. Auxiliary and Missing Covariate Problems in Failure Time Regression Analysis.
  • Zhou H, Pepe MS. Auxiliary covariate data in failure time regression analysis. Biometrika. 1995;82:139–149.
  • Zhou H, Wang C-Y. Failure time regression with continuous covariates measured with error. J. R. Statist. Soc. B. 2000;62:657–665.