Motivated from a colorectal cancer study, we propose a class of frailty
semi-competing risks survival models to account for the dependence between
disease progression time, survival time, and treatment switching. Properties of
the proposed models are examined and an efficient Gibbs sampling algorithm using
the collapsed Gibbs technique is developed. A Bayesian procedure for assessing
the treatment effect is also proposed. The Deviance Information Criterion (DIC)
with an appropriate deviance function and Logarithm of the Pseudomarginal
Likelihood (LPML) are constructed for model comparison. A simulation study is
conducted to examine the empirical performance of DIC and LPML and as well as
the posterior estimates. The proposed method is further applied to analyze data
from a colorectal cancer study.
Competing risks; Panitumumab; Partial treatment switching; Posterior propriety; Semi-Markov model
Genetic data are now collected frequently in clinical studies and epidemiological cohort studies. For a large study, it may be prohibitively expensive to genotype all study subjects, especially with the next-generation sequencing technology. Two-phase sampling, such as case-cohort and nested case-control sampling, is cost-effective in such settings but entails considerable analysis challenges, especially if efficient estimators are desired. A different type of missing data arises when the investigators are interested in the haplotypes or the genetic markers that are not on the genotyping platform used for the current study. Valid and efficient analysis of such missing data is also interesting and challenging. This article provides an overview of these issues and outlines some directions for future research.
Case-cohort design; Case-control design; Censoring; Genome-wide association studies; Haplotypes; Next-generation sequencing; Nonparametric likelihood; Single nucleotide polymorphisms; Two-phase study; Women’s Health Initiative
The cross-ratio is an important local measure that characterizes the dependence between bivariate failure times. To estimate the cross-ratio in follow-up studies where delayed entry is present, estimation procedures need to account for left truncation. Ignoring left truncation yields biased estimates of the cross-ratio. We extend the method of Hu et al. (2011) by modifying the risk sets and relevant indicators to handle left-truncated bivariate failure times, which yields the cross-ratio estimate with desirable asymptotic properties that can be shown by the same techniques used in Hu et al. (2011). Numerical studies are conducted.
Bivariate survival; Cross-ratio; Left truncation; Pseudo-partial likelihood; Right censoring
Two-phase study methods, in which more detailed or more expensive exposure information is only collected on a sample of individuals with events and a small proportion of other individuals, are expected to play a critical role in biomarker validation research. One major limitation of standard two-phase designs is that they are most conveniently employed with study cohorts in which information on longitudinal follow-up and other potential matching variables is electronically recorded. However for many practical situations, at the sampling stage such information may not be readily available for every potential candidates. Study eligibility needs to be verified by reviewing information from medical charts one by one. In this manuscript, we study in depth a novel study design commonly undertaken in practice that involves sampling until quotas of eligible cases and controls are identified. We propose semiparametric methods to calculate risk distributions and a wide variety of prediction indices when outcomes are censored failure times and data are collected under the quota sampling design. Consistency and asymptotic normality of our estimators are established using empirical process theory. Simulation results indicate that the proposed procedures perform well in finite samples. Application is made to the evaluation of a new risk model for predicting the onset of cardiovascular disease.
Biomarker; Nested case; control study; Prediction accuracy; Prognosis; Quota sampling; Risk prediction
Two common features of clinical trials, and other longitudinal studies, are (1) a primary interest in composite endpoints, and (2) the problem of subjects withdrawing prematurely from the study. In some settings, withdrawal may only affect observation of some components of the composite endpoint, for example when another component is death, information on which may be available from a national registry. In this paper, we use the theory of augmented inverse probability weighted estimating equations to show how such partial information on the composite endpoint for subjects who withdraw from the study can be incorporated in a principled way into the estimation of the distribution of time to composite endpoint, typically leading to increased efficiency without relying on additional assumptions above those that would be made by standard approaches. We describe our proposed approach theoretically, and demonstrate its properties in a simulation study.
Augmented inverse probability weighted estimator; Composite endpoint; Missing data; Nelson–Aalen estimator; Semi-parametric efficiency; Withdrawal
Nonparametric estimators of component and system life distributions are developed and presented for situations where recurrent competing risks data from series systems are available. The use of recurrences of components’ failures leads to improved efficiencies in statistical inference, thereby leading to resource-efficient experimental or study designs or improved inferences about the distributions governing the event times. Finite and asymptotic properties of the estimators are obtained through simulation studies and analytically. The detrimental impact of parametric model misspecification is also vividly demonstrated, lending credence to the virtue of adopting nonparametric or semiparametric models, especially in biomedical settings. The estimators are illustrated by applying them to a data set pertaining to car repairs for vehicles that were under warranty.
Recurrent events; Competing risks; Perfect and partial repairs; Martingales; Survival analysis; Repairable systems; Nonparametric methods
Developing individualized prediction rules for disease risk and prognosis has played a key role in modern medicine. When new genomic or biological markers become available to assist in risk prediction, it is essential to assess the improvement in clinical usefulness of the new markers over existing routine variables. Net reclassification improvement (NRI) has been proposed to assess improvement in risk reclassification in the context of comparing two risk models and the concept has been quickly adopted in medical journals. We propose both nonparametric and semiparametric procedures for calculating NRI as a function of a future prediction time t with a censored failure time outcome. The proposed methods accommodate covariate-dependent censoring, therefore providing more robust and sometimes more efficient procedures compared with the existing nonparametric-based estimators. Simulation results indicate that the proposed procedures perform well in finite samples. We illustrate these procedures by evaluating a new risk model for predicting the onset of cardiovascular disease.
Inverse probability weighted (IPW) estimator; Net reclassification improvement (NRI); Risk prediction; Survival analysis
This paper studies the generalized semiparametric regression model for longitudinal data where the covariate effects are constant for some and time-varying for others. Different link functions can be used to allow more flexible modelling of longitudinal data. The nonparametric components of the model are estimated using a local linear estimating equation and the parametric components are estimated through a profile estimating function. The method automatically adjusts for heterogeneity of sampling times, allowing the sampling strategy to depend on the past sampling history as well as possibly time-dependent covariates without specifically model such dependence. A K -fold cross-validation bandwidth selection is proposed as a working tool for locating an appropriate bandwidth. A criteria for selecting the link function is proposed to provide better fit of the data. Large sample properties of the proposed estimators are investigated. Large sample pointwise and simultaneous confidence intervals for the regression coefficients are constructed. Formal hypothesis testing procedures are proposed to check for the covariate effects and whether the effects are time-varying. A simulation study is conducted to examine the finite sample performances of the proposed estimation and hypothesis testing procedures. The methods are illustrated with a data example.
Asymptotics; Kernel smoothing; Link function; Sampling adjusted estimation; Testing time-varying effects; Weighted least squares
Counting process models have played an important role in survival and event history analysis for more than 30 years. Nevertheless, almost all models that are being used have a very simple structure. Analyzing recurrent events invites the application of more complex models with dynamic covariates. We discuss how to define valid models in such a setting. One has to check carefully that a suggested model is well defined as a stochastic process. We give conditions for this to hold. Some detailed discussion is presented in relation to a Cox type model, where the exponential structure combined with feedback lead to an exploding model. In general, counting process models with dynamic covariates can be formulated to avoid explosions. In particular, models with a linear feedback structure do not explode, making them useful tools in general modeling of recurrent events.
Recurrent events; Cox regression; Explosion; Honest process; Birth process; Aalen regression; Stochastic differential equation; Lipschitz condition; Feller criterion; Martingale problem
Absolute risk is the probability that a cause-specific event occurs in a given time interval in the presence of competing events. We present methods to estimate population-based absolute risk from a complex survey cohort that can accommodate multiple exposure-specific competing risks. The hazard function for each event type consists of an individualized relative risk multiplied by a baseline hazard function, which is modeled nonparametrically or parametrically with a piecewise exponential model. An influence method is used to derive a Taylor-linearized variance estimate for the absolute risk estimates. We introduce novel measures of the cause-specific influences that can guide modeling choices for the competing event components of the model. To illustrate our methodology, we build and validate cause-specific absolute risk models for cardiovascular and cancer deaths using data from the National Health and Nutrition Examination Survey. Our applications demonstrate the usefulness of survey-based risk prediction models for predicting health outcomes and quantifying the potential impact of disease prevention programs at the population level.
Absolute risk; Censored data; Crude risk; Cumulative incidence; NHANES; Survey cohort
We propose an evidence synthesis approach through a degradation model to estimate causal influences of physiological factors on myocardial infarction (MI) and coronary heart disease (CHD). For instance several studies give incidences of MI and CHD for different age strata, other studies give relative or absolute risks for strata of main risk factors of MI or CHD. Evidence synthesis of several studies allows incorporating these disparate pieces of information into a single model. For doing this we need to develop a sufficiently general dynamical model; we also need to estimate the distribution of explanatory factors in the population. We develop a degradation model for both MI and CHD using a Brownian motion with drift, and the drift is modeled as a function of indicators of obesity, lipid profile, inflammation and blood pressure. Conditionally on these factors the times to MI or CHD have inverse Gaussian (
) distributions. The results we want to fit are generally not conditional on all the factors and thus we need marginal distributions of the time of occurrence of MI and CHD; this leads us to manipulate the inverse Gaussian normal distribution (
whose drift parameter has a normal distribution). Another possible model arises if a factor modifies the threshold. This led us to define an extension of
obtained when both drift and threshold parameters have normal distributions. We applied the model to results published in five important studies of MI and CHD and their risk factors. The fit of the model using the evidence synthesis approach was satisfactory and the effects of the four risk factors were highly significant.
Causality; Causal inference; Coronary heart disease; Degradation model; Epidemiology; Evidence synthesis; Inverse Gaussian distribution; Myocardial infarction; Stochastic processes
The AUC (area under ROC curve) is a commonly used metric to assess discrimination of risk prediction rules; however, standard errors of AUC are usually based on the Mann-Whitney U test that assumes independence of sampling units. For ophthalmologic applications, it is desirable to assess risk prediction rules based on eye-specific outcome variables which are generally highly, but not perfectly correlated in fellow eyes (eg. progression of individual eyes to age-related macular degeneration (AMD)). In this article, we use the extended Mann-Whitney U test (Rosner et al, 2009) for the case where subunits within a cluster may have different progression status and assess discrimination of different prediction rules in this setting. Both data analyses based on progression of AMD and simulation studies show reasonable accuracy of this extended Mann-Whitney U test to assess discrimination of eye-specific risk prediction rules.
risk prediction; ROC curves; clustered data; GEE
Multiple biomarkers are frequently observed or collected for detecting or understanding a disease. The research interest of this paper is to extend tools of ROC analysis from univariate marker setting to multivariate marker setting for evaluating predictive accuracy of biomarkers using a tree-based classification rule. Using an arbitrarily combined and-or classifier, an ROC function together with a weighted ROC function (WROC) and their conjugate counterparts are introduced for examining the performance of multivariate markers. Specific features of the ROC and WROC functions and other related statistics are discussed in comparison with those familiar properties for univariate marker. Nonparametric methods are developed for estimating the ROC and WROC functions, and area under curve (AUC) and concordance probability. With emphasis on population average performance of markers, the proposed procedures and inferential results are useful for evaluating marker predictability based on multivariate marker measurements with different choices of markers, and for evaluating different and-or combinations in classifiers.
Concordance probability; Multiple markers; Prediction accuracy; U-statistics
In many clinical applications, understanding when measurement of new markers is necessary to provide added accuracy to existing prediction tools could lead to more cost effective disease management. Many statistical tools for evaluating the incremental value (IncV) of the novel markers over the routine clinical risk factors have been developed in recent years. However, most existing literature focuses primarily on global assessment. Since the IncVs of new markers often vary across subgroups, it would be of great interest to identify subgroups for which the new markers are most/least useful in improving risk prediction. In this paper we provide novel statistical procedures for systematically identifying potential traditional-marker based subgroups in whom it might be beneficial to apply a new model with measurements of both the novel and traditional markers. We consider various conditional time-dependent accuracy parameters for censored failure time outcome to assess the subgroup-specific IncVs. We provide non-parametric kernel-based estimation procedures to calculate the proposed parameters. Simultaneous interval estimation procedures are provided to account for sampling variation and adjust for multiple testing. Simulation studies suggest that our proposed procedures work well in finite samples. The proposed procedures are applied to the Framingham Offspring Study to examine the added value of an inflammation marker, C-reactive protein, on top of the traditional Framingham risk score for predicting 10-year risk of cardiovascular disease.
Incremental value; Partial area under the ROC curve; Prognostic accuracy; Risk prediction; Subgroup analysis; Time dependent ROC analysis
The area under the receiver operating characteristic curve (AUC) is the most commonly reported measure of discrimination for prediction models with binary outcomes. However, recently it has been criticized for its inability to increase when important risk factors are added to a baseline model with good discrimination. This has led to the claim that the reliance on the AUC as a measure of discrimination may miss important improvements in clinical performance of risk prediction rules derived from a baseline model. In this paper we investigate this claim by relating the AUC to measures of clinical performance based on sensitivity and specificity under the assumption of multivariate normality. The behavior of the AUC is contrasted with that of discrimination slope. We show that unless rules with very good specificity are desired, the change in the AUC does an adequate job as a predictor of the change in measures of clinical performance. However, stronger or more numerous predictors are needed to achieve the same increment in the AUC for baseline models with good versus poor discrimination. When excellent specificity is desired, our results suggest that the discrimination slope might be a better measure of model improvement than AUC. The theoretical results are illustrated using a Framingham Heart Study example of a model for predicting the 10-year incidence of atrial fibrillation.
risk prediction; discrimination; AUC; IDI; Youden index; relative utility
When an existing risk prediction model is not sufficiently predictive, additional variables are sought for inclusion in the model. This paper addresses study designs to evaluate the improvement in prediction performance that is gained by adding a new predictor to a risk prediction model. We consider studies that measure the new predictor in a case–control subset of the study cohort, a practice that is common in biomarker research. We ask if matching controls to cases in regards to baseline predictors improves efficiency. A variety of measures of prediction performance are studied. We find through simulation studies that matching improves the efficiency with which most measures are estimated, but can reduce efficiency for some. Efficiency gains are less when more controls per case are included in the study. A method that models the distribution of the new predictor in controls appears to improve estimation efficiency considerably.
Classification; Diagnosis; Medical decision making; Receiver operating characteristic curve
Recurrent event data are often encountered in biomedical research, for example, recurrent infections or recurrent hospitalizations for patients after renal transplant. In many studies, there are more than one type of events of interest. Cai and Schaubel (2004) advocated a proportional marginal rate model for multiple type recurrent event data. In this paper, we propose a general additive marginal rate regression model. Estimating equations approach is used to obtain the estimators of regression coefficients and baseline rate function. We prove the consistency and asymptotic normality of the proposed estimators. The finite sample properties of our estimators are demonstrated by simulations. The proposed methods are applied to the India renal transplant study to examine risk factors for bacterial, fungal and viral infections.
additive model; empirical process; multiple type recurrent events; recurrent events
In this paper we consider a problem from hematopoietic cell transplant (HCT) studies where there is interest on assessing the effect of haplotype match for donor and patient on the cumulative incidence function for a right censored competing risks data. For the HCT study, donor’s and patient’s genotype are fully observed and matched but their haplotypes are missing. In this paper we describe how to deal with missing covariates of each individual for competing risks data. We suggest a procedure for estimating the cumulative incidence functions for a flexible class of regression models when there are missing data, and establish the large sample properties. Small sample properties are investigated using simulations in a setting that mimics the motivating haplotype matching problem. The proposed approach is then applied to the HCT study.
Binomial modeling; Bone marrow transplant; Competing risks; Haplotype effects; Haplotype match; Missing covariates; Inverse-censoring probability weighting; Nonparametric effects; Non-proportionality; Regression effects
Time-to-event data in which failures are only assessed at discrete time points are common in many clinical trials. Examples include oncology studies where events are observed through periodic screenings such as radiographic scans. When the survival endpoint is acknowledged to be discrete, common methods for the analysis of observed failure times include the discrete hazard models (e.g., the discrete-time proportional hazards and the continuation ratio model) and the proportional odds model. In this manuscript, we consider estimation of a marginal treatment effect in discrete hazard models where the constant treatment effect assumption is violated. We demonstrate that the estimator resulting from these discrete hazard models is consistent for a parameter that depends on the underlying censoring distribution. An estimator that removes the dependence on the censoring mechanism is proposed and its asymptotic distribution is derived. Basing inference on the proposed estimator allows for statistical inference that is scientifically meaningful and reproducible. Simulation is used to assess the performance of the presented methodology in finite samples.
Censoring; Estimating equations; Discrete survival endpoints; Model misspecification; Robust inference
Competing risks data are routinely encountered in various medical applications due to the fact that patients may die from different causes. Recently, several models have been proposed for fitting such survival data. In this paper, we develop a fully specified subdistribution model for survival data in the presence of competing risks via a subdistribution model for the primary cause of death and conditional distributions for other causes of death. Various properties of this fully specified subdistribution model have been examined. An efficient Gibbs sampling algorithm via latent variables is developed to carry out posterior computations. Deviance Information Criterion (DIC) and Logarithm of the Pseudomarginal Likelihood (LPML) are used for model comparison. An extensive simulation study is carried out to examine the performance of DIC and LPML in comparing the cause-specific hazards model, the mixture model, and the fully specified subdistribution model. The proposed methodology is applied to analyze a real dataset from a prostate cancer study in detail.
Latent variables; Markov chain Monte Carlo; Partial likelihood; Proportional hazards
Standard methods for estimating the effect of a time-varying exposure on survival may be biased in the presence of time-dependent confounders themselves affected by prior exposure. This problem can be overcome by inverse probability weighted estimation of Marginal Structural Cox Models (Cox MSM), g-estimation of Structural Nested Accelerated Failure Time Models (SNAFTM) and g-estimation of Structural Nested Cumulative Failure Time Models (SNCFTM). In this paper, we describe a data generation mechanism that approximately satisfies a Cox MSM, an SNAFTM and an SNCFTM. Besides providing a procedure for data simulation, our formal description of a data generation mechanism that satisfies all three models allows one to assess the relative advantages and disadvantages of each modeling approach. A simulation study is also presented to compare effect estimates across the three models.
A copula model for bivariate survival data with hybrid censoring is
proposed to study the association between survival time of individuals infected
with HIV and persistence time of infection with an additional virus. Survival
with HIV is right censored and the persistence time of the additional virus is
subject to interval censoring case 1. A pseudo-likelihood method is developed to
study the association between the two event times under such hybrid censoring.
Asymptotic consistency and normality of the pseudo-likelihood estimator are
established based on empirical process theory. Simulation studies indicate good
performance of the estimator with moderate sample size. The method is applied to
a motivating HIV study which investigates the effect of GB virus type C (GBV-C)
co-infection on survival time of HIV infected individuals.
Association measure; Bivariate survival model; Copula; Current status data; Kendall's τ; Right censored data; Empirical process
In many biomedical studies, it is common that due to budget constraints, the primary covariate is only collected in a randomly selected subset from the full study cohort. Often, there is an inexpensive auxiliary covariate for the primary exposure variable that is readily available for all the cohort subjects. Valid statistical methods that make use of the auxiliary information to improve study efficiency need to be developed. To this end, we develop an estimated partial likelihood approach for correlated failure time data with auxiliary information. We assume a marginal hazard model with common baseline hazard function. The asymptotic properties for the proposed estimators are developed. The proof of the asymptotic results for the proposed estimators is nontrivial since the moments used in estimating equation are not martingale-based and the classical martingale theory is not sufficient. Instead, our proofs rely on modern empirical theory. The proposed estimator is evaluated through simulation studies and is shown to have increased efficiency compared to existing methods. The proposed methods are illustrated with a data set from the Framingham study.
Marginal hazard model; Correlated failure time; Validation set; Auxiliary covariate
Multi-state models provide a convenient statistical framework for a wide variety of medical applications characterized by multiple events and longitudinal data. We illustrate this through four examples. The potential value of the incorporation of unobserved or partially observed states is highlighted. In addition, joint modelling of multiple processes is illustrated with application to potentially informative loss to follow-up, mis-measured or missclassified data and causal inference.
Causal inference; Classification uncertainty; Informative missing data; Multi-state models; Time dependent explanatory variables
We derive estimators of the mean of a function of a quality-of-life adjusted failure time, in the presence of competing right censoring mechanisms. Our approach allows for the possibility that some or all of the competing censoring mechanisms are associated with the endpoint, even after adjustment for recorded prognostic factors, with the degree of residual association possibly different for distinct censoring processes. Our methods generalize from a single to many censoring processes and from ignorable to non-ignorable censoring processes.
Cause-specific; Dependent censoring; Inverse weighted probability; Sensitivity analysis