Search tips
Search criteria

Results 1-25 (1159881)

Clipboard (0)

Related Articles

1.  Collaborative Double Robust Targeted Maximum Likelihood Estimation* 
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
PMCID: PMC2898626  PMID: 20628637
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
2.  Targeted Maximum Likelihood Based Causal Inference: Part I 
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
PMCID: PMC3126670  PMID: 20737021
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
3.  The Relative Performance of Targeted Maximum Likelihood Estimators 
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
PMCID: PMC3173607  PMID: 21931570
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
4.  An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics 
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
PMCID: PMC3126668  PMID: 21731530
causal effect; cross-validation; collaborative double robust; double robust; efficient influence curve; penalized likelihood; penalization; estimator selection; locally efficient; maximum likelihood estimation; model selection; super efficiency; super learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
5.  Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation 
Statistics in medicine  2009;28(1):39-64.
Covariate adjustment using linear models for continuous outcomes in randomized trials has been shown to increase efficiency and power over the unadjusted method in estimating the marginal effect of treatment. However, for binary outcomes, investigators generally rely on the unadjusted estimate as the literature indicates that covariate-adjusted estimates based on the logistic regression models are less efficient. The crucial step that has been missing when adjusting for covariates is that one must integrate/average the adjusted estimate over those covariates in order to obtain the marginal effect. We apply the method of targeted maximum likelihood estimation (tMLE) to obtain estimators for the marginal effect using covariate adjustment for binary outcomes. We show that the covariate adjustment in randomized trials using the logistic regression models can be mapped, by averaging over the covariate(s), to obtain a fully robust and efficient estimator of the marginal effect, which equals a targeted maximum likelihood estimator. This tMLE is obtained by simply adding a clever covariate to a fixed initial regression. We present simulation studies that demonstrate that this tMLE increases efficiency and power over the unadjusted method, particularly for smaller sample sizes, even when the regression model is mis-specified.
PMCID: PMC2857590  PMID: 18985634
clinical trails; efficiency; covariate adjustment; variable selection
6.  Identification and efficient estimation of the natural direct effect among the untreated 
Biometrics  2013;69(2):310-317.
The natural direct effect (NDE), or the effect of an exposure on an outcome if an intermediate variable was set to the level it would have been in the absence of the exposure, is often of interest to investigators. In general, the statistical parameter associated with the NDE is difficult to estimate in the non-parametric model, particularly when the intermediate variable is continuous or high dimensional. In this paper we introduce a new causal parameter called the natural direct effect among the untreated, discus identifiability assumptions, propose a sensitivity analysis for some of the assumptions, and show that this new parameter is equivalent to the NDE in a randomized controlled trial. We also present a targeted minimum loss estimator (TMLE), a locally efficient, double robust substitution estimator for the statistical parameter associated with this causal parameter. The TMLE can be applied to problems with continuous and high dimensional intermediate variables, and can be used to estimate the NDE in a randomized controlled trial with such data. Additionally, we define and discuss the estimation of three related causal parameters: the natural direct effect among the treated, the indirect effect among the untreated and the indirect effect among the treated.
PMCID: PMC3692606  PMID: 23607645
Causal inference; direct effect; indirect effect; mediation analysis; semiparametric models; targeted minimum loss estimation
7.  Cervical Cancer Precursors and Hormonal Contraceptive Use in HIV-Positive Women: Application of a Causal Model and Semi-Parametric Estimation Methods 
PLoS ONE  2014;9(6):e101090.
To demonstrate the application of causal inference methods to observational data in the obstetrics and gynecology field, particularly causal modeling and semi-parametric estimation.
Human immunodeficiency virus (HIV)-positive women are at increased risk for cervical cancer and its treatable precursors. Determining whether potential risk factors such as hormonal contraception are true causes is critical for informing public health strategies as longevity increases among HIV-positive women in developing countries.
We developed a causal model of the factors related to combined oral contraceptive (COC) use and cervical intraepithelial neoplasia 2 or greater (CIN2+) and modified the model to fit the observed data, drawn from women in a cervical cancer screening program at HIV clinics in Kenya. Assumptions required for substantiation of a causal relationship were assessed. We estimated the population-level association using semi-parametric methods: g-computation, inverse probability of treatment weighting, and targeted maximum likelihood estimation.
We identified 2 plausible causal paths from COC use to CIN2+: via HPV infection and via increased disease progression. Study data enabled estimation of the latter only with strong assumptions of no unmeasured confounding. Of 2,519 women under 50 screened per protocol, 219 (8.7%) were diagnosed with CIN2+. Marginal modeling suggested a 2.9% (95% confidence interval 0.1%, 6.9%) increase in prevalence of CIN2+ if all women under 50 were exposed to COC; the significance of this association was sensitive to method of estimation and exposure misclassification.
Use of causal modeling enabled clear representation of the causal relationship of interest and the assumptions required to estimate that relationship from the observed data. Semi-parametric estimation methods provided flexibility and reduced reliance on correct model form. Although selected results suggest an increased prevalence of CIN2+ associated with COC, evidence is insufficient to conclude causality. Priority areas for future studies to better satisfy causal criteria are identified.
PMCID: PMC4076246  PMID: 24979709
8.  Modeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with Targeted Maximum Likelihood Estimation 
Biometrics  2013;70(1):144-152.
Despite modern effective HIV treatment, hepatitis C virus (HCV) co-infection is associated with a high risk of progression to end-stage liver disease (ESLD) which has emerged as the primary cause of death in this population. Clinical interest lies in determining the impact of clearance of HCV on risk for ESLD. In this case study, we examine whether HCV clearance affects risk of ESLD using data from the multicenter Canadian Co-infection Cohort Study. Complications in this survival analysis arise from the time-dependent nature of the data, the presence of baseline confounders, loss to follow-up, and confounders that change over time, all of which can obscure the causal effect of interest. Additional challenges included non-censoring variable missingness and event sparsity.
In order to efficiently estimate the ESLD-free survival probabilities under a specific history of HCV clearance, we demonstrate the doubly-robust and semiparametric efficient method of Targeted Maximum Likelihood Estimation (TMLE). Marginal structural models (MSM) can be used to model the effect of viral clearance (expressed as a hazard ratio) on ESLD-free survival and we demonstrate a way to estimate the parameters of a logistic model for the hazard function with TMLE. We show the theoretical derivation of the efficient influence curves for the parameters of two different MSMs and how they can be used to produce variance approximations for parameter estimates. Finally, the data analysis evaluating the impact of HCV on ESLD was undertaken using multiple imputations to account for the non-monotone missing data.
PMCID: PMC3954273  PMID: 24571372
Double-robust; Inverse probability of treatment weighting; Kaplan-Meier; Longitudinal data; Marginal structural model; Survival analysis; Targeted maximum likelihood estimation
9.  Population Intervention Causal Effects Based on Stochastic Interventions 
Biometrics  2011;68(2):541-549.
Estimating the causal effect of an intervention on a population typically involves defining parameters in a nonparametric structural equation model (Pearl, 2000, Causality: Models, Reasoning, and Inference) in which the treatment or exposure is deterministically assigned in a static or dynamic way. We define a new causal parameter that takes into account the fact that intervention policies can result in stochastically assigned exposures. The statistical parameter that identifies the causal parameter of interest is established. Inverse probability of treatment weighting (IPTW), augmented IPTW (A-IPTW), and targeted maximum likelihood estimators (TMLE) are developed. A simulation study is performed to demonstrate the properties of these estimators, which include the double robustness of the A-IPTW and the TMLE. An application example using physical activity data is presented.
PMCID: PMC4117410  PMID: 21977966
Causal effect; Counterfactual outcome; Double robustness; Stochastic intervention; Targeted maximum likelihood estimation
10.  Structured measurement error in nutritional epidemiology: applications in the Pregnancy, Infection, and Nutrition (PIN) Study 
Preterm birth, defined as delivery before 37 completed weeks’ gestation, is a leading cause of infant morbidity and mortality. Identifying factors related to preterm delivery is an important goal of public health professionals who wish to identify etiologic pathways to target for prevention. Validation studies are often conducted in nutritional epidemiology in order to study measurement error in instruments that are generally less invasive or less expensive than ”gold standard” instruments. Data from such studies are then used in adjusting estimates based on the full study sample. However, measurement error in nutritional epidemiology has recently been shown to be complicated by correlated error structures in the study-wide and validation instruments. Investigators of a study of preterm birth and dietary intake designed a validation study to assess measurement error in a food frequency questionnaire (FFQ) administered during pregnancy and with the secondary goal of assessing whether a single administration of the FFQ could be used to describe intake over the relatively short pregnancy period, in which energy intake typically increases. Here, we describe a likelihood-based method via Markov Chain Monte Carlo to estimate the regression coefficients in a generalized linear model relating preterm birth to covariates, where one of the covariates is measured with error and the multivariate measurement error model has correlated errors among contemporaneous instruments (i.e. FFQs, 24-hour recalls, and/or biomarkers). Because of constraints on the covariance parameters in our likelihood, identifiability for all the variance and covariance parameters is not guaranteed and, therefore, we derive the necessary and suficient conditions to identify the variance and covariance parameters under our measurement error model and assumptions. We investigate the sensitivity of our likelihood-based model to distributional assumptions placed on the true folate intake by employing semi-parametric Bayesian methods through the mixture of Dirichlet process priors framework. We exemplify our methods in a recent prospective cohort study of risk factors for preterm birth. We use long-term folate as our error-prone predictor of interest, the food-frequency questionnaire (FFQ) and 24-hour recall as two biased instruments, and serum folate biomarker as the unbiased instrument. We found that folate intake, as measured by the FFQ, led to a conservative estimate of the estimated odds ratio of preterm birth (0.76) when compared to the odds ratio estimate from our likelihood-based approach, which adjusts for the measurement error (0.63). We found that our parametric model led to similar conclusions to the semi-parametric Bayesian model.
PMCID: PMC2440718  PMID: 18584067
Adaptive-Rejection Sampling; Dirichlet process prior; MCMC; Semiparametric Bayes
11.  Dimension reduction with gene expression data using targeted variable importance measurement 
BMC Bioinformatics  2011;12:312.
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
PMCID: PMC3166941  PMID: 21849016
12.  Targeted Maximum Likelihood Estimation of Effect Modification Parameters in Survival Analysis 
The Cox proportional hazards model or its discrete time analogue, the logistic failure time model, posit highly restrictive parametric models and attempt to estimate parameters which are specific to the model proposed. These methods are typically implemented when assessing effect modification in survival analyses despite their flaws. The targeted maximum likelihood estimation (TMLE) methodology is more robust than the methods typically implemented and allows practitioners to estimate parameters that directly answer the question of interest. TMLE will be used in this paper to estimate two newly proposed parameters of interest that quantify effect modification in the time to event setting. These methods are then applied to the Tshepo study to assess if either gender or baseline CD4 level modify the effect of two cART therapies of interest, efavirenz (EFV) and nevirapine (NVP), on the progression of HIV. The results show that women tend to have more favorable outcomes using EFV while males tend to have more favorable outcomes with NVP. Furthermore, EFV tends to be favorable compared to NVP for individuals at high CD4 levels.
PMCID: PMC3083138  PMID: 21556287
causal effect; semi-parametric; censored longitudinal data; double robust; efficient influence curve; influence curve; G-computation; Targeted Maximum Likelihood Estimation; Cox-proportional hazards; survival analysis
13.  Nonparametric estimation for censored mixture data with application to the Cooperative Huntington’s Observational Research Trial 
This work presents methods for estimating genotype-specific distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs) which do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators which do not assume parametric density models and are easy to implement. They are based on the inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). The AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated non-carrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared to non-carriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic testing, and in facilitating future subjects at risk to make informed decisions on whether to undergo genetic mutation testings.
PMCID: PMC3905630  PMID: 24489419
Censored data; Finite mixture model; Huntington’s disease; Kin-cohort design; Quantitative trait locus
The annals of applied statistics  2014;8(2):703-725.
The PROmotion of Breastfeeding Intervention Trial (PROBIT) cluster-randomized a program encouraging breastfeeding to new mothers in hospital centers. The original studies indicated that this intervention successfully increased duration of breastfeeding and lowered rates of gastrointestinal tract infections in newborns. Additional scientific and popular interest lies in determining the causal effect of longer breastfeeding on gastrointestinal infection. In this study, we estimate the expected infection count under various lengths of breastfeeding in order to estimate the effect of breastfeeding duration on infection. Due to the presence of baseline and time-dependent confounding, specialized “causal” estimation methods are required. We demonstrate the double-robust method of Targeted Maximum Likelihood Estimation (TMLE) in the context of this application and review some related methods and the adjustments required to account for clustering. We compare TMLE (implemented both parametrically and using a data-adaptive algorithm) to other causal methods for this example. In addition, we conduct a simulation study to determine (1) the effectiveness of controlling for clustering indicators when cluster-specific confounders are unmeasured and (2) the importance of using data-adaptive TMLE.
PMCID: PMC4259272  PMID: 25505499
Causal inference; G-computation; inverse probability weighting; marginal effects; missing data; pediatrics
15.  Semiparametric estimation of covariance matrices for longitudinal data 
Estimation of longitudinal data covariance structure poses significant challenges because the data are usually collected at irregular time points. A viable semiparametric model for covariance matrices was proposed in Fan, Huang and Li (2007) that allows one to estimate the variance function nonparametrically and to estimate the correlation function parametrically via aggregating information from irregular and sparse data points within each subject. However, the asymptotic properties of their quasi-maximum likelihood estimator (QMLE) of parameters in the covariance model are largely unknown. In the current work, we address this problem in the context of more general models for the conditional mean function including parametric, nonparametric, or semi-parametric. We also consider the possibility of rough mean regression function and introduce the difference-based method to reduce biases in the context of varying-coefficient partially linear mean regression models. This provides a more robust estimator of the covariance function under a wider range of situations. Under some technical conditions, consistency and asymptotic normality are obtained for the QMLE of the parameters in the correlation function. Simulation studies and a real data example are used to illustrate the proposed approach.
PMCID: PMC2631936  PMID: 19180247
Correlation structure; difference-based estimation; quasi-maximum likelihood; varying-coefficient partially linear model
16.  Targeted maximum likelihood estimation in safety analysis 
Journal of clinical epidemiology  2013;66(8 0):10.1016/j.jclinepi.2013.02.017.
To compare the performance of a targeted maximum likelihood estimator (TMLE) and a collaborative TMLE (CTMLE) to other estimators in a drug safety analysis, including a regression-based estimator, propensity score (PS)–based estimators, and an alternate doubly robust (DR) estimator in a real example and simulations.
Study Design and Setting
The real data set is a subset of observational data from Kaiser Permanente Northern California formatted for use in active drug safety surveillance. Both the real and simulated data sets include potential confounders, a treatment variable indicating use of one of two antidiabetic treatments and an outcome variable indicating occurrence of an acute myocardial infarction (AMI).
In the real data example, there is no difference in AMI rates between treatments. In simulations, the double robustness property is demonstrated: DR estimators are consistent if either the initial outcome regression or PS estimator is consistent, whereas other estimators are inconsistent if the initial estimator is not consistent. In simulations with near-positivity violations, CTMLE performs well relative to other estimators by adaptively estimating the PS.
Each of the DR estimators was consistent, and TMLE and CTMLE had the smallest mean squared error in simulations.
PMCID: PMC3818128  PMID: 23849159
Safety analysis; Targeted maximum likelihood estimation; Doubly robust; Causal inference; Collaborative targeted maximum likelihood estimation; Super learning
17.  A model of yeast cell-cycle regulation based on multisite phosphorylation 
Multisite phosphorylation of CDK target proteins provides the requisite nonlinearity for cell cycle modeling using elementary reaction mechanisms.Stochastic simulations, based on Gillespie's algorithm and using realistic numbers of protein and mRNA molecules, compare favorably with single-cell measurements in budding yeast.The role of transcription–translation coupling is critical in the robust operation of protein regulatory networks in yeast cells.
Progression through the eukaryotic cell cycle is governed by the activation and inactivation of a family of cyclin-dependent kinases (CDKs) and auxiliary proteins that regulate CDK activities (Morgan, 2007). The many components of this protein regulatory network are interconnected by positive and negative feedback loops that create bistable switches and transient pulses (Tyson and Novak, 2008). The network must ensure that cell-cycle events proceed in the correct order, that cell division is balanced with respect to cell growth, and that any problems encountered (in replicating the genome or partitioning chromosomes to daughter cells) are corrected before the cell proceeds to the next phase of the cycle. The network must operate robustly in the context of unavoidable molecular fluctuations in a yeast-sized cell. With a volume of only 5×10−14 l, a yeast cell contains one copy of the gene for each component of the network, a handful of mRNA transcripts of each gene, and a few hundreds to thousands of protein molecules carrying out each gene's function. How large are the molecular fluctuations implied by these numbers, and what effects do they have on the functioning of the cell-cycle control system?
To answer these questions, we have built a new model (Figure 1) of the CDK regulatory network in budding yeast, based on the fact that the targets of CDK activity are typically phosphorylated on multiple sites. The activity of each target protein depends on how many sites are phosphorylated. The target proteins feedback on CDK activity by controlling cyclin synthesis (SBF's role) and degradation (Cdh1's role) and by releasing a CDK-counteracting phosphatase (Cdc14). Every reaction in Figure 1 can be described by a mass-action rate law, with an accompanying rate constant that must be estimated from experimental data. As the transcription and translation of mRNA molecules have major effects on fluctuating numbers of protein molecules (Pedraza and Paulsson, 2008), we have included mRNA transcripts for each protein in the model.
To create a deterministic model, the rate laws are combined, according to standard principles of chemical kinetics, into a set of 60 differential equations that govern the temporal dynamics of the control system. In the stochastic version of the model, the rate law for each reaction determines the probability per unit time that a particular reaction occurs, and we use Gillespie's stochastic simulation algorithm (Gillespie, 1976) to compute possible temporal sequences of reaction events. Accurate stochastic simulations require knowledge of the expected numbers of mRNA and protein molecules in a single yeast cell. Fortunately, these numbers are available from several sources (Ghaemmaghami et al, 2003; Zenklusen et al, 2008). Although the experimental estimates are not always in good agreement with each other, they are sufficiently reliable to populate a stochastic model with realistic numbers of molecules.
By simulating thousands of cells (as in Figure 5), we can build up representative samples for computing the mean and s.d. of any measurable cell-cycle property (e.g. interdivision time, size at division, duration of G1 phase). The excellent fit of simulated statistics to observations of cell-cycle variability is documented in the main text and Supplementary Information.
Of particular interest to us are observations of Di Talia et al (2007) of the timing of a crucial G1 event (export of Whi5 protein from the nucleus) in a population of budding yeast cells growing at a specific growth rate α=ln2/(mass-doubling time). Whi5 export is a consequence of Whi5 phosphorylation, and it occurs simultaneously with the release (activation) of SBF (see Figure 1). Using fluorescently labeled Whi5, Di Talia et al could easily measure (in individual yeast cells) the time, T1, from cell birth to the abrupt loss of Whi5 from the nucleus. Correlating T1 to the size of the cell at birth, Vbirth, they found that, for a sample of daughter cells, αT1 versus ln(Vbirth) could be fit with two straight lines of slope −0.7 and −0.3. Our simulation of this experiment (Figure 7 of the main text) compares favorably with Figure 3d and e in Di Talia et al (2007).
The major sources of noise in our model (and in protein regulatory networks in yeast cells, in general) are related to gene transcription and the small number of unique mRNA transcripts. As each mRNA molecule may instruct the synthesis of dozens of protein molecules, the coefficient of variation of molecular fluctuations at the protein level (CVP) may be dominated by fluctuations at the mRNA level, as expressed in the formula (Pedraza and Paulsson, 2008) where NM, NP denote the number of mRNA and protein molecules, respectively, and ρ=τM/τP is the ratio of half-lives of mRNA and protein molecules. For a yeast cell, typical values of NM and NP are 8 and 800, respectively (Ghaemmaghami et al, 2003; Zenklusen et al, 2008). If ρ=1, then CVP≈25%. Such large fluctuations in protein levels are inconsistent with the observed variability of size and age at division in yeast cells, as shown in the simplified cell-cycle model of Kar et al (2009) and as we have confirmed with our more realistic model. The size of these fluctuations can be reduced to a more acceptable level by assuming a shorter half-life for mRNA (say, ρ=0.1).
There must be some mechanisms whereby yeast cells lessen the protein fluctuations implied by transcription–translation coupling. Following Pedraza and Paulsson (2008), we suggest that mRNA gestation and senescence may resolve this problem. Equation (3) is based on a simple, one-stage, birth–death model of mRNA turnover. In Supplementary Appendix 1, we show that a model of mRNA processing, with 10 stages each of mRNA gestation and senescence, gives reasonable fluctuations at the protein level (CVP≈5%), even if the effective half-life of mRNA is 10 min. A one-stage model with τM=1 min gives comparable fluctuations (CVP≈5%). In the main text, we use a simple birth–death model of mRNA turnover with an ‘effective' half-life of 1 min, in order to limit the computational complexity of the full cell-cycle model.
In order for the cell's genome to be passed intact from one generation to the next, the events of the cell cycle (DNA replication, mitosis, cell division) must be executed in the correct order, despite the considerable molecular noise inherent in any protein-based regulatory system residing in the small confines of a eukaryotic cell. To assess the effects of molecular fluctuations on cell-cycle progression in budding yeast cells, we have constructed a new model of the regulation of Cln- and Clb-dependent kinases, based on multisite phosphorylation of their target proteins and on positive and negative feedback loops involving the kinases themselves. To account for the significant role of noise in the transcription and translation steps of gene expression, the model includes mRNAs as well as proteins. The model equations are simulated deterministically and stochastically to reveal the bistable switching behavior on which proper cell-cycle progression depends and to show that this behavior is robust to the level of molecular noise expected in yeast-sized cells (∼50 fL volume). The model gives a quantitatively accurate account of the variability observed in the G1-S transition in budding yeast, which is governed by an underlying sizer+timer control system.
PMCID: PMC2947364  PMID: 20739927
bistability; cell-cycle variability; size control; stochastic model; transcription–translation coupling
18.  A Joint Model for Prognostic Effect of Biomarker Variability on Outcomes: long-term intraocular pressure (IOP) fluctuation on the risk of developing primary open-angle glaucoma (POAG) 
JP journal of biostatistics  2011;5(2):73-96.
Primary open-angle glaucoma (POAG) is among the leading causes of blindness in the United States and worldwide. While numerous prospective clinical trials have convincingly shown that elevated intraocular pressure (IOP) is a leading risk factor for the development of POAG, an increasingly debated issue in recent years is the effect of IOP fluctuation on the risk of developing POAG. In many applications, this question is addressed via a “naïve” two-step approach where some sample-based estimates (e.g., standard deviation) are first obtained as surrogates for the “true” within-subject variability and then included in Cox regression models as covariates. However, estimates from two-step approach are more likely to suffer from the measurement error inherent in sample-based summary statistics. In this paper we propose a joint model to assess the question whether individuals with different levels of IOP variability have different susceptibility to POAG. In our joint model, the trajectory of IOP is described by a linear mixed model that incorporates patient-specific variance, the time to POAG is fit using a semi-parametric or parametric distribution, and the two models are linked via patient-specific random effects. Parameters in the joint model are estimated under Bayesian framework using Markov chain Monte Carlo (MCMC) methods with Gibbs sampling. The method is applied to data from the Ocular Hypertension Treatment Study (OHTS) and the European Glaucoma Prevention Study (EGPS), two large-scale multi-center randomized trials on the prevention of POAG.
PMCID: PMC3237682  PMID: 22180704
Joint model; Longitudinal data; Markov chain Monte Carlo (MCMC); Patient-specific variance; Survival data
19.  Maximum Likelihood Estimations and EM Algorithms with Length-biased Data 
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
PMCID: PMC3273908  PMID: 22323840
Cox regression model; EM algorithm; Increasing failure rate; Non-parametric likelihood; Profile likelihood; Right-censored data
20.  Investigating hospital heterogeneity with a multi-state frailty model: application to nosocomial pneumonia disease in intensive care units 
Multistate models have become increasingly useful to study the evolution of a patient’s state over time in intensive care units ICU (e.g. admission, infections, alive discharge or death in ICU). In addition, in critically-ill patients, data come from different ICUs, and because observations are clustered into groups (or units), the observed outcomes cannot be considered as independent. Thus a flexible multi-state model with random effects is needed to obtain valid outcome estimates.
We show how a simple multi-state frailty model can be used to study semi-competing risks while fully taking into account the clustering (in ICU) of the data and the longitudinal aspects of the data, including left truncation and right censoring. We suggest the use of independent frailty models or joint frailty models for the analysis of transition intensities. Two distinct models which differ in the definition of time t in the transition functions have been studied: semi-Markov models where the transitions depend on the waiting times and nonhomogenous Markov models where the transitions depend on the time since inclusion in the study. The parameters in the proposed multi-state model may conveniently be computed using a semi-parametric or parametric approach with an existing R package FrailtyPack for frailty models. The likelihood cross-validation criterion is proposed to guide the choice of a better fitting model.
We illustrate the use of our approach though the analysis of nosocomial infections (ventilator-associated pneumonia infections: VAP) in ICU, with “alive discharge” and “death” in ICU as other endpoints. We show that the analysis of dependent survival data using a multi-state model without frailty terms may underestimate the variance of regression coefficients specific to each group, leading to incorrect inferences. Some factors are wrongly significantly associated based on the model without frailty terms. This result is confirmed by a short simulation study. We also present individual predictions of VAP underlining the usefulness of dynamic prognostic tools that can take into account the clustering of observations.
The use of multistate frailty models allows the analysis of very complex data. Such models could help improve the estimation of the impact of proposed prognostic features on each transition in a multi-centre study. We suggest a method and software that give accurate estimates and enables inference for any parameter or predictive quantity of interest.
PMCID: PMC3537543  PMID: 22702430
21.  Modeling gene expression measurement error: a quasi-likelihood approach 
BMC Bioinformatics  2003;4:10.
Using suitable error models for gene expression measurements is essential in the statistical analysis of microarray data. However, the true probabilistic model underlying gene expression intensity readings is generally not known. Instead, in currently used approaches some simple parametric model is assumed (usually a transformed normal distribution) or the empirical distribution is estimated. However, both these strategies may not be optimal for gene expression data, as the non-parametric approach ignores known structural information whereas the fully parametric models run the risk of misspecification. A further related problem is the choice of a suitable scale for the model (e.g. observed vs. log-scale).
Here a simple semi-parametric model for gene expression measurement error is presented. In this approach inference is based an approximate likelihood function (the extended quasi-likelihood). Only partial knowledge about the unknown true distribution is required to construct this function. In case of gene expression this information is available in the form of the postulated (e.g. quadratic) variance structure of the data.
As the quasi-likelihood behaves (almost) like a proper likelihood, it allows for the estimation of calibration and variance parameters, and it is also straightforward to obtain corresponding approximate confidence intervals. Unlike most other frameworks, it also allows analysis on any preferred scale, i.e. both on the original linear scale as well as on a transformed scale. It can also be employed in regression approaches to model systematic (e.g. array or dye) effects.
The quasi-likelihood framework provides a simple and versatile approach to analyze gene expression data that does not make any strong distributional assumptions about the underlying error model. For several simulated as well as real data sets it provides a better fit to the data than competing models. In an example it also improved the power of tests to identify differential expression.
PMCID: PMC153502  PMID: 12659637
22.  Comparison of Kasai Autocorrelation and Maximum Likelihood Estimators for Doppler Optical Coherence Tomography 
IEEE transactions on medical imaging  2013;32(6):1033-1042.
In optical coherence tomography (OCT) and ultrasound, unbiased Doppler frequency estimators with low variance are desirable for blood velocity estimation. Hardware improvements in OCT mean that ever higher acquisition rates are possible, which should also, in principle, improve estimation performance. Paradoxically, however, the widely used Kasai autocorrelation estimator’s performance worsens with increasing acquisition rate. We propose that parametric estimators based on accurate models of noise statistics can offer better performance. We derive a maximum likelihood estimator (MLE) based on a simple additive white Gaussian noise model, and show that it can outperform the Kasai autocorrelation estimator. In addition, we also derive the Cramer Rao lower bound (CRLB), and show that the variance of the MLE approaches the CRLB for moderate data lengths and noise levels. We note that the MLE performance improves with longer acquisition time, and remains constant or improves with higher acquisition rates. These qualities may make it a preferred technique as OCT imaging speed continues to improve. Finally, our work motivates the development of more general parametric estimators based on statistical models of decorrelation noise.
PMCID: PMC3745780  PMID: 23446044
Cramer–Rao bound (CRB); Doppler optical coherence tomography; Doppler ultrasound; frequency estimation; maximum likelihood estimation (MLE)
23.  Genetic and Environmental Transmission of Body Mass Index Fluctuation 
Behavior genetics  2012;42(6):867-874.
This study sought to determine the relationship between BMI fluctuation and cardiovascular disease phenotypes, diabetes, and depression and the role of genetic and environmental factors in individual differences in BMI fluctuation using the extended twin-family model (ETFM).
Study Design and Methods
This study included 14,763 twins and their relatives. Health and Lifestyle Questionnaires were obtained from 28,492 individuals from the Virginia 30,000 dataset including twins, parents, siblings, spouses, and children of twins. Self-report cardiovascular disease, diabetes, and depression data were available. From self-reported height and weight, BMI fluctuation was calculated as the difference between highest and lowest BMI after age 18, for individuals 18–80 years. Logistic regression analyses were used to determine the relationship between BMI fluctuation and disease status. The ETFM was used to estimate the significance and contribution of genetic and environmental factors, cultural transmission, and assortative mating components to BMI fluctuation, while controlling for age. We tested sex differences in additive and dominant genetic effects, parental, non-parental, twin, and unique environmental effects.
BMI fluctuation was highly associated with disease status, independent of BMI. Genetic effects accounted for ~34% of variance in BMI fluctuation in males and ~43% of variance in females. The majority of the variance was accounted for by environmental factors, about a third of which were shared among twins. Assortative mating, and cultural transmission accounted for only a small proportion of variance in this phenotype.
Since there are substantial health risks associated with BMI fluctuation and environmental components of BMI fluctuation account for over 60% of variance in males and over 50% of variance in females, environmental risk factors may be appropriate targets to reduce BMI fluctuation.
PMCID: PMC3474545  PMID: 23011216
BMI; Chronic Disease; Weight change; Heritability; Family Studies
24.  A weighted combination of pseudo-likelihood estimators for longitudinal binary data subject to nonignorable non-monotone missingness 
Statistics in medicine  2010;29(14):1511-1521.
For longitudinal binary data with non-monotone non-ignorably missing outcomes over time, a full likelihood approach is complicated algebraically, and with many follow-up times, maximum likelihood estimation can be computationally prohibitive. As alternatives, two pseudo-likelihood approaches have been proposed that use minimal parametric assumptions. One formulation requires specification of the marginal distributions of the outcome and missing data mechanism at each time point, but uses an “independence working assumption,” i.e., an assumption that observations are independent over time. Another method avoids having to estimate the missing data mechanism by formulating a “protective estimator.” In simulations, these two estimators can be very inefficient, both for estimating time trends in the first case and for estimating both time-varying and time-stationary effects in the second. In this paper, we propose use of the optimal weighted combination of these two estimators, and in simulations we show that the optimal weighted combination can be much more efficient than either estimator alone. Finally, the proposed method is used to analyze data from two longitudinal clinical trials of HIV-infected patients.
PMCID: PMC2996053  PMID: 20205269
25.  A probit- log- skew-normal mixture model for repeated measures data with excess zeros, with application to a cohort study of paediatric respiratory symptoms 
A zero-inflated continuous outcome is characterized by occurrence of "excess" zeros that more than a single distribution can explain, with the positive observations forming a skewed distribution. Mixture models are employed for regression analysis of zero-inflated data. Moreover, for repeated measures zero-inflated data the clustering structure should also be modeled for an adequate analysis.
Diary of Asthma and Viral Infections Study (DAVIS) was a one year (2004) cohort study conducted at McMaster University to monitor viral infection and respiratory symptoms in children aged 5-11 years with and without asthma. Respiratory symptoms were recorded daily using either an Internet or paper-based diary. Changes in symptoms were assessed by study staff and led to collection of nasal fluid specimens for virological testing. The study objectives included investigating the response of respiratory symptoms to respiratory viral infection in children with and without asthma over a one year period. Due to sparse data daily respiratory symptom scores were aggregated into weekly average scores. More than 70% of the weekly average scores were zero, with the positive scores forming a skewed distribution. We propose a random effects probit/log-skew-normal mixture model to analyze the DAVIS data. The model parameters were estimated using a maximum marginal likelihood approach. A simulation study was conducted to assess the performance of the proposed mixture model if the underlying distribution of the positive response is different from log-skew normal.
Viral infection status was highly significant in both probit and log-skew normal model components respectively. The probability of being symptom free was much lower for the week a child was viral positive relative to the week she/he was viral negative. The severity of the symptoms was also greater for the week a child was viral positive. The probability of being symptom free was smaller for asthmatics relative to non-asthmatics throughout the year, whereas there was no difference in the severity of the symptoms between the two groups.
A positive association was observed between viral infection status and both the probability of experiencing any respiratory symptoms, and their severity during the year. For DAVIS data the random effects probit -log skew normal model fits significantly better than the random effects probit -log normal model, endorsing our parametric choice for the model. The simulation study indicates that our proposed model seems to be robust to misspecification of the distribution of the positive skewed response.
PMCID: PMC2902491  PMID: 20540810

Results 1-25 (1159881)