PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1465059)

Clipboard (0)
None

Related Articles

1.  Collaborative Double Robust Targeted Maximum Likelihood Estimation* 
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
doi:10.2202/1557-4679.1181
PMCID: PMC2898626  PMID: 20628637
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
2.  The Relative Performance of Targeted Maximum Likelihood Estimators 
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
doi:10.2202/1557-4679.1308
PMCID: PMC3173607  PMID: 21931570
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
3.  Dimension reduction with gene expression data using targeted variable importance measurement 
BMC Bioinformatics  2011;12:312.
Background
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
Results
We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
Conclusions
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
doi:10.1186/1471-2105-12-312
PMCID: PMC3166941  PMID: 21849016
4.  Targeted Maximum Likelihood Estimation for Dynamic and Static Longitudinal Marginal Structural Working Models 
Journal of causal inference  2014;2(2):147-185.
This paper describes a targeted maximum likelihood estimator (TMLE) for the parameters of longitudinal static and dynamic marginal structural models. We consider a longitudinal data structure consisting of baseline covariates, time-dependent intervention nodes, intermediate time-dependent covariates, and a possibly time-dependent outcome. The intervention nodes at each time point can include a binary treatment as well as a right-censoring indicator. Given a class of dynamic or static interventions, a marginal structural model is used to model the mean of the intervention-specific counterfactual outcome as a function of the intervention, time point, and possibly a subset of baseline covariates. Because the true shape of this function is rarely known, the marginal structural model is used as a working model. The causal quantity of interest is defined as the projection of the true function onto this working model. Iterated conditional expectation double robust estimators for marginal structural model parameters were previously proposed by Robins (2000, 2002) and Bang and Robins (2005). Here we build on this work and present a pooled TMLE for the parameters of marginal structural working models. We compare this pooled estimator to a stratified TMLE (Schnitzer et al. 2014) that is based on estimating the intervention-specific mean separately for each intervention of interest. The performance of the pooled TMLE is compared to the performance of the stratified TMLE and the performance of inverse probability weighted (IPW) estimators using simulations. Concepts are illustrated using an example in which the aim is to estimate the causal effect of delayed switch following immunological failure of first line antiretroviral therapy among HIV-infected patients. Data from the International Epidemiological Databases to Evaluate AIDS, Southern Africa are analyzed to investigate this question using both TML and IPW estimators. Our results demonstrate practical advantages of the pooled TMLE over an IPW estimator for working marginal structural models for survival, as well as cases in which the pooled TMLE is superior to its stratified counterpart.
doi:10.1515/jci-2013-0007
PMCID: PMC4405134  PMID: 25909047
dynamic regime; semiparametric statistical model; targeted minimum loss based estimation; confounding; right censoring
5.  Causal Inference for a Population of Causally Connected Units 
Journal of causal inference  2014;2(1):13-74.
Suppose that we observe a population of causally connected units. On each unit at each time-point on a grid we observe a set of other units the unit is potentially connected with, and a unit-specific longitudinal data structure consisting of baseline and time-dependent covariates, a time-dependent treatment, and a final outcome of interest. The target quantity of interest is defined as the mean outcome for this group of units if the exposures of the units would be probabilistically assigned according to a known specified mechanism, where the latter is called a stochastic intervention. Causal effects of interest are defined as contrasts of the mean of the unit-specific outcomes under different stochastic interventions one wishes to evaluate. This covers a large range of estimation problems from independent units, independent clusters of units, and a single cluster of units in which each unit has a limited number of connections to other units. The allowed dependence includes treatment allocation in response to data on multiple units and so called causal interference as special cases. We present a few motivating classes of examples, propose a structural causal model, define the desired causal quantities, address the identification of these quantities from the observed data, and define maximum likelihood based estimators based on cross-validation. In particular, we present maximum likelihood based super-learning for this network data. Nonetheless, such smoothed/regularized maximum likelihood estimators are not targeted and will thereby be overly bias w.r.t. the target parameter, and, as a consequence, generally not result in asymptotically normally distributed estimators of the statistical target parameter.
To formally develop estimation theory, we focus on the simpler case in which the longitudinal data structure is a point-treatment data structure. We formulate a novel targeted maximum likelihood estimator of this estimand and show that the double robustness of the efficient influence curve implies that the bias of the targeted minimum loss-based estimation (TMLE) will be a second-order term involving squared differences of two nuisance parameters. In particular, the TMLE will be consistent if either one of these nuisance parameters is consistently estimated. Due to the causal dependencies between units, the data set may correspond with the realization of a single experiment, so that establishing a (e.g. normal) limit distribution for the targeted maximum likelihood estimators, and corresponding statistical inference, is a challenging topic. We prove two formal theorems establishing the asymptotic normality using advances in weak-convergence theory. We conclude with a discussion and refer to an accompanying technical report for extensions to general longitudinal data structures.
doi:10.1515/jci-2013-0002
PMCID: PMC4500386  PMID: 26180755
networks; causal inference; targeted maximum likelihood estimation; stochastic intervention; efficient influence curve
6.  Modeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with Targeted Maximum Likelihood Estimation 
Biometrics  2013;70(1):144-152.
Summary
Despite modern effective HIV treatment, hepatitis C virus (HCV) co-infection is associated with a high risk of progression to end-stage liver disease (ESLD) which has emerged as the primary cause of death in this population. Clinical interest lies in determining the impact of clearance of HCV on risk for ESLD. In this case study, we examine whether HCV clearance affects risk of ESLD using data from the multicenter Canadian Co-infection Cohort Study. Complications in this survival analysis arise from the time-dependent nature of the data, the presence of baseline confounders, loss to follow-up, and confounders that change over time, all of which can obscure the causal effect of interest. Additional challenges included non-censoring variable missingness and event sparsity.
In order to efficiently estimate the ESLD-free survival probabilities under a specific history of HCV clearance, we demonstrate the doubly-robust and semiparametric efficient method of Targeted Maximum Likelihood Estimation (TMLE). Marginal structural models (MSM) can be used to model the effect of viral clearance (expressed as a hazard ratio) on ESLD-free survival and we demonstrate a way to estimate the parameters of a logistic model for the hazard function with TMLE. We show the theoretical derivation of the efficient influence curves for the parameters of two different MSMs and how they can be used to produce variance approximations for parameter estimates. Finally, the data analysis evaluating the impact of HCV on ESLD was undertaken using multiple imputations to account for the non-monotone missing data.
doi:10.1111/biom.12105
PMCID: PMC3954273  PMID: 24571372
Double-robust; Inverse probability of treatment weighting; Kaplan-Meier; Longitudinal data; Marginal structural model; Survival analysis; Targeted maximum likelihood estimation
7.  An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics 
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
doi:10.2202/1557-4679.1182
PMCID: PMC3126668  PMID: 21731530
causal effect; cross-validation; collaborative double robust; double robust; efficient influence curve; penalized likelihood; penalization; estimator selection; locally efficient; maximum likelihood estimation; model selection; super efficiency; super learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
8.  A Method for Subsampling Terrestrial Invertebrate Samples in the Laboratory: Estimating Abundance and Taxa Richness 
Significant progress has been made in developing subsampling techniques to process large samples of aquatic invertebrates. However, limited information is available regarding subsampling techniques for terrestrial invertebrate samples. Therefore a novel subsampling procedure was evaluated for processing samples of terrestrial invertebrates collected using two common field techniques: pitfall and pan traps. A three-phase sorting protocol was developed for estimating abundance and taxa richness of invertebrates. First, large invertebrates and plant material were removed from the sample using a sieve with a 4 mm mesh size. Second, the sample was poured into a specially designed, gridded sampling tray, and 16 cells, comprising 25% of the sampling tray, were randomly subsampled and processed. Third, the remainder of the sample was scanned for 4–7 min to record rare taxa missed in the second phase. To compare estimated abundance and taxa richness with the true values of these variables for the samples, the remainder of each sample was processed completely. The results were analyzed relative to three sample size categories: samples with less than 250 invertebrates (low abundance samples), samples with 250–500 invertebrates (moderate abundance samples), and samples with more than 500 invertebrates (high abundance samples). The number of invertebrates estimated after subsampling eight or more cells was highly precise for all sizes and types of samples. High accuracy for moderate and high abundance samples was achieved after even as few as six subsamples. However, estimates of the number of invertebrates for low abundance samples were less reliable. The subsampling technique also adequately estimated taxa richness; on average, subsampling detected 89% of taxa found in samples. Thus, the subsampling technique provided accurate data on both the abundance and taxa richness of terrestrial invertebrate samples. Importantly, subsampling greatly decreased the time required to process samples, cutting the time per sample by up to 80%. Based on these data, this subsampling technique is recommended to minimize the time and cost of processing moderate to large samples without compromising the integrity of the data and to maximize the information extracted from large terrestrial invertebrate samples. For samples with a relatively low number of invertebrates, complete counting is preferred.
doi:10.1673/031.010.2501
PMCID: PMC3014723  PMID: 20578889
pitfall traps; laboratory sampling techniques
9.  A Causal Framework for Understanding the Effect of Losses to Follow-up on Epidemiologic Analyses in Clinic-based Cohorts: The Case of HIV-infected Patients on Antiretroviral Therapy in Africa 
American Journal of Epidemiology  2012;175(10):1080-1087.
Although clinic-based cohorts are most representative of the “real world,” they are susceptible to loss to follow-up. Strategies for managing the impact of loss to follow-up are therefore needed to maximize the value of studies conducted in these cohorts. The authors evaluated adult patients starting antiretroviral therapy at an HIV/AIDS clinic in Uganda, where 29% of patients were lost to follow-up after 2 years (January 1, 2004–September 30, 2007). Unweighted, inverse probability of censoring weighted (IPCW), and sampling-based approaches (using supplemental data from a sample of lost patients subsequently tracked in the community) were used to identify the predictive value of sex on mortality. Directed acyclic graphs (DAGs) were used to explore the structural basis for bias in each approach. Among 3,628 patients, unweighted and IPCW analyses found men to have higher mortality than women, whereas the sampling-based approach did not. DAGs encoding knowledge about the data-generating process, including the fact that death is a cause of being classified as lost to follow-up in this setting, revealed “collider” bias in the unweighted and IPCW approaches. In a clinic-based cohort in Africa, unweighted and IPCW approaches—which rely on the “missing at random” assumption—yielded biased estimates. A sampling-based approach can in general strengthen epidemiologic analyses conducted in many clinic-based cohorts, including those examining other diseases.
doi:10.1093/aje/kwr444
PMCID: PMC3353135  PMID: 22306557
Africa; antiretroviral therapy; clinic-based cohorts; directed acyclic graphs; informative censoring; inverse probability of censoring weights; loss to follow-up; missing at random
10.  Targeted maximum likelihood estimation in safety analysis 
Journal of clinical epidemiology  2013;66(8 0):10.1016/j.jclinepi.2013.02.017.
Objectives
To compare the performance of a targeted maximum likelihood estimator (TMLE) and a collaborative TMLE (CTMLE) to other estimators in a drug safety analysis, including a regression-based estimator, propensity score (PS)–based estimators, and an alternate doubly robust (DR) estimator in a real example and simulations.
Study Design and Setting
The real data set is a subset of observational data from Kaiser Permanente Northern California formatted for use in active drug safety surveillance. Both the real and simulated data sets include potential confounders, a treatment variable indicating use of one of two antidiabetic treatments and an outcome variable indicating occurrence of an acute myocardial infarction (AMI).
Results
In the real data example, there is no difference in AMI rates between treatments. In simulations, the double robustness property is demonstrated: DR estimators are consistent if either the initial outcome regression or PS estimator is consistent, whereas other estimators are inconsistent if the initial estimator is not consistent. In simulations with near-positivity violations, CTMLE performs well relative to other estimators by adaptively estimating the PS.
Conclusion
Each of the DR estimators was consistent, and TMLE and CTMLE had the smallest mean squared error in simulations.
doi:10.1016/j.jclinepi.2013.02.017
PMCID: PMC3818128  PMID: 23849159
Safety analysis; Targeted maximum likelihood estimation; Doubly robust; Causal inference; Collaborative targeted maximum likelihood estimation; Super learning
11.  Self-Consistent Nonparametric Maximum Likelihood Estimator of the Bivariate Survivor Function 
Biometrika  2014;101(3):505-518.
Summary
As usually formulated the nonparametric likelihood for the bivariate survivor function is over-parameterized, resulting in uniqueness problems for the corresponding nonparametric maximum likelihood estimator. Here the estimation problem is redefined to include parameters for marginal hazard rates, and for double failure hazard rates only at informative uncensored failure time grid points where there is pertinent empirical information. Double failure hazard rates at other grid points in the risk region are specified rather than estimated. With this approach the nonparametric maximum likelihood estimator is unique, and can be calculated using a two-step procedure. The first step involves setting aside all doubly censored observations that are interior to the risk region. The nonparametric maximum likelihood estimator from the remaining data turns out to be the Dabrowska (1988) estimator. The omitted doubly censored observations are included in the procedure in the second stage using self-consistency, resulting in a non-iterative nonpara-metric maximum likelihood estimator for the bivariate survivor function. Simulation evaluation and asymptotic distributional results are provided. Moderate sample size efficiency for the survivor function nonparametric maximum likelihood estimator is similar to that for the Dabrowska estimator as applied to the entire dataset, while some useful efficiency improvement arises for corresponding distribution function estimator, presumably due to the avoidance of negative mass assignments.
doi:10.1093/biomet/asu010
PMCID: PMC4306565  PMID: 25632162
Bivariate survivor function; Censored data; Dabrowska estimator; Kaplan–Meier estimator; Non-parametric maximum likelihood; Self-consistency
12.  A two-stage validation study for determining sensitivity and specificity. 
Environmental Health Perspectives  1994;102(Suppl 8):11-14.
A two-stage procedure for estimating sensitivity and specificity is described. The procedure is developed in the context of a validation study for self-reported atypical nevi, a potentially useful measure in the study of risk factors for malignant melanoma. The first stage consists of a sample of N individuals classified only by the test measure. The second stage is a subsample of size m, stratified according the information collected in the first stage, in which the presence of atypical nevi is determined by clinical examination. Using missing data methods for contingency tables, maximum likelihood estimators for the joint distribution of the test measure and the "gold standard" clinical evaluation are presented, along with efficient estimators for the sensitivity and specificity. Asymptotic coefficients of variation are computed to compare alternative sampling strategies for the second stage.
PMCID: PMC1566548  PMID: 7851324
13.  Population Intervention Causal Effects Based on Stochastic Interventions 
Biometrics  2011;68(2):541-549.
SUMMARY
Estimating the causal effect of an intervention on a population typically involves defining parameters in a nonparametric structural equation model (Pearl, 2000, Causality: Models, Reasoning, and Inference) in which the treatment or exposure is deterministically assigned in a static or dynamic way. We define a new causal parameter that takes into account the fact that intervention policies can result in stochastically assigned exposures. The statistical parameter that identifies the causal parameter of interest is established. Inverse probability of treatment weighting (IPTW), augmented IPTW (A-IPTW), and targeted maximum likelihood estimators (TMLE) are developed. A simulation study is performed to demonstrate the properties of these estimators, which include the double robustness of the A-IPTW and the TMLE. An application example using physical activity data is presented.
doi:10.1111/j.1541-0420.2011.01685.x
PMCID: PMC4117410  PMID: 21977966
Causal effect; Counterfactual outcome; Double robustness; Stochastic intervention; Targeted maximum likelihood estimation
14.  Targeted Maximum Likelihood Based Causal Inference: Part I 
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
doi:10.2202/1557-4679.1211
PMCID: PMC3126670  PMID: 20737021
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
15.  A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome 
Targeted maximum likelihood estimation of a parameter of a data generating distribution, known to be an element of a semi-parametric model, involves constructing a parametric model through an initial density estimator with parameter ɛ representing an amount of fluctuation of the initial density estimator, where the score of this fluctuation model at ɛ = 0 equals the efficient influence curve/canonical gradient. The latter constraint can be satisfied by many parametric fluctuation models since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the semi-parametric model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, by which we mean situations where the Fisher information is low, a violation of this property can heavily affect the performance of the estimator. This paper presents a fluctuation approach that guarantees the fluctuated density estimator remains inside the bounds of the data model. We demonstrate this in the context of estimation of a causal effect of a binary treatment on a continuous outcome that is bounded. It results in a targeted maximum likelihood estimator that inherently respects known bounds, and consequently is more robust in sparse data situations than the targeted MLE using a naive fluctuation model.
When an estimation procedure incorporates weights, observations having large weights relative to the rest heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it can also introduce bias into the estimate. We present an alternative targeted maximum likelihood estimation (TMLE) approach that dampens the effect of these heavily weighted observations. As a substitution estimator, TMLE respects the global constraints of the observed data model. For example, when outcomes are binary, a fluctuation of an initial density estimate on the logit scale constrains predicted probabilities to be between 0 and 1. This inherent enforcement of bounds has been extended to continuous outcomes. Simulation study results indicate that this approach is on a par with, and many times superior to, fluctuating on the linear scale, and in particular is more robust when there is sparsity in the data.
doi:10.2202/1557-4679.1260
PMCID: PMC3126669  PMID: 21731529
targeted maximum likelihood estimation; TMLE; causal effect
16.  Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation 
Statistics in medicine  2009;28(1):39-64.
SUMMARY
Covariate adjustment using linear models for continuous outcomes in randomized trials has been shown to increase efficiency and power over the unadjusted method in estimating the marginal effect of treatment. However, for binary outcomes, investigators generally rely on the unadjusted estimate as the literature indicates that covariate-adjusted estimates based on the logistic regression models are less efficient. The crucial step that has been missing when adjusting for covariates is that one must integrate/average the adjusted estimate over those covariates in order to obtain the marginal effect. We apply the method of targeted maximum likelihood estimation (tMLE) to obtain estimators for the marginal effect using covariate adjustment for binary outcomes. We show that the covariate adjustment in randomized trials using the logistic regression models can be mapped, by averaging over the covariate(s), to obtain a fully robust and efficient estimator of the marginal effect, which equals a targeted maximum likelihood estimator. This tMLE is obtained by simply adding a clever covariate to a fixed initial regression. We present simulation studies that demonstrate that this tMLE increases efficiency and power over the unadjusted method, particularly for smaller sample sizes, even when the regression model is mis-specified.
doi:10.1002/sim.3445
PMCID: PMC2857590  PMID: 18985634
clinical trails; efficiency; covariate adjustment; variable selection
17.  Maximum Likelihood Estimations and EM Algorithms with Length-biased Data 
SUMMARY
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
doi:10.1198/jasa.2011.tm10156
PMCID: PMC3273908  PMID: 22323840
Cox regression model; EM algorithm; Increasing failure rate; Non-parametric likelihood; Profile likelihood; Right-censored data
18.  Mixed effect regression analysis for a cluster-based two-stage outcome-auxiliary-dependent sampling design with a continuous outcome 
Biostatistics (Oxford, England)  2012;13(4):650-664.
Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics 66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
doi:10.1093/biostatistics/kxs013
PMCID: PMC3440236  PMID: 22723503
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
19.  Nonparametric estimation for censored mixture data with application to the Cooperative Huntington’s Observational Research Trial 
This work presents methods for estimating genotype-specific distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs) which do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators which do not assume parametric density models and are easy to implement. They are based on the inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). The AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated non-carrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared to non-carriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic testing, and in facilitating future subjects at risk to make informed decisions on whether to undergo genetic mutation testings.
doi:10.1080/01621459.2012.699353
PMCID: PMC3905630  PMID: 24489419
Censored data; Finite mixture model; Huntington’s disease; Kin-cohort design; Quantitative trait locus
20.  Child Mortality Estimation: Appropriate Time Periods for Child Mortality Estimates from Full Birth Histories 
PLoS Medicine  2012;9(8):e1001289.
Jon Pedersen and Jing Liu examine the feasibility and potential advantages of using one-year rather than five-year time periods along with calendar year-based estimation when deriving estimates of child mortality.
Background
Child mortality estimates from complete birth histories from Demographic and Health Surveys (DHS) surveys and similar surveys are a chief source of data used to track Millennium Development Goal 4, which aims for a reduction of under-five mortality by two-thirds between 1990 and 2015. Based on the expected sample sizes when the DHS program commenced, the estimates are usually based on 5-y time periods. Recent surveys have had larger sample sizes than early surveys, and here we aimed to explore the benefits of using shorter time periods than 5 y for estimation. We also explore the benefit of changing the estimation procedure from being based on years before the survey, i.e., measured with reference to the date of the interview for each woman, to being based on calendar years.
Methods and Findings
Jackknife variance estimation was used to calculate standard errors for 207 DHS surveys in order to explore to what extent the large samples in recent surveys can be used to produce estimates based on 1-, 2-, 3-, 4-, and 5-y periods. We also recalculated the estimates for the surveys into calendar-year-based estimates. We demonstrate that estimation for 1-y periods is indeed possible for many recent surveys.
Conclusions
The reduction in bias achieved using 1-y periods and calendar-year-based estimation is worthwhile in some cases. In particular, it allows tracking of the effects of particular events such as droughts, epidemics, or conflict on child mortality in a way not possible with previous estimation procedures. Recommendations to use estimation for short time periods when possible and to use calendar-year-based estimation were adopted in the United Nations 2011 estimates of child mortality.
Editors' Summary
Background
In 2000, world leaders set, as Millennium Development Goal 4 (MDG 4), a target of reducing global under-five mortality (the number of children who die before their fifth birthday to a third of its 1990 level (12 million deaths per year) by 2015. (The MDGs are designed to alleviate extreme poverty by 2015.) To track progress towards MDG 4, the under-five mortality rate (also shown as 5q0) needs to be estimated both “precisely” and “accurately.” A “precise” estimate has a small random error (a quality indicated by a statistical measurement called the coefficient of variance), and an “accurate” estimate is one that is close to the true value because it lacks bias (systematic errors). In an ideal world, under-five mortality estimates would be based on official records of births and deaths. However, developing countries, which are where most under-five deaths occur, rarely have such records, and under-five mortality estimation relies on “complete birth histories” provided by women via surveys. These are collected by Demographic and Health Surveys (DHS, a project that helps developing countries collect data on health and population trends) and record all the births that a surveyed woman has had and the age at death of any of her children who have died.
Why Was This Study Done?
Because the DHS originally surveyed samples of 5,000–6,000 women, estimates of under-five mortality are traditionally calculated using data from five-year time periods. Over shorter periods with this sample size, the random errors in under-five mortality estimates become unacceptably large. Nowadays, the average DHS survey sample size is more than 10,000 women, so it should be possible to estimate under-five mortality over shorter time periods. Such estimates should be able to track the effects on under-five mortality of events such as droughts and conflicts better than estimates made over five years. In this study, the researchers determine appropriate time periods for child mortality estimates based on full birth histories, given different sample sizes. Specifically, they ask whether, with the bigger sample sizes that are now available, details about trends in under-five mortality rates are being missed by using the estimation procedures that were developed for smaller samples. They also ask whether calendar-year-based estimates can be calculated; mortality is usually estimated in “years before the survey,” a process that blurs the reference period for the estimate.
What Did the Researchers Do and Find?
The researchers used a statistical method called “jackknife variance estimation” to determine coefficients of variance for child mortality estimates calculated over different time periods using complete birth histories from 207 DHS surveys. Regardless of the estimation period, half of the estimates had a coefficient of variance of less than 10%, a level of random variation that is generally considered acceptable. However, within each time period, some estimates had very high coefficients of variance. These estimates were derived from surveys where there was a small sample size, low fertility (the women surveyed had relatively few babies), or low child mortality. Other analyses show that although the five-year period estimates had lower standard errors than the one-year period estimates, the latter were affected less by bias than the five-year period estimates. Finally, estimates fixed to calendar years rather than to years before the survey were more directly comparable across surveys and brought out variations in child mortality caused by specific events such as conflicts more clearly.
What Do These Findings Mean?
These findings show that although under-five mortality rate estimates based on five-year periods of data have been the norm, the sample sizes currently employed in DHS surveys make it feasible to estimate mortality for shorter periods. The findings also show that using shorter periods of data in estimations of the under-five mortality rate, and using calendar-year-based estimation, reduces bias (makes the estimations more accurate) and allows the effects of events such as droughts, epidemics, or conflict on under-five mortality rates to be tracked in a way that is impossible when using five-year periods of data. Given these findings, the researchers recommend that time periods shorter than five years should be adopted for the estimation of under-five mortality and that estimations should be pegged to calendar years rather than to years before the survey. Both recommendations have already been adopted by the United Nations Inter-agency Group for Child Mortality Estimation (IGME) and were used in their 2011 analysis of under-five mortality.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001289.
This paper is part of a collection of papers on Child Mortality Estimation Methods published in PLOS Medicine
The United Nations Childrens Fund (UNICEF) works for children's rights, survival, development, and protection around the world; it provides information on Millennium Development Goal 4, and its Childinfo website provides detailed statistics about child survival and health, including a description of the United Nations Inter-agency Group for Child Mortality Estimation; the 2011 IGME report on Levels and Trends in Child Mortality is available
The World Health Organization also has information about Millennium Development Goal 4 and provides estimates of child mortality rates (some information in several languages)
Further information about the Millennium Development Goals is available
Information is also available about Demographic and Health Surveys of infant and child mortality
doi:10.1371/journal.pmed.1001289
PMCID: PMC3429388  PMID: 22952435
21.  Efficient Logistic Regression Designs Under an Imperfect Population Identifier 
Biometrics  2013;70(1):175-184.
Summary
Motivated by actual study designs, this article considers efficient logistic regression designs where the population is identified with a binary test that is subject to diagnostic error. We consider the case where the imperfect test is obtained on all participants, while the gold standard test is measured on a small chosen subsample. Under maximum-likelihood estimation, we evaluate the optimal design in terms of sample selection as well as verification. We show that there may be substantial efficiency gains by choosing a small percentage of individuals who test negative on the imperfect test for inclusion in the sample (e.g., verifying 90% test-positive cases). We also show that a two-stage design may be a good practical alternative to a fixed design in some situations. Under optimal and nearly optimal designs, we compare maximum-likelihood and semi-parametric efficient estimators under correct and misspecified models with simulations. The methodology is illustrated with an analysis from a diabetes behavioral intervention trial.
doi:10.1111/biom.12106
PMCID: PMC3954435  PMID: 24261471
Case-control designs; Diagnostic accuracy; Epidemiologic designs; Misclassification; Measurement error
22.  Double inverse-weighted estimation of cumulative treatment effects under non-proportional hazards and dependent censoring 
Biometrics  2011;67(1):29-38.
Summary
In medical studies of time to event data, non-proportional hazards and dependent censoring are very common issues when estimating the treatment effect. A traditional method for dealing with time-dependent treatment effects is to model the time-dependence parametrically. Limitations of this approach include the difficulty to verify the correctness of the specified functional form and the fact that, in the presence of a treatment effect that varies over time, investigators are usually interested in the cumulative as opposed to instantaneous treatment effect. In many applications, censoring time is not independent of event time. Therefore, we propose methods for estimating the cumulative treatment effect in the presence of non-proportional hazards and dependent censoring. Three measures are proposed, including the ratio of cumulative hazards, relative risk and difference in restricted mean lifetime. For each measure, we propose a double-inverse-weighted estimator, constructed by first using inverse probability of treatment weighting (IPTW) to balance the treatment-specific covariate distributions, then using inverse probability of censoring weighting (IPCW) to overcome the dependent censoring. The proposed estimators are shown to be consistent and asymptotically normal. We study their finite-sample properties through simulation. The proposed methods are used to compare kidney wait list mortality by race.
doi:10.1111/j.1541-0420.2010.01449.x
PMCID: PMC3372067  PMID: 20560935
Cumulative hazard; Dependent censoring; Inverse weighting; Relative Risk; Restricted mean lifetime; Survival analysis; Treatment effect
23.  EFFECT OF BREASTFEEDING ON GASTROINTESTINAL INFECTION IN INFANTS: A TARGETED MAXIMUM LIKELIHOOD APPROACH FOR CLUSTERED LONGITUDINAL DATA 
The annals of applied statistics  2014;8(2):703-725.
The PROmotion of Breastfeeding Intervention Trial (PROBIT) cluster-randomized a program encouraging breastfeeding to new mothers in hospital centers. The original studies indicated that this intervention successfully increased duration of breastfeeding and lowered rates of gastrointestinal tract infections in newborns. Additional scientific and popular interest lies in determining the causal effect of longer breastfeeding on gastrointestinal infection. In this study, we estimate the expected infection count under various lengths of breastfeeding in order to estimate the effect of breastfeeding duration on infection. Due to the presence of baseline and time-dependent confounding, specialized “causal” estimation methods are required. We demonstrate the double-robust method of Targeted Maximum Likelihood Estimation (TMLE) in the context of this application and review some related methods and the adjustments required to account for clustering. We compare TMLE (implemented both parametrically and using a data-adaptive algorithm) to other causal methods for this example. In addition, we conduct a simulation study to determine (1) the effectiveness of controlling for clustering indicators when cluster-specific confounders are unmeasured and (2) the importance of using data-adaptive TMLE.
PMCID: PMC4259272  PMID: 25505499
Causal inference; G-computation; inverse probability weighting; marginal effects; missing data; pediatrics
24.  Modelling the initial phase of an epidemic using incidence and infection network data: 2009 H1N1 pandemic in Israel as a case study 
This paper presents new computational and modelling tools for studying the dynamics of an epidemic in its initial stages that use both available incidence time series and data describing the population's infection network structure. The work is motivated by data collected at the beginning of the H1N1 pandemic outbreak in Israel in the summer of 2009. We formulated a new discrete-time stochastic epidemic SIR (susceptible-infected-recovered) model that explicitly takes into account the disease's specific generation-time distribution and the intrinsic demographic stochasticity inherent to the infection process. Moreover, in contrast with many other modelling approaches, the model allows direct analytical derivation of estimates for the effective reproductive number (Re) and of their credible intervals, by maximum likelihood and Bayesian methods. The basic model can be extended to include age–class structure, and a maximum likelihood methodology allows us to estimate the model's next-generation matrix by combining two types of data: (i) the incidence series of each age group, and (ii) infection network data that provide partial information of ‘who-infected-who’. Unlike other approaches for estimating the next-generation matrix, the method developed here does not require making a priori assumptions about the structure of the next-generation matrix. We show, using a simulation study, that even a relatively small amount of information about the infection network greatly improves the accuracy of estimation of the next-generation matrix. The method is applied in practice to estimate the next-generation matrix from the Israeli H1N1 pandemic data. The tools developed here should be of practical importance for future investigations of epidemics during their initial stages. However, they require the availability of data which represent a random sample of the real epidemic process. We discuss the conditions under which reporting rates may or may not influence our estimated quantities and the effects of bias.
doi:10.1098/rsif.2010.0515
PMCID: PMC3104348  PMID: 21247949
epidemic modelling; H1N1 influenza; maximum likelihood; model fitting; next-generation matrix
25.  Estimating Correlation with Multiply Censored Data Arising from the Adjustment of Singly Censored Data 
Environmental data frequently are left censored due to detection limits of laboratory assay procedures. Left censored means that some of the observations are known only to fall below a censoring point (detection limit). This presents difficulties in statistical analysis of the data. In this paper, we examine methods for estimating the correlation between variables each of which is censored at multiple points. Multiple censoring frequently arises due to adjustment of singly censored laboratory results for physical sample size. We discuss maximum likelihood (ML) estimation of the correlation and introduce a new method (cp.mle2) that, instead of using the multiply censored data directly, relies on ML estimates of the covariance of the singly censored laboratory data. We compare the ML methods with Kendall's tau-b (ck.taub) which is a modification Kendall's tau adjusted for ties, and several commonly used simple substitution methods: correlations estimated with non-detects set to the detection limit divided by two and correlations based on detects only (cs.det) with non-detects set to missing. The methods are compared based on simulations and real data. In the simulations, censoring levels are varied from 0 to 90%, ρ from -0.8 to 0.8 and ν (variance of physical sample size) is set to 0 and 0.5, for a total of 550 parameter combinations with 1000 replications at each combination. We find that with increasing levels of censoring most of the correlation methods are highly biased. The simple substitution methods in general tend toward zero if singly censored and one if multiply censored. ck.taub tends toward zero. Least biased is cp.mle2, however, it has higher variance than some of the other estimators. Overall, cs.det performs the worst and cp.mle2 the best.
PMCID: PMC2565512  PMID: 17265951
Correlation; left censored; detection limit; environmental data

Results 1-25 (1465059)