Search tips
Search criteria

Results 1-25 (1517732)

Clipboard (0)

Related Articles

1.  Collaborative Double Robust Targeted Maximum Likelihood Estimation* 
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
PMCID: PMC2898626  PMID: 20628637
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
2.  The Relative Performance of Targeted Maximum Likelihood Estimators 
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
PMCID: PMC3173607  PMID: 21931570
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
3.  Dimension reduction with gene expression data using targeted variable importance measurement 
BMC Bioinformatics  2011;12:312.
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
PMCID: PMC3166941  PMID: 21849016
4.  Modeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with Targeted Maximum Likelihood Estimation 
Biometrics  2013;70(1):144-152.
Despite modern effective HIV treatment, hepatitis C virus (HCV) co-infection is associated with a high risk of progression to end-stage liver disease (ESLD) which has emerged as the primary cause of death in this population. Clinical interest lies in determining the impact of clearance of HCV on risk for ESLD. In this case study, we examine whether HCV clearance affects risk of ESLD using data from the multicenter Canadian Co-infection Cohort Study. Complications in this survival analysis arise from the time-dependent nature of the data, the presence of baseline confounders, loss to follow-up, and confounders that change over time, all of which can obscure the causal effect of interest. Additional challenges included non-censoring variable missingness and event sparsity.
In order to efficiently estimate the ESLD-free survival probabilities under a specific history of HCV clearance, we demonstrate the doubly-robust and semiparametric efficient method of Targeted Maximum Likelihood Estimation (TMLE). Marginal structural models (MSM) can be used to model the effect of viral clearance (expressed as a hazard ratio) on ESLD-free survival and we demonstrate a way to estimate the parameters of a logistic model for the hazard function with TMLE. We show the theoretical derivation of the efficient influence curves for the parameters of two different MSMs and how they can be used to produce variance approximations for parameter estimates. Finally, the data analysis evaluating the impact of HCV on ESLD was undertaken using multiple imputations to account for the non-monotone missing data.
PMCID: PMC3954273  PMID: 24571372
Double-robust; Inverse probability of treatment weighting; Kaplan-Meier; Longitudinal data; Marginal structural model; Survival analysis; Targeted maximum likelihood estimation
5.  Targeted maximum likelihood estimation in safety analysis 
Journal of clinical epidemiology  2013;66(8 0):10.1016/j.jclinepi.2013.02.017.
To compare the performance of a targeted maximum likelihood estimator (TMLE) and a collaborative TMLE (CTMLE) to other estimators in a drug safety analysis, including a regression-based estimator, propensity score (PS)–based estimators, and an alternate doubly robust (DR) estimator in a real example and simulations.
Study Design and Setting
The real data set is a subset of observational data from Kaiser Permanente Northern California formatted for use in active drug safety surveillance. Both the real and simulated data sets include potential confounders, a treatment variable indicating use of one of two antidiabetic treatments and an outcome variable indicating occurrence of an acute myocardial infarction (AMI).
In the real data example, there is no difference in AMI rates between treatments. In simulations, the double robustness property is demonstrated: DR estimators are consistent if either the initial outcome regression or PS estimator is consistent, whereas other estimators are inconsistent if the initial estimator is not consistent. In simulations with near-positivity violations, CTMLE performs well relative to other estimators by adaptively estimating the PS.
Each of the DR estimators was consistent, and TMLE and CTMLE had the smallest mean squared error in simulations.
PMCID: PMC3818128  PMID: 23849159
Safety analysis; Targeted maximum likelihood estimation; Doubly robust; Causal inference; Collaborative targeted maximum likelihood estimation; Super learning
6.  A Causal Framework for Understanding the Effect of Losses to Follow-up on Epidemiologic Analyses in Clinic-based Cohorts: The Case of HIV-infected Patients on Antiretroviral Therapy in Africa 
American Journal of Epidemiology  2012;175(10):1080-1087.
Although clinic-based cohorts are most representative of the “real world,” they are susceptible to loss to follow-up. Strategies for managing the impact of loss to follow-up are therefore needed to maximize the value of studies conducted in these cohorts. The authors evaluated adult patients starting antiretroviral therapy at an HIV/AIDS clinic in Uganda, where 29% of patients were lost to follow-up after 2 years (January 1, 2004–September 30, 2007). Unweighted, inverse probability of censoring weighted (IPCW), and sampling-based approaches (using supplemental data from a sample of lost patients subsequently tracked in the community) were used to identify the predictive value of sex on mortality. Directed acyclic graphs (DAGs) were used to explore the structural basis for bias in each approach. Among 3,628 patients, unweighted and IPCW analyses found men to have higher mortality than women, whereas the sampling-based approach did not. DAGs encoding knowledge about the data-generating process, including the fact that death is a cause of being classified as lost to follow-up in this setting, revealed “collider” bias in the unweighted and IPCW approaches. In a clinic-based cohort in Africa, unweighted and IPCW approaches—which rely on the “missing at random” assumption—yielded biased estimates. A sampling-based approach can in general strengthen epidemiologic analyses conducted in many clinic-based cohorts, including those examining other diseases.
PMCID: PMC3353135  PMID: 22306557
Africa; antiretroviral therapy; clinic-based cohorts; directed acyclic graphs; informative censoring; inverse probability of censoring weights; loss to follow-up; missing at random
7.  A two-stage validation study for determining sensitivity and specificity. 
Environmental Health Perspectives  1994;102(Suppl 8):11-14.
A two-stage procedure for estimating sensitivity and specificity is described. The procedure is developed in the context of a validation study for self-reported atypical nevi, a potentially useful measure in the study of risk factors for malignant melanoma. The first stage consists of a sample of N individuals classified only by the test measure. The second stage is a subsample of size m, stratified according the information collected in the first stage, in which the presence of atypical nevi is determined by clinical examination. Using missing data methods for contingency tables, maximum likelihood estimators for the joint distribution of the test measure and the "gold standard" clinical evaluation are presented, along with efficient estimators for the sensitivity and specificity. Asymptotic coefficients of variation are computed to compare alternative sampling strategies for the second stage.
PMCID: PMC1566548  PMID: 7851324
8.  An Application of Collaborative Targeted Maximum Likelihood Estimation in Causal Inference and Genomics 
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
PMCID: PMC3126668  PMID: 21731530
causal effect; cross-validation; collaborative double robust; double robust; efficient influence curve; penalized likelihood; penalization; estimator selection; locally efficient; maximum likelihood estimation; model selection; super efficiency; super learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
9.  A Method for Subsampling Terrestrial Invertebrate Samples in the Laboratory: Estimating Abundance and Taxa Richness 
Significant progress has been made in developing subsampling techniques to process large samples of aquatic invertebrates. However, limited information is available regarding subsampling techniques for terrestrial invertebrate samples. Therefore a novel subsampling procedure was evaluated for processing samples of terrestrial invertebrates collected using two common field techniques: pitfall and pan traps. A three-phase sorting protocol was developed for estimating abundance and taxa richness of invertebrates. First, large invertebrates and plant material were removed from the sample using a sieve with a 4 mm mesh size. Second, the sample was poured into a specially designed, gridded sampling tray, and 16 cells, comprising 25% of the sampling tray, were randomly subsampled and processed. Third, the remainder of the sample was scanned for 4–7 min to record rare taxa missed in the second phase. To compare estimated abundance and taxa richness with the true values of these variables for the samples, the remainder of each sample was processed completely. The results were analyzed relative to three sample size categories: samples with less than 250 invertebrates (low abundance samples), samples with 250–500 invertebrates (moderate abundance samples), and samples with more than 500 invertebrates (high abundance samples). The number of invertebrates estimated after subsampling eight or more cells was highly precise for all sizes and types of samples. High accuracy for moderate and high abundance samples was achieved after even as few as six subsamples. However, estimates of the number of invertebrates for low abundance samples were less reliable. The subsampling technique also adequately estimated taxa richness; on average, subsampling detected 89% of taxa found in samples. Thus, the subsampling technique provided accurate data on both the abundance and taxa richness of terrestrial invertebrate samples. Importantly, subsampling greatly decreased the time required to process samples, cutting the time per sample by up to 80%. Based on these data, this subsampling technique is recommended to minimize the time and cost of processing moderate to large samples without compromising the integrity of the data and to maximize the information extracted from large terrestrial invertebrate samples. For samples with a relatively low number of invertebrates, complete counting is preferred.
PMCID: PMC3014723  PMID: 20578889
pitfall traps; laboratory sampling techniques
10.  Population Intervention Causal Effects Based on Stochastic Interventions 
Biometrics  2011;68(2):541-549.
Estimating the causal effect of an intervention on a population typically involves defining parameters in a nonparametric structural equation model (Pearl, 2000, Causality: Models, Reasoning, and Inference) in which the treatment or exposure is deterministically assigned in a static or dynamic way. We define a new causal parameter that takes into account the fact that intervention policies can result in stochastically assigned exposures. The statistical parameter that identifies the causal parameter of interest is established. Inverse probability of treatment weighting (IPTW), augmented IPTW (A-IPTW), and targeted maximum likelihood estimators (TMLE) are developed. A simulation study is performed to demonstrate the properties of these estimators, which include the double robustness of the A-IPTW and the TMLE. An application example using physical activity data is presented.
PMCID: PMC4117410  PMID: 21977966
Causal effect; Counterfactual outcome; Double robustness; Stochastic intervention; Targeted maximum likelihood estimation
11.  A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome 
Targeted maximum likelihood estimation of a parameter of a data generating distribution, known to be an element of a semi-parametric model, involves constructing a parametric model through an initial density estimator with parameter ɛ representing an amount of fluctuation of the initial density estimator, where the score of this fluctuation model at ɛ = 0 equals the efficient influence curve/canonical gradient. The latter constraint can be satisfied by many parametric fluctuation models since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the semi-parametric model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, by which we mean situations where the Fisher information is low, a violation of this property can heavily affect the performance of the estimator. This paper presents a fluctuation approach that guarantees the fluctuated density estimator remains inside the bounds of the data model. We demonstrate this in the context of estimation of a causal effect of a binary treatment on a continuous outcome that is bounded. It results in a targeted maximum likelihood estimator that inherently respects known bounds, and consequently is more robust in sparse data situations than the targeted MLE using a naive fluctuation model.
When an estimation procedure incorporates weights, observations having large weights relative to the rest heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it can also introduce bias into the estimate. We present an alternative targeted maximum likelihood estimation (TMLE) approach that dampens the effect of these heavily weighted observations. As a substitution estimator, TMLE respects the global constraints of the observed data model. For example, when outcomes are binary, a fluctuation of an initial density estimate on the logit scale constrains predicted probabilities to be between 0 and 1. This inherent enforcement of bounds has been extended to continuous outcomes. Simulation study results indicate that this approach is on a par with, and many times superior to, fluctuating on the linear scale, and in particular is more robust when there is sparsity in the data.
PMCID: PMC3126669  PMID: 21731529
targeted maximum likelihood estimation; TMLE; causal effect
12.  Self-Consistent Nonparametric Maximum Likelihood Estimator of the Bivariate Survivor Function 
Biometrika  2014;101(3):505-518.
As usually formulated the nonparametric likelihood for the bivariate survivor function is over-parameterized, resulting in uniqueness problems for the corresponding nonparametric maximum likelihood estimator. Here the estimation problem is redefined to include parameters for marginal hazard rates, and for double failure hazard rates only at informative uncensored failure time grid points where there is pertinent empirical information. Double failure hazard rates at other grid points in the risk region are specified rather than estimated. With this approach the nonparametric maximum likelihood estimator is unique, and can be calculated using a two-step procedure. The first step involves setting aside all doubly censored observations that are interior to the risk region. The nonparametric maximum likelihood estimator from the remaining data turns out to be the Dabrowska (1988) estimator. The omitted doubly censored observations are included in the procedure in the second stage using self-consistency, resulting in a non-iterative nonpara-metric maximum likelihood estimator for the bivariate survivor function. Simulation evaluation and asymptotic distributional results are provided. Moderate sample size efficiency for the survivor function nonparametric maximum likelihood estimator is similar to that for the Dabrowska estimator as applied to the entire dataset, while some useful efficiency improvement arises for corresponding distribution function estimator, presumably due to the avoidance of negative mass assignments.
PMCID: PMC4306565  PMID: 25632162
Bivariate survivor function; Censored data; Dabrowska estimator; Kaplan–Meier estimator; Non-parametric maximum likelihood; Self-consistency
13.  Mixed effect regression analysis for a cluster-based two-stage outcome-auxiliary-dependent sampling design with a continuous outcome 
Biostatistics (Oxford, England)  2012;13(4):650-664.
Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics 66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics 67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
PMCID: PMC3440236  PMID: 22723503
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
14.  Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation 
Statistics in medicine  2009;28(1):39-64.
Covariate adjustment using linear models for continuous outcomes in randomized trials has been shown to increase efficiency and power over the unadjusted method in estimating the marginal effect of treatment. However, for binary outcomes, investigators generally rely on the unadjusted estimate as the literature indicates that covariate-adjusted estimates based on the logistic regression models are less efficient. The crucial step that has been missing when adjusting for covariates is that one must integrate/average the adjusted estimate over those covariates in order to obtain the marginal effect. We apply the method of targeted maximum likelihood estimation (tMLE) to obtain estimators for the marginal effect using covariate adjustment for binary outcomes. We show that the covariate adjustment in randomized trials using the logistic regression models can be mapped, by averaging over the covariate(s), to obtain a fully robust and efficient estimator of the marginal effect, which equals a targeted maximum likelihood estimator. This tMLE is obtained by simply adding a clever covariate to a fixed initial regression. We present simulation studies that demonstrate that this tMLE increases efficiency and power over the unadjusted method, particularly for smaller sample sizes, even when the regression model is mis-specified.
PMCID: PMC2857590  PMID: 18985634
clinical trails; efficiency; covariate adjustment; variable selection
15.  Maximum Likelihood Estimations and EM Algorithms with Length-biased Data 
Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.
PMCID: PMC3273908  PMID: 22323840
Cox regression model; EM algorithm; Increasing failure rate; Non-parametric likelihood; Profile likelihood; Right-censored data
16.  Nonparametric estimation for censored mixture data with application to the Cooperative Huntington’s Observational Research Trial 
This work presents methods for estimating genotype-specific distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs) which do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators which do not assume parametric density models and are easy to implement. They are based on the inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). The AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated non-carrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared to non-carriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic testing, and in facilitating future subjects at risk to make informed decisions on whether to undergo genetic mutation testings.
PMCID: PMC3905630  PMID: 24489419
Censored data; Finite mixture model; Huntington’s disease; Kin-cohort design; Quantitative trait locus
17.  Efficient Logistic Regression Designs Under an Imperfect Population Identifier 
Biometrics  2013;70(1):175-184.
Motivated by actual study designs, this article considers efficient logistic regression designs where the population is identified with a binary test that is subject to diagnostic error. We consider the case where the imperfect test is obtained on all participants, while the gold standard test is measured on a small chosen subsample. Under maximum-likelihood estimation, we evaluate the optimal design in terms of sample selection as well as verification. We show that there may be substantial efficiency gains by choosing a small percentage of individuals who test negative on the imperfect test for inclusion in the sample (e.g., verifying 90% test-positive cases). We also show that a two-stage design may be a good practical alternative to a fixed design in some situations. Under optimal and nearly optimal designs, we compare maximum-likelihood and semi-parametric efficient estimators under correct and misspecified models with simulations. The methodology is illustrated with an analysis from a diabetes behavioral intervention trial.
PMCID: PMC3954435  PMID: 24261471
Case-control designs; Diagnostic accuracy; Epidemiologic designs; Misclassification; Measurement error
18.  Double inverse-weighted estimation of cumulative treatment effects under non-proportional hazards and dependent censoring 
Biometrics  2011;67(1):29-38.
In medical studies of time to event data, non-proportional hazards and dependent censoring are very common issues when estimating the treatment effect. A traditional method for dealing with time-dependent treatment effects is to model the time-dependence parametrically. Limitations of this approach include the difficulty to verify the correctness of the specified functional form and the fact that, in the presence of a treatment effect that varies over time, investigators are usually interested in the cumulative as opposed to instantaneous treatment effect. In many applications, censoring time is not independent of event time. Therefore, we propose methods for estimating the cumulative treatment effect in the presence of non-proportional hazards and dependent censoring. Three measures are proposed, including the ratio of cumulative hazards, relative risk and difference in restricted mean lifetime. For each measure, we propose a double-inverse-weighted estimator, constructed by first using inverse probability of treatment weighting (IPTW) to balance the treatment-specific covariate distributions, then using inverse probability of censoring weighting (IPCW) to overcome the dependent censoring. The proposed estimators are shown to be consistent and asymptotically normal. We study their finite-sample properties through simulation. The proposed methods are used to compare kidney wait list mortality by race.
PMCID: PMC3372067  PMID: 20560935
Cumulative hazard; Dependent censoring; Inverse weighting; Relative Risk; Restricted mean lifetime; Survival analysis; Treatment effect
The annals of applied statistics  2014;8(2):703-725.
The PROmotion of Breastfeeding Intervention Trial (PROBIT) cluster-randomized a program encouraging breastfeeding to new mothers in hospital centers. The original studies indicated that this intervention successfully increased duration of breastfeeding and lowered rates of gastrointestinal tract infections in newborns. Additional scientific and popular interest lies in determining the causal effect of longer breastfeeding on gastrointestinal infection. In this study, we estimate the expected infection count under various lengths of breastfeeding in order to estimate the effect of breastfeeding duration on infection. Due to the presence of baseline and time-dependent confounding, specialized “causal” estimation methods are required. We demonstrate the double-robust method of Targeted Maximum Likelihood Estimation (TMLE) in the context of this application and review some related methods and the adjustments required to account for clustering. We compare TMLE (implemented both parametrically and using a data-adaptive algorithm) to other causal methods for this example. In addition, we conduct a simulation study to determine (1) the effectiveness of controlling for clustering indicators when cluster-specific confounders are unmeasured and (2) the importance of using data-adaptive TMLE.
PMCID: PMC4259272  PMID: 25505499
Causal inference; G-computation; inverse probability weighting; marginal effects; missing data; pediatrics
20.  Modelling the initial phase of an epidemic using incidence and infection network data: 2009 H1N1 pandemic in Israel as a case study 
This paper presents new computational and modelling tools for studying the dynamics of an epidemic in its initial stages that use both available incidence time series and data describing the population's infection network structure. The work is motivated by data collected at the beginning of the H1N1 pandemic outbreak in Israel in the summer of 2009. We formulated a new discrete-time stochastic epidemic SIR (susceptible-infected-recovered) model that explicitly takes into account the disease's specific generation-time distribution and the intrinsic demographic stochasticity inherent to the infection process. Moreover, in contrast with many other modelling approaches, the model allows direct analytical derivation of estimates for the effective reproductive number (Re) and of their credible intervals, by maximum likelihood and Bayesian methods. The basic model can be extended to include age–class structure, and a maximum likelihood methodology allows us to estimate the model's next-generation matrix by combining two types of data: (i) the incidence series of each age group, and (ii) infection network data that provide partial information of ‘who-infected-who’. Unlike other approaches for estimating the next-generation matrix, the method developed here does not require making a priori assumptions about the structure of the next-generation matrix. We show, using a simulation study, that even a relatively small amount of information about the infection network greatly improves the accuracy of estimation of the next-generation matrix. The method is applied in practice to estimate the next-generation matrix from the Israeli H1N1 pandemic data. The tools developed here should be of practical importance for future investigations of epidemics during their initial stages. However, they require the availability of data which represent a random sample of the real epidemic process. We discuss the conditions under which reporting rates may or may not influence our estimated quantities and the effects of bias.
PMCID: PMC3104348  PMID: 21247949
epidemic modelling; H1N1 influenza; maximum likelihood; model fitting; next-generation matrix
21.  Child Mortality Estimation: Appropriate Time Periods for Child Mortality Estimates from Full Birth Histories 
PLoS Medicine  2012;9(8):e1001289.
Jon Pedersen and Jing Liu examine the feasibility and potential advantages of using one-year rather than five-year time periods along with calendar year-based estimation when deriving estimates of child mortality.
Child mortality estimates from complete birth histories from Demographic and Health Surveys (DHS) surveys and similar surveys are a chief source of data used to track Millennium Development Goal 4, which aims for a reduction of under-five mortality by two-thirds between 1990 and 2015. Based on the expected sample sizes when the DHS program commenced, the estimates are usually based on 5-y time periods. Recent surveys have had larger sample sizes than early surveys, and here we aimed to explore the benefits of using shorter time periods than 5 y for estimation. We also explore the benefit of changing the estimation procedure from being based on years before the survey, i.e., measured with reference to the date of the interview for each woman, to being based on calendar years.
Methods and Findings
Jackknife variance estimation was used to calculate standard errors for 207 DHS surveys in order to explore to what extent the large samples in recent surveys can be used to produce estimates based on 1-, 2-, 3-, 4-, and 5-y periods. We also recalculated the estimates for the surveys into calendar-year-based estimates. We demonstrate that estimation for 1-y periods is indeed possible for many recent surveys.
The reduction in bias achieved using 1-y periods and calendar-year-based estimation is worthwhile in some cases. In particular, it allows tracking of the effects of particular events such as droughts, epidemics, or conflict on child mortality in a way not possible with previous estimation procedures. Recommendations to use estimation for short time periods when possible and to use calendar-year-based estimation were adopted in the United Nations 2011 estimates of child mortality.
Editors' Summary
In 2000, world leaders set, as Millennium Development Goal 4 (MDG 4), a target of reducing global under-five mortality (the number of children who die before their fifth birthday to a third of its 1990 level (12 million deaths per year) by 2015. (The MDGs are designed to alleviate extreme poverty by 2015.) To track progress towards MDG 4, the under-five mortality rate (also shown as 5q0) needs to be estimated both “precisely” and “accurately.” A “precise” estimate has a small random error (a quality indicated by a statistical measurement called the coefficient of variance), and an “accurate” estimate is one that is close to the true value because it lacks bias (systematic errors). In an ideal world, under-five mortality estimates would be based on official records of births and deaths. However, developing countries, which are where most under-five deaths occur, rarely have such records, and under-five mortality estimation relies on “complete birth histories” provided by women via surveys. These are collected by Demographic and Health Surveys (DHS, a project that helps developing countries collect data on health and population trends) and record all the births that a surveyed woman has had and the age at death of any of her children who have died.
Why Was This Study Done?
Because the DHS originally surveyed samples of 5,000–6,000 women, estimates of under-five mortality are traditionally calculated using data from five-year time periods. Over shorter periods with this sample size, the random errors in under-five mortality estimates become unacceptably large. Nowadays, the average DHS survey sample size is more than 10,000 women, so it should be possible to estimate under-five mortality over shorter time periods. Such estimates should be able to track the effects on under-five mortality of events such as droughts and conflicts better than estimates made over five years. In this study, the researchers determine appropriate time periods for child mortality estimates based on full birth histories, given different sample sizes. Specifically, they ask whether, with the bigger sample sizes that are now available, details about trends in under-five mortality rates are being missed by using the estimation procedures that were developed for smaller samples. They also ask whether calendar-year-based estimates can be calculated; mortality is usually estimated in “years before the survey,” a process that blurs the reference period for the estimate.
What Did the Researchers Do and Find?
The researchers used a statistical method called “jackknife variance estimation” to determine coefficients of variance for child mortality estimates calculated over different time periods using complete birth histories from 207 DHS surveys. Regardless of the estimation period, half of the estimates had a coefficient of variance of less than 10%, a level of random variation that is generally considered acceptable. However, within each time period, some estimates had very high coefficients of variance. These estimates were derived from surveys where there was a small sample size, low fertility (the women surveyed had relatively few babies), or low child mortality. Other analyses show that although the five-year period estimates had lower standard errors than the one-year period estimates, the latter were affected less by bias than the five-year period estimates. Finally, estimates fixed to calendar years rather than to years before the survey were more directly comparable across surveys and brought out variations in child mortality caused by specific events such as conflicts more clearly.
What Do These Findings Mean?
These findings show that although under-five mortality rate estimates based on five-year periods of data have been the norm, the sample sizes currently employed in DHS surveys make it feasible to estimate mortality for shorter periods. The findings also show that using shorter periods of data in estimations of the under-five mortality rate, and using calendar-year-based estimation, reduces bias (makes the estimations more accurate) and allows the effects of events such as droughts, epidemics, or conflict on under-five mortality rates to be tracked in a way that is impossible when using five-year periods of data. Given these findings, the researchers recommend that time periods shorter than five years should be adopted for the estimation of under-five mortality and that estimations should be pegged to calendar years rather than to years before the survey. Both recommendations have already been adopted by the United Nations Inter-agency Group for Child Mortality Estimation (IGME) and were used in their 2011 analysis of under-five mortality.
Additional Information
Please access these websites via the online version of this summary at
This paper is part of a collection of papers on Child Mortality Estimation Methods published in PLOS Medicine
The United Nations Childrens Fund (UNICEF) works for children's rights, survival, development, and protection around the world; it provides information on Millennium Development Goal 4, and its Childinfo website provides detailed statistics about child survival and health, including a description of the United Nations Inter-agency Group for Child Mortality Estimation; the 2011 IGME report on Levels and Trends in Child Mortality is available
The World Health Organization also has information about Millennium Development Goal 4 and provides estimates of child mortality rates (some information in several languages)
Further information about the Millennium Development Goals is available
Information is also available about Demographic and Health Surveys of infant and child mortality
PMCID: PMC3429388  PMID: 22952435
22.  Hierarchical Modeling for Estimating Relative Risks of Rare Genetic Variants: Properties of the Pseudo-Likelihood Method 
Biometrics  2010;67(2):371-380.
Many major genes have been identified that strongly influence the risk of cancer. However, there are typically many different mutations that can occur in the gene, each of which may or may not confer increased risk. It is critical to identify which specific mutations are harmful, and which ones are harmless, so that individuals who learn from genetic testing that they have a mutation can be appropriately counseled. This is a challenging task, since new mutations are continually being identified, and there is typically relatively little evidence available about each individual mutation. In an earlier article we employed hierarchical modeling (Capanu et al. 2008) using the pseudo-likelihood and Gibbs sampling methods to estimate the relative risks of individual rare variants using data from a case-control study and showed that one can draw strength from the aggregating power of hierarchical models to distinguish the variants that contribute to cancer risk. However, further research is needed to validate the application of asymptotic methods to such sparse data. In this article we use simulations to study in detail the properties of the pseudo-likelihood method for this purpose. We also explore two alternative approaches: pseudo-likelihood with correction for the variance component estimate as proposed by Lin and Breslow (1996) and a hybrid pseudo-likelihood approach with Bayesian estimation of the variance component. We investigate the validity of these hierarchical modeling techniques by looking at the bias and coverage properties of the estimators as well as at the efficiency of the hierarchical modeling estimates relative to that of the maximum likelihood estimates. The results indicate that the estimates of the relative risks of very sparse variants have small bias, and that the estimated 95% confidence intervals are typically anti-conservative, though the actual coverage rates are generally above 90 per cent. The widths of the confidence intervals narrow as the residual variance in the second-stage model is reduced. The results also show that the hierarchical modeling estimates have shorter confidence intervals relative to estimates obtained from conventional logistic regression, and that these relative improvements increase as the variants become more rare.
PMCID: PMC3015025  PMID: 20707869
hierarchical models; pseudo-likelihood; Bayesian; genetic risk; rare variants
23.  Targeted Maximum Likelihood Based Causal Inference: Part I 
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
PMCID: PMC3126670  PMID: 20737021
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
24.  Shrinkage Estimators for Covariance Matrices 
Biometrics  2001;57(4):1173-1184.
Estimation of covariance matrices in small samples has been studied by many authors. Standard estimators, like the unstructured maximum likelihood estimator (ML) or restricted maximum likelihood (REML) estimator, can be very unstable with the smallest estimated eigenvalues being too small and the largest too big. A standard approach to more stably estimating the matrix in small samples is to compute the ML or REML estimator under some simple structure that involves estimation of fewer parameters, such as compound symmetry or independence. However, these estimators will not be consistent unless the hypothesized structure is correct. If interest focuses on estimation of regression coefficients with correlated (or longitudinal) data, a sandwich estimator of the covariance matrix may be used to provide standard errors for the estimated coefficients that are robust in the sense that they remain consistent under misspecifics tion of the covariance structure. With large matrices, however, the inefficiency of the sandwich estimator becomes worrisome. We consider here two general shrinkage approaches to estimating the covariance matrix and regression coefficients. The first involves shrinking the eigenvalues of the unstructured ML or REML estimator. The second involves shrinking an unstructured estimator toward a structured estimator. For both cases, the data determine the amount of shrinkage. These estimators are consistent and give consistent and asymptotically efficient estimates for regression coefficients. Simulations show the improved operating characteristics of the shrinkage estimators of the covariance matrix and the regression coefficients in finite samples. The final estimator chosen includes a combination of both shrinkage approaches, i.e., shrinking the eigenvalues and then shrinking toward structure. We illustrate our approach on a sleep EEG study that requires estimation of a 24 × 24 covariance matrix and for which inferences on mean parameters critically depend on the covariance estimator chosen. We recommend making inference using a particular shrinkage estimator that provides a reasonable compromise between structured and unstructured estimators.
PMCID: PMC2748251  PMID: 11764258
Empirical Bayes; General linear model; Givens angles; Hierarchical prior; Longitudinal data
25.  Non-parametric Estimation of a Survival Function with Two-stage Design Studies 
The two-stage design is popular in epidemiology studies and clinical trials due to its cost effectiveness. Typically, the first stage sample contains cheaper and possibly biased information, while the second stage validation sample consists of a subset of subjects with accurate and complete information. In this paper, we study estimation of a survival function with right-censored survival data from a two-stage design. A non-parametric estimator is derived by combining data from both stages. We also study its large sample properties and derive pointwise and simultaneous confidence intervals for the survival function. The proposed estimator effectively reduces the variance and finite-sample bias of the Kaplan–Meier estimator solely based on the second stage validation sample. Finally, we apply our method to a real data set from a medical device post-marketing surveillance study.
PMCID: PMC2729091  PMID: 19696901
censoring; Kaplan–Meier estimator; martingale; Nelson–Aalen estimator; truncation

Results 1-25 (1517732)