PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (988840)

Clipboard (0)
None

Related Articles

1.  Collaborative Double Robust Targeted Maximum Likelihood Estimation* 
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
doi:10.2202/1557-4679.1181
PMCID: PMC2898626  PMID: 20628637
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
2.  Targeted Maximum Likelihood Estimation for Dynamic and Static Longitudinal Marginal Structural Working Models 
Journal of causal inference  2014;2(2):147-185.
This paper describes a targeted maximum likelihood estimator (TMLE) for the parameters of longitudinal static and dynamic marginal structural models. We consider a longitudinal data structure consisting of baseline covariates, time-dependent intervention nodes, intermediate time-dependent covariates, and a possibly time-dependent outcome. The intervention nodes at each time point can include a binary treatment as well as a right-censoring indicator. Given a class of dynamic or static interventions, a marginal structural model is used to model the mean of the intervention-specific counterfactual outcome as a function of the intervention, time point, and possibly a subset of baseline covariates. Because the true shape of this function is rarely known, the marginal structural model is used as a working model. The causal quantity of interest is defined as the projection of the true function onto this working model. Iterated conditional expectation double robust estimators for marginal structural model parameters were previously proposed by Robins (2000, 2002) and Bang and Robins (2005). Here we build on this work and present a pooled TMLE for the parameters of marginal structural working models. We compare this pooled estimator to a stratified TMLE (Schnitzer et al. 2014) that is based on estimating the intervention-specific mean separately for each intervention of interest. The performance of the pooled TMLE is compared to the performance of the stratified TMLE and the performance of inverse probability weighted (IPW) estimators using simulations. Concepts are illustrated using an example in which the aim is to estimate the causal effect of delayed switch following immunological failure of first line antiretroviral therapy among HIV-infected patients. Data from the International Epidemiological Databases to Evaluate AIDS, Southern Africa are analyzed to investigate this question using both TML and IPW estimators. Our results demonstrate practical advantages of the pooled TMLE over an IPW estimator for working marginal structural models for survival, as well as cases in which the pooled TMLE is superior to its stratified counterpart.
doi:10.1515/jci-2013-0007
PMCID: PMC4405134  PMID: 25909047
dynamic regime; semiparametric statistical model; targeted minimum loss based estimation; confounding; right censoring
3.  Population Intervention Causal Effects Based on Stochastic Interventions 
Biometrics  2011;68(2):541-549.
SUMMARY
Estimating the causal effect of an intervention on a population typically involves defining parameters in a nonparametric structural equation model (Pearl, 2000, Causality: Models, Reasoning, and Inference) in which the treatment or exposure is deterministically assigned in a static or dynamic way. We define a new causal parameter that takes into account the fact that intervention policies can result in stochastically assigned exposures. The statistical parameter that identifies the causal parameter of interest is established. Inverse probability of treatment weighting (IPTW), augmented IPTW (A-IPTW), and targeted maximum likelihood estimators (TMLE) are developed. A simulation study is performed to demonstrate the properties of these estimators, which include the double robustness of the A-IPTW and the TMLE. An application example using physical activity data is presented.
doi:10.1111/j.1541-0420.2011.01685.x
PMCID: PMC4117410  PMID: 21977966
Causal effect; Counterfactual outcome; Double robustness; Stochastic intervention; Targeted maximum likelihood estimation
4.  A Targeted Maximum Likelihood Estimator of a Causal Effect on a Bounded Continuous Outcome 
Targeted maximum likelihood estimation of a parameter of a data generating distribution, known to be an element of a semi-parametric model, involves constructing a parametric model through an initial density estimator with parameter ɛ representing an amount of fluctuation of the initial density estimator, where the score of this fluctuation model at ɛ = 0 equals the efficient influence curve/canonical gradient. The latter constraint can be satisfied by many parametric fluctuation models since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the semi-parametric model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, by which we mean situations where the Fisher information is low, a violation of this property can heavily affect the performance of the estimator. This paper presents a fluctuation approach that guarantees the fluctuated density estimator remains inside the bounds of the data model. We demonstrate this in the context of estimation of a causal effect of a binary treatment on a continuous outcome that is bounded. It results in a targeted maximum likelihood estimator that inherently respects known bounds, and consequently is more robust in sparse data situations than the targeted MLE using a naive fluctuation model.
When an estimation procedure incorporates weights, observations having large weights relative to the rest heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it can also introduce bias into the estimate. We present an alternative targeted maximum likelihood estimation (TMLE) approach that dampens the effect of these heavily weighted observations. As a substitution estimator, TMLE respects the global constraints of the observed data model. For example, when outcomes are binary, a fluctuation of an initial density estimate on the logit scale constrains predicted probabilities to be between 0 and 1. This inherent enforcement of bounds has been extended to continuous outcomes. Simulation study results indicate that this approach is on a par with, and many times superior to, fluctuating on the linear scale, and in particular is more robust when there is sparsity in the data.
doi:10.2202/1557-4679.1260
PMCID: PMC3126669  PMID: 21731529
targeted maximum likelihood estimation; TMLE; causal effect
5.  Causal Inference for a Population of Causally Connected Units 
Journal of causal inference  2014;2(1):13-74.
Suppose that we observe a population of causally connected units. On each unit at each time-point on a grid we observe a set of other units the unit is potentially connected with, and a unit-specific longitudinal data structure consisting of baseline and time-dependent covariates, a time-dependent treatment, and a final outcome of interest. The target quantity of interest is defined as the mean outcome for this group of units if the exposures of the units would be probabilistically assigned according to a known specified mechanism, where the latter is called a stochastic intervention. Causal effects of interest are defined as contrasts of the mean of the unit-specific outcomes under different stochastic interventions one wishes to evaluate. This covers a large range of estimation problems from independent units, independent clusters of units, and a single cluster of units in which each unit has a limited number of connections to other units. The allowed dependence includes treatment allocation in response to data on multiple units and so called causal interference as special cases. We present a few motivating classes of examples, propose a structural causal model, define the desired causal quantities, address the identification of these quantities from the observed data, and define maximum likelihood based estimators based on cross-validation. In particular, we present maximum likelihood based super-learning for this network data. Nonetheless, such smoothed/regularized maximum likelihood estimators are not targeted and will thereby be overly bias w.r.t. the target parameter, and, as a consequence, generally not result in asymptotically normally distributed estimators of the statistical target parameter.
To formally develop estimation theory, we focus on the simpler case in which the longitudinal data structure is a point-treatment data structure. We formulate a novel targeted maximum likelihood estimator of this estimand and show that the double robustness of the efficient influence curve implies that the bias of the targeted minimum loss-based estimation (TMLE) will be a second-order term involving squared differences of two nuisance parameters. In particular, the TMLE will be consistent if either one of these nuisance parameters is consistently estimated. Due to the causal dependencies between units, the data set may correspond with the realization of a single experiment, so that establishing a (e.g. normal) limit distribution for the targeted maximum likelihood estimators, and corresponding statistical inference, is a challenging topic. We prove two formal theorems establishing the asymptotic normality using advances in weak-convergence theory. We conclude with a discussion and refer to an accompanying technical report for extensions to general longitudinal data structures.
doi:10.1515/jci-2013-0002
PMCID: PMC4500386  PMID: 26180755
networks; causal inference; targeted maximum likelihood estimation; stochastic intervention; efficient influence curve
6.  Targeted maximum likelihood estimation in safety analysis 
Journal of clinical epidemiology  2013;66(8 0):10.1016/j.jclinepi.2013.02.017.
Objectives
To compare the performance of a targeted maximum likelihood estimator (TMLE) and a collaborative TMLE (CTMLE) to other estimators in a drug safety analysis, including a regression-based estimator, propensity score (PS)–based estimators, and an alternate doubly robust (DR) estimator in a real example and simulations.
Study Design and Setting
The real data set is a subset of observational data from Kaiser Permanente Northern California formatted for use in active drug safety surveillance. Both the real and simulated data sets include potential confounders, a treatment variable indicating use of one of two antidiabetic treatments and an outcome variable indicating occurrence of an acute myocardial infarction (AMI).
Results
In the real data example, there is no difference in AMI rates between treatments. In simulations, the double robustness property is demonstrated: DR estimators are consistent if either the initial outcome regression or PS estimator is consistent, whereas other estimators are inconsistent if the initial estimator is not consistent. In simulations with near-positivity violations, CTMLE performs well relative to other estimators by adaptively estimating the PS.
Conclusion
Each of the DR estimators was consistent, and TMLE and CTMLE had the smallest mean squared error in simulations.
doi:10.1016/j.jclinepi.2013.02.017
PMCID: PMC3818128  PMID: 23849159
Safety analysis; Targeted maximum likelihood estimation; Doubly robust; Causal inference; Collaborative targeted maximum likelihood estimation; Super learning
7.  Identification and efficient estimation of the natural direct effect among the untreated 
Biometrics  2013;69(2):310-317.
Summary
The natural direct effect (NDE), or the effect of an exposure on an outcome if an intermediate variable was set to the level it would have been in the absence of the exposure, is often of interest to investigators. In general, the statistical parameter associated with the NDE is difficult to estimate in the non-parametric model, particularly when the intermediate variable is continuous or high dimensional. In this paper we introduce a new causal parameter called the natural direct effect among the untreated, discus identifiability assumptions, propose a sensitivity analysis for some of the assumptions, and show that this new parameter is equivalent to the NDE in a randomized controlled trial. We also present a targeted minimum loss estimator (TMLE), a locally efficient, double robust substitution estimator for the statistical parameter associated with this causal parameter. The TMLE can be applied to problems with continuous and high dimensional intermediate variables, and can be used to estimate the NDE in a randomized controlled trial with such data. Additionally, we define and discuss the estimation of three related causal parameters: the natural direct effect among the treated, the indirect effect among the untreated and the indirect effect among the treated.
doi:10.1111/biom.12022
PMCID: PMC3692606  PMID: 23607645
Causal inference; direct effect; indirect effect; mediation analysis; semiparametric models; targeted minimum loss estimation
8.  EFFECT OF BREASTFEEDING ON GASTROINTESTINAL INFECTION IN INFANTS: A TARGETED MAXIMUM LIKELIHOOD APPROACH FOR CLUSTERED LONGITUDINAL DATA 
The annals of applied statistics  2014;8(2):703-725.
The PROmotion of Breastfeeding Intervention Trial (PROBIT) cluster-randomized a program encouraging breastfeeding to new mothers in hospital centers. The original studies indicated that this intervention successfully increased duration of breastfeeding and lowered rates of gastrointestinal tract infections in newborns. Additional scientific and popular interest lies in determining the causal effect of longer breastfeeding on gastrointestinal infection. In this study, we estimate the expected infection count under various lengths of breastfeeding in order to estimate the effect of breastfeeding duration on infection. Due to the presence of baseline and time-dependent confounding, specialized “causal” estimation methods are required. We demonstrate the double-robust method of Targeted Maximum Likelihood Estimation (TMLE) in the context of this application and review some related methods and the adjustments required to account for clustering. We compare TMLE (implemented both parametrically and using a data-adaptive algorithm) to other causal methods for this example. In addition, we conduct a simulation study to determine (1) the effectiveness of controlling for clustering indicators when cluster-specific confounders are unmeasured and (2) the importance of using data-adaptive TMLE.
PMCID: PMC4259272  PMID: 25505499
Causal inference; G-computation; inverse probability weighting; marginal effects; missing data; pediatrics
9.  Modeling the impact of hepatitis C viral clearance on end-stage liver disease in an HIV co-infected cohort with Targeted Maximum Likelihood Estimation 
Biometrics  2013;70(1):144-152.
Summary
Despite modern effective HIV treatment, hepatitis C virus (HCV) co-infection is associated with a high risk of progression to end-stage liver disease (ESLD) which has emerged as the primary cause of death in this population. Clinical interest lies in determining the impact of clearance of HCV on risk for ESLD. In this case study, we examine whether HCV clearance affects risk of ESLD using data from the multicenter Canadian Co-infection Cohort Study. Complications in this survival analysis arise from the time-dependent nature of the data, the presence of baseline confounders, loss to follow-up, and confounders that change over time, all of which can obscure the causal effect of interest. Additional challenges included non-censoring variable missingness and event sparsity.
In order to efficiently estimate the ESLD-free survival probabilities under a specific history of HCV clearance, we demonstrate the doubly-robust and semiparametric efficient method of Targeted Maximum Likelihood Estimation (TMLE). Marginal structural models (MSM) can be used to model the effect of viral clearance (expressed as a hazard ratio) on ESLD-free survival and we demonstrate a way to estimate the parameters of a logistic model for the hazard function with TMLE. We show the theoretical derivation of the efficient influence curves for the parameters of two different MSMs and how they can be used to produce variance approximations for parameter estimates. Finally, the data analysis evaluating the impact of HCV on ESLD was undertaken using multiple imputations to account for the non-monotone missing data.
doi:10.1111/biom.12105
PMCID: PMC3954273  PMID: 24571372
Double-robust; Inverse probability of treatment weighting; Kaplan-Meier; Longitudinal data; Marginal structural model; Survival analysis; Targeted maximum likelihood estimation
10.  Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation 
Statistics in medicine  2009;28(1):39-64.
SUMMARY
Covariate adjustment using linear models for continuous outcomes in randomized trials has been shown to increase efficiency and power over the unadjusted method in estimating the marginal effect of treatment. However, for binary outcomes, investigators generally rely on the unadjusted estimate as the literature indicates that covariate-adjusted estimates based on the logistic regression models are less efficient. The crucial step that has been missing when adjusting for covariates is that one must integrate/average the adjusted estimate over those covariates in order to obtain the marginal effect. We apply the method of targeted maximum likelihood estimation (tMLE) to obtain estimators for the marginal effect using covariate adjustment for binary outcomes. We show that the covariate adjustment in randomized trials using the logistic regression models can be mapped, by averaging over the covariate(s), to obtain a fully robust and efficient estimator of the marginal effect, which equals a targeted maximum likelihood estimator. This tMLE is obtained by simply adding a clever covariate to a fixed initial regression. We present simulation studies that demonstrate that this tMLE increases efficiency and power over the unadjusted method, particularly for smaller sample sizes, even when the regression model is mis-specified.
doi:10.1002/sim.3445
PMCID: PMC2857590  PMID: 18985634
clinical trails; efficiency; covariate adjustment; variable selection
11.  Effect Modification by Sex and Baseline CD4+ Cell Count Among Adults Receiving Combination Antiretroviral Therapy in Botswana: Results from a Clinical Trial 
Abstract
The Tshepo study was the first clinical trial to evaluate outcomes of adults receiving nevirapine (NVP)-based versus efavirenz (EFV)-based combination antiretroviral therapy (cART) in Botswana. This was a 3 year study (n=650) comparing the efficacy and tolerability of various first-line cART regimens, stratified by baseline CD4+: <200 (low) vs. 201-350 (high). Using targeted maximum likelihood estimation (TMLE), we retrospectively evaluated the causal effect of assigned NNRTI on time to virologic failure or death [intent-to-treat (ITT)] and time to minimum of virologic failure, death, or treatment modifying toxicity [time to loss of virological response (TLOVR)] by sex and baseline CD4+. Sex did significantly modify the effect of EFV versus NVP for both the ITT and TLOVR outcomes with risk differences in the probability of survival of males versus the females of approximately 6% (p=0.015) and 12% (p=0.001), respectively. Baseline CD4+ also modified the effect of EFV versus NVP for the TLOVR outcome, with a mean difference in survival probability of approximately 12% (p=0.023) in the high versus low CD4+ cell count group. TMLE appears to be an efficient technique that allows for the clinically meaningful delineation and interpretation of the causal effect of NNRTI treatment and effect modification by sex and baseline CD4+ cell count strata in this study. EFV-treated women and NVP-treated men had more favorable cART outcomes. In addition, adults initiating EFV-based cART at higher baseline CD4+ cell count values had more favorable outcomes compared to those initiating NVP-based cART.
doi:10.1089/aid.2011.0349
PMCID: PMC3423643  PMID: 22309114
12.  A Targeted Maximum Likelihood Estimator for Two-Stage Designs 
We consider two-stage sampling designs, including so-called nested case control studies, where one takes a random sample from a target population and completes measurements on each subject in the first stage. The second stage involves drawing a subsample from the original sample, collecting additional data on the subsample. This data structure can be viewed as a missing data structure on the full-data structure collected in the second-stage of the study. Methods for analyzing two-stage designs include parametric maximum likelihood estimation and estimating equation methodology. We propose an inverse probability of censoring weighted targeted maximum likelihood estimator (IPCW-TMLE) in two-stage sampling designs and present simulation studies featuring this estimator.
doi:10.2202/1557-4679.1217
PMCID: PMC3083136  PMID: 21556285
two-stage designs; targeted maximum likelihood estimators; nested case control studies; double robust estimation
13.  The Relative Performance of Targeted Maximum Likelihood Estimators 
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
doi:10.2202/1557-4679.1308
PMCID: PMC3173607  PMID: 21931570
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
14.  Estimation of a non-parametric variable importance measure of a continuous exposure 
We define a new measure of variable importance of an exposure on a continuous outcome, accounting for potential confounders. The exposure features a reference level x0 with positive mass and a continuum of other levels. For the purpose of estimating it, we fully develop the semi-parametric estimation methodology called targeted minimum loss estimation methodology (TMLE) [23, 22]. We cover the whole spectrum of its theoretical study (convergence of the iterative procedure which is at the core of the TMLE methodology; consistency and asymptotic normality of the estimator), practical implementation, simulation study and application to a genomic example that originally motivated this article. In the latter, the exposure X and response Y are, respectively, the DNA copy number and expression level of a given gene in a cancer cell. Here, the reference level is x0 = 2, that is the expected DNA copy number in a normal cell. The confounder is a measure of the methylation of the gene. The fact that there is no clear biological indication that X and Y can be interpreted as an exposure and a response, respectively, is not problematic.
doi:10.1214/12-EJS703
PMCID: PMC3546832  PMID: 23336014
Variable importance measure; non-parametric estimation; targeted minimum loss estimation; robustness; asymptotics
15.  Targeted Maximum Likelihood Estimation of Effect Modification Parameters in Survival Analysis 
The Cox proportional hazards model or its discrete time analogue, the logistic failure time model, posit highly restrictive parametric models and attempt to estimate parameters which are specific to the model proposed. These methods are typically implemented when assessing effect modification in survival analyses despite their flaws. The targeted maximum likelihood estimation (TMLE) methodology is more robust than the methods typically implemented and allows practitioners to estimate parameters that directly answer the question of interest. TMLE will be used in this paper to estimate two newly proposed parameters of interest that quantify effect modification in the time to event setting. These methods are then applied to the Tshepo study to assess if either gender or baseline CD4 level modify the effect of two cART therapies of interest, efavirenz (EFV) and nevirapine (NVP), on the progression of HIV. The results show that women tend to have more favorable outcomes using EFV while males tend to have more favorable outcomes with NVP. Furthermore, EFV tends to be favorable compared to NVP for individuals at high CD4 levels.
doi:10.2202/1557-4679.1307
PMCID: PMC3083138  PMID: 21556287
causal effect; semi-parametric; censored longitudinal data; double robust; efficient influence curve; influence curve; G-computation; Targeted Maximum Likelihood Estimation; Cox-proportional hazards; survival analysis
16.  Dimension reduction with gene expression data using targeted variable importance measurement 
BMC Bioinformatics  2011;12:312.
Background
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
Results
We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
Conclusions
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
doi:10.1186/1471-2105-12-312
PMCID: PMC3166941  PMID: 21849016
17.  Targeted Estimation of Binary Variable Importance Measures with Interval-Censored Outcomes 
In most experimental and observational studies, participants are not followed in continuous time. Instead, data is collected about participants only at certain monitoring times. These monitoring times are random and often participant specific. As a result, outcomes are only known up to random time intervals, resulting in interval-censored data. In contrast, when estimating variable importance measures on interval-censored outcomes, practitioners often ignore the presence of interval censoring, and instead treat the data as continuous or right-censored, applying ad hoc approaches to mask the true interval censoring. In this article, we describe targeted minimum loss–based estimation (TMLE) methods tailored for estimation of binary variable importance measures with interval-censored outcomes. We demonstrate the performance of the interval-censored TMLE procedure through simulation studies and apply the method to analyze the effects of a variety of variables on spontaneous hepatitis C virus clearance among injecton drug users, using data from the “International Collaboration of Incident HIV and HCV in Injecting Cohorts” project.
doi:10.1515/ijb-2013-0009
PMCID: PMC4491438  PMID: 24637001
interval censoring; missing data; observational data; targeted learning; variable importance
18.  Targeted Maximum Likelihood Based Causal Inference: Part I 
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
doi:10.2202/1557-4679.1211
PMCID: PMC3126670  PMID: 20737021
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
19.  Usual Physical Activity and Hip Fracture in Older Men: An Application of Semiparametric Methods to Observational Data 
American Journal of Epidemiology  2011;173(5):578-586.
Few studies have examined the relation between usual physical activity level and rate of hip fracture in older men or applied semiparametric methods from the causal inference literature that estimate associations without assuming a particular parametric model. Using the Physical Activity Scale for the Elderly, the authors measured usual physical activity level at baseline (2000–2002) in 5,682 US men ≥65 years of age who were enrolled in the Osteoporotic Fractures in Men Study. Physical activity levels were classified as low (bottom quartile of Physical Activity Scale for the Elderly score), moderate (middle quartiles), or high (top quartile). Hip fractures were confirmed by central review. Marginal associations between physical activity and hip fracture were estimated with 3 estimation methods: inverse probability-of-treatment weighting, G-computation, and doubly robust targeted maximum likelihood estimation. During 6.5 years of follow-up, 95 men (1.7%) experienced a hip fracture. The unadjusted risk of hip fracture was lower in men with a high physical activity level versus those with a low physical activity level (relative risk = 0.51, 95% confidence interval: 0.28, 0.92). In semiparametric analyses that controlled confounding, hip fracture risk was not lower with moderate (e.g., targeted maximum likelihood estimation relative risk = 0.92, 95% confidence interval: 0.62, 1.44) or high (e.g., targeted maximum likelihood estimation relative risk = 0.88, 95% confidence interval: 0.53, 2.03) physical activity relative to low. This study does not support a protective effect of usual physical activity on hip fracture in older men.
doi:10.1093/aje/kwq405
PMCID: PMC3105440  PMID: 21303805
aged; confounding factors (epidemiology); exercise; hip fractures; men; motor activity; prospective studies
20.  Associations between Potentially Modifiable Risk Factors and Alzheimer Disease: A Mendelian Randomization Study 
PLoS Medicine  2015;12(6):e1001841.
Background
Potentially modifiable risk factors including obesity, diabetes, hypertension, and smoking are associated with Alzheimer disease (AD) and represent promising targets for intervention. However, the causality of these associations is unclear. We sought to assess the causal nature of these associations using Mendelian randomization (MR).
Methods and Findings
We used SNPs associated with each risk factor as instrumental variables in MR analyses. We considered type 2 diabetes (T2D, NSNPs = 49), fasting glucose (NSNPs = 36), insulin resistance (NSNPs = 10), body mass index (BMI, NSNPs = 32), total cholesterol (NSNPs = 73), HDL-cholesterol (NSNPs = 71), LDL-cholesterol (NSNPs = 57), triglycerides (NSNPs = 39), systolic blood pressure (SBP, NSNPs = 24), smoking initiation (NSNPs = 1), smoking quantity (NSNPs = 3), university completion (NSNPs = 2), and years of education (NSNPs = 1). We calculated MR estimates of associations between each exposure and AD risk using an inverse-variance weighted approach, with summary statistics of SNP–AD associations from the International Genomics of Alzheimer’s Project, comprising a total of 17,008 individuals with AD and 37,154 cognitively normal elderly controls. We found that genetically predicted higher SBP was associated with lower AD risk (odds ratio [OR] per standard deviation [15.4 mm Hg] of SBP [95% CI]: 0.75 [0.62–0.91]; p = 3.4 × 10−3). Genetically predicted higher SBP was also associated with a higher probability of taking antihypertensive medication (p = 6.7 × 10−8). Genetically predicted smoking quantity was associated with lower AD risk (OR per ten cigarettes per day [95% CI]: 0.67 [0.51–0.89]; p = 6.5 × 10−3), although we were unable to stratify by smoking history; genetically predicted smoking initiation was not associated with AD risk (OR = 0.70 [0.37, 1.33]; p = 0.28). We saw no evidence of causal associations between glycemic traits, T2D, BMI, or educational attainment and risk of AD (all p > 0.1). Potential limitations of this study include the small proportion of intermediate trait variance explained by genetic variants and other implicit limitations of MR analyses.
Conclusions
Inherited lifetime exposure to higher SBP is associated with lower AD risk. These findings suggest that higher blood pressure—or some environmental exposure associated with higher blood pressure, such as use of antihypertensive medications—may reduce AD risk.
Robert A. Scott and colleagues use genetic instruments to identify causal associations between known risk factors and Alzheimer's disease.
Editors' Summary
Background
Worldwide, about 44 million people have dementia, a group of brain degeneration disorders characterized by an irreversible decline in memory, communication, and other “cognitive” functions. Dementia mainly affects older people, and because people are living longer, experts estimate that more than 135 million people will have dementia by 2050. The most common form of dementia, which accounts for 60%–70% of cases, is Alzheimer disease (AD). The earliest sign of AD is often increasing forgetfulness. As the disease progresses, affected individuals gradually lose the ability to look after themselves, they may become anxious or aggressive, and they may have difficulty recognizing friends and relatives. People with late stage disease may lose control of their bladder and of other physical functions. At present, there is no cure for AD, although some of its symptoms can be managed with drugs. Most people with AD are initially cared for at home by relatives and other caregivers, but many affected individuals end their days in a care home or specialist nursing home.
Why Was This Study Done?
Researchers are interested in identifying risk factors for AD, particularly modifiable risk factors, because if such risk factors exist, it might be possible to limit the predicted increase in future AD cases. Epidemiological studies (investigations that examine patterns of disease in populations) have identified several potential risk factors for AD, including hypertension (high blood pressure), obesity, smoking, and dyslipidemia (changes in how the body handles fats). However, epidemiological studies cannot prove that a specific risk factor causes AD. For example, people with hypertension might share another characteristic that causes both hypertension and AD (confounding) or AD might cause hypertension (reverse causation). Information on causality is needed to decide which risk factors to target to help prevent AD. Here, the researchers use “Mendelian randomization” to examine whether differences in several epidemiologically identified risk factors for AD have a causal impact on AD risk. In Mendelian randomization, causal associations are inferred from the effects of genetic variants (which predict levels of modifiable risk factors) on the outcome of interest. Because gene variants are inherited randomly, they are not prone to confounding and are free from reverse causation. So, if hypertension actually causes AD, genetic variants that affect hypertension should be associated with an altered risk of AD.
What Did the Researchers Do and Find?
The researchers identified causal associations between potentially modifiable risk factors and AD risk by analyzing the occurrence of single nucleotide polymorphisms (SNPs, a type of gene variant) known to predict levels of each risk factor, in genetic data from 17,008 individuals with AD and 37,154 cognitively normal elderly controls collected by the International Genomics of Alzheimer’s Project. They report that genetically predicted higher systolic blood pressure (SBP; the pressure exerted on the inside of large blood vessels when the heart is pumping out blood) was associated with lower AD risk (and with a higher probability of taking antihypertensive medication). Predicted smoking quantity was also associated with lower AD risk, but there was no evidence of causal associations between any of the other risk factors investigated and AD risk.
What Do These Findings Mean?
In contrast to some epidemiological studies, these findings suggest that hypertension is associated with lower AD risk. However, because genetically predicted higher SBP was also associated with a higher probability of taking antihypertensive medication, it could be that exposure to such drugs, rather than having hypertension, reduces AD risk. Like all Mendelian randomization studies, the reliability of these findings depends on the validity of several assumptions made by the researchers and on the ability of the SNPs used in the analyses to explain variations in exposure to the various risk factors. Moreover, because all the participants in the International Genomics of Alzheimer’s Project are of European ancestry, these findings may not be valid for other ethnic groups. Given that hypertension is a risk factor for cardiovascular disease, the researchers do not advocate raising blood pressure as a measure to prevent AD (neither do they advocate that people smoke more cigarettes to lower AD risk). Rather, given the strong association between higher SBP gene scores and the probability of exposure to antihypertensive treatment, they suggest that the possibility that antihypertensive drugs might reduce AD risk independently of their effects on blood pressure should be investigated as a priority.
Additional Information
This list of resources contains links that can be accessed when viewing the PDF on a device or via the online version of the article at http://dx.doi.org/10.1371/journal.pmed.1001841.
The UK National Health Service Choices website provides information (including personal stories) about Alzheimer disease
The UK not-for-profit organization Alzheimer’s Society provides information for patients and carers about dementia, including personal experiences of living with Alzheimer disease
The US not-for-profit organization Alzheimer’s Association also provides information for patients and carers about dementia and personal stories about dementia
Alzheimer’s Disease International is the federation of Alzheimer disease associations around the world; it provides links to individual Alzheimer associations, information about dementia, and links to world Alzheimer reports
MedlinePlus provides links to additional resources about Alzheimer disease (in English and Spanish)
Wikipedia has a page on Mendelian randomization (note: Wikipedia is a free online encyclopedia that anyone can edit; available in several languages)
A PLOS Medicine Research Article by Proitsi et al. describes a Mendelian randomization study that looked for a causal association between dyslipidemia and Alzheimer disease
doi:10.1371/journal.pmed.1001841
PMCID: PMC4469461  PMID: 26079503
21.  Cervical Cancer Precursors and Hormonal Contraceptive Use in HIV-Positive Women: Application of a Causal Model and Semi-Parametric Estimation Methods 
PLoS ONE  2014;9(6):e101090.
Objective
To demonstrate the application of causal inference methods to observational data in the obstetrics and gynecology field, particularly causal modeling and semi-parametric estimation.
Background
Human immunodeficiency virus (HIV)-positive women are at increased risk for cervical cancer and its treatable precursors. Determining whether potential risk factors such as hormonal contraception are true causes is critical for informing public health strategies as longevity increases among HIV-positive women in developing countries.
Methods
We developed a causal model of the factors related to combined oral contraceptive (COC) use and cervical intraepithelial neoplasia 2 or greater (CIN2+) and modified the model to fit the observed data, drawn from women in a cervical cancer screening program at HIV clinics in Kenya. Assumptions required for substantiation of a causal relationship were assessed. We estimated the population-level association using semi-parametric methods: g-computation, inverse probability of treatment weighting, and targeted maximum likelihood estimation.
Results
We identified 2 plausible causal paths from COC use to CIN2+: via HPV infection and via increased disease progression. Study data enabled estimation of the latter only with strong assumptions of no unmeasured confounding. Of 2,519 women under 50 screened per protocol, 219 (8.7%) were diagnosed with CIN2+. Marginal modeling suggested a 2.9% (95% confidence interval 0.1%, 6.9%) increase in prevalence of CIN2+ if all women under 50 were exposed to COC; the significance of this association was sensitive to method of estimation and exposure misclassification.
Conclusion
Use of causal modeling enabled clear representation of the causal relationship of interest and the assumptions required to estimate that relationship from the observed data. Semi-parametric estimation methods provided flexibility and reduced reliance on correct model form. Although selected results suggest an increased prevalence of CIN2+ associated with COC, evidence is insufficient to conclude causality. Priority areas for future studies to better satisfy causal criteria are identified.
doi:10.1371/journal.pone.0101090
PMCID: PMC4076246  PMID: 24979709
22.  When to Start Antiretroviral Therapy in Children Aged 2–5 Years: A Collaborative Causal Modelling Analysis of Cohort Studies from Southern Africa 
PLoS Medicine  2013;10(11):e1001555.
Michael Schomaker and colleagues estimate the mortality associated with starting ART at different CD4 thresholds among children aged 2–5 years using observational data collected in cohort studies in Southern Africa.
Please see later in the article for the Editors' Summary
Background
There is limited evidence on the optimal timing of antiretroviral therapy (ART) initiation in children 2–5 y of age. We conducted a causal modelling analysis using the International Epidemiologic Databases to Evaluate AIDS–Southern Africa (IeDEA-SA) collaborative dataset to determine the difference in mortality when starting ART in children aged 2–5 y immediately (irrespective of CD4 criteria), as recommended in the World Health Organization (WHO) 2013 guidelines, compared to deferring to lower CD4 thresholds, for example, the WHO 2010 recommended threshold of CD4 count <750 cells/mm3 or CD4 percentage (CD4%) <25%.
Methods and Findings
ART-naïve children enrolling in HIV care at IeDEA-SA sites who were between 24 and 59 mo of age at first visit and with ≥1 visit prior to ART initiation and ≥1 follow-up visit were included. We estimated mortality for ART initiation at different CD4 thresholds for up to 3 y using g-computation, adjusting for measured time-dependent confounding of CD4 percent, CD4 count, and weight-for-age z-score. Confidence intervals were constructed using bootstrapping.
The median (first; third quartile) age at first visit of 2,934 children (51% male) included in the analysis was 3.3 y (2.6; 4.1), with a median (first; third quartile) CD4 count of 592 cells/mm3 (356; 895) and median (first; third quartile) CD4% of 16% (10%; 23%). The estimated cumulative mortality after 3 y for ART initiation at different CD4 thresholds ranged from 3.4% (95% CI: 2.1–6.5) (no ART) to 2.1% (95% CI: 1.3%–3.5%) (ART irrespective of CD4 value). Estimated mortality was overall higher when initiating ART at lower CD4 values or not at all. There was no mortality difference between starting ART immediately, irrespective of CD4 value, and ART initiation at the WHO 2010 recommended threshold of CD4 count <750 cells/mm3 or CD4% <25%, with mortality estimates of 2.1% (95% CI: 1.3%–3.5%) and 2.2% (95% CI: 1.4%–3.5%) after 3 y, respectively. The analysis was limited by loss to follow-up and the unavailability of WHO staging data.
Conclusions
The results indicate no mortality difference for up to 3 y between ART initiation irrespective of CD4 value and ART initiation at a threshold of CD4 count <750 cells/mm3 or CD4% <25%, but there are overall higher point estimates for mortality when ART is initiated at lower CD4 values.
Please see later in the article for the Editors' Summary
Editors' Summary
Background
Infection with HIV, the virus that causes AIDS, contributes substantially to the burden of disease in children. Worldwide, more than 3 million children younger than 15 years old (90% of whom live in sub-Saharan Africa) are HIV-positive, and every year around 330,000 more children are infected with HIV. Children usually acquire HIV from their mother during pregnancy, birth, or breastfeeding. The virus gradually destroys CD4 lymphocytes and other immune system cells, leaving infected children susceptible to other potentially life-threatening infections. HIV infection can be kept in check, with antiretroviral therapy (ART)—cocktails of drugs that have to be taken daily throughout life. ART is very effective in children but is expensive, and despite concerted international efforts over the past decade to provide universal access to ART, in 2011, less than a third of children who needed ART were receiving it.
Why Was This Study Done?
For children diagnosed as HIV-positive between the ages of two and five years, the 2010 World Health Organization (WHO) guidelines for the treatment of HIV infection recommended that ART be initiated when the CD4 count dropped below 750 cells/mm3 blood or when CD4 cells represented less than 25% of the total lymphocyte population (CD4 percent). Since June 2013, however, WHO has recommended that all HIV-positive children in this age group begin ART immediately, irrespective of their CD4 values. Earlier ART initiation might reduce mortality (death) and morbidity (illness), but it could also increase the risk of toxicity and of earlier development of drug resistance. In this causal modeling analysis, the researchers estimate the mortality associated with starting ART at different CD4 thresholds among children aged 2–5 years using observational data collected in cohort studies of ART undertaken in southern Africa. Specifically, they compared the estimated mortality associated with the WHO 2010 and WHO 2013 guidelines. Observational studies compare the outcomes of groups (cohorts) with different interventions (here, the timing of ART initiation). Data from such studies are affected by time-dependent confounding: CD4 count, for example, varies with time and is a predictor of both ART initiation and the probability of death. Causal modeling techniques take time-dependent confounding into account and enable the estimation of the causal effect of an intervention on an outcome from observational data.
What Did the Researchers Do and Find?
The researchers used g-computation (a type of causal modeling) adjusting for time-dependent confounding of CD4 percent, CD4 count, and weight-for-age z-score (a measure of whether a child is underweight for their age that provides a proxy indicator of the clinical stage of HIV infection) to estimate mortality for ART initiation at different CD4 thresholds in 2,934 ART-naïve, HIV-positive children aged 2–5 years old at their first visit to one of eight study sites in southern Africa. The average initial CD4 values of these children were a CD4 count of 592 cells/mm3 and a CD4 percent of 16%. The estimated cumulative mortality after three years was 3.4% in all children if ART was never started. If all children had started ART immediately after diagnosis irrespective of CD4 value or if the 2010 WHO-recommended threshold of a CD4 count below 750 cells/mm3 or a CD4 percent below 25% was followed, the estimated cumulative mortalities after three years were 2.1% and 2.2%, respectively (a statistically non-significant difference).
What Do These Findings Mean?
These findings suggest that, among southern African children aged 2–5 years at HIV diagnosis, there is no difference in mortality for up to three years between children in whom ART is initiated immediately and those in whom ART initiation is deferred until their CD4 value falls below a CD4 count of 750 cells/mm3 or a CD4 percent of 25%. Although causal modeling was used in this analysis, the accuracy of these results may be affected by residual confounding. For example, the researchers were unable to adjust for the clinical stage of HIV disease at HIV diagnosis and instead had to use weight-for-age z-scores as a proxy indicator of disease severity. Other limitations of the study include the large number of children lost to follow-up and a possible lack of generalizability—most of the study participants were from urban settings in South Africa. Importantly, however, these findings suggest that the recent change in the WHO guidelines for ART initiation in young children is unlikely to increase or reduce mortality, with the proviso that the long-term effects of earlier ART initiation such as toxicity and the development of resistance to ART need to be explored further.
Additional Information
Please access these websites via the online version of this summary at http://dx.doi.org/10.1371/journal.pmed.1001555
Information is available from the US National Institute of Allergy and Infectious Diseases on HIV infection and AIDS
NAM/aidsmap provides basic information about HIV/AIDS and summaries of recent research findings on HIV care and treatment
Information is available from Avert, an international AIDS charity, on many aspects of HIV/AIDS, including information on HIV and AIDS in Africa and on children and HIV/AIDS (in English and Spanish)
The UNAIDS World AIDS Day Report 2012 provides up-to-date information about the AIDS epidemic and efforts to halt it; the 2013 Progress Report on the Global Plan provides information on progress towards eliminating new HIV infections among children by 2015
The World Health Organization provides information about universal access to AIDS treatment (in several languages); its 2010 guidelines for ART in infants and children and its 2013 consolidated guidelines on the use of ART can be downloaded
The researchers involved in this study are part of the International Epidemiologic Databases to Evaluate AIDSSouthern Africa collaboration, which develops and implements methodology to generate the large datasets needed to address high-priority research questions related to HIV/AIDS
Personal stories about living with HIV/AIDS, including stories from young people infected with HIV, are available through Avert, through NAM/aidsmap, and through the charity website Healthtalkonline
doi:10.1371/journal.pmed.1001555
PMCID: PMC3833834  PMID: 24260029
23.  Can Linear Regression Modeling Help Clinicians in the Interpretation of Genotypic Resistance Data? An Application to Derive a Lopinavir-Score 
PLoS ONE  2011;6(11):e25665.
Background
The question of whether a score for a specific antiretroviral (e.g. lopinavir/r in this analysis) that improves prediction of viral load response given by existing expert-based interpretation systems (IS) could be derived from analyzing the correlation between genotypic data and virological response using statistical methods remains largely unanswered.
Methods and Findings
We used the data of the patients from the UK Collaborative HIV Cohort (UK CHIC) Study for whom genotypic data were stored in the UK HIV Drug Resistance Database (UK HDRD) to construct a training/validation dataset of treatment change episodes (TCE). We used the average square error (ASE) on a 10-fold cross-validation and on a test dataset (the EuroSIDA TCE database) to compare the performance of a newly derived lopinavir/r score with that of the 3 most widely used expert-based interpretation rules (ANRS, HIVDB and Rega). Our analysis identified mutations V82A, I54V, K20I and I62V, which were associated with reduced viral response and mutations I15V and V91S which determined lopinavir/r hypersensitivity. All models performed equally well (ASE on test ranging between 1.1 and 1.3, p = 0.34).
Conclusions
We fully explored the potential of linear regression to construct a simple predictive model for lopinavir/r-based TCE. Although, the performance of our proposed score was similar to that of already existing IS, previously unrecognized lopinavir/r-associated mutations were identified. The analysis illustrates an approach of validation of expert-based IS that could be used in the future for other antiretrovirals and in other settings outside HIV research.
doi:10.1371/journal.pone.0025665
PMCID: PMC3217925  PMID: 22110581
24.  Semiparametric Estimation of the Impacts of Longitudinal Interventions on Adolescent Obesity using Targeted Maximum-Likelihood: Accessible Estimation with the ltmle Package 
Journal of causal inference  2014;2(1):95-108.
While child and adolescent obesity is a serious public health concern, few studies have utilized parameters based on the causal inference literature to examine the potential impacts of early intervention. The purpose of this analysis was to estimate the causal effects of early interventions to improve physical activity and diet during adolescence on body mass index (BMI), a measure of adiposity, using improved techniques. The most widespread statistical method in studies of child and adolescent obesity is multi-variable regression, with the parameter of interest being the coefficient on the variable of interest. This approach does not appropriately adjust for time-dependent confounding, and the modeling assumptions may not always be met. An alternative parameter to estimate is one motivated by the causal inference literature, which can be interpreted as the mean change in the outcome under interventions to set the exposure of interest. The underlying data-generating distribution, upon which the estimator is based, can be estimated via a parametric or semi-parametric approach. Using data from the National Heart, Lung, and Blood Institute Growth and Health Study, a 10-year prospective cohort study of adolescent girls, we estimated the longitudinal impact of physical activity and diet interventions on 10-year BMI z-scores via a parameter motivated by the causal inference literature, using both parametric and semi-parametric estimation approaches. The parameters of interest were estimated with a recently released R package, ltmle, for estimating means based upon general longitudinal treatment regimes. We found that early, sustained intervention on total calories had a greater impact than a physical activity intervention or non-sustained interventions. Multivariable linear regression yielded inflated effect estimates compared to estimates based on targeted maximum-likelihood estimation and data-adaptive super learning. Our analysis demonstrates that sophisticated, optimal semiparametric estimation of longitudinal treatment-specific means via ltmle provides an incredibly powerful, yet easy-to-use tool, removing impediments for putting theory into practice.
doi:10.1515/jci-2013-0025
PMCID: PMC4452010  PMID: 26046009
obesity; longitudinal data; causal inference
25.  Extracting causal relations on HIV drug resistance from literature 
BMC Bioinformatics  2010;11:101.
Background
In HIV treatment it is critical to have up-to-date resistance data of applicable drugs since HIV has a very high rate of mutation. These data are made available through scientific publications and must be extracted manually by experts in order to be used by virologists and medical doctors. Therefore there is an urgent need for a tool that partially automates this process and is able to retrieve relations between drugs and virus mutations from literature.
Results
In this work we present a novel method to extract and combine relationships between HIV drugs and mutations in viral genomes. Our extraction method is based on natural language processing (NLP) which produces grammatical relations and applies a set of rules to these relations. We applied our method to a relevant set of PubMed abstracts and obtained 2,434 extracted relations with an estimated performance of 84% for F-score. We then combined the extracted relations using logistic regression to generate resistance values for each pair. The results of this relation combination show more than 85% agreement with the Stanford HIVDB for the ten most frequently occurring mutations. The system is used in 5 hospitals from the Virolab project http://www.virolab.org to preselect the most relevant novel resistance data from literature and present those to virologists and medical doctors for further evaluation.
Conclusions
The proposed relation extraction and combination method has a good performance on extracting HIV drug resistance data. It can be used in large-scale relation extraction experiments. The developed methods can also be applied to extract other type of relations such as gene-protein, gene-disease, and disease-mutation.
doi:10.1186/1471-2105-11-101
PMCID: PMC2841207  PMID: 20178611

Results 1-25 (988840)