We propose statistical methods for comparing phenomics data generated by the Biolog Phenotype Microarray (PM) platform for high-throughput phenotyping. Instead of the routinely used visual inspection of data with no sound inferential basis, we develop two approaches. The first approach is based on quantifying the distance between mean or median curves from two treatments and then applying a permutation test; we also consider a permutation test applied to areas under mean curves. The second approach employs functional principal component analysis. Properties of the proposed methods are investigated on both simulated data and data sets from the PM platform.
functional data analysis; principal components; permutation tests; phenotype microarrays; high-throughput phenotyping; phenomics; Biolog
Targeted maximum likelihood estimation of a parameter of a data generating distribution, known to be an element of a semi-parametric model, involves constructing a parametric model through an initial density estimator with parameter ɛ representing an amount of fluctuation of the initial density estimator, where the score of this fluctuation model at ɛ = 0 equals the efficient influence curve/canonical gradient. The latter constraint can be satisfied by many parametric fluctuation models since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the semi-parametric model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, by which we mean situations where the Fisher information is low, a violation of this property can heavily affect the performance of the estimator. This paper presents a fluctuation approach that guarantees the fluctuated density estimator remains inside the bounds of the data model. We demonstrate this in the context of estimation of a causal effect of a binary treatment on a continuous outcome that is bounded. It results in a targeted maximum likelihood estimator that inherently respects known bounds, and consequently is more robust in sparse data situations than the targeted MLE using a naive fluctuation model.
When an estimation procedure incorporates weights, observations having large weights relative to the rest heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it can also introduce bias into the estimate. We present an alternative targeted maximum likelihood estimation (TMLE) approach that dampens the effect of these heavily weighted observations. As a substitution estimator, TMLE respects the global constraints of the observed data model. For example, when outcomes are binary, a fluctuation of an initial density estimate on the logit scale constrains predicted probabilities to be between 0 and 1. This inherent enforcement of bounds has been extended to continuous outcomes. Simulation study results indicate that this approach is on a par with, and many times superior to, fluctuating on the linear scale, and in particular is more robust when there is sparsity in the data.
targeted maximum likelihood estimation; TMLE; causal effect
In their presentation on measures of predictive capacity Gu and Pepe say little about calibration. This comment distinguishes conditional and unconditional calibration and how these relate to the stated results.
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
causal effect; cross-validation; collaborative double robust; double robust; efficient influence curve; penalized likelihood; penalization; estimator selection; locally efficient; maximum likelihood estimation; model selection; super efficiency; super learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
Multilevel logistic regression models are increasingly being used to analyze clustered data in medical, public health, epidemiological, and educational research. Procedures for estimating the parameters of such models are available in many statistical software packages. There is currently little evidence on the minimum number of clusters necessary to reliably fit multilevel regression models. We conducted a Monte Carlo study to compare the performance of different statistical software procedures for estimating multilevel logistic regression models when the number of clusters was low. We examined procedures available in BUGS, HLM, R, SAS, and Stata. We found that there were qualitative differences in the performance of different software procedures for estimating multilevel logistic models when the number of clusters was low. Among the likelihood-based procedures, estimation methods based on adaptive Gauss-Hermite approximations to the likelihood (glmer in R and xtlogit in Stata) or adaptive Gaussian quadrature (Proc NLMIXED in SAS) tended to have superior performance for estimating variance components when the number of clusters was small, compared to software procedures based on penalized quasi-likelihood. However, only Bayesian estimation with BUGS allowed for accurate estimation of variance components when there were fewer than 10 clusters. For all statistical software procedures, estimation of variance components tended to be poor when there were only five subjects per cluster, regardless of the number of clusters.
statistical software; multilevel models; hierarchical models; random effects model; mixed effects model; generalized linear mixed models; Monte Carlo simulations; Bayesian analysis; R; SAS; Stata; BUGS
Targeted maximum likelihood estimation is a versatile tool for estimating parameters in semiparametric and nonparametric models. We work through an example applying targeted maximum likelihood methodology to estimate the parameter of a marginal structural model. In the case we consider, we show how this can be easily done by clever use of standard statistical software. We point out differences between targeted maximum likelihood estimation and other approaches (including estimating function based methods). The application we consider is to estimate the effect of adherence to antiretroviral medications on virologic failure in HIV positive individuals.
targeted maximum likelihood; marginal structural model
Dynamic treatment regimes are the type of regime most commonly used in clinical practice. For example, physicians may initiate combined antiretroviral therapy the first time an individual’s recorded CD4 cell count drops below either 500 cells/mm3 or 350 cells/mm3. This paper describes an approach for using observational data to emulate randomized clinical trials that compare dynamic regimes of the form “initiate treatment within a certain time period of some time-varying covariate first crossing a particular threshold.” We applied this method to data from the French Hospital database on HIV (FHDH-ANRS CO4), an observational study of HIV-infected patients, in order to compare dynamic regimes of the form “initiate treatment within m months after the recorded CD4 cell count first drops below x cells/mm3” where x takes values from 200 to 500 in increments of 10 and m takes values 0 or 3. We describe the method in the context of this example and discuss some complications that arise in emulating a randomized experiment using observational data.
dynamic treatment regimes; marginal structural models; HIV infection; antiretroviral therapy
Models, such as logistic regression and Poisson regression models, are often used to estimate treatment effects in randomized trials. These models leverage information in variables collected before randomization, in order to obtain more precise estimates of treatment effects. However, there is the danger that model misspecification will lead to bias. We show that certain easy to compute, model-based estimators are asymptotically unbiased even when the working model used is arbitrarily misspecified. Furthermore, these estimators are locally efficient. As a special case of our main result, we consider a simple Poisson working model containing only main terms; in this case, we prove the maximum likelihood estimate of the coefficient corresponding to the treatment variable is an asymptotically unbiased estimator of the marginal log rate ratio, even when the working model is arbitrarily misspecified. This is the log-linear analog of ANCOVA for linear models. Our results demonstrate one application of targeted maximum likelihood estimation.
misspecified model; targeted maximum likelihood; generalized linear model; Poisson regression
We consider the problem of score testing for certain low dimensional parameters of interest in a model that could include finite but high dimensional secondary covariates and associated nuisance parameters. We investigate the possibility of the potential gain in power by reducing the dimensionality of the secondary variables via oracle estimators such as the Adaptive Lasso. As an application, we use a recently developed framework for score tests of association of a disease outcome with an exposure of interest in the presence of a possible interaction of the exposure with other co-factors of the model. We derive the local power of such tests and show that if the primary and secondary predictors are independent, then having an oracle estimator does not improve the local power of the score test. Conversely, if they are dependent, there is the potential for power gain. Simulations are used to validate the theoretical results and explore the extent of correlation needed between the primary and secondary covariates to observe an improvement of the power of the test by using the oracle estimator. Our conclusions are likely to hold more generally beyond the model of interactions considered here.
Adaptive Lasso; gene-environment interactions; Lasso; model selection; oracle estimation; score tests
Repeated cross-sectional cluster randomization trials are cluster randomization trials in which the response variable is measured on a sample of subjects from each cluster at baseline and on a different sample of subjects from each cluster at follow-up. One can estimate the effect of the intervention on the follow-up response alone, on the follow-up responses after adjusting for baseline responses, or on the change in the follow-up response from the baseline response. We used Monte Carlo simulations to determine the relative statistical power of different methods of analysis. We examined methods of analysis based on generalized estimating equations (GEE) and a random effects model to account for within-cluster homogeneity. We also examined cluster-level analyses that treated the cluster as the unit of analysis. We found that the use of random effects models to estimate the effect of the intervention on the change in the follow-up response from the baseline response had lower statistical power compared to the other competing methods across a wide range of scenarios. The other methods tended to have similar statistical power in many settings. However, in some scenarios, those analyses that adjusted for the baseline response tended to have marginally greater power than did methods that did not account for the baseline response.
cluster randomization trials; cluster randomized trials; group randomized trials; statistical power; simulations; community intervention trials; clustered data; cross-sectional studies
In this companion article to “Dynamic Regime Marginal Structural Mean Models for Estimation of Optimal Dynamic Treatment Regimes, Part I: Main Content” [Orellana, Rotnitzky and Robins (2010), IJB, Vol. 6, Iss. 2, Art. 7] we present (i) proofs of the claims in that paper, (ii) a proposal for the computation of a confidence set for the optimal index when this lies in a finite set, and (iii) an example to aid the interpretation of the positivity assumption.
dynamic treatment regime; double-robust; inverse probability weighted; marginal structural model; optimal treatment regime; causality
Dynamic treatment regime is a decision rule in which the choice of the treatment of an individual at any given time can depend on the known past history of that individual, including baseline covariates, earlier treatments, and their measured responses. In this paper we argue that finding an optimal regime can, at least in moderately simple cases, be accomplished by a straightforward application of nonparametric Bayesian modeling and predictive inference. As an illustration we consider an inference problem in a subset of the Multicenter AIDS Cohort Study (MACS) data set, studying the effect of AZT initiation on future CD4-cell counts during a 12-month follow-up.
Bayesian nonparametric regression; causal inference; dynamic programming; monotonicity; optimal dynamic regimes
This paper summarizes recent advances in causal inference and underscores the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underlie all causal inferences, the languages used in formulating those assumptions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coherent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: those about (1) the effects of potential interventions, (2) probabilities of counterfactuals, and (3) direct and indirect effects (also known as "mediation"). Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both. The tools are demonstrated in the analyses of mediation, causes of effects, and probabilities of causation.
structural equation models; confounding; graphical methods; counterfactuals; causal effects; potential-outcome; mediation; policy evaluation; causes of effects
A number of results concerning attributable fractions for sufficient cause interactions are given. Results are given both for etiologic fractions (i.e. the proportion of the disease due to a particular sufficient cause) and for excess fractions (i.e. the proportion of disease that could be eliminated by removing a particular sufficient cause). Results are given both with and without assumptions of monotonicity. Under monotonicity assumptions, exact formulas can be given for the excess fraction. When etiologic fractions are of interest or when monotonicity assumptions do not hold for excess fractions then only lower bounds can be given. The interpretation of the results in this paper and in a proposal by Hoffmann et al. (2006) are discussed and compared. A method is described to estimate the lower bounds on attributable fractions using marginal structural models. Identification is discussed in settings in which time-dependent confounding may be present.
attributable fraction; interaction; marginal structural models; sufficient cause; synergism
Malaria is a major public health problem. An effective vaccine against malaria is actively being sought. We formulate a potential outcomes definition of the efficacy of a malaria vaccine for preventing fever. A challenge in estimating this efficacy is that there is no sure way to determine whether a fever was caused by malaria. We study the properties of two approaches for estimating efficacy: (1) use a deterministic case definition of a malaria caused fever as the conjunction of fever and parasite density above a certain cutoff; (2) use a probabilistic case definition in which the probability that each fever was caused by malaria is estimated. We compare these approaches in a simulation study and find that both approaches can potentially have large biases. We suggest a strategy for choosing an estimator based on the investigator's prior knowledge about the area in which the trial is being conducted and the range of vaccine efficacies over which the investigator would like the estimator to have good properties.
causal inference; case definition; attributable fraction
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
In this article, we provide a template for the practical implementation of the targeted maximum likelihood estimator for analyzing causal effects of multiple time point interventions, for which the methodology was developed and presented in Part I. In addition, the application of this template is demonstrated in two important estimation problems: estimation of the effect of individualized treatment rules based on marginal structural models for treatment rules, and the effect of a baseline treatment on survival in a randomized clinical trial in which the time till event is subject to right censoring.
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; path-wise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
Multistate modeling methods are well-suited for analysis of some chronic diseases that move through distinct stages. The memoryless or Markov assumptions typically made, however, may be suspect for some diseases, such as hepatitis C, where there is interest in whether prognosis depends on history. This paper describes methods for multistate modeling where transition risk can depend on any property of past progression history, including time spent in the current stage and the time taken to reach the current stage. Analysis of 901 measurements of fibrosis in 401 patients following liver transplantation found decreasing risk of progression as time in the current stage increased, even when controlled for several fixed covariates. Longer time to reach the current stage did not appear associated with lower progression risk. Analysis of simulation scenarios based on the transplant study showed that greater misclassification of fibrosis produced more technical difficulties in fitting the models and poorer estimation of covariate effects than did less misclassification or error-free fibrosis measurement. The higher risk of progression when less time has been spent in the current stage could be due to varying disease activity over time, with recent progression indicating an “active” period and consequent higher risk of further progression.
fibrosis; hepatitis C; liver transplant; memoryless assumptions; multistate modeling
Self modeling regression (SEMOR) is an approach for modeling sets of observed curves that have a common shape (or sequence of features) but have variability in the amplitude (y-axis) and/or timing (x-axis) of the features across curves. SEMOR assumes the x and y axes for each observed curve can be separately transformed in a parametric manner so that the features across curves are aligned with the common shape, usually represented by non-parametric function. We show that when the common shape is modeled with a regression spline and the transformational parameters are modeled as random with the traditional distribution (normal with mean zero), the SEMOR model may surprisingly suffer from lack of fit and the variance components may be over-estimated. A random effects distribution that restricts the predicted random transformational parameters to have mean zero or the inclusion of a fixed transformational parameter improves estimation. Our work is motivated by arterial pulse pressure waveform data where one of the variance components is a novel measure of short-term variability in blood pressure.
functional data; nonlinear mixed effects models; self-modeling