Summary
Despite modern effective HIV treatment, hepatitis C virus (HCV) co-infection is associated with a high risk of progression to end-stage liver disease (ESLD) which has emerged as the primary cause of death in this population. Clinical interest lies in determining the impact of clearance of HCV on risk for ESLD. In this case study, we examine whether HCV clearance affects risk of ESLD using data from the multicenter Canadian Co-infection Cohort Study. Complications in this survival analysis arise from the time-dependent nature of the data, the presence of baseline confounders, loss to follow-up, and confounders that change over time, all of which can obscure the causal effect of interest. Additional challenges included non-censoring variable missingness and event sparsity.
In order to efficiently estimate the ESLD-free survival probabilities under a specific history of HCV clearance, we demonstrate the doubly-robust and semiparametric efficient method of Targeted Maximum Likelihood Estimation (TMLE). Marginal structural models (MSM) can be used to model the effect of viral clearance (expressed as a hazard ratio) on ESLD-free survival and we demonstrate a way to estimate the parameters of a logistic model for the hazard function with TMLE. We show the theoretical derivation of the efficient influence curves for the parameters of two different MSMs and how they can be used to produce variance approximations for parameter estimates. Finally, the data analysis evaluating the impact of HCV on ESLD was undertaken using multiple imputations to account for the non-monotone missing data.
doi:10.1111/biom.12105
PMCID: PMC3954273
PMID: 24571372
Double-robust; Inverse probability of treatment weighting; Kaplan-Meier; Longitudinal data; Marginal structural model; Survival analysis; Targeted maximum likelihood estimation
Abstract
The Tshepo study was the first clinical trial to evaluate outcomes of adults receiving nevirapine (NVP)-based versus efavirenz (EFV)-based combination antiretroviral therapy (cART) in Botswana. This was a 3 year study (n=650) comparing the efficacy and tolerability of various first-line cART regimens, stratified by baseline CD4+: <200 (low) vs. 201-350 (high). Using targeted maximum likelihood estimation (TMLE), we retrospectively evaluated the causal effect of assigned NNRTI on time to virologic failure or death [intent-to-treat (ITT)] and time to minimum of virologic failure, death, or treatment modifying toxicity [time to loss of virological response (TLOVR)] by sex and baseline CD4+. Sex did significantly modify the effect of EFV versus NVP for both the ITT and TLOVR outcomes with risk differences in the probability of survival of males versus the females of approximately 6% (p=0.015) and 12% (p=0.001), respectively. Baseline CD4+ also modified the effect of EFV versus NVP for the TLOVR outcome, with a mean difference in survival probability of approximately 12% (p=0.023) in the high versus low CD4+ cell count group. TMLE appears to be an efficient technique that allows for the clinically meaningful delineation and interpretation of the causal effect of NNRTI treatment and effect modification by sex and baseline CD4+ cell count strata in this study. EFV-treated women and NVP-treated men had more favorable cART outcomes. In addition, adults initiating EFV-based cART at higher baseline CD4+ cell count values had more favorable outcomes compared to those initiating NVP-based cART.
doi:10.1089/aid.2011.0349
PMCID: PMC3423643
PMID: 22309114
We define a new measure of variable importance of an exposure on a continuous outcome, accounting for potential confounders. The exposure features a reference level x0 with positive mass and a continuum of other levels. For the purpose of estimating it, we fully develop the semi-parametric estimation methodology called targeted minimum loss estimation methodology (TMLE) [23, 22]. We cover the whole spectrum of its theoretical study (convergence of the iterative procedure which is at the core of the TMLE methodology; consistency and asymptotic normality of the estimator), practical implementation, simulation study and application to a genomic example that originally motivated this article. In the latter, the exposure X and response Y are, respectively, the DNA copy number and expression level of a given gene in a cancer cell. Here, the reference level is x0 = 2, that is the expected DNA copy number in a normal cell. The confounder is a measure of the methylation of the gene. The fact that there is no clear biological indication that X and Y can be interpreted as an exposure and a response, respectively, is not problematic.
doi:10.1214/12-EJS703
PMCID: PMC3546832
PMID: 23336014
Variable importance measure; non-parametric estimation; targeted minimum loss estimation; robustness; asymptotics
A new class of Marginal Structural Models (MSMs), History-Restricted MSMs (HRMSMs), was recently introduced for longitudinal data for the purpose of defining causal parameters which may often be better suited for public health research or at least more practicable than MSMs (6, 2). HRMSMs allow investigators to analyze the causal effect of a treatment on an outcome based on a fixed, shorter and user-specified history of exposure compared to MSMs. By default, the latter represent the treatment causal effect of interest based on a treatment history defined by the treatments assigned between the study’s start and outcome collection. We lay out in this article the formal statistical framework behind HRMSMs. Beyond allowing a more flexible causal analysis, HRMSMs improve computational tractability and mitigate statistical power concerns when designing longitudinal studies. We also develop three consistent estimators of HRMSM parameters under sufficient model assumptions: the Inverse Probability of Treatment Weighted (IPTW), G-computation and Double Robust (DR) estimators. In addition, we show that the assumptions commonly adopted for identification and consistent estimation of MSM parameters (existence of counterfactuals, consistency, time-ordering and sequential randomization assumptions) also lead to identification and consistent estimation of HRMSM parameters.
doi:10.1214/07-EJS050
PMCID: PMC3475192
PMID: 23087778
causal inference; counterfactual; marginal structural model; longitudinal study; IPTW; G-computation; Double Robust
Background:
The evidence for the effectiveness of antihypertensive medication use for slowing decline in kidney function in older persons is sparse. We addressed this research question by the application of novel methods in a marginal structural model.
Methods:
Change in kidney function was measured by two or more measures of cystatin C in 1,576 hypertensive participants in the Cardiovascular Health Study over 7 years of follow-up (1989–1997 in four U.S. communities). The exposure of interest was antihypertensive medication use. We used a novel estimator in a marginal structural model to account for bias due to confounding and informative censoring.
Results:
The mean annual decline in eGFR was 2.41 ± 4.91 mL/min/1.73 m2. In unadjusted analysis, antihypertensive medication use was not associated with annual change in kidney function. Traditional multivariable regression did not substantially change these estimates. Based on a marginal structural analysis, persons on antihypertensives had slower declines in kidney function; participants had an estimated 0.88 (0.13, 1.63) ml/min/1.73 m2 per year slower decline in eGFR compared with persons on no treatment. In a model that also accounted for bias due to informative censoring, the estimate for the treatment effect was 2.23 (−0.13, 4.59) ml/min/1.73 m2 per year slower decline in eGFR.
Conclusion:
In summary, estimates from a marginal structural model suggested that antihypertensive therapy was associated with preserved kidney function in hypertensive elderly adults. Confirmatory studies may provide power to determine the strength and validity of the findings.
doi:10.2202/1557-4679.1320
PMCID: PMC3204667
PMID: 22049266
aged; kidney function; hypertension; marginal structural model
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
doi:10.2202/1557-4679.1308
PMCID: PMC3173607
PMID: 21931570
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
Quantitative trait loci mapping is focused on identifying the positions and effect of genes underlying an an observed trait. We present a collaborative targeted maximum likelihood estimator in a semi-parametric model using a newly proposed 2-part super learning algorithm to find quantitative trait loci genes in listeria data. Results are compared to the parametric composite interval mapping approach.
doi:10.1016/j.spl.2010.11.001
PMCID: PMC3090625
PMID: 21572586
collaborative targeted maximum likelihood estimation; quantitative trait loci; super learner; machine learning
The Cox proportional hazards model or its discrete time analogue, the logistic failure time model, posit highly restrictive parametric models and attempt to estimate parameters which are specific to the model proposed. These methods are typically implemented when assessing effect modification in survival analyses despite their flaws. The targeted maximum likelihood estimation (TMLE) methodology is more robust than the methods typically implemented and allows practitioners to estimate parameters that directly answer the question of interest. TMLE will be used in this paper to estimate two newly proposed parameters of interest that quantify effect modification in the time to event setting. These methods are then applied to the Tshepo study to assess if either gender or baseline CD4 level modify the effect of two cART therapies of interest, efavirenz (EFV) and nevirapine (NVP), on the progression of HIV. The results show that women tend to have more favorable outcomes using EFV while males tend to have more favorable outcomes with NVP. Furthermore, EFV tends to be favorable compared to NVP for individuals at high CD4 levels.
doi:10.2202/1557-4679.1307
PMCID: PMC3083138
PMID: 21556287
causal effect; semi-parametric; censored longitudinal data; double robust; efficient influence curve; influence curve; G-computation; Targeted Maximum Likelihood Estimation; Cox-proportional hazards; survival analysis
We consider two-stage sampling designs, including so-called nested case control studies, where one takes a random sample from a target population and completes measurements on each subject in the first stage. The second stage involves drawing a subsample from the original sample, collecting additional data on the subsample. This data structure can be viewed as a missing data structure on the full-data structure collected in the second-stage of the study. Methods for analyzing two-stage designs include parametric maximum likelihood estimation and estimating equation methodology. We propose an inverse probability of censoring weighted targeted maximum likelihood estimator (IPCW-TMLE) in two-stage sampling designs and present simulation studies featuring this estimator.
doi:10.2202/1557-4679.1217
PMCID: PMC3083136
PMID: 21556285
two-stage designs; targeted maximum likelihood estimators; nested case control studies; double robust estimation
In longitudinal and repeated measures data analysis, often the goal is to determine the effect of a treatment or aspect on a particular outcome (e.g., disease progression). We consider a semiparametric repeated measures regression model, where the parametric component models effect of the variable of interest and any modification by other covariates. The expectation of this parametric component over the other covariates is a measure of variable importance. Here, we present a targeted maximum likelihood estimator of the finite dimensional regression parameter, which is easily estimated using standard software for generalized estimating equations.
The targeted maximum likelihood method provides double robust and locally efficient estimates of the variable importance parameters and inference based on the influence curve. We demonstrate these properties through simulation under correct and incorrect model specification, and apply our method in practice to estimating the activity of transcription factor (TF) over cell cycle in yeast. We specifically target the importance of SWI4, SWI6, MBP1, MCM1, ACE2, FKH2, NDD1, and SWI5.
The semiparametric model allows us to determine the importance of a TF at specific time points by specifying time indicators as potential effect modifiers of the TF. Our results are promising, showing significant importance trends during the expected time periods. This methodology can also be used as a variable importance analysis tool to assess the effect of a large number of variables such as gene expressions or single nucleotide polymorphisms.
doi:10.2202/1544-6115.1553
PMCID: PMC3122882
PMID: 21291412
targeted maximum likelihood; semiparametric; repeated measures; longitudinal; transcription factors
Targeted maximum likelihood estimation of a parameter of a data generating distribution, known to be an element of a semi-parametric model, involves constructing a parametric model through an initial density estimator with parameter ɛ representing an amount of fluctuation of the initial density estimator, where the score of this fluctuation model at ɛ = 0 equals the efficient influence curve/canonical gradient. The latter constraint can be satisfied by many parametric fluctuation models since it represents only a local constraint of its behavior at zero fluctuation. However, it is very important that the fluctuations stay within the semi-parametric model for the observed data distribution, even if the parameter can be defined on fluctuations that fall outside the assumed observed data model. In particular, in the context of sparse data, by which we mean situations where the Fisher information is low, a violation of this property can heavily affect the performance of the estimator. This paper presents a fluctuation approach that guarantees the fluctuated density estimator remains inside the bounds of the data model. We demonstrate this in the context of estimation of a causal effect of a binary treatment on a continuous outcome that is bounded. It results in a targeted maximum likelihood estimator that inherently respects known bounds, and consequently is more robust in sparse data situations than the targeted MLE using a naive fluctuation model.
When an estimation procedure incorporates weights, observations having large weights relative to the rest heavily influence the point estimate and inflate the variance. Truncating these weights is a common approach to reducing the variance, but it can also introduce bias into the estimate. We present an alternative targeted maximum likelihood estimation (TMLE) approach that dampens the effect of these heavily weighted observations. As a substitution estimator, TMLE respects the global constraints of the observed data model. For example, when outcomes are binary, a fluctuation of an initial density estimate on the logit scale constrains predicted probabilities to be between 0 and 1. This inherent enforcement of bounds has been extended to continuous outcomes. Simulation study results indicate that this approach is on a par with, and many times superior to, fluctuating on the linear scale, and in particular is more robust when there is sparsity in the data.
doi:10.2202/1557-4679.1260
PMCID: PMC3126669
PMID: 21731529
targeted maximum likelihood estimation; TMLE; causal effect
Background
When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms.
Results
We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest.
Conclusions
We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
doi:10.1186/1471-2105-12-312
PMCID: PMC3166941
PMID: 21849016
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
doi:10.2202/1557-4679.1181
PMCID: PMC2898626
PMID: 20628637
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
doi:10.2202/1557-4679.1182
PMCID: PMC3126668
PMID: 21731530
causal effect; cross-validation; collaborative double robust; double robust; efficient influence curve; penalized likelihood; penalization; estimator selection; locally efficient; maximum likelihood estimation; model selection; super efficiency; super learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
Targeted maximum likelihood estimation is a versatile tool for estimating parameters in semiparametric and nonparametric models. We work through an example applying targeted maximum likelihood methodology to estimate the parameter of a marginal structural model. In the case we consider, we show how this can be easily done by clever use of standard statistical software. We point out differences between targeted maximum likelihood estimation and other approaches (including estimating function based methods). The application we consider is to estimate the effect of adherence to antiretroviral medications on virologic failure in HIV positive individuals.
doi:10.2202/1557-4679.1238
PMCID: PMC3126671
PMID: 21731532
targeted maximum likelihood; marginal structural model
Models, such as logistic regression and Poisson regression models, are often used to estimate treatment effects in randomized trials. These models leverage information in variables collected before randomization, in order to obtain more precise estimates of treatment effects. However, there is the danger that model misspecification will lead to bias. We show that certain easy to compute, model-based estimators are asymptotically unbiased even when the working model used is arbitrarily misspecified. Furthermore, these estimators are locally efficient. As a special case of our main result, we consider a simple Poisson working model containing only main terms; in this case, we prove the maximum likelihood estimate of the coefficient corresponding to the treatment variable is an asymptotically unbiased estimator of the marginal log rate ratio, even when the working model is arbitrarily misspecified. This is the log-linear analog of ANCOVA for linear models. Our results demonstrate one application of targeted maximum likelihood estimation.
doi:10.2202/1557-4679.1138
PMCID: PMC2898625
PMID: 20628636
misspecified model; targeted maximum likelihood; generalized linear model; Poisson regression
doi:10.1093/aje/kwm230
PMCID: PMC3052779
PMID: 21399753
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
doi:10.2202/1557-4679.1211
PMCID: PMC3126670
PMID: 20737021
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
In this article, we provide a template for the practical implementation of the targeted maximum likelihood estimator for analyzing causal effects of multiple time point interventions, for which the methodology was developed and presented in Part I. In addition, the application of this template is demonstrated in two important estimation problems: estimation of the effect of individualized treatment rules based on marginal structural models for treatment rules, and the effect of a baseline treatment on survival in a randomized clinical trial in which the time till event is subject to right censoring.
doi:10.2202/1557-4679.1241
PMCID: PMC3126672
PMID: 21731531
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; path-wise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
Summary
Regression models are often used to test for cause-effect relationships from data collected in randomized trials or experiments. This practice has deservedly come under heavy scrutiny, since commonly used models such as linear and logistic regression will often not capture the actual relationships between variables, and incorrectly specified models potentially lead to incorrect conclusions. In this paper, we focus on hypothesis tests of whether the treatment given in a randomized trial has any effect on the mean of the primary outcome, within strata of baseline variables such as age, sex, and health status. Our primary concern is ensuring that such hypothesis tests have correct Type I error for large samples. Our main result is that for a surprisingly large class of commonly used regression models, standard regression-based hypothesis tests (but using robust variance estimators) are guaranteed to have correct Type I error for large samples, even when the models are incorrectly specified. To the best of our knowledge, this robustness of such model-based hypothesis tests to incorrectly specified models was previously unknown for Poisson regression models and for other commonly used models we consider. Our results have practical implications for understanding the reliability of commonly used, model-based tests for analyzing randomized trials.
doi:10.1111/j.1541-0420.2008.01177.x
PMCID: PMC2748134
PMID: 19210739
Causal effect; Generalized linear model; Misspecified model; Randomized trial; Regression; Robust methods
Multiple testing has become an integral component in genomic analyses involving microarray experiments where a large number of hypotheses are tested simultaneously. However, before applying more computationally intensive methods, it is often desirable to complete an initial truncation of the variable set using a simpler and faster supervised method such as univariate regression. Once such a truncation is completed, multiple testing methods applied to any subsequent analysis no longer control the appropriate Type I error rates. Here we propose a modified marginal Benjamini & Hochberg step-up FDR controlling procedure for multi-stage analyses (FDR-MSA), which correctly controls Type I error in terms of the entire variable set when only a subset of the initial set of variables is tested. The method is presented with respect to a variable importance application. As the initial subset size increases, we observe convergence to the standard Benjamini & Hochberg step-up FDR controlling multiple testing procedures. We demonstrate the power and Type I error control through simulation and application to the Golub Leukemia data from 1999.
doi:10.2202/1544-6115.1397
PMCID: PMC2861322
PMID: 19222379
The validity of standard confidence intervals constructed in survey sampling is based on the central limit theorem. For small sample sizes, the central limit theorem may give a poor approximation, resulting in confidence intervals that are misleading. We discuss this issue and propose methods for constructing confidence intervals for the population mean tailored to small sample sizes.
We present a simple approach for constructing confidence intervals for the population mean based on tail bounds for the sample mean that are correct for all sample sizes. Bernstein's inequality provides one such tail bound. The resulting confidence intervals have guaranteed coverage probability under much weaker assumptions than are required for standard methods. A drawback of this approach, as we show, is that these confidence intervals are often quite wide. In response to this, we present a method for constructing much narrower confidence intervals, which are better suited for practical applications, and that are still more robust than confidence intervals based on standard methods, when dealing with small sample sizes. We show how to extend our approaches to much more general estimation problems than estimating the sample mean. We describe how these methods can be used to obtain more reliable confidence intervals in survey sampling. As a concrete example, we construct confidence intervals using our methods for the number of violent deaths between March 2003 and July 2006 in Iraq, based on data from the study “Mortality after the 2003 invasion of Iraq: A cross sectional cluster sample survey,” by Burnham et al. (2006).
doi:10.2202/1557-4679.1118
PMCID: PMC2827893
PMID: 20231867
Matched case-control study designs are commonly implemented in the field of public health. While matching is intended to eliminate confounding, the main potential benefit of matching in case-control studies is a gain in efficiency. Methods for analyzing matched case-control studies have focused on utilizing conditional logistic regression models that provide conditional and not causal estimates of the odds ratio. This article investigates the use of case-control weighted targeted maximum likelihood estimation to obtain marginal causal effects in matched case-control study designs. We compare the use of case-control weighted targeted maximum likelihood estimation in matched and unmatched designs in an effort to explore which design yields the most information about the marginal causal effect. The procedures require knowledge of certain prevalence probabilities and were previously described by van der Laan (2008). In many practical situations where a causal effect is the parameter of interest, researchers may be better served using an unmatched design.
doi:10.2202/1557-4679.1127
PMCID: PMC2827892
PMID: 20231866
Researchers of uncommon diseases are often interested in assessing potential risk factors. Given the low incidence of disease, these studies are frequently case-control in design. Such a design allows a sufficient number of cases to be obtained without extensive sampling and can increase efficiency; however, these case-control samples are then biased since the proportion of cases in the sample is not the same as the population of interest. Methods for analyzing case-control studies have focused on utilizing logistic regression models that provide conditional and not causal estimates of the odds ratio. This article will demonstrate the use of the prevalence probability and case-control weighted targeted maximum likelihood estimation (MLE), as described by van der Laan (2008), in order to obtain causal estimates of the parameters of interest (risk difference, relative risk, and odds ratio). It is meant to be used as a guide for researchers, with step-by-step directions to implement this methodology. We will also present simulation studies that show the improved efficiency of the case-control weighted targeted MLE compared to other techniques.
doi:10.2202/1557-4679.1115
PMCID: PMC2835459
PMID: 20231910
Many applications aim to learn a high dimensional parameter of a data generating distribution based on a sample of independent and identically distributed observations. For example, the goal might be to estimate the conditional mean of an outcome given a list of input variables. In this prediction context, bootstrap aggregating (bagging) has been introduced as a method to reduce the variance of a given estimator at little cost to bias. Bagging involves applying an estimator to multiple bootstrap samples, and averaging the result across bootstrap samples. In order to address the curse of dimensionality, a common practice has been to apply bagging to estimators which themselves use cross-validation, thereby using cross-validation within a bootstrap sample to select fine-tuning parameters trading off bias and variance of the bootstrap sample-specific candidate estimators. In this article we point out that in order to achieve the correct bias variance trade-off for the parameter of interest, one should apply the cross-validation selector externally to candidate bagged estimators indexed by these fine-tuning parameters. We use three simulations to compare the new cross-validated bagging method with bagging of cross-validated estimators and bagging of non-cross-validated estimators.
doi:10.1016/j.jmva.2007.07.004
PMCID: PMC2367370
PMID: 19255599
bootstrap aggregation; data-adaptive regression; resistant HIV; Deletion/Substitution/Addition algorithm