A sufficient cause interaction between two exposures signals the presence of individuals for whom the outcome would occur only under certain values of the two exposures. When the outcome is dichotomous and all exposures are categorical, then under certain no confounding assumptions, empirical conditions for sufficient cause interactions can be constructed based on the sign of linear contrasts of conditional outcome probabilities between differently exposed subgroups, given confounders. It is argued that logistic regression models are unsatisfactory for evaluating such contrasts, and that Bernoulli regression models with linear link are prone to misspecification. We therefore develop semiparametric tests for sufficient cause interactions under models which postulate probability contrasts in terms of a finite-dimensional parameter, but which are otherwise unspecified. Estimation is often not feasible in these models because it would require nonparametric estimation of auxiliary conditional expectations given high-dimensional variables. We therefore develop ‘multiply robust tests’ under a union model that assumes at least one of several working submodels holds. In the special case of a randomized experiment or a family-based genetic study in which the joint exposure distribution is known by design or Mendelian inheritance, the procedure leads to asymptotically distribution-free tests of the null hypothesis of no sufficient cause interaction.
Double robustness; Effect modification; Gene-environment interaction; Gene-gene interaction; Semiparametric inference; Sufficient cause; Synergism
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
The current goal of initial antiretroviral (ARV) therapy is suppression of plasma human immunodeficiency virus (HIV)-1 RNA levels to below 200 copies per milliliter. A proportion of HIV-infected patients who initiate antiretroviral therapy in clinical practice or antiretroviral clinical trials either fail to suppress HIV-1 RNA or have HIV-1 RNA levels rebound on therapy. Frequently, these patients have sustained CD4 cell counts responses and limited or no clinical symptoms and, therefore, have potentially limited indications for altering therapy which they may be tolerating well despite increased viral replication. On the other hand, increased viral replication on therapy leads to selection of resistance mutations to the antiretroviral agents comprising their therapy and potentially cross-resistance to other agents in the same class decreasing the likelihood of response to subsequent antiretroviral therapy. The optimal time to switch antiretroviral therapy to ensure sustained virologic suppression and prevent clinical events in patients who have rebound in their HIV-1 RNA, yet are stable, is not known. Randomized clinical trials to compare early versus delayed switching have been difficult to design and more difficult to enroll. In some clinical trials, such as the AIDS Clinical Trials Group (ACTG) Study A5095, patients randomized to initial antiretroviral treatment combinations, who fail to suppress HIV-1 RNA or have a rebound of HIV-1 RNA on therapy are allowed to switch from the initial ARV regimen to a new regimen, based on clinician and patient decisions. We delineate a statistical framework to estimate the effect of early versus late regimen change using data from ACTG A5095 in the context of two-stage designs.
In causal inference, a large class of doubly robust estimators are derived through semiparametric theory with applications to missing data problems. This class of estimators is motivated through geometric arguments and relies on large samples for good performance. By now, several authors have noted that a doubly robust estimator may be suboptimal when the outcome model is misspecified even if it is semiparametric efficient when the outcome regression model is correctly specified. Through auxiliary variables, two-stage designs, and within the contextual backdrop of our scientific problem and clinical study, we propose improved doubly robust, locally efficient estimators of a population mean and average causal effect for early versus delayed switching to second-line ARV treatment regimens. Our analysis of the ACTG A5095 data further demonstrates how methods that use auxiliary variables can improve over methods that ignore them. Using the methods developed here, we conclude that patients who switch within 8 weeks of virologic failure have better clinical outcomes, on average, than patients who delay switching to a new second-line ARV regimen after failing on the initial regimen. Ordinary statistical methods fail to find such differences. This article has online supplementary material.
Causal inference; Double robustness; Longitudinal data analysis; Missing data; Rubin causal model; Semiparametric efficient estimation
In this paper, we are concerned with how to select significant variables in semiparametric modeling. Variable selection for semiparametric regression models consists of two components: model selection for nonparametric components and select significant variables for parametric portion. Thus, it is much more challenging than that for parametric models such as linear models and generalized linear models because traditional variable selection procedures including stepwise regression and the best subset selection require model selection to nonparametric components for each submodel. This leads to very heavy computational burden. In this paper, we propose a class of variable selection procedures for semiparametric regression models using nonconcave penalized likelihood. The newly proposed procedures are distinguished from the traditional ones in that they delete insignificant variables and estimate the coefficients of significant variables simultaneously. This allows us to establish the sampling properties of the resulting estimate. We first establish the rate of convergence of the resulting estimate. With proper choices of penalty functions and regularization parameters, we then establish the asymptotic normality of the resulting estimate, and further demonstrate that the proposed procedures perform as well as an oracle procedure. Semiparametric generalized likelihood ratio test is proposed to select significant variables in the nonparametric component. We investigate the asymptotic behavior of the proposed test and demonstrate its limiting null distribution follows a chi-squared distribution, which is independent of the nuisance parameters. Extensive Monte Carlo simulation studies are conducted to examine the finite sample performance of the proposed variable selection procedures.
Nonconcave penalized likelihood; SCAD; efficient score; local linear regression; partially linear model; varying coefficient models
There is an active debate in the literature on censored data about the relative performance of model based maximum likelihood estimators, IPCW-estimators, and a variety of double robust semiparametric efficient estimators. Kang and Schafer (2007) demonstrate the fragility of double robust and IPCW-estimators in a simulation study with positivity violations. They focus on a simple missing data problem with covariates where one desires to estimate the mean of an outcome that is subject to missingness. Responses by Robins, et al. (2007), Tsiatis and Davidian (2007), Tan (2007) and Ridgeway and McCaffrey (2007) further explore the challenges faced by double robust estimators and offer suggestions for improving their stability. In this article, we join the debate by presenting targeted maximum likelihood estimators (TMLEs). We demonstrate that TMLEs that guarantee that the parametric submodel employed by the TMLE procedure respects the global bounds on the continuous outcomes, are especially suitable for dealing with positivity violations because in addition to being double robust and semiparametric efficient, they are substitution estimators. We demonstrate the practical performance of TMLEs relative to other estimators in the simulations designed by Kang and Schafer (2007) and in modified simulations with even greater estimation challenges.
censored data; collaborative double robustness; collaborative targeted maximum likelihood estimation; double robust; estimator selection; inverse probability of censoring weighting; locally efficient estimation; maximum likelihood estimation; semiparametric model; targeted maximum likelihood estimation; targeted minimum loss based estimation; targeted nuisance parameter estimator selection
For nonnegative measurements such as income or sick days, zero counts often have special status. Furthermore, the incidence of zero counts is often greater than expected for the Poisson model. This article considers a doubly semiparametric zero-inflated Poisson model to fit data of this type, which assumes two partially linear link functions in both the mean of the Poisson component and the probability of zero. We study a sieve maximum likelihood estimator for both the regression parameters and the nonparametric functions. We show, under routine conditions, that the estimators are strongly consistent. Moreover, the parameter estimators are asymptotically normal and first-order efficient, while the nonparametric components achieve the optimal convergence rates. Simulation studies suggest that the extra flexibility inherent from the doubly semiparametric model is gained with little loss in statistical efficiency. We also illustrate our approach with a dataset from a public health study.
Asymptotic efficiency; Partly linear model; Sieve maximum likelihood estimator; Zero-inflated Poisson model
A primary focus of an increasing number of scientific studies is to determine whether two exposures interact in the effect that they produce on an outcome of interest. Interaction is commonly assessed by fitting regression models in which the linear predictor includes the product between those exposures. When the main interest lies in the interaction, this approach is not entirely satisfactory because it is prone to (possibly severe) bias when the main exposure effects or the association between outcome and extraneous factors are misspecified. In this article, we therefore consider conditional mean models with identity or log link which postulate the statistical interaction in terms of a finite-dimensional parameter, but which are otherwise unspecified. We show that estimation of the interaction parameter is often not feasible in this model because it would require nonparametric estimation of auxiliary conditional expectations given high-dimensional variables. We thus consider ‘multiply robust estimation’ under a union model that assumes at least one of several working submodels holds. Our approach is novel in that it makes use of information on the joint distribution of the exposures conditional on the extraneous factors in making inferences about the interaction parameter of interest. In the special case of a randomized trial or a family-based genetic study in which the joint exposure distribution is known by design or by Mendelian inheritance, the resulting multiply robust procedure leads to asymptotically distribution-free tests of the null hypothesis of no interaction on an additive scale. We illustrate the methods via simulation and the analysis of a randomized follow-up study.
Double robustness; Gene-environment interaction; Gene-gene interaction; Longitudinal data; Semiparametric inference
Missing data are common in medical and social science studies and often pose a serious challenge in data analysis. Multiple imputation methods are popular and natural tools for handling missing data, replacing each missing value with a set of plausible values that represent the uncertainty about the underlying values. We consider a case of missing at random (MAR) and investigate the estimation of the marginal mean of an outcome variable in the presence of missing values when a set of fully observed covariates is available. We propose a new nonparametric multiple imputation (MI) approach that uses two working models to achieve dimension reduction and define the imputing sets for the missing observations. Compared with existing nonparametric imputation procedures, our approach can better handle covariates of high dimension, and is doubly robust in the sense that the resulting estimator remains consistent if either of the working models is correctly specified. Compared with existing doubly robust methods, our nonparametric MI approach is more robust to the misspecification of both working models; it also avoids the use of inverse-weighting and hence is less sensitive to missing probabilities that are close to 1. We propose a sensitivity analysis for evaluating the validity of the working models, allowing investigators to choose the optimal weights so that the resulting estimator relies either completely or more heavily on the working model that is likely to be correctly specified and achieves improved efficiency. We investigate the asymptotic properties of the proposed estimator, and perform simulation studies to show that the proposed method compares favorably with some existing methods in finite samples. The proposed method is further illustrated using data from a colorectal adenoma study.
Doubly robust; Missing at random; Multiple imputation; Nearest neighbor; Nonparametric imputation; Sensitivity analysis
We consider a class of semiparametric normal transformation models for right censored bivariate failure times. Nonparametric hazard rate models are transformed to a standard normal model and a joint normal distribution is assumed for the bivariate vector of transformed variates. A semiparametric maximum likelihood estimation procedure is developed for estimating the marginal survival distribution and the pairwise correlation parameters. This produces an efficient estimator of the correlation parameter of the semiparametric normal transformation model, which characterizes the bivariate dependence of bivariate survival outcomes. In addition, a simple positive-mass-redistribution algorithm can be used to implement the estimation procedures. Since the likelihood function involves infinite-dimensional parameters, the empirical process theory is utilized to study the asymptotic properties of the proposed estimators, which are shown to be consistent, asymptotically normal and semiparametric efficient. A simple estimator for the variance of the estimates is also derived. The finite sample performance is evaluated via extensive simulations.
Asymptotic normality; Bivariate failure time; Consistency; Semiparametric efficiency; Semiparametric maximum likelihood estimate; Semiparametric normal transformation
Regression methods for survival data with right censoring have been extensively studied under semiparametric transformation models  such as the Cox regression model  and the proportional odds model . However, their practical application could be limited due to possible violation of model assumption or lack of ready interpretation for the regression coefficients in some cases. As an alternative, in this paper, the proportional likelihood ratio model introduced by Luo and Tsai  is extended to flexibly model the relationship between survival outcome and covariates. This model has a natural connection with many important semiparametric models such as generalized linear model and density ratio model, and is closely related to biased sampling problems. Compared with the semiparametric transformation model, the proportional likelihood ratio model is appealing and practical in many ways because of its model flexibility and quite direct clinical interpretation. We present two likelihood approaches for the estimation and inference on the target regression parameters under independent and dependent censoring assumptions. Based on a conditional likelihood approach using uncensored failure times, a numerically simple estimation procedure is developed by maximizing a pairwise pseudo-likelihood . We also develop a full likelihood approach and the most efficient maximum likelihood estimator is obtained by a profile likelihood. Simulation studies are conducted to assess the finite-sample properties of the proposed estimators and compare the efficiency of the two likelihood approaches. An application to survival data for bone marrow transplantation patients of acute leukemia is provided to illustrate the proposed method and other approaches for handling non-proportionality. The relative merits of these methods are discussed in concluding remarks.
conditional likelihood; pairwise pseudo-likelihood; profile likelihood; proportional likelihood ratio model; right-censored data
Most statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. We call a pathway as a predefined set of genes that serve a particular cellular or physiological function. Limited work has been done in the regression settings to study the effects of clinical covariates and expression levels of genes in a pathway on a continuous clinical outcome. A semiparametric regression approach for identifying pathways related to a continuous outcome was proposed by Liu et al. (2007), who demonstrated the connection between a least squares kernel machine for nonparametric pathway effect and a restricted maximum likelihood (REML) for variance components. However, the asymptotic properties on a semiparametric regression for identifying pathway have never been studied. In this paper, we study the asymptotic properties of the parameter estimates on semiparametric regression and compare Liu et al.’s REML with our REML obtained from a profile likelihood. We prove that both approaches provide consistent estimators, have
n convergence rate under regularity conditions, and have either an asymptotically normal distribution or a mixture of normal distributions. However, the estimators based on our REML obtained from a profile likelihood have a theoretically smaller mean squared error than those of Liu et al.’s REML. Simulation study supports this theoretical result. A profile restricted likelihood ratio test is also provided for the non-standard testing problem. We apply our approach to a type II diabetes data set (Mootha et al., 2003).
Gaussian random process; Kernel machine; Mixed model; Pathway analysis; Profile likelihood; Restricted maximum likelihood
In social and health sciences, many research questions involve understanding the causal effect of a longitudinal treatment on mortality (or time-to-event outcomes in general). Often, treatment status may change in response to past covariates that are risk factors for mortality, and in turn, treatment status may also affect such subsequent covariates. In these situations, Marginal Structural Models (MSMs), introduced by Robins (1997), are well-established and widely used tools to account for time-varying confounding. In particular, a MSM can be used to specify the intervention-specific counterfactual hazard function, i.e. the hazard for the outcome of a subject in an ideal experiment where he/she was assigned to follow a given intervention on their treatment variables. The parameters of this hazard MSM are traditionally estimated using the Inverse Probability Weighted estimation (IPTW, van der Laan and Petersen (2007), Robins et al. (2000b), Robins (1999), Robins et al. (2008)). This estimator is easy to implement and admits Wald-type confidence intervals. However, its consistency hinges on the correct specification of the treatment allocation probabilities, and the estimates are generally sensitive to large treatment weights (especially in the presence of strong confounding), which are difficult to stabilize for dynamic treatment regimes. In this paper, we present a pooled targeted maximum likelihood estimator (TMLE, van der Laan and Rubin (2006)) for MSM for the hazard function under longitudinal dynamic treatment regimes. The proposed estimator is semiparametric efficient and doubly robust, hence offers bias reduction and efficiency gain over the incumbent IPTW estimator. Moreover, the substitution principle rooted in the TMLE potentially mitigates the sensitivity to large treatment weights in IPTW. We compare the performance of the proposed estimator with the IPTW and a non-targeted substitution estimator in a simulation study.
The inverse of the nonparametric information operator is key to finding doubly robust estimators and the semiparametric efficient estimator in missing data problems. It is known that no closed-form expression for the inverse of the nonparametric information operator exists when missing data form nonmonotone patterns. Neumann series is usually applied to approximate the inverse. However, Neumann series approximation is only known to converge in L2 norm, which is not sufficient for establishing statistical properties of the estimators yielded from the approximation. In this article, we show that L∞ convergence of the Neumann series approximations to the inverse of the non-parametric information operator and to the efficient scores in missing data problems can be obtained under very simple conditions. This paves the way to the study of the asymptotic properties of the doubly robust estimators and the locally semiparametric efficient estimator in those difficult situations.
Auxiliary information; Induction; Rate of convergence; Weighted estimating equation
We study the regression relationship among covariates in case-control data, an area known as the secondary analysis of case-control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either (a) specified a fully parametric distribution for the regression errors, (b) specified a homoscedastic distribution for the regression errors, (c) has specified the rate of disease in the population (we refer this as true population), or (d) has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric ones in that they draw conclusions about the true population, while technically operating in a hypothetic superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, while all other nonparametric effects are estimated despite of the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies, as well as through an empirical example on the relation between red meat consumption and heterocyclic amines. Our analysis verified the positive relationship between red meat consumption and two forms of HCA, indicating that increased red meat consumption leads to increased levels of MeIQA and PhiP, both being risk factors for colorectal cancer. Computer software as well as data to illustrate the methodology are available at http://wileyonlinelibrary.com/journal/rss-datasets.
Biased samples; Case-control study; Heteroscedastic regression; Secondary analysis; Semiparametric estimation
We consider causal inference in randomized survival studies with right censored outcomes and all-or-nothing compliance, using semiparametric transformation models to estimate the distribution of survival times in treatment and control groups, conditional on covariates and latent compliance type. Estimands depending on these distributions, for example, the complier average causal effect (CACE), the complier effect on survival beyond time t, and the complier quantile effect are then considered. Maximum likelihood is used to estimate the parameters of the transformation models, using a specially designed expectation-maximization (EM) algorithm to overcome the computational difficulties created by the mixture structure of the problem and the infinite dimensional parameter in the transformation models. The estimators are shown to be consistent, asymptotically normal, and semiparametrically efficient. Inferential procedures for the causal parameters are developed. A simulation study is conducted to evaluate the finite sample performance of the estimated causal parameters. We also apply our methodology to a randomized study conducted by the Health Insurance Plan of Greater New York to assess the reduction in breast cancer mortality due to screening.
All-or-nothing compliance; Complier average causal effect; Instrumental variable; Randomized trials; Survival analysis; Semiparametric transformation models
Researchers often seek robust inference for a parameter through semiparametric estimation. Efficient semiparametric estimation currently requires theoretical derivation of the efficient influence function (EIF), which can be a challenging and time-consuming task. If this task can be computerized, it can save dramatic human effort, which can be transferred, for example, to the design of new studies. Although the EIF is, in principle, a derivative, simple numerical differentiation to calculate the EIF by a computer masks the EIF’s functional dependence on the parameter of interest. For this reason, the standard approach to obtaining the EIF relies on the theoretical construction of the space of scores under all possible parametric submodels. This process currently depends on the correctness of conjectures about these spaces, and the correct verification of such conjectures. The correct guessing of such conjectures, though successful in some problems, is a nondeductive process, i.e., is not guaranteed to succeed (e.g., is not computerizable), and the verification of conjectures is generally susceptible to mistakes. We propose a method that can deductively produce semiparametric locally efficient estimators. The proposed method is computerizable, meaning that it does not need either conjecturing, or otherwise theoretically deriving the functional form of the EIF, and is guaranteed to produce the desired estimates even for complex parameters. The method is demonstrated through an example.
Compatibility; Deductive procedure; Gateaux derivative; Influence function; Semiparametric estimation; Turing machine
Rationale and Objectives
Semiparametric methods provide smooth and continuous receiver operating characteristic (ROC) curve fits to ordinal test results and require only that the data follow some unknown monotonic transformation of the model's assumed distributions. The quantitative relationship between cutoff settings or individual test-result values on the data scale and points on the estimated ROC curve is lost in this procedure, however. To recover that relationship in a principled way, we propose a new algorithm for “proper” ROC curves and illustrate it by use of the proper binormal model.
Materials and Methods
Several authors have proposed the use of multinomial distributions to fit semiparametric ROC curves by maximum-likelihood estimation. The resulting approach requires nuisance parameters that specify interval probabilities associated with the data, which are used subsequently as a basis for estimating values of the curve parameters of primary interest. In the method described here, we employ those “nuisance” parameters to recover the relationship between any ordinal test-result scale and true-positive fraction, false-positive fraction, and likelihood ratio. Computer simulations based on the proper binormal model were used to evaluate our approach in estimating those relationships and to assess the coverage of its confidence intervals for realistically sized datasets.
In our simulations, the method reliably estimated simple relationships between test-result values and the several ROC quantities.
The proposed approach provides an effective and reliable semiparametric method with which to estimate the relationship between cutoff settings or individual test-result values and corresponding points on the ROC curve.
Receiver operating characteristic (ROC) analysis; proper binormal model; likelihood ratio; test-result scale; maximum likelihood estimation (MLE)
Due to the need to evaluate the effectiveness of community-based programs in practice, there is substantial interest in methods to estimate the causal effects of community-level treatments or exposures on individual level outcomes. The challenge one is confronted with is that different communities have different environmental factors affecting the individual outcomes, and all individuals in a community share the same environment and intervention. In practice, data are often available from only a small number of communities, making it difficult if not impossible to adjust for these environmental confounders. In this paper we consider an extreme version of this dilemma, in which two communities each receives a different level of the intervention, and covariates and outcomes are measured on a random sample of independent individuals from each of the two populations; the results presented can be straightforwardly generalized to settings in which more than two communities are sampled. We address the question of what conditions are needed to estimate the causal effect of the intervention, defined in terms of an ideal experiment in which the exposed level of the intervention is assigned to both communities and individual outcomes are measured in the combined population, and then the clock is turned back and a control level of the intervention is assigned to both communities and individual outcomes are measured in the combined population. We refer to the difference in the expectation of these outcomes as the marginal (overall) treatment effect. We also discuss conditions needed for estimation of the treatment effect on the treated community. We apply a nonparametric structural equation model to define these causal effects and to establish conditions under which they are identified. These identifiability conditions provide guidance for the design of studies to investigate community level causal effects and for assessing the validity of causal interpretations when data are only available from a few communities. When the identifiability conditions fail to hold, the proposed statistical parameters still provide nonparametric treatment effect measures (albeit non-causal) whose statistical interpretations do not depend on model specifications. In addition, we study the use of a matched cohort sampling design in which the units of different communities are matched on individual factors. Finally, we provide semiparametric efficient and doubly robust targeted MLE estimators of the community level causal effect based on i.i.d. sampling and matched cohort sampling.
causal effect; causal effect among the treated; community-based intervention; efficient influence curve; environmental confounding
Semiparametric linear transformation models have received much attention due to its high flexibility in modeling survival data. A useful estimating equation procedure was recently proposed by Chen et al. (2002) for linear transformation models to jointly estimate parametric and nonparametric terms. They showed that this procedure can yield a consistent and robust estimator. However, the problem of variable selection for linear transformation models is less studied, partially because a convenient loss function is not readily available under this context. In this paper, we propose a simple yet powerful approach to achieve both sparse and consistent estimation for linear transformation models. The main idea is to derive a profiled score from the estimating equation of Chen et al. (2002), construct a loss function based on the profile scored and its variance, and then minimize the loss subject to some shrinkage penalty. Under regularity conditions, we have shown that the resulting estimator is consistent for both model estimation and variable selection. Furthermore, the estimated parametric terms are asymptotically normal and can achieve higher efficiency than that yielded from the estimation equations. For computation, we suggest a one-step approximation algorithm which can take advantage of the LARS and build the entire solution path efficiently. Performance of the new procedure is illustrated through numerous simulations and real examples including one microarray data.
Censored survival data; Linear transformation models; LARS; Shrinkage; Variable selection
We consider tests of hypotheses when the parameters are not identifiable under the null in semiparametric models, where regularity conditions for profile likelihood theory fail. Exponential average tests based on integrated profile likelihood are constructed and shown to be asymptotically optimal under a weighted average power criterion with respect to a prior on the nonidentifiable aspect of the model. These results extend existing results for parametric models, which involve more restrictive assumptions on the form of the alternative than do our results. Moreover, the proposed tests accommodate models with infinite dimensional nuisance parameters which either may not be identifiable or may not be estimable at the usual parametric rate. Examples include tests of the presence of a change-point in the Cox model with current status data and tests of regression parameters in odds-rate models with right censored data. Optimal tests have not previously been studied for these scenarios. We study the asymptotic distribution of the proposed tests under the null, fixed contiguous alternatives and random contiguous alternatives. We also propose a weighted bootstrap procedure for computing the critical values of the test statistics. The optimal tests perform well in simulation studies, where they may exhibit improved power over alternative tests.
Change-point models; contiguous alternative; empirical processes; exponential average test; nonstandard testing problem; odds-rate models; optimal test; power; profile likelihood
We study a class of semiparametric skewed distributions arising when the sample selection process produces non-randomly sampled observations. Based on semiparametric theory and taking into account the symmetric nature of the population distribution, we propose both consistent estimators, i.e. robust to model mis-specification, and efficient estimators, i.e. reaching the minimum possible estimation variance, of the location of the symmetric population. We demonstrate the theoretical properties of our estimators through asymptotic analysis and assess their finite sample performance through simulations. We also implement our methodology on a real data example of ambulatory expenditures to illustrate the applicability of the estimators in practice.
robustness; selection bias; semiparametric model; skewness; skew-symmetric distribution
In longitudinal and repeated measures data analysis, often the goal is to determine the effect of a treatment or aspect on a particular outcome (e.g., disease progression). We consider a semiparametric repeated measures regression model, where the parametric component models effect of the variable of interest and any modification by other covariates. The expectation of this parametric component over the other covariates is a measure of variable importance. Here, we present a targeted maximum likelihood estimator of the finite dimensional regression parameter, which is easily estimated using standard software for generalized estimating equations.
The targeted maximum likelihood method provides double robust and locally efficient estimates of the variable importance parameters and inference based on the influence curve. We demonstrate these properties through simulation under correct and incorrect model specification, and apply our method in practice to estimating the activity of transcription factor (TF) over cell cycle in yeast. We specifically target the importance of SWI4, SWI6, MBP1, MCM1, ACE2, FKH2, NDD1, and SWI5.
The semiparametric model allows us to determine the importance of a TF at specific time points by specifying time indicators as potential effect modifiers of the TF. Our results are promising, showing significant importance trends during the expected time periods. This methodology can also be used as a variable importance analysis tool to assess the effect of a large number of variables such as gene expressions or single nucleotide polymorphisms.
targeted maximum likelihood; semiparametric; repeated measures; longitudinal; transcription factors
This paper investigates the appropriateness of the integration of flexible
propensity score modeling (nonparametric or machine learning approaches) in semiparametric
models for the estimation of a causal quantity, such as the mean outcome under treatment.
We begin with an overview of some of the issues involved in knowledge-based and
statistical variable selection in causal inference and the potential pitfalls of automated
selection based on the fit of the propensity score. Using a simple example, we directly
show the consequences of adjusting for pure causes of the exposure when using inverse
probability of treatment weighting (IPTW). Such variables are likely to be selected when
using a naive approach to model selection for the propensity score. We describe how the
method of Collaborative Targeted minimum loss-based estimation (C-TMLE; van der Laan and Gruber, 2010) capitalizes on the
collaborative double robustness property of semiparametric efficient estimators to select
covariates for the propensity score based on the error in the conditional outcome model.
Finally, we compare several approaches to automated variable selection in low-and
high-dimensional settings through a simulation study. From this simulation study, we
conclude that using IPTW with flexible prediction for the propensity score can result in
inferior estimation, while Targeted minimum loss-based estimation and C-TMLE may benefit
from flexible prediction and remain robust to the presence of variables that are highly
correlated with treatment. However, in our study, standard influence function-based
methods for the variance underestimated the standard errors, resulting in poor coverage
under certain data-generating scenarios.
C-TMLE; IPTW; variable reduction
Most randomized efficacy trials of interventions to prevent HIV or other infectious diseases have assessed intervention efficacy by a method that either does not incorporate baseline covariates, or that incorporates them in a non-robust or inefficient way. Yet, it has long been known that randomized treatment effects can be assessed with greater efficiency by incorporating baseline covariates that predict the response variable. Tsiatis et al. (2007) and Zhang et al. (2008) advocated a semiparametric efficient approach, based on the theory of Robins et al. (1994), for consistently estimating randomized treatment effects that optimally incorporates predictive baseline covariates, without any parametric assumptions. They stressed the objectivity of the approach, which is achieved by separating the modeling of baseline predictors from the estimation of the treatment effect. While their work adequately justifies implementation of the method for large Phase 3 trials (because its optimality is in terms of asymptotic properties), its performance for intermediate-sized screening Phase 2b efficacy trials, which are increasing in frequency, is unknown. Furthermore, the past work did not consider a right-censored time-to-event endpoint, which is the usual primary endpoint for a prevention trial. For Phase 2b HIV vaccine efficacy trials, we study finite-sample performance of Zhang et al.'s (2008) method for a dichotomous endpoint, and develop and study an adaptation of this method to a discrete right-censored time-to-event endpoint. We show that, given the predictive capacity of baseline covariates collected in real HIV prevention trials, the methods achieve 5-15% gains in efficiency compared to methods in current use. We apply the methods to the first HIV vaccine efficacy trial. This work supports implementation of the discrete failure time method for prevention trials.
Auxiliary; Covariate Adjustment; Intermediate-sized Phase 2b Efficacy Trial; Semiparametric Efficiency
We consider estimation, from a double-blind randomized trial, of treatment effect within levels of base-line covariates on an outcome that is measured after a post-treatment event E has occurred in the subpopulation 𝒫E,E that would experience event E regardless of treatment. Specifically, we consider estimation of the parameters γ indexing models for the outcome mean conditional on treatment and base-line covariates in the subpopulation 𝒫E,E. Such parameters are not identified from randomized trial data but become identified if additionally it is assumed that the subpopulation 𝒫Ē,E of subjects that would experience event E under the second treatment but not under the first is empty and a parametric model for the conditional probability that a subject experiences event E if assigned to the first treatment given that the subject would experience the event if assigned to the second treatment, his or her outcome under the second treatment and his or her pretreatment covariates. We develop a class of estimating equations whose solutions comprise, up to asymptotic equivalence, all consistent and asymptotically normal estimators of γ under these two assumptions. In addition, we derive a locally semiparametric efficient estimator of γ. We apply our methods to estimate the effect on mean viral load of vaccine versus placebo after infection with human immunodeficiency virus (the event E) in a placebo-controlled randomized acquired immune deficiency syndrome vaccine trial.
Counterfactuals; Missing data; Potential outcomes; Principal stratification; Structural model; Vaccine trials