In this paper, we investigate several well known approaches for missing
data and their relationships for the parametric probability regression model
when outcome of interest Y is subject to missingness. We
explore the relationships between the mean score method, the inverse probability
weighting (IPW) method and the augmented inverse probability weighted (AIPW)
method with some interesting findings. The asymptotic distributions of the IPW
and AIPW estimators are derived and their efficiencies are compared. Our
analysis details how efficiency may be gained from the AIPW estimator over the
IPW estimator through estimation of validation probability and augmentation. We
show that the AIPW estimator that is based on augmentation using the full set of
observed variables is more efficient than the AIPW estimator that is based on
augmentation using a subset of observed variables. The developed approaches are
applied to Poisson regression model with missing outcomes based on auxiliary
outcomes and a validated sample for true outcomes. We show that, by stratifying
based on a set of discrete variables, the proposed statistical procedure can be
formulated to analyze automated records that only contain summarized information
at categorical levels. The proposed methods are applied to analyze influenza
vaccine efficacy for an influenza vaccine study conducted in Temple-Belton,
Texas during the 2000-2001 influenza season.
Augmented inverse probability weighted estimator; Asymptotic results; Automated records; Auxiliary outcome; Efficiency; Inverse probability weighted Estimator; Mean score estimation; Recurrent events; Vaccine efficacy; Validation sample
Motivated by aging research, we propose an estimator of the effect of a time-varying exposure on an outcome in longitudinal studies with dropout and truncation by death. We use an inverse-probability weighted (IPW) estimator to derive a doubly robust augmented inverse-probability weighted (AIPW) estimator. IPW estimation involves weights for the exposure mechanism, dropout, and mortality; AIPW estimation additionally involves estimating data-generating models via regression. We demonstrate that the estimators identify a causal contrast that is a function of principal strata effects under a set of assumptions. Simulations show that AIPW estimation is unbiased when weights or outcome regressions are correct, and that AIPW estimation is more efficient than IPW estimation when all models are correct. We apply the method to a study of vitamin D and gait speed among older adults.
Causal inference; Longitudinal data analysis; Missing data; Observational studies
This work presents methods for estimating genotype-specific distributions from genetic epidemiology studies where the event times are subject to right censoring, the genotypes are not directly observed, and the data arise from a mixture of scientifically meaningful subpopulations. Examples of such studies include kin-cohort studies and quantitative trait locus (QTL) studies. Current methods for analyzing censored mixture data include two types of nonparametric maximum likelihood estimators (NPMLEs) which do not make parametric assumptions on the genotype-specific density functions. Although both NPMLEs are commonly used, we show that one is inefficient and the other inconsistent. To overcome these deficiencies, we propose three classes of consistent nonparametric estimators which do not assume parametric density models and are easy to implement. They are based on the inverse probability weighting (IPW), augmented IPW (AIPW), and nonparametric imputation (IMP). The AIPW achieves the efficiency bound without additional modeling assumptions. Extensive simulation experiments demonstrate satisfactory performance of these estimators even when the data are heavily censored. We apply these estimators to the Cooperative Huntington’s Observational Research Trial (COHORT), and provide age-specific estimates of the effect of mutation in the Huntington gene on mortality using a sample of family members. The close approximation of the estimated non-carrier survival rates to that of the U.S. population indicates small ascertainment bias in the COHORT family sample. Our analyses underscore an elevated risk of death in Huntington gene mutation carriers compared to non-carriers for a wide age range, and suggest that the mutation equally affects survival rates in both genders. The estimated survival rates are useful in genetic counseling for providing guidelines on interpreting the risk of death associated with a positive genetic testing, and in facilitating future subjects at risk to make informed decisions on whether to undergo genetic mutation testings.
Censored data; Finite mixture model; Huntington’s disease; Kin-cohort design; Quantitative trait locus
Missing data are common in medical and social science studies and often pose a serious challenge in data analysis. Multiple imputation methods are popular and natural tools for handling missing data, replacing each missing value with a set of plausible values that represent the uncertainty about the underlying values. We consider a case of missing at random (MAR) and investigate the estimation of the marginal mean of an outcome variable in the presence of missing values when a set of fully observed covariates is available. We propose a new nonparametric multiple imputation (MI) approach that uses two working models to achieve dimension reduction and define the imputing sets for the missing observations. Compared with existing nonparametric imputation procedures, our approach can better handle covariates of high dimension, and is doubly robust in the sense that the resulting estimator remains consistent if either of the working models is correctly specified. Compared with existing doubly robust methods, our nonparametric MI approach is more robust to the misspecification of both working models; it also avoids the use of inverse-weighting and hence is less sensitive to missing probabilities that are close to 1. We propose a sensitivity analysis for evaluating the validity of the working models, allowing investigators to choose the optimal weights so that the resulting estimator relies either completely or more heavily on the working model that is likely to be correctly specified and achieves improved efficiency. We investigate the asymptotic properties of the proposed estimator, and perform simulation studies to show that the proposed method compares favorably with some existing methods in finite samples. The proposed method is further illustrated using data from a colorectal adenoma study.
Doubly robust; Missing at random; Multiple imputation; Nearest neighbor; Nonparametric imputation; Sensitivity analysis
In this article, we study the estimation of mean response and regression coefficient in semiparametric regression problems when response variable is subject to nonrandom missingness. When the missingness is independent of the response conditional on high-dimensional auxiliary information, the parametric approach may misspecify the relationship between covariates and response while the nonparametric approach is infeasible because of the curse of dimensionality. To overcome this, we study a model-based approach to condense the auxiliary information and estimate the parameters of interest nonparametrically on the condensed covariate space. Our estimators possess the double robustness property, i.e., they are consistent whenever the model for the response given auxiliary covariates or the model for the missingness given auxiliary covariate is correct. We conduct a number of simulations to compare the numerical performance between our estimators and other existing estimators in the current missing data literature, including the propensity score approach and the inverse probability weighted estimating equation. A set of real data is used to illustrate our approach.
Auxiliary covariate; High-dimensional data; Kernel estimation; Missing at random; Semiparametric regression
Longitudinal studies often feature incomplete response and covariate data. Likelihood-based methods such as the expectation–maximization algorithm give consistent estimators for model parameters when data are missing at random (MAR) provided that the response model and the missing covariate model are correctly specified; however, we do not need to specify the missing data mechanism. An alternative method is the weighted estimating equation, which gives consistent estimators if the missing data and response models are correctly specified; however, we do not need to specify the distribution of the covariates that have missing values. In this article, we develop a doubly robust estimation method for longitudinal data with missing response and missing covariate when data are MAR. This method is appealing in that it can provide consistent estimators if either the missing data model or the missing covariate model is correctly specified. Simulation studies demonstrate that this method performs well in a variety of situations.
Doubly robust; Estimating equation; Missing at random; Missing covariate; Missing response
Missing data are common in longitudinal studies due to drop-out, loss to follow-up, and death. Likelihood-based mixed effects models for longitudinal data give valid estimates when the data are ignorably missing; that is, the parameters for the missing data process are distinct from those of the main model for the outcome, and the data are missing at random (MAR). These assumptions, however, are not testable without further information. In some studies, there is additional information available in the form of an auxiliary variable known to be correlated with the missing outcome of interest. Availability of such auxiliary information provides us with an opportunity to test the MAR assumption. If the MAR assumption is violated, such information can be utilized to reduce or eliminate bias when the missing data process depends on the unobserved outcome through the auxiliary information. We compare two methods of utilizing the auxiliary information: joint modeling of the outcome of interest and the auxiliary variable, and multiple imputation (MI). Simulation studies are performed to examine the two methods. The likelihood-based joint modeling approach is consistent and most efficient when correctly specified. However, mis-specification of the joint distribution can lead to biased results. MI is slightly less efficient than a correct joint modeling approach but more robust to model mis-specification when all the variables affecting the missing data mechanism and the missing outcome are included in the imputation model. An example is presented from a dementia screening study.
auxiliary variable MAR (A-MAR); joint modeling; linear mixed effects model; missing data; MNAR; multiple imputation (MI)
We consider a semiparametric regression model that relates a normal outcome to covariates and a genetic pathway, where the covariate effects are modeled parametrically and the pathway effect of multiple gene expressions is modeled parametrically or nonparametrically using least-squares kernel machines (LSKMs). This unified framework allows a flexible function for the joint effect of multiple genes within a pathway by specifying a kernel function and allows for the possibility that each gene expression effect might be nonlinear and the genes within the same pathway are likely to interact with each other in a complicated way. This semiparametric model also makes it possible to test for the overall genetic pathway effect. We show that the LSKM semiparametric regression can be formulated using a linear mixed model. Estimation and inference hence can proceed within the linear mixed model framework using standard mixed model software. Both the regression coefficients of the covariate effects and the LSKM estimator of the genetic pathway effect can be obtained using the best linear unbiased predictor in the corresponding linear mixed model formulation. The smoothing parameter and the kernel parameter can be estimated as variance components using restricted maximum likelihood. A score test is developed to test for the genetic pathway effect. Model/variable selection within the LSKM framework is discussed. The methods are illustrated using a prostate cancer data set and evaluated using simulations.
BLUPs; Kernel function; Model/variable selection; Nonparametric regression; Penalized likelihood; REML; Score test; Smoothing parameter; Support vector machines
This work focuses on the estimation of distribution functions with incomplete data, where the variable of interest Y has ignorable missingness but the covariate X is always observed. When X is high dimensional, parametric approaches to incorporate X — information is encumbered by the risk of model misspecification and nonparametric approaches by the curse of dimensionality. We propose a semiparametric approach, which is developed under a nonparametric kernel regression framework, but with a parametric working index to condense the high dimensional X — information for reduced dimension. This kernel dimension reduction estimator has double robustness to model misspecification and is most efficient if the working index adequately conveys the X — information about the distribution of Y. Numerical studies indicate better performance of the semiparametric estimator over its parametric and nonparametric counterparts. We apply the kernel dimension reduction estimation to an HIV study for the effect of antiretroviral therapy on HIV virologic suppression.
curse of dimensionality; dimension reduction; distribution function; ignorable missingness; kernel regression; quantile
Growing interest on biological pathways has called for new statistical methods for modeling and testing a genetic pathway effect on a health outcome. The fact that genes within a pathway tend to interact with each other and relate to the outcome in a complicated way makes nonparametric methods more desirable. The kernel machine method provides a convenient, powerful and unified method for multi-dimensional parametric and nonparametric modeling of the pathway effect.
In this paper we propose a logistic kernel machine regression model for binary outcomes. This model relates the disease risk to covariates parametrically, and to genes within a genetic pathway parametrically or nonparametrically using kernel machines. The nonparametric genetic pathway effect allows for possible interactions among the genes within the same pathway and a complicated relationship of the genetic pathway and the outcome. We show that kernel machine estimation of the model components can be formulated using a logistic mixed model. Estimation hence can proceed within a mixed model framework using standard statistical software. A score test based on a Gaussian process approximation is developed to test for the genetic pathway effect. The methods are illustrated using a prostate cancer data set and evaluated using simulations. An extension to continuous and discrete outcomes using generalized kernel machine models and its connection with generalized linear mixed models is discussed.
Logistic kernel machine regression and its extension generalized kernel machine regression provide a novel and flexible statistical tool for modeling pathway effects on discrete and continuous outcomes. Their close connection to mixed models and attractive performance make them have promising wide applications in bioinformatics and other biomedical areas.
Many statistical methods for microarray data analysis consider one gene at a time, and they may miss subtle changes at the single gene level. This limitation may be overcome by considering a set of genes simultaneously where the gene sets are derived from prior biological knowledge. Limited work has been carried out in the regression setting to study the effects of clinical covariates and expression levels of genes in a pathway either on a continuous or on a binary clinical outcome. Hence, we propose a Bayesian approach for identifying pathways related to both types of outcomes. We compare our Bayesian approaches with a likelihood-based approach that was developed by relating a least squares kernel machine for nonparametric pathway effect with a restricted maximum likelihood for variance components. Unlike the likelihood-based approach, the Bayesian approach allows us to directly estimate all parameters and pathway effects. It can incorporate prior knowledge into Bayesian hierarchical model formulation and makes inference by using the posterior samples without asymptotic theory. We consider several kernels (Gaussian, polynomial, and neural network kernels) to characterize gene expression effects in a pathway on clinical outcomes. Our simulation results suggest that the Bayesian approach has more accurate coverage probability than the likelihood-based approach, and this is especially so when the sample size is small compared with the number of genes being studied in a pathway. We demonstrate the usefulness of our approaches through its applications to a type II diabetes mellitus data set. Our approaches can also be applied to other settings where a large number of strongly correlated predictors are present.
Gaussian random process; kernel machine; pathway
The current goal of initial antiretroviral (ARV) therapy is suppression of plasma human immunodeficiency virus (HIV)-1 RNA levels to below 200 copies per milliliter. A proportion of HIV-infected patients who initiate antiretroviral therapy in clinical practice or antiretroviral clinical trials either fail to suppress HIV-1 RNA or have HIV-1 RNA levels rebound on therapy. Frequently, these patients have sustained CD4 cell counts responses and limited or no clinical symptoms and, therefore, have potentially limited indications for altering therapy which they may be tolerating well despite increased viral replication. On the other hand, increased viral replication on therapy leads to selection of resistance mutations to the antiretroviral agents comprising their therapy and potentially cross-resistance to other agents in the same class decreasing the likelihood of response to subsequent antiretroviral therapy. The optimal time to switch antiretroviral therapy to ensure sustained virologic suppression and prevent clinical events in patients who have rebound in their HIV-1 RNA, yet are stable, is not known. Randomized clinical trials to compare early versus delayed switching have been difficult to design and more difficult to enroll. In some clinical trials, such as the AIDS Clinical Trials Group (ACTG) Study A5095, patients randomized to initial antiretroviral treatment combinations, who fail to suppress HIV-1 RNA or have a rebound of HIV-1 RNA on therapy are allowed to switch from the initial ARV regimen to a new regimen, based on clinician and patient decisions. We delineate a statistical framework to estimate the effect of early versus late regimen change using data from ACTG A5095 in the context of two-stage designs.
In causal inference, a large class of doubly robust estimators are derived through semiparametric theory with applications to missing data problems. This class of estimators is motivated through geometric arguments and relies on large samples for good performance. By now, several authors have noted that a doubly robust estimator may be suboptimal when the outcome model is misspecified even if it is semiparametric efficient when the outcome regression model is correctly specified. Through auxiliary variables, two-stage designs, and within the contextual backdrop of our scientific problem and clinical study, we propose improved doubly robust, locally efficient estimators of a population mean and average causal effect for early versus delayed switching to second-line ARV treatment regimens. Our analysis of the ACTG A5095 data further demonstrates how methods that use auxiliary variables can improve over methods that ignore them. Using the methods developed here, we conclude that patients who switch within 8 weeks of virologic failure have better clinical outcomes, on average, than patients who delay switching to a new second-line ARV regimen after failing on the initial regimen. Ordinary statistical methods fail to find such differences. This article has online supplementary material.
Causal inference; Double robustness; Longitudinal data analysis; Missing data; Rubin causal model; Semiparametric efficient estimation
This technical paper offers a critical re-evaluation of (spectral) Granger causality measures in the analysis of biological timeseries. Using realistic (neural mass) models of coupled neuronal dynamics, we evaluate the robustness of parametric and nonparametric Granger causality. Starting from a broad class of generative (state-space) models of neuronal dynamics, we show how their Volterra kernels prescribe the second-order statistics of their response to random fluctuations; characterised in terms of cross-spectral density, cross-covariance, autoregressive coefficients and directed transfer functions. These quantities in turn specify Granger causality — providing a direct (analytic) link between the parameters of a generative model and the expected Granger causality. We use this link to show that Granger causality measures based upon autoregressive models can become unreliable when the underlying dynamics is dominated by slow (unstable) modes — as quantified by the principal Lyapunov exponent. However, nonparametric measures based on causal spectral factors are robust to dynamical instability. We then demonstrate how both parametric and nonparametric spectral causality measures can become unreliable in the presence of measurement noise. Finally, we show that this problem can be finessed by deriving spectral causality measures from Volterra kernels, estimated using dynamic causal modelling.
•This paper describes the evaluation of expected Granger causality measures.•It uses these measures to quantify problems with dynamical instability and noise.•These problems are resolved by basing Granger measures on DCM estimates.
Granger causality; Dynamic causal modelling; Effective connectivity; Functional connectivity; Dynamics; Cross spectra; Neurophysiology
Covariate-specific ROC curves are often used to evaluate the classification accuracy of a medical diagnostic test or a biomarker, when the accuracy of the test is associated with certain covariates. In many large-scale screening tests, the gold standard is subject to missingness due to high cost or harmfulness to the patient. In this paper, we propose a semiparametric estimation of the covariate-specific ROC curves with a partial missing gold standard. A location-scale model is constructed for the test result to model the covariates’ effect, but the residual distributions are left unspecified. Thus the baseline and link functions of the ROC curve both have flexible shapes. With the gold standard missing at random (MAR) assumption, we consider weighted estimating equations for the location-scale parameters, and weighted kernel estimating equations for the residual distributions. Three ROC curve estimators are proposed and compared, namely, imputation-based, inverse probability weighted and doubly robust estimators. We derive the asymptotic normality of the estimated ROC curve, as well as the analytical form the standard error estimator. The proposed method is motivated and applied to the data in an Alzheimer's disease research.
Alzheimer's disease; covariate-specific ROC curve; ignorable missingness; verification bias; weighted estimating equations
Model misspecification can be a concern for high-dimensional data. Nonparametric regression obviates model specification but is impeded by the curse of dimensionality. This paper focuses on the estimation of the marginal mean response when there is missingness in the response and multiple covariates are available. We propose estimating the mean response through nonparametric functional estimation, where the dimension is reduced by a parametric working index. The proposed semiparametric estimator is robust to model misspecification: it is consistent for any working index if the missing mechanism of the response is known or correctly specified up to unknown parameters; even with misspecification in the missing mechanism, it is consistent so long as the working index can recover E(Y | X), the conditional mean response given the covariates. In addition, when the missing mechanism is correctly specified, the semiparametric estimator attains the optimal efficiency if E(Y | X) is recoverable through the working index. Robustness and efficiency of the proposed estimator is further investigated by simulations. We apply the proposed method to a clinical trial for HIV.
Dimension reduction; Inverse probability weighting; Kernel regression; Missing at random; Robustness to model misspecification
Sensitivity and specificity are common measures of the accuracy of a diagnostic test. The usual estimators of these quantities are unbiased if data on the diagnostic test result and the true disease status are obtained from all subjects in an appropriately selected sample. In some studies, verification of the true disease status is performed only for a subset of subjects, possibly depending on the result of the diagnostic test and other characteristics of the subjects. Estimators of sensitivity and specificity based on this subset of subjects are typically biased; this is known as verification bias. Methods have been proposed to correct verification bias under the assumption that the missing data on disease status are missing at random (MAR), that is, the probability of missingness depends on the true (missing) disease status only through the test result and observed covariate information. When some of the covariates are continuous, or the number of covariates is relatively large, the existing methods require parametric models for the probability of disease or the probability of verification (given the test result and covariates), and hence are subject to model misspecification. We propose a new method for correcting verification bias based on the propensity score, defined as the predicted probability of verification given the test result and observed covariates. This is estimated separately for those with positive and negative test results. The new method classifies the verified sample into several subsamples that have homogeneous propensity scores and allows correction for verification bias. Simulation studies demonstrate that the new estimators are more robust to model misspecification than existing methods, but still perform well when the models for the probability of disease and probability of verification are correctly specified.
Diagnostic test; Model misspecification; Propensity score; Sensitivity; Specificity
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified.
In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q0 in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable.
We present theoretical results for “collaborative double robustness,” demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q0. This marks an improvement over the current definition of double robustness in the estimating equation literature.
We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter.
This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
asymptotic linearity; coarsening at random; causal effect; censored data; crossvalidation; collaborative double robust; double robust; efficient influence curve; estimating function; estimator selection; influence curve; G-computation; locally efficient; loss-function; marginal structural model; maximum likelihood estimation; model selection; pathwise derivative; semiparametric model; sieve; super efficiency; super-learning; targeted maximum likelihood estimation; targeted nuisance parameter estimator selection; variable importance
Given causal graph assumptions, intervention-specific counterfactual distributions of the data can be defined by the so called G-computation formula, which is obtained by carrying out these interventions on the likelihood of the data factorized according to the causal graph. The obtained G-computation formula represents the counterfactual distribution the data would have had if this intervention would have been enforced on the system generating the data. A causal effect of interest can now be defined as some difference between these counterfactual distributions indexed by different interventions. For example, the interventions can represent static treatment regimens or individualized treatment rules that assign treatment in response to time-dependent covariates, and the causal effects could be defined in terms of features of the mean of the treatment-regimen specific counterfactual outcome of interest as a function of the corresponding treatment regimens. Such features could be defined nonparametrically in terms of so called (nonparametric) marginal structural models for static or individualized treatment rules, whose parameters can be thought of as (smooth) summary measures of differences between the treatment regimen specific counterfactual distributions.
In this article, we develop a particular targeted maximum likelihood estimator of causal effects of multiple time point interventions. This involves the use of loss-based super-learning to obtain an initial estimate of the unknown factors of the G-computation formula, and subsequently, applying a target-parameter specific optimal fluctuation function (least favorable parametric submodel) to each estimated factor, estimating the fluctuation parameter(s) with maximum likelihood estimation, and iterating this updating step of the initial factor till convergence. This iterative targeted maximum likelihood updating step makes the resulting estimator of the causal effect double robust in the sense that it is consistent if either the initial estimator is consistent, or the estimator of the optimal fluctuation function is consistent. The optimal fluctuation function is correctly specified if the conditional distributions of the nodes in the causal graph one intervenes upon are correctly specified. The latter conditional distributions often comprise the so called treatment and censoring mechanism. Selection among different targeted maximum likelihood estimators (e.g., indexed by different initial estimators) can be based on loss-based cross-validation such as likelihood based cross-validation or cross-validation based on another appropriate loss function for the distribution of the data. Some specific loss functions are mentioned in this article.
Subsequently, a variety of interesting observations about this targeted maximum likelihood estimation procedure are made. This article provides the basis for the subsequent companion Part II-article in which concrete demonstrations for the implementation of the targeted MLE in complex causal effect estimation problems are provided.
causal effect; causal graph; censored data; cross-validation; collaborative double robust; double robust; dynamic treatment regimens; efficient influence curve; estimating function; estimator selection; locally efficient; loss function; marginal structural models for dynamic treatments; maximum likelihood estimation; model selection; pathwise derivative; randomized controlled trials; sieve; super-learning; targeted maximum likelihood estimation
Missing data is a very common problem in medical and social studies, especially when data are collected longitudinally. It is a challenging problem to utilize observed data effectively. Many papers on missing data problems can be found in statistical literature. It is well known that the inverse weighted estimation is neither efficient nor robust. On the other hand, the doubly robust (DR) method can improve the efficiency and robustness. As is known, the DR estimation requires a missing data model (i.e., a model for the probability that data are observed) and a working regression model (i.e., a model for the outcome variable given covariates and surrogate variables). Because the DR estimating function has mean zero for any parameters in the working regression model when the missing data model is correctly specified, in this paper, we derive a formula for the estimator of the parameters of the working regression model that yields the optimally efficient estimator of the marginal mean model (the parameters of interest) when the missing data model is correctly specified. Furthermore, the proposed method also inherits the DR property. Simulation studies demonstrate the greater efficiency of the proposed method compared with the standard DR method. A longitudinal dementia data set is used for illustration.
longitudinal data; missing data; optimal; surrogate outcome
Most of the genetic architecture of schizophrenia (SCZ) has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3), which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic (“z-score”) of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive a “relative enrichment score” for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions meta-analysis sub-studies into training and replication samples. We fit a scale mixture of two Gaussians model to each stratum, obtaining parameter estimates that minimize the sum of squared differences of the scale-mixture model with the stratified nonparametric estimates. We apply this approach to the recent genome-wide association study (GWAS) of SCZ (n = 82,315), obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both they are genome-wide significant (p < 5x10-8). There were 693 and 219 independent loci with model-based replication rates ≥80% and ≥90%, respectively. Compared to analyses not incorporating relative enrichment scores, CM3 increased out-of-sample yield for SNPs that replicate at a given rate. This demonstrates that replication probabilities can be more accurately estimated using prior enrichment information with CM3.
Genome-wide association studies (GWAS) have thus far identified only a small fraction of the heritability of common complex disorders, such as schizophrenia. Here, we demonstrate that by using auxiliary information we can improve estimates of replication probabilities from GWAS summary statistics. The proposed Covariate-Modulated Mixture Model (CM3) incorporates auxiliary information to construct an “enrichment score” for each single nucleotide polymorphism (SNP). We show that a scale mixture of two Gaussians provides a good fit to the observed effect size distribution stratified by the predicted enrichment score when applied the method to a recent genome-wide association study (GWAS) of SCZ (n = 82,315). Compared to estimates performed not using auxiliary information, the CM3 more accurately models the observed replication rates by stratifying on covariate-modulated enrichment scores. We observed that SNPs with low enrichment scores replicate with a lower probability compared to SNPs with high enrichment scores, even when both are genome-wide significant (p < 5x10-8). At model-based replication rates ≥80% and ≥90% there were 693 and 219 independent loci, respectively. Increased out-of-sample yield for SNPs ranked according to CM3 demonstrate the utility of incorporating auxiliary information via CM3.
In statistical inference one has to make sure that the underlying regression model is correctly specified otherwise the resulting estimation may be biased. Model checking is an important method to detect any departure of the regression model from the true one. Missing data is a ubiquitous problem in social and medical studies. If the underlying regression model is correctly specified, recent researches show great popularity of the doubly robust estimates method for handling missing data because of its robustness to the misspecification of either the missing data model or the conditional mean model, i.e. the model for the conditional expectation of true regression model conditioning on the observed quantities. However, little work has been devoted to the goodness of fit test for doubly robust estimates method. In this paper, we propose a testing method to assess the reliability of the estimator derived from the doubly robust estimating equation with possibly missing response and always observed auxiliary variables. Numerical studies demonstrate that the proposed test can control type I errors well. Furthermore the proposed method can detect departures from model assumptions in the marginal mean model of interest powerfully. A real dementia data set is used to illustrate the method for the diagnosis of model misspecification in the problem of missing response with an always observed auxiliary variable for cross-sectional data.
Auxiliary; doubly robust; estimating equation; goodness of fit; missing data
Secondary trait genetic association provides insight into the genetic architecture of disease etiology but requires caution in estimation. Ignoring case-control sampling may introduce bias into secondary trait association. In this paper, we compare the efficiency and robustness of various inverse probability weighted (IPW) estimators and maximum likelihood (ML) estimators. ML methods have been proposed but require correct modeling of both the secondary and the primary trait associations for valid inference. We show that ML methods using a misspecified primary trait model can severely inflate the type I error. IPW estimators are typically less efficient than ML estimators but are robust against model misspecification. When the secondary trait is available for the entire cohort, the IPW estimator with selection probabilities estimated nonparametrically and the augmented IPW estimator improve efficiency over the simple IPW estimator. We conclude that in large genetic association studies with complex sampling schemes, IPW-based estimators offer flexibility and robustness, and therefore are a viable option for analysis.
case-control sampling; design consistency; inverse probability weighting; maximum likelihood
We present a semi-parametric deconvolution estimator for the density function of a random variable X that is measured with error, a common challenge in many epidemiological studies. Traditional deconvolution estimators rely only on assumptions about the distribution of X and the error in its measurement, and ignore information available in auxiliary variables. Our method assumes the availability of a covariate vector statistically related to X by a mean–variance function regression model, where regression errors are normally distributed and independent of the measurement errors. Simulations suggest that the estimator achieves a much lower integrated squared error than the observed-data kernel density estimator when models are correctly specified and the assumption of normal regression errors is met. We illustrate the method using anthropometric measurements of newborns to estimate the density function of newborn length.
density estimation; measurement error; mean–variance function model
This paper considers generalized linear quantile regression for competing risks data when the failure type may be missing. Two estimation procedures for the regression co-efficients, including an inverse probability weighted complete-case estimator and an augmented inverse probability weighted estimator, are discussed under the assumption that the failure type is missing at random. The proposed estimation procedures utilize supplemental auxiliary variables for predicting the missing failure type and for informing its distribution. The asymptotic properties of the two estimators are derived and their asymptotic efficiencies are compared. We show that the augmented estimator is more efficient and possesses a double robustness property against misspecification of either the model for missingness or for the failure type. The asymptotic covariances are estimated using the local functional linearity of the estimating functions. The finite sample performance of the proposed estimation procedures are evaluated through a simulation study. The methods are applied to analyze the ‘Mashi’ trial data for investigating the effect of formula-versus breast-feeding plus extended infant zidovudine prophylaxis on HIV-related death of infants born to HIV-infected mothers in Botswana.
Augmented inverse probability weighted; Auxiliary variables; Competing risks; Double robustness; Efficient estimator; Estimating equation; Inverse probability weighted; Local functional linearity; Logistic regression; Mashi trial; Missing at random; Quantile regression
Integrating genomic information with traditional clinical risk factors to improve the prediction of disease outcomes could profoundly change the practice of medicine. However, the large number of potential markers and possible complexity of the relationship between markers and disease make it difficult to construct accurate risk prediction models. Standard approaches for identifying important markers often rely on marginal associations or linearity assumptions and may not capture non-linear or interactive effects. In recent years, much work has been done to group genes into pathways and networks. Integrating such biological knowledge into statistical learning could potentially improve model interpretability and reliability. One effective approach is to employ a kernel machine (KM) framework, which can capture nonlinear effects if nonlinear kernels are used (Scholkopf and Smola, 2002; Liu et al., 2007, 2008). For survival outcomes, KM regression modeling and testing procedures have been derived under a proportional hazards (PH) assumption (Li and Luan, 2003; Cai et al., 2011). In this paper, we derive testing and prediction methods for KM regression under the accelerated failure time model, a useful alternative to the PH model. We approximate the null distribution of our test statistic using resampling procedures. When multiple kernels are of potential interest, it may be unclear in advance which kernel to use for testing and estimation. We propose a robust Omnibus Test that combines information across kernels, and an approach for selecting the best kernel for estimation. The methods are illustrated with an application in breast cancer.
Accelerated Failure Time Model; Kernel Machines; Omnibus Test; Resampling; Risk Prediction; Survival Analysis