Two-stage design is a well-known cost-effective way for conducting biomedical studies when the exposure variable is expensive or difficult to measure. Recent research development further allowed one or both stages of the two-stage design to be outcome dependent on a continuous outcome variable. This outcome-dependent sampling feature enables further efficiency gain in parameter estimation and overall cost reduction of the study (e.g. Wang, X. and Zhou, H., 2010. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics
66, 502–511; Zhou, H., Song, R., Wu, Y. and Qin, J., 2011. Statistical inference for a two-stage outcome-dependent sampling design with a continuous outcome. Biometrics
67, 194–202). In this paper, we develop a semiparametric mixed effect regression model for data from a two-stage design where the second-stage data are sampled with an outcome-auxiliary-dependent sample (OADS) scheme. Our method allows the cluster- or center-effects of the study subjects to be accounted for. We propose an estimated likelihood function to estimate the regression parameters. Simulation study indicates that greater study efficiency gains can be achieved under the proposed two-stage OADS design with center-effects when compared with other alternative sampling schemes. We illustrate the proposed method by analyzing a dataset from the Collaborative Perinatal Project.
Center effect; Mixed model; Outcome-auxiliary-dependent sampling; Validation sample
With development of massively parallel sequencing technologies, there is a substantial need for developing powerful rare variant association tests. Common approaches include burden and non-burden tests. Burden tests assume all rare variants in the target region have effects on the phenotype in the same direction and of similar magnitude. The recently proposed sequence kernel association test (SKAT) (Wu, M. C., and others, 2011. Rare-variant association testing for sequencing data with the SKAT. The American Journal of Human Genetics
89, 82–93], an extension of the C-alpha test (Neale, B. M., and others, 2011. Testing for an unusual distribution of rare variants. PLoS Genetics
7, 161–165], provides a robust test that is particularly powerful in the presence of protective and deleterious variants and null variants, but is less powerful than burden tests when a large number of variants in a region are causal and in the same direction. As the underlying biological mechanisms are unknown in practice and vary from one gene to another across the genome, it is of substantial practical interest to develop a test that is optimal for both scenarios. In this paper, we propose a class of tests that include burden tests and SKAT as special cases, and derive an optimal test within this class that maximizes power. We show that this optimal test outperforms burden tests and SKAT in a wide range of scenarios. The results are illustrated using simulation studies and triglyceride data from the Dallas Heart Study. In addition, we have derived sample size/power calculation formula for SKAT with a new family of kernels to facilitate designing new sequence association studies.
Burden tests; Correlated effects; Kernel association test; Rare variants; Score test
In recent years, genome-wide association studies (GWAS) and gene-expression profiling have generated a large number of valuable datasets for assessing how genetic variations are related to disease outcomes. With such datasets, it is often of interest to assess the overall effect of a set of genetic markers, assembled based on biological knowledge. Genetic marker-set analyses have been advocated as more reliable and powerful approaches compared with the traditional marginal approaches (Curtis and others, 2005. Pathways to the analysis of microarray data. TRENDS in Biotechnology
23, 429–435; Efroni and others, 2007. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS One
2, 425). Procedures for testing the overall effect of a marker-set have been actively studied in recent years. For example, score tests derived under an Empirical Bayes (EB) framework (Liu and others, 2007. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics
63, 1079–1088; Liu and others, 2008. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinformatics
9, 292–2; Wu and others, 2010. Powerful SNP-set analysis for case-control genome-wide association studies. American Journal of Human Genetics
86, 929) have been proposed as powerful alternatives to the standard Rao score test (Rao, 1948. Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 44, 50–57). The advantages of these EB-based tests are most apparent when the markers are correlated, due to the reduction in the degrees of freedom. In this paper, we propose an adaptive score test which up- or down-weights the contributions from each member of the marker-set based on the Z-scores of their effects. Such an adaptive procedure gains power over the existing procedures when the signal is sparse and the correlation among the markers is weak. By combining evidence from both the EB-based score test and the adaptive test, we further construct an omnibus test that attains good power in most settings. The null distributions of the proposed test statistics can be approximated well either via simple perturbation procedures or via distributional approximations. Through extensive simulation studies, we demonstrate that the proposed procedures perform well in finite samples. We apply the tests to a breast cancer genetic study to assess the overall effect of the FGFR2 gene on breast cancer risk.
Adaptive procedures; Empirical Bayes; GWAS; Pathway analysis; Score test; SNP sets
With the growing availability of omics data generated to describe different cells and tissues, the modeling and interpretation of such data has become increasingly important. Pathways are sets of reactions involving genes, metabolites, and proteins highlighting functional modules in the cell. Therefore, to discover activated or perturbed pathways when comparing two conditions, for example two different tissues, it is beneficial to use several types of omics data. We present a model that integrates transcriptomic and metabolomic data in order to make an informed pathway-level decision. Since metabolites can be seen as end-points of perturbations happening at the gene level, the gene expression data constitute the explanatory variables in a sparse regression model for the metabolite data. Sophisticated model selection procedures are developed to determine an appropriate model. We demonstrate that the transcript profiles can be used to informatively explain the metabolite data from cancer cell lines. Simulation studies further show that the proposed model offers a better performance in identifying active pathways than, for example, enrichment methods performed separately on the transcript and metabolite data.
Enrichment; Integrated modeling; Metabolomics; Pathways; Transcriptomics
The cross-odds ratio is defined as the ratio of the conditional odds of the occurrence of one cause-specific event for one subject given the occurrence of the same or a different cause-specific event for another subject in the same cluster over the unconditional odds of occurrence of the cause-specific event. It is a measure of the association between the correlated cause-specific failure times within a cluster. The joint cumulative incidence function can be expressed as a function of the marginal cumulative incidence functions and the cross-odds ratio. Assuming that the marginal cumulative incidence functions follow a generalized semiparametric model, this paper studies the parametric regression modeling of the cross-odds ratio. A set of estimating equations are proposed for the unknown parameters and the asymptotic properties of the estimators are explored. Non-parametric estimation of the cross-odds ratio is also discussed. The proposed procedures are applied to the Danish twin data to model the associations between twins in their times to natural menopause and to investigate whether the association differs among monozygotic and dizygotic twins and how these associations have changed over time.
Binomial modeling; Correlated cause-specific failure times; Danish twin data; Estimating equation; Generalized semiparametric additive model; Inverse censoring probability weighting; Joint cumulative incidence function; Large sample properties; Marginal cumulative incidence function; Parametric regression model
RNA-Seq is widely used in biological and biomedical studies. Methods for the estimation of the transcript's abundance using RNA-Seq data have been intensively studied, many of which are based on the assumption that the short-reads of RNA-Seq are uniformly distributed along the transcripts. However, the short-reads are found to be nonuniformly distributed along the transcripts, which can greatly reduce the accuracies of these methods based on the uniform assumption. Several methods are developed to adjust the biases induced by this nonuniformity, utilizing the short-read's empirical distribution in transcript. As an alternative, we found that RNA degradation plays a major role in the formation of the short-read's nonuniform distribution and thus developed a new approach that quantifies the short-read's nonuniform distribution by precisely modeling RNA degradation. Our model of RNA degradation fits RNA-Seq data quite well, and based on this model, a new statistical method was further developed to estimate transcript expression level, as well as the RNA degradation rate, for individual genes and their isoforms. We showed that our method can improve the accuracy of transcript isoform expression estimation. The RNA degradation rate of individual transcript we estimated is consistent across samples and/or experiments/platforms. In addition, the RNA degradation rate from our model is independent of the RNA length, consistent with previous studies on RNA decay rate.
EM algorithm; Gene expression; Next generation sequencing; RNA degradation; RNA-Seq
Advances in human genetics have led to epidemiological investigations not only of the effects of genes alone but also of gene–environment (G–E) interaction. A widely accepted design strategy in the study of how G–E relate to disease risks is the population-based case–control study (PBCCS). For simple random samples, semiparametric methods for testing G–E have been developed by Chatterjee and Carroll in 2005. The use of complex sampling in PBCCS that involve differential probabilities of sample selection of cases and controls and possibly cluster sampling is becoming more common. Two complexities, weighting for selection probabilities and intracluster correlation of observations, are induced by the complex sampling. We develop pseudo-semiparametric maximum likelihood estimators (pseudo-SPMLE) that apply to PBCCS with complex sampling. We study the finite sample performance of the pseudo-SPMLE using simulations and illustrate the pseudo-SPMLE with a US case–control study of kidney cancer.
Hardy–Weinberg equilibrium; Population weights; Selection probability; Stratified multistage cluster sampling; Taylor linearization
The meta-analytic approach to evaluating surrogate end points assesses the predictiveness of treatment effect on the surrogate toward treatment effect on the clinical end point based on multiple clinical trials. Definition and estimation of the correlation of treatment effects were developed in linear mixed models and later extended to binary or failure time outcomes on a case-by-case basis. In a general regression setting that covers nonnormal outcomes, we discuss in this paper several metrics that are useful in the meta-analytic evaluation of surrogacy. We propose a unified 3-step procedure to assess these metrics in settings with binary end points, time-to-event outcomes, or repeated measures. First, the joint distribution of estimated treatment effects is ascertained by an estimating equation approach; second, the restricted maximum likelihood method is used to estimate the means and the variance components of the random treatment effects; finally, confidence intervals are constructed by a parametric bootstrap procedure. The proposed method is evaluated by simulations and applications to 2 clinical trials.
Causal inference; Meta-analysis; Surrogacy
A noninferiority (NI) trial is sometimes employed to show efficacy of a new treatment when it is unethical to randomize current patients to placebo because of the established efficacy of a standard treatment. Under this framework, if the NI trial determines that the treatment advantage of the standard to the new drug (i.e. S−N) is less than the historic advantage of the standard to placebo (S−P), then the efficacy of the new treatment (N−P) is established indirectly. We explicitly combine information from the NI trial with estimates from a random effects model, allowing study-to-study variability in k historic trials. Existing methods under random effects, such as the synthesis method, fail to account for the variability of the true standard versus placebo effect in the NI trial. Our method effectively uses a prediction interval for the missing standard versus placebo effect rather than a confidence interval of the mean. The consequences are to increase the variance of the synthesis method by incorporating a prediction variance term and to approximate the null distribution of the new statistic with a t with k−1 degrees of freedom instead of the standard normal. Thus, it is harder to conclude NI of the new to (predicted) placebo, compared with traditional methods, especially when k is small or when between study variability is large. When the between study variances are nonzero, we demonstrate substantial Type I error rate inflation with conventional approaches; simulations suggest that the new procedure has only modest inflation, and it is very conservative when between study variances are zero. An example is used to illustrate practical issues.
Active control trial; Clinical trial; Meta-analysis; Noninferiority trial; Random effects; Synthesis method
In high-throughput cancer genomic studies, markers identified from the analysis of single data sets often suffer a lack of reproducibility because of the small sample sizes. An ideal solution is to conduct large-scale prospective studies, which are extremely expensive and time consuming. A cost-effective remedy is to pool data from multiple comparable studies and conduct integrative analysis. Integrative analysis of multiple data sets is challenging because of the high dimensionality of genomic measurements and heterogeneity among studies. In this article, we propose a sparse boosting approach for marker identification in integrative analysis of multiple heterogeneous cancer diagnosis studies with gene expression measurements. The proposed approach can effectively accommodate the heterogeneity among multiple studies and identify markers with consistent effects across studies. Simulation shows that the proposed approach has satisfactory identification results and outperforms alternatives including an intensity approach and meta-analysis. The proposed approach is used to identify markers of pancreatic cancer and liver cancer.
Cancer genomics; Marker identification; Sparse boosting
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method “Remove Unwanted Variation, 2-step” (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.
Batch effect; Control gene; Differential expression; Factor analysis; SVA; Unwanted variation
Frailty models are useful for measuring unobserved heterogeneity in risk of failures across clusters, providing cluster-specific risk prediction. In a frailty model, the latent frailties shared by members within a cluster are assumed to act multiplicatively on the hazard function. In order to obtain parameter and frailty variate estimates, we consider the hierarchical likelihood (H-likelihood) approach (Ha, Lee and Song, 2001. Hierarchical-likelihood approach for frailty models. Biometrika
88, 233–243) in which the latent frailties are treated as “parameters” and estimated jointly with other parameters of interest. We find that the H-likelihood estimators perform well when the censoring rate is low, however, they are substantially biased when the censoring rate is moderate to high. In this paper, we propose a simple and easy-to-implement bias correction method for the H-likelihood estimators under a shared frailty model. We also extend the method to a multivariate frailty model, which incorporates complex dependence structure within clusters. We conduct an extensive simulation study and show that the proposed approach performs very well for censoring rates as high as 80%. We also illustrate the method with a breast cancer data set. Since the H-likelihood is the same as the penalized likelihood function, the proposed bias correction method is also applicable to the penalized likelihood estimators.
Frailty model; Hierarchical likelihood; Multivariate survival; NPMLE; Penalized likelihood; Semiparametric
To estimate an overall treatment difference with data from a randomized comparative clinical study, baseline covariates are often utilized to increase the estimation precision. Using the standard analysis of covariance technique for making inferences about such an average treatment difference may not be appropriate, especially when the fitted model is nonlinear. On the other hand, the novel augmentation procedure recently studied, for example, by Zhang and others (2008. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics
64, 707–715) is quite flexible. However, in general, it is not clear how to select covariates for augmentation effectively. An overly adjusted estimator may inflate the variance and in some cases be biased. Furthermore, the results from the standard inference procedure by ignoring the sampling variation from the variable selection process may not be valid. In this paper, we first propose an estimation procedure, which augments the simple treatment contrast estimator directly with covariates. The new proposal is asymptotically equivalent to the aforementioned augmentation method. To select covariates, we utilize the standard lasso procedure. Furthermore, to make valid inference from the resulting lasso-type estimator, a cross validation method is used. The validity of the new proposal is justified theoretically and empirically. We illustrate the procedure extensively with a well-known primary biliary cirrhosis clinical trial data set.
ANCOVA; Cross validation; Efficiency augmentation; Mayo PBC data; Semi-parametric efficiency
Methods for causal inference regarding health effects of air quality regulations are met with unique challenges because (1) changes in air quality are intermediates on the causal pathway between regulation and health, (2) regulations typically affect multiple pollutants on the causal pathway towards health, and (3) regulating a given location can affect pollution at other locations, that is, there is interference between observations. We propose a principal stratification method designed to examine causal effects of a regulation on health that are and are not associated with causal effects of the regulation on air quality. A novel feature of our approach is the accommodation of a continuously scaled multivariate intermediate response vector representing multiple pollutants. Furthermore, we use a spatial hierarchical model for potential pollution concentrations and ultimately use estimates from this model to assess validity of assumptions regarding interference. We apply our method to estimate causal effects of the 1990 Clean Air Act Amendments among approximately 7 million Medicare enrollees living within 6 miles of a pollution monitor.
Air pollution; Bayesian statistics; Causal inference; Principal stratification; Spatial data
Randomized trials with dropouts or censored data and discrete time-to-event type outcomes are frequently analyzed using the Kaplan–Meier or product limit (PL) estimation method. However, the PL method assumes that the censoring mechanism is noninformative and when this assumption is violated, the inferences may not be valid. We propose an expanded PL method using a Bayesian framework to incorporate informative censoring mechanism and perform sensitivity analysis on estimates of the cumulative incidence curves. The expanded method uses a model, which can be viewed as a pattern mixture model, where odds for having an event during the follow-up interval (tk−1,tk], conditional on being at risk at tk−1, differ across the patterns of missing data. The sensitivity parameters relate the odds of an event, between subjects from a missing-data pattern with the observed subjects for each interval. The large number of the sensitivity parameters is reduced by considering them as random and assumed to follow a log-normal distribution with prespecified mean and variance. Then we vary the mean and variance to explore sensitivity of inferences. The missing at random (MAR) mechanism is a special case of the expanded model, thus allowing exploration of the sensitivity to inferences as departures from the inferences under the MAR assumption. The proposed approach is applied to data from the TRial Of Preventing HYpertension.
Clinical trials; Hypertension; Ignorability index; Missing data; Pattern-mixture model; TROPHY trial
We investigate methods for regression analysis when covariates are measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies the classical measurement error model, but it may not have repeated measurements. In addition to the surrogate variables that are available among the subjects in the calibration sample, we assume that there is an instrumental variable (IV) that is available for all study subjects. An IV is correlated with the unobserved true exposure variable and hence can be useful in the estimation of the regression coefficients. We propose a robust best linear estimator that uses all the available data, which is the most efficient among a class of consistent estimators. The proposed estimator is shown to be consistent and asymptotically normal under very weak distributional assumptions. For Poisson or linear regression, the proposed estimator is consistent even if the measurement error from the surrogate or IV is heteroscedastic. Finite-sample performance of the proposed estimator is examined and compared with other estimators via intensive simulation studies. The proposed method and other methods are applied to a bladder cancer case–control study.
Calibration sample; Estimating equation; Heteroscedastic measurement error; Nonparametric correction
Given two variables that causally influence a binary response, we formalize the idea that their effects operate through a common mechanism, in which case we say that the two variables interact mechanistically. We introduce a mechanistic interaction relationship of “interference” that is asymmetric in the two causal factors. Conditions and assumptions under which such mechanistic interaction can be tested under a given regime of data collection, be it interventional or observational, are expressed in terms of conditional independence relationships between the problem variables, which can be manipulated with the aid of causal diagrams. The proposed method is able, under appropriate conditions, to test for interaction between direct effects, and to deal with the situation where one of the two factors is a dichotomized version of a continuous variable. The method is illustrated with the aid of a study on heart disease.
Biological mechanism; Causal inference; Compositional epistasis; Direct effects; Directed acyclic graphs; Excess risk
A generic random effects formulation for the Dirichlet negative multinomial distribution
is developed together with a convenient regression parameterization. A simulation study
indicates that, even when somewhat misspecified, regression models based on the Dirichlet
negative multinomial distribution have smaller median absolute error than generalized
estimating equations, with a particularly pronounced improvement when correlation between
observations in a cluster is high. Estimation of explanatory variable effects and sources
of variation is illustrated for a study of clinical trial recruitment.
Dirichlet negative multinomial; Longitudinal count data; Regression; Sources of variation
Recent developments in RNA-sequencing (RNA-seq) technology have led to a rapid increase in gene expression data in the form of counts. RNA-seq can be used for a variety of applications, however, identifying differential expression (DE) remains a key task in functional genomics. There have been a number of statistical methods for DE detection for RNA-seq data. One common feature of several leading methods is the use of the negative binomial (Gamma–Poisson mixture) model. That is, the unobserved gene expression is modeled by a gamma random variable and, given the expression, the sequencing read counts are modeled as Poisson. The distinct feature in various methods is how the variance, or dispersion, in the Gamma distribution is modeled and estimated. We evaluate several large public RNA-seq datasets and find that the estimated dispersion in existing methods does not adequately capture the heterogeneity of biological variance among samples. We present a new empirical Bayes shrinkage estimate of the dispersion parameters and demonstrate improved DE detection.
Differential expression; Empirical Bayes; RNA sequencing; Shrinkage estimator
In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga.
Decision theory; Forward simulation; High-throughput experiments; multiple testing; Optimal design; Sample size; Sequential design
The ability to measure gene expression on a genome-wide scale is one of the most promising accomplishments in molecular biology. Microarrays, the technology that first permitted this, were riddled with problems due to unwanted sources of variability. Many of these problems are now mitigated, after a decade's worth of statistical methodology development. The recently developed RNA sequencing (RNA-seq) technology has generated much excitement in part due to claims of reduced variability in comparison to microarrays. However, we show that RNA-seq data demonstrate unwanted and obscuring variability similar to what was first observed in microarrays. In particular, we find guanine-cytosine content (GC-content) has a strong sample-specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results. We also report on commonly observed data distortions that demonstrate the need for data normalization. Here, we describe a statistical methodology that improves precision by 42% without loss of accuracy. Our resulting conditional quantile normalization algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.
Gene expression; Normalization; RNA sequencing
Many public and private genome-wide association studies that we have analyzed include flaws in design, with avoidable confounding appearing as a norm rather than the exception. Rather than recognizing flawed research design and addressing that, a category of quality-control statistical methods has arisen to treat only the symptoms. Reflecting more deeply, we examine elements of current genomic research in light of the traditional scientific method and find that hypotheses are often detached from data collection, experimental design, and causal theories. Association studies independent of causal theories, along with multiple testing errors, too often drive health care and public policy decisions. In an era of large-scale biological research, we ask questions about the role of statistical analyses in advancing coherent theories of diseases and their mechanisms. We advocate for reinterpretation of the scientific method in the context of large-scale data analysis opportunities and for renewed appreciation of falsifiable hypotheses, so that we can learn more from our best mistakes.
Association studies; Bioinformatics; Experimental design; GWAS; Scientific method