Cause-specific proportional hazards models are commonly used for analyzing competing risks data in clinical studies. Motivated by the objective to assess differential vaccine protection against distinct pathogen types in randomized preventive vaccine efficacy trials, we present an alternative case-only method to standard maximum partial likelihood estimation that applies to a rare failure event, e.g. acquisition of HIV infection. A logistic regression model is fit to the counts of cause-specific events (infecting pathogen type) within study arms, with an offset adjusting for the randomization ratio. This formulation of cause-specific hazard ratio estimation permits immediate incorporation of host-genetic factors to be assessed as effect modifiers, an important area of vaccine research for identifying immune correlates of protection, thus inheriting the estimation efficiency, and cost benefits of the case-only estimator commonly used for assessing gene–treatment interactions. The method is used to reassess HIV genotype-specific vaccine efficacy in the RV144 trial, providing nearly identical results to standard Cox methods, and to assess if and how this vaccine efficacy depends on Fc-γ receptor genes.
Gene–treatment interaction; Sieve analysis; Vaccine efficacy
Blood and tissue are composed of many functionally distinct cell subsets. In immunological studies, these can be measured accurately only using single-cell assays. The characterization of these small cell subsets is crucial to decipher system-level biological changes. For this reason, an increasing number of studies rely on assays that provide single-cell measurements of multiple genes and proteins from bulk cell samples. A common problem in the analysis of such data is to identify biomarkers (or combinations of biomarkers) that are differentially expressed between two biological conditions (e.g. before/after stimulation), where expression is defined as the proportion of cells expressing that biomarker (or biomarker combination) in the cell subset(s) of interest. Here, we present a Bayesian hierarchical framework based on a beta-binomial mixture model for testing for differential biomarker expression using single-cell assays. Our model allows the inference to be subject specific, as is typically required when assessing vaccine responses, while borrowing strength across subjects through common prior distributions. We propose two approaches for parameter estimation: an empirical-Bayes approach using an Expectation–Maximization algorithm and a fully Bayesian one based on a Markov chain Monte Carlo algorithm. We compare our method against classical approaches for single-cell assays including Fisher’s exact test, a likelihood ratio test, and basic log-fold changes. Using several experimental assays measuring proteins or genes at single-cell level and simulations, we show that our method has higher sensitivity and specificity than alternative methods. Additional simulations show that our framework is also robust to model misspecification. Finally, we demonstrate how our approach can be extended to testing multivariate differential expression across multiple biomarker combinations using a Dirichlet-multinomial model and illustrate this approach using single-cell gene expression data and simulations.
Bayesian modeling; Expectation–Maximization; Flow cytometry; Hierarchical modeling; Immunology; Marginal likelihood; Markov Chain Monte Carlo; MIMOSA; Single-cell gene expression
Analyzing the failure times of multiple events is of interest in many fields. Estimating the joint distribution of the failure times in a non-parametric way is not straightforward because some failure times are often right-censored and only known to be greater than observed follow-up times. Although it has been studied, there is no universally optimal solution for this problem. It is still challenging and important to provide alternatives that may be more suitable than existing ones in specific settings. Related problems of the existing methods are not only limited to infeasible computations, but also include the lack of optimality and possible non-monotonicity of the estimated survival function. In this paper, we proposed a non-parametric Bayesian approach for directly estimating the density function of multivariate survival times, where the prior is constructed based on the optional Pólya tree. We investigated several theoretical aspects of the procedure and derived an efficient iterative algorithm for implementing the Bayesian procedure. The empirical performance of the method was examined via extensive simulation studies. Finally, we presented a detailed analysis using the proposed method on the relationship among organ recovery times in severely injured patients. From the analysis, we suggested interesting medical information that can be further pursued in clinics.
Multivariate survival analysis; Non-parametric Bayesian; Optional Pólya tree
Empirical Bayes methods have been extensively used for microarray data analysis by modeling the large number of unknown parameters as random effects. Empirical Bayes allows borrowing information across genes and can automatically adjust for multiple testing and selection bias. However, the standard empirical Bayes model can perform poorly if the assumed working prior deviates from the true prior. This paper proposes a new rank-conditioned inference in which the shrinkage and confidence intervals are based on the distribution of the error conditioned on rank of the data. Our approach is in contrast to a Bayesian posterior, which conditions on the data themselves. The new method is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In addition, it allows a more accurate (but also more complex) non-parametric estimate of the prior to be easily incorporated, resulting in improved inference. The new method’s prior robustness is demonstrated via simulation experiments. Application to a breast cancer gene expression microarray dataset is presented. Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology.
Bayesian shrinkage; Confidence intervals; Ranking bias; Robust multiple estimation
Censored quantile regression provides a useful alternative to the Cox proportional hazards model for analyzing survival data. It directly models the conditional quantile of the survival time and hence is easy to interpret. Moreover, it relaxes the proportionality constraint on the hazard function associated with the popular Cox model and is natural for modeling heterogeneity of the data. Recently, Wang and Wang (2009. Locally weighted censored quantile regression. Journal of the American Statistical Association
103, 1117–1128) proposed a locally weighted censored quantile regression approach that allows for covariate-dependent censoring and is less restrictive than other censored quantile regression methods. However, their kernel smoothing-based weighting scheme requires all covariates to be continuous and encounters practical difficulty with even a moderate number of covariates. We propose a new weighting approach that uses recursive partitioning, e.g. survival trees, that offers greater flexibility in handling covariate-dependent censoring in moderately high dimensions and can incorporate both continuous and discrete covariates. We prove that this new weighting scheme leads to consistent estimation of the quantile regression coefficients and demonstrate its effectiveness via Monte Carlo simulations. We also illustrate the new method using a widely recognized data set from a clinical trial on primary biliary cirrhosis.
Censored quantile regression; Recursive partitioning; Survival analysis; Survival ensembles
Modern case–control studies typically involve the collection of data on a large number of outcomes, often at considerable logistical and monetary expense. These data are of potentially great value to subsequent researchers, who, although not necessarily concerned with the disease that defined the case series in the original study, may want to use the available information for a regression analysis involving a secondary outcome. Because cases and controls are selected with unequal probability, regression analysis involving a secondary outcome generally must acknowledge the sampling design. In this paper, the author presents a new framework for the analysis of secondary outcomes in case–control studies. The approach is based on a careful re-parameterization of the conditional model for the secondary outcome given the case–control outcome and regression covariates, in terms of (a) the population regression of interest of the secondary outcome given covariates and (b) the population regression of the case–control outcome on covariates. The error distribution for the secondary outcome given covariates and case–control status is otherwise unrestricted. For a continuous outcome, the approach sometimes reduces to extending model (a) by including a residual of (b) as a covariate. However, the framework is general in the sense that models (a) and (b) can take any functional form, and the methodology allows for an identity, log or logit link function for model (a).
Case–control studies; Generalized linear models; Statistical genetics; Secondary outcomes
We introduce an explicit set of metrics for human activity based on high-density acceleration recordings from a hip-worn tri-axial accelerometer. These metrics are based on two concepts: (i) Time Active, a measure of the length of time when activity is distinguishable from rest and (ii) AI, a measure of the relative amplitude of activity relative to rest. All measurements are normalized (have the same interpretation across subjects and days), easy to explain and implement, and reproducible across platforms and software implementations. Metrics were validated by visual inspection of results and quantitative in-lab replication studies, and by an association study with health outcomes
Activity intensity; Movelets; Movement; Signal processing; Time active; Tri-axial accelerometer
Estimation of the period length of time-course data from cyclical biological processes, such as those driven by the circadian pacemaker, is crucial for inferring the properties of the biological clock found in many living organisms. We propose a methodology for period estimation based on spectrum resampling (SR) techniques. Simulation studies show that SR is superior and more robust to non-sinusoidal and noisy cycles than a currently used routine based on Fourier approximations. In addition, a simple fit to the oscillations using linear least squares is available, together with a non-parametric test for detecting changes in period length which allows for period estimates with different variances, as frequently encountered in practice. The proposed methods are motivated by and applied to various data examples from chronobiology.
Circadian rhythms; Non-parametric testing; Period estimation; Resampling; Spectrum
The detection of areas in which the risk of a particular disease is significantly elevated, leading to an excess of cases, is an important enterprise in spatial epidemiology. Various frequentist approaches have been suggested for the detection of “clusters” within a hypothesis testing framework. Unfortunately, these suffer from a number of drawbacks including the difficulty in specifying a p-value threshold at which to call significance, the inherent multiplicity problem, and the possibility of multiple clusters. In this paper, we suggest a Bayesian approach to detecting “areas of clustering” in which the study region is partitioned into, possibly multiple, “zones” within which the risk is either at a null, or non-null, level. Computation is carried out using Markov chain Monte Carlo, tuned to the model that we develop. The method is applied to leukemia data in upstate New York.
Bayes factors; Markov chain Monte Carlo; Scan statistic; Spatial epidemiology
We consider in this paper testing for interactions between a genetic marker set and an environmental variable. A common practice in studying gene–environment (GE) interactions is to analyze one single-nucleotide polymorphism (SNP) at a time. It is of significant interest to analyze SNPs in a biologically defined set simultaneously, e.g. gene or pathway. In this paper, we first show that if the main effects of multiple SNPs in a set are associated with a disease/trait, the classical single SNP–GE interaction analysis can be biased. We derive the asymptotic bias and study the conditions under which the classical single SNP–GE interaction analysis is unbiased. We further show that, the simple minimum p-value-based SNP-set GE analysis, can be biased and have an inflated Type 1 error rate. To overcome these difficulties, we propose a computationally efficient and powerful gene–environment set association test (GESAT) in generalized linear models. Our method tests for SNP-set by environment interactions using a variance component test, and estimates the main SNP effects under the null hypothesis using ridge regression. We evaluate the performance of GESAT using simulation studies, and apply GESAT to data from the Harvard lung cancer genetic study to investigate GE interactions between the SNPs in the 15q24–25.1 region and smoking on lung cancer risk.
Asymptotic bias analysis; Gene–environment interactions; Genome-wide association studies; Score statistic; Single-nucleotide polymorphism; Variance component test
The Lasso shrinkage procedure achieved its popularity, in part, by its tendency to shrink estimated coefficients to zero, and its ability to serve as a variable selection procedure. Using data-adaptive weights, the adaptive Lasso modified the original procedure to increase the penalty terms for those variables estimated to be less important by ordinary least squares. Although this modified procedure attained the oracle properties, the resulting models tend to include a large number of “false positives” in practice. Here, we adapt the concept of local false discovery rates (lFDRs) so that it applies to the sequence, λn, of smoothing parameters for the adaptive Lasso. We define the lFDR for a given λn to be the probability that the variable added to the model by decreasing λn to λn−δ is not associated with the outcome, where δ is a small value. We derive the relationship between the lFDR and λn, show lFDR=1 for traditional smoothing parameters, and show how to select λn so as to achieve a desired lFDR. We compare the smoothing parameters chosen to achieve a specified lFDR and those chosen to achieve the oracle properties, as well as their resulting estimates for model coefficients, with both simulation and an example from a genetic study of prostate specific antigen.
Adaptive Lasso; Local false discovery rate; Smoothing parameter; Variable selection
Modern medicine has graduated from broad spectrum treatments to targeted therapeutics. New drugs recognize the recently discovered heterogeneity of many diseases previously considered to be fairly homogeneous. These treatments attack specific genetic pathways which are only dysregulated in some smaller subset of patients with the disease. Often this subset is only rudimentarily understood until well into large-scale clinical trials. As such, standard practice has been to enroll a broad range of patients and run post hoc subset analysis to determine those who may particularly benefit. This unnecessarily exposes many patients to hazardous side effects, and may vastly decrease the efficiency of the trial (especially if only a small subset of patients benefit). In this manuscript, we propose a class of adaptive enrichment designs that allow the eligibility criteria of a trial to be adaptively updated during the trial, restricting entry to patients likely to benefit from the new treatment. We show that our designs both preserve the type 1 error, and in a variety of cases provide a substantial increase in power.
Adaptive clinical trials; Biomarker; Cutpoint; Enrichment
We propose a beta product confidence procedure (BPCP) that is a non-parametric confidence procedure for the survival curve at a fixed time for right-censored data assuming independent censoring. In such situations, the Kaplan–Meier estimator is typically used with an asymptotic confidence interval (CI) that can have coverage problems when the number of observed failures is not large, and/or when testing the latter parts of the curve where there are few remaining subjects at risk. The BPCP guarantees central coverage (i.e. ensures that both one-sided error rates are no more than half of the total nominal rate) when there is no censoring (in which case it reduces to the Clopper–Pearson interval) or when there is progressive type II censoring (i.e. when censoring only occurs immediately after failures on fixed proportions of the remaining individuals). For general independent censoring, simulations show that the BPCP maintains central coverage in many situations where competing methods can have very substantial error rate inflation for the lower limit. The BPCP gives asymptotically correct coverage and is asymptotically equivalent to the CI on the Kaplan–Meier estimator using Greenwood’s variance. The BPCP may be inverted to create confidence procedures for a quantile of the underlying survival distribution. Because the BPCP is easy to implement, offers protection in settings when other methods fail, and essentially matches other methods when they succeed, it should be the method of choice.
Clopper–Pearson confidence interval; Exact confidence interval; Kaplan–Meierestimator; Median survival; Non-parametric methods; Survival analysis
In studies that compare several diagnostic or treatment groups, subjects may not only be measured on a certain set of feature variables, but also be matched on a number of demographic characteristics and measured on additional covariates. Linear discriminant analysis (LDA) is sometimes used to identify which feature variables best discriminate among groups, while accounting for the dependencies among the feature variables. We present a new approach to LDA for multivariate normal data that accounts for the subject matching used in a particular study design, as well as covariates not used in the matching. Applications are given for post-mortem tissue data with the aim of comparing neurobiological characteristics of subjects with schizophrenia with those of normal controls, and for a post-mortem tissue primate study comparing brain biomarker measurements across three treatment groups. We also investigate the performance of our approach using a simulation study.
Covariates; Linear discriminant analysis; Matched design; Post-mortem tissue studies; Schizophrenia
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
Conditional score; Cox regression; High-dimensional data; Risk prediction; Sample size; Training set
A major challenge in cancer epidemiologic studies, especially those of rare cancers, is observing enough cases. To address this, researchers often join forces by bringing multiple studies together to achieve large sample sizes, allowing for increased power in hypothesis testing, and improved efficiency in effect estimation. Combining studies, however, renders the analysis difficult owing to the presence of heterogeneity in the pooled data. In this article, motivated by a collaborative nested case–control (NCC) study of ovarian cancer in three cohorts from United States, Sweden, and Italy, we investigate the use of penalty regularized partial likelihood estimation in the context of pooled NCC studies to achieve two goals. First, we propose an adaptive group lasso (gLASSO) penalized approach to simultaneously identify important variables and estimate their effects. Second, we propose a composite agLASSO penalized approach to identify variables with heterogeneous effects. Both methods are readily implemented with the group coordinate gradient decent algorithm and shown to enjoy the oracle property. We conduct simulation studies to evaluate the performance of our proposed approaches in finite samples under various heterogeneity settings, and apply them to the pooled ovarian cancer study.
Cox's proportional hazards model; Group penalty; Heterogeneity; Nested case–control sampling; Ovarian cancer; Pooled studies; Shrinkage estimation
When some of the regressors can act on both the response and other explanatory variables, the already challenging problem of selecting variables when the number of covariates exceeds the sample size becomes more difficult. A motivating example is a metabolic study in mice that has diet groups and gut microbial percentages that may affect changes in multiple phenotypes related to body weight regulation. The data have more variables than observations and diet is known to act directly on the phenotypes as well as on some or potentially all of the microbial percentages. Interest lies in determining which gut microflora influence the phenotypes while accounting for the direct relationship between diet and the other variables A new methodology for variable selection in this context is presented that links the concept of q-values from multiple hypothesis testing to the recently developed weighted Lasso.
False discovery rate; Microbial data; q-Values; Variable selection; Weighted Lasso
The standard methods for detecting differential gene expression are mostly designed for analyzing a single gene expression experiment. When data from multiple related gene expression studies are available, separately analyzing each study is not ideal as it may fail to detect important genes with consistent but relatively weak differential signals in multiple studies. Jointly modeling all data allows one to borrow information across studies to improve the analysis. However, a simple concordance model, in which each gene is assumed to be differential in either all studies or none of the studies, is incapable of handling genes with study-specific differential expression. In contrast, a model that naively enumerates and analyzes all possible differential patterns across studies can deal with study-specificity and allow information pooling, but the complexity of its parameter space grows exponentially as the number of studies increases. Here, we propose a correlation motif approach to address this dilemma. This approach searches for a small number of latent probability vectors called correlation motifs to capture the major correlation patterns among multiple studies. The motifs provide the basis for sharing information among studies and genes. The approach has flexibility to handle all possible study-specific differential patterns. It improves detection of differential expression and overcomes the barrier of exponential model complexity.
Bayes hierarchical model; Correlation motif; EM algorithm; Microarray; Multiple datasets
We propose a class of estimation techniques for scalar-on-function regression where both outcomes and functional predictors may be observed at multiple visits. Our methods are motivated by a longitudinal brain diffusion tensor imaging tractography study. One of the study’s primary goals is to evaluate the contemporaneous association between human function and brain imaging over time. The complexity of the study requires the development of methods that can simultaneously incorporate: (1) multiple functional (and scalar) regressors; (2) longitudinal outcome and predictor measurements per patient; (3) Gaussian or non-Gaussian outcomes; and (4) missing values within functional predictors. We propose two versions of a new method, longitudinal functional principal components regression (PCR). These methods extend the well-known functional PCR and allow for different effects of subject-specific trends in curves and of visit-specific deviations from that trend. The new methods are compared with existing approaches, and the most promising techniques are used for analyzing the tractography data.
Diffusion tensor imaging; Functional principal components; Functional regression; Longitudinal functional principal components regression; Multiple sclerosis; Repeated measurements
Independent component analysis (ICA) is a widely used technique for blind source separation, used heavily in several scientific research areas including acoustics, electrophysiology, and functional neuroimaging. We propose a scalable two-stage iterative true group ICA methodology for analyzing population level functional magnetic resonance imaging (fMRI) data where the number of subjects is very large. The method is based on likelihood estimators of the underlying source densities and the mixing matrix. As opposed to many commonly used group ICA algorithms, the proposed method does not require significant data reduction by a 2-fold singular value decomposition. In addition, the method can be applied to a large group of subjects since the memory requirements are not restrictive. The performance of our approach is compared with a commonly used group ICA algorithm via simulation studies. Furthermore, the proposed method is applied to a large collection of resting state fMRI datasets. The results show that established brain networks are well recovered by the proposed algorithm.
Functional MRI; Signal processing; Source separation
Random effects models are commonly used to analyze longitudinal categorical data. Marginalized random effects models are a class of models that permit direct estimation of marginal mean parameters and characterize serial correlation for longitudinal categorical data via random effects (Heagerty, 1999). Marginally specified logistic-normal models for longitudinal binary data. Biometrics
55, 688–698; Lee and Daniels, 2008. Marginalized models for longitudinal ordinal data with application to quality of life studies. Statistics in Medicine
27, 4359–4380). In this paper, we propose a Kronecker product (KP) covariance structure to capture the correlation between processes at a given time and the correlation within a process over time (serial correlation) for bivariate longitudinal ordinal data. For the latter, we consider a more general class of models than standard (first-order) autoregressive correlation models, by re-parameterizing the correlation matrix using partial autocorrelations (Daniels and Pourahmadi, 2009). Modeling covariance matrices via partial autocorrelations. Journal of Multivariate Analysis
100, 2352–2363). We assess the reasonableness of the KP structure with a score test. A maximum marginal likelihood estimation method is proposed utilizing a quasi-Newton algorithm with quasi-Monte Carlo integration of the random effects. We examine the effects of demographic factors on metabolic syndrome and C-reactive protein using the proposed models.
Kronecker product; Metabolic syndrome; Partial autocorrelation
Assumptions regarding the true underlying genetic model, or mode of inheritance, are necessary when quantifying genetic associations with disease phenotypes. Here we propose new methods to ascertain the underlying genetic model from parental data in family-based association studies. Specifically, for parental mating-type data, we propose a novel statistic to test whether the underlying genetic model is additive, dominant, or recessive; for parental genotype–phenotype data, we propose three strategies to determine the true mode of inheritance. We illustrate how to incorporate the information gleaned from these strategies into family-based association tests. Because family-based association tests are conducted conditional on parental genotypes, the type I error rate of these procedures is not inflated by the information learned from parental data. This result holds even if such information is weak or when the assumption of Hardy–Weinberg equilibrium is violated. Our simulations demonstrate that incorporating parental data into family-based association tests can improve power under common inheritance models. The application of our proposed methods to a candidate-gene study of type 1 diabetes successfully detects a recessive effect in MGAT5 that would otherwise be missed by conventional family-based association tests.
Case–parents; Dominant; Mode of inheritance; Nuclear families; Recessive; Robust
Cost-effectiveness analysis (CEA) is an important component of the economic evaluation of new treatment options. In many clinical and observational studies of costs, censored data pose challenges to the CEA. We consider a special situation where the terminating events for the survival time and costs are different. Traditional methods for statistical inference offer no means for dealing with censored data in these circumstances. To address this gap, we propose a new method for deriving the confidence interval for the incremental cost-effectiveness ratio. The simulation studies and real data example show that our method performs very well for some practical settings, revealing a great potential for application to actual settings in which terminating events for the survival time and costs differ.
Censored data; Cost-effectiveness analysis; Different terminating events; Fieller method; Survival analysis