We consider studies for evaluating the short-term effect of a treatment of interest on a time-to-event outcome. The studies we consider are only partially controlled in the following sense: (1) Subjects’ exposure to the treatment of interest can vary over time, but this exposure is not directly controlled by the study; (2) subjects’ follow-up time is not directly controlled by the study; and (3) the study directly controls another factor that can affect subjects’ exposure to the treatment of interest as well as subjects’ follow-up time. When factors 1 and 2 are both present in the study, evaluating the treatment of interest using standard methods, including instrumental variables, does not generally estimate treatment effects. We develop the methodology for estimating the effect of treatment 1 in this setting of partially controlled studies under explicit assumptions using the framework for principal stratification for causal inference. We illustrate our methods by a study to evaluate the efficacy of the Baltimore Needle Exchange Program to reduce the risk of human immunodeficiency virus (HIV) transmission, using data on distance of the program’s sites from the subjects.
Causal inference; HIV; Needle exchange; Partially controlled studies; Potential outcomes; Principal stratification
In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time. In this paper, we propose to consider the genotypes of the SNPs simultaneously via a logistic analysis of variance (ANOVA) model, which expresses the logit transformed mean of SNP genotypes as the summation of the SNP effects, effects of the disease phenotype and/or other clinical variables, and the interaction effects. We use a reduced-rank representation of the interaction-effect matrix for dimensionality reduction, and employ the L1-penalty in a penalized likelihood framework to filter out the SNPs that have no associations. We develop a Majorization-Minimization algorithm for computational implementation. In addition, we propose a modified BIC criterion to select the penalty parameters and determine the rank number. The proposed method is applied to a Multiple Sclerosis data set and simulated data sets and shows promise in biomarker detection.
BIC; GWAS; MM Algorithm; L1-penalty; Penalized Bernoulli Likelihood; Simultaneous Modeling of SNPs
In multicenter studies, one often needs to make inference about a population survival curve based on multiple, possibly heterogeneous survival data from individual centers. We investigate a flexible Bayesian method for estimating a population survival curve based on a semiparametric multiresolution hazard model that can incorporate covariates and account for center heterogeneity. The method yields a smooth estimate of the survival curve for “multiple resolutions” or time scales of interest. The Bayesian model used has the capability to accommodate general forms of censoring and a priori smoothness assumptions. We develop a model checking and diagnostic technique based on the posterior predictive distribution and use it to identify departures from the model assumptions. The hazard estimator is used to analyze data from 110 centers that participated in a multicenter randomized clinical trial to evaluate tamoxifen in the treatment of early stage breast cancer. Of particular interest are the estimates of center heterogeneity in the baseline hazard curves and in the treatment effects, after adjustment for a few key clinical covariates. Our analysis suggests that the treatment effect estimates are rather robust, even for a collection of small trial centers, despite variations in center characteristics.
Bayesian survival analysis; Breast cancer; Clinical trials; Hazard estimation; Kaplan–Meier estimator; Meta-analysis; Multicenter study; Multiresolution models; Posterior predictive check; Tamoxifen
We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed in the literature for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this article we pursue a coordinate-descent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a df-standardizing reparametrization that assists our pathwise algorithm. The MC+ penalty is ideally suited to this task, and we use it to demonstrate the performance of our algorithm. Certain technical derivations and experiments related to this article are included in the Supplementary Materials section.
Degrees of freedom; LASSO; Nonconvex optimization; Regularization surface; Sparse regression; Variable selection
We propose new, optimal methods for analyzing randomized trials, when it is suspected that treatment effects may differ in two predefined subpopulations. Such subpopulations could be defined by a biomarker or risk factor measured at baseline. The goal is to simultaneously learn which subpopulations benefit from an experimental treatment, while providing strong control of the familywise Type I error rate. We formalize this as a multiple testing problem and show it is computationally infeasible to solve using existing techniques. Our solution involves a novel approach, in which we first transform the original multiple testing problem into a large, sparse linear program. We then solve this problem using advanced optimization techniques. This general method can solve a variety of multiple testing problems and decision theory problems related to optimal trial design, for which no solution was previously available. In particular, we construct new multiple testing procedures that satisfy minimax and Bayes optimality criteria. For a given optimality criterion, our new approach yields the optimal tradeoff between power to detect an effect in the overall population versus power to detect effects in subpopulations. We demonstrate our approach in examples motivated by two randomized trials of new treatments for HIV.
In many studies with a survival outcome, it is often not feasible to fully observe the primary event of interest. This often leads to heavy censoring and thus, difficulty in efficiently estimating survival or comparing survival rates between two groups. In certain diseases, baseline covariates and the event time of non-fatal intermediate events may be associated with overall survival. In these settings, incorporating such additional information may lead to gains in efficiency in estimation of survival and testing for a difference in survival between two treatment groups. If gains in efficiency can be achieved, it may then be possible to decrease the sample size of patients required for a study to achieve a particular power level or decrease the duration of the study. Most existing methods for incorporating intermediate events and covariates to predict survival focus on estimation of relative risk parameters and/or the joint distribution of events under semiparametric models. However, in practice, these model assumptions may not hold and hence may lead to biased estimates of the marginal survival. In this paper, we propose a semi-nonparametric two-stage procedure to estimate and compare t-year survival rates by incorporating intermediate event information observed before some landmark time, which serves as a useful approach to overcome semi-competing risks issues. In a randomized clinical trial setting, we further improve efficiency through an additional calibration step. Simulation studies demonstrate substantial potential gains in efficiency in terms of estimation and power. We illustrate our proposed procedures using an AIDS Clinical Trial Protocol 175 dataset by estimating survival and examining the difference in survival between two treatment groups: zidovudine and zidovudine plus zalcitabine.
Efficiency Augmentation; Kaplan Meier; Landmark Prediction; Semi-competing Risks; Survival Analysis
Under two-phase cohort designs, such as case-cohort and nested case-control sampling, information on observed event times, event indicators, and inexpensive covariates is collected in the first phase, and the first-phase information is used to select subjects for measurements of expensive covariates in the second phase; inexpensive covariates are also used in the data analysis to control for confounding and to evaluate interactions. This paper provides efficient estimation of semiparametric transformation models for such designs, accommodating both discrete and continuous covariates and allowing inexpensive and expensive covariates to be correlated. The estimation is based on the maximization of a modified nonparametric likelihood function through a generalization of the expectation-maximization algorithm. The resulting estimators are shown to be consistent, asymptotically normal and asymptotically efficient with easily estimated variances. Simulation studies demonstrate that the asymptotic approximations are accurate in practical situations. Empirical data from Wilms’ tumor studies and the Atherosclerosis Risk in Communities (ARIC) study are presented.
Case-cohort design; EM algorithm; Kernel estimation; Nested case-control sampling; Nonparametric likelihood; Semiparametric efficiency
Recently, increasing attention has focused on making causal inference when interference is possible. In the presence of interference, treatment may have several types of effects. In this paper, we consider inference about such effects when the population consists of groups of individuals where interference is possible within groups but not between groups. A two stage randomization design is assumed where in the first stage groups are randomized to different treatment allocation strategies and in the second stage individuals are randomized to treatment or control conditional on the strategy assigned to their group in the first stage. For this design, the asymptotic distributions of estimators of the causal effects are derived when either the number of individuals per group or the number of groups grows large. Under certain homogeneity assumptions, the asymptotic distributions provide justification for Wald-type confidence intervals (CIs) and tests. Empirical results demonstrate the Wald CIs have good coverage in finite samples and are narrower than CIs based on either the Chebyshev or Hoeffding inequalities provided the number of groups is not too small. The methods are illustrated by two examples which consider the effects of cholera vaccination and an intervention to encourage voting.
causal inference; confidence interval; interference; Normal mixture; randomization
Nucleosome is the fundamental packing unit of DNA in eukaryotic cells, and its positioning plays a critical role in regulation of gene expression and chromosome functions. Using a recently developed chemical mapping method, nucleosomes can be potentially mapped with an unprecedented single-base-pair resolution. Existence of overlapping nucleosomes due to cell mixture or cell dynamics, however, causes convolution of nucleosome positioning signals. In this paper, we introduce a locally convoluted cluster model and a maximum likelihood deconvolution approach, and illustrate the effectiveness of this approach in quantification of the nucleosome positional signal in the chemical mapping data.
Poisson cluster model; Deconvolution; Nucleosome positioning; EM algorithm; Chemical mapping; Negative Binomial cluster model
This paper is concerned with feature screening and variable selection for varying coefficient models with ultrahigh dimensional covariates. We propose a new feature screening procedure for these models based on conditional correlation coefficient. We systematically study the theoretical properties of the proposed procedure, and establish their sure screening property and the ranking consistency. To enhance the finite sample performance of the proposed procedure, we further develop an iterative feature screening procedure. Monte Carlo simulation studies were conducted to examine the performance of the proposed procedures. In practice, we advocate a two-stage approach for varying coefficient models. The two stage approach consists of (a) reducing the ultrahigh dimensionality by using the proposed procedure and (b) applying regularization methods for dimension-reduced varying coefficient models to make statistical inferences on the coefficient functions. We illustrate the proposed two-stage approach by a real data example.
Feature selection; varying coefficient models; ranking consistency; sure screening property
In DAE (DNA After Enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments (“enriched regions”) versus all the other regions (“background”). However, many confounding factors may influence DAE-seq signals. In addition, the signals in adjacent genomic regions may exhibit strong correlations, which invalidate the independence assumption employed by many existing methods. To mitigate these issues, we develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for covariates effects and violations of the independence assumption. We demonstrate that our AR-HMM leads to improved performance in identifying enriched regions in both simulated and real datasets, especially in those in epigenetic datasets with broader regions of DAE-seq signal enrichment. We also introduce a variable selection procedure in the context of the HMM/AR-HMM where the observations are not independent and the mean value of each state-specific emission distribution is modeled by some covariates. We study the theoretical properties of this variable selection procedure and demonstrate its efficacy in simulated and real DAE-seq data. In summary, we develop several practical approaches for DAE-seq data analysis that are also applicable to more general problems in statistics.
High-throughput Sequencing; Hidden Markov Model; Mixture Regression; Autoregressive modeling; Variable Selection
We propose a novel two-step procedure to combine epidemiological data obtained from diverse sources with the aim to quantify risk factors affecting the probability that an individual develops certain disease such as cancer. In the first step we derive all possible unbiased estimating functions based on a group of cases and a group of controls each time. In the second step, we combine these estimating functions efficiently in order to make full use of the information contained in data. Our approach is computationally simple and flexible. We illustrate its efficacy through simulation and apply it to investigate pancreatic cancer risks based on data obtained from the Connecticut Tumor Registry, a population-based case-control study, and the Behavioral Risk Factor Surveillance System which is a state-based system of health surveys.
Spatial epidemiology; spatial point process; estimating equation
Causal inference with observational data frequently relies on the notion of the propensity score (PS) to adjust treatment comparisons for observed confounding factors. As decisions in the era of “big data” are increasingly reliant on large and complex collections of digital data, researchers are frequently confronted with decisions regarding which of a high-dimensional covariate set to include in the PS model in order to satisfy the assumptions necessary for estimating average causal effects. Typically, simple or ad-hoc methods are employed to arrive at a single PS model, without acknowledging the uncertainty associated with the model selection. We propose three Bayesian methods for PS variable selection and model averaging that 1) select relevant variables from a set of candidate variables to include in the PS model and 2) estimate causal treatment effects as weighted averages of estimates under different PS models. The associated weight for each PS model reflects the data-driven support for that model’s ability to adjust for the necessary variables. We illustrate features of our proposed approaches with a simulation study, and ultimately use our methods to compare the effectiveness of surgical vs. nonsurgical treatment for brain tumors among 2,606 Medicare beneficiaries. Supplementary materials are available online.
Bayesian statistics; causal inference; comparative effectiveness; model averaging; propensity score
We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets.
High dimensional statistics; Principal component analysis; Elliptical distribution; Robust statistics
The Seychelles Child Development Study (SCDS) examines the effects of prenatal exposure to methylmercury on the functioning of the central nervous system. The SCDS data include 20 outcomes measured on 9-year old children that can be classified broadly in four outcome classes or “domains”: cognition, memory, motor, and social behavior. Previous analyses and scientific theory suggest that these outcomes may belong to more than one of these domains, rather than only a single domain as is frequently assumed for modeling. We present a framework for examining the effects of exposure and other covariates when the outcomes may each belong to more than one domain and where we also want to learn about the assignment of outcomes to domains.
Each domain is defined by a sentinel outcome which is preassigned to that domain only. All other outcomes can belong to multiple domains and are not preassigned. Our model allows exposure and covariate effects to differ across domains and across outcomes within domains, and includes random subject-specific effects which model correlations between outcomes within and across domains. We take a Bayesian MCMC approach. Results from the Seychelles study and from extensive simulations show that our model can effectively determine sparse domain assignment, and at the same time give increased power to detect overall, domain-specific and outcome-specific exposure and covariate effects relative to separate models for each endpoint. When fit to the Seychelles data, several outcomes were classified as partly belonging to domains other than their originally assigned domains. In retrospect, the new partial domain assignments are reasonable and, as we discuss, suggest important scientific insights about the nature of the outcomes. Checks of model misspecification were improved relative to a model that assumes each outcome is in a single domain.
Bayesian variable selection; Latent variable model; Markov chain Monte Carlo; Methylmercury; Sparsity
The hypothalamic-pituitary-adrenal (HPA) axis is crucial in coping with stress and maintaining homeostasis. Hormones produced by the HPA axis exhibit both complex univariate longitudinal profiles and complex relationships among different hormones. Consequently, modeling these multivariate longitudinal hormone profiles is a challenging task. In this paper, we propose a bivariate hierarchical state space model, in which each hormone profile is modeled by a hierarchical state space model, with both population-average and subject-specific components. The bivariate model is constructed by concatenating the univariate models based on the hypothesized relationship. Because of the flexible framework of state space form, the resultant models not only can handle complex individual profiles, but also can incorporate complex relationships between two hormones, including both concurrent and feedback relationship. Estimation and inference are based on marginal likelihood and posterior means and variances. Computationally efficient Kalman filtering and smoothing algorithms are used for implementation. Application of the proposed method to a study of chronic fatigue syndrome and fibromyalgia reveals that the relationships between adrenocorticotropic hormone and cortisol in the patient group are weaker than in healthy controls.
Circadian rhythm; Feedback Relationship; HPA axis; Kalman filter; Periodic splines
We propose a Bayesian generalized low rank regression model (GLRR) for the analysis of both high-dimensional responses and covariates. This development is motivated by performing searches for associations between genetic variants and brain imaging phenotypes. GLRR integrates a low rank matrix to approximate the high-dimensional regression coefficient matrix of GLRR and a dynamic factor model to model the high-dimensional covariance matrix of brain imaging phenotypes. Local hypothesis testing is developed to identify significant covariates on high-dimensional responses. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. A simulation study is performed to evaluate the finite sample performance of GLRR and its comparison with several competing approaches. We apply GLRR to investigate the impact of 1,071 SNPs on top 40 genes reported by AlzGene database on the volumes of 93 regions of interest (ROI) obtained from Alzheimer's Disease Neuroimaging Initiative (ADNI).
Generalized low rank regression; Genetic variant; High dimension; Imaging phenotype; Markov chain Monte Carlo; Penalized method
Functional principal component analysis (FPCA) has become the most widely used dimension reduction tool for functional data analysis. We consider functional data measured at random, subject-specific time points, contaminated with measurement error, allowing for both sparse and dense functional data, and propose novel information criteria to select the number of principal component in such data. We propose a Bayesian information criterion based on marginal modeling that can consistently select the number of principal components for both sparse and dense functional data. For dense functional data, we also developed an Akaike information criterion (AIC) based on the expected Kullback-Leibler information under a Gaussian assumption. In connecting with factor analysis in multivariate time series data, we also consider the information criteria by Bai & Ng (2002) and show that they are still consistent for dense functional data, if a prescribed undersmoothing scheme is undertaken in the FPCA algorithm. We perform intensive simulation studies and show that the proposed information criteria vastly outperform existing methods for this type of data. Surprisingly, our empirical evidence shows that our information criteria proposed for dense functional data also perform well for sparse functional data. An empirical example using colon carcinogenesis data is also provided to illustrate the results.
Akaike information criterion; Bayesian information criterion; Functional data analysis; Kernel smoothing; Principal components
We propose a nested Gaussian process (nGP) as a locally adaptive prior for Bayesian nonparametric regression. Specified through a set of stochastic differential equations (SDEs), the nGP imposes a Gaussian process prior for the function’s mth-order derivative. The nesting comes in through including a local instantaneous mean function, which is drawn from another Gaussian process inducing adaptivity to locally-varying smoothness. We discuss the support of the nGP prior in terms of the closure of a reproducing kernel Hilbert space, and consider theoretical properties of the posterior. The posterior mean under the nGP prior is shown to be equivalent to the minimizer of a nested penalized sum-of-squares involving penalties for both the global and local roughness of the function. Using highly-efficient Markov chain Monte Carlo for posterior inference, the proposed method performs well in simulation studies compared to several alternatives, and is scalable to massive data, illustrated through a proteomics application.
Bayesian nonparametric regression; Nested Gaussian processes; Nested smoothing spline; Penalized sum-of-square; Reproducing kernel Hilbert space; Stochastic differential equations
Multivariate matching with doses of treatment differs from the treatment-control matching in three ways. First, pairs must not only balance covariates, but also must differ markedly in dose. Second, any two subjects may be paired, so that the matching is nonbipartite, and different algorithms are required. Finally, a propensity score with doses must be used in place of the conventional propensity score. We illustrate multivariate matching with doses using pilot data from a media campaign against drug abuse. The media campaign is intended to change attitudes and intentions related to illegal drugs, and the evaluation compares stated intentions among ostensibly comparable teens who reported markedly different exposures to the media campaign.
Coherent signed rank test; Equal percent bias reducing; Matching with doses; Nonbipartite matching; Observational studies; Ordinal logit model; Optimal matching; Propensity score
We use data collected in the National Comorbidity Survey - Adolescent (NCS-A) to develop a methodology to estimate the small-area prevalence of serious emotional distress (SED) in schools in the United States, exploiting the clustering of the main NCS-A sample by school. The NCS-A instrument includes both a short screening scale, the K6, and extensive diagnostic assessments of the individual disorders and associated impairment that determine the diagnosis of SED. We fitted a Bayesian bivariate multilevel regression model with correlated effects for the probability of SED and a modified K6 score at the individual and school levels. Our results provide evidence for the existence of variation in the prevalence of SED across schools and geographical regions. Although the concordance between the modified K6 scale and SED is only modest for individuals, the school-level random effects for the two measures are strongly correlated. Under this model we obtain a prediction equation for the rate of SED based on the mean K6 score and covariates. This finding supports the feasibility of using short screening scales like the K6 as an alternative to more comprehensive lay assessments in estimating school-level rates of SED. These methods may be applicable to other studies aiming at small-area estimation for geographical units.
Bayesian; bivariate; hierarchical models; National Comorbidity Survey; prediction; serious emotional distress; small-area estimation; survey
In evaluating familial risk for disease we have two main statistical tasks: assessing the probability of carrying an inherited genetic mutation conferring higher risk; and predicting the absolute risk of developing diseases over time, for those individuals whose mutation status is known. Despite substantial progress, much remains unknown about the role of genetic and environmental risk factors, about the sources of variation in risk among families that carry high-risk mutations, and about the sources of familial aggregation beyond major Mendelian effects. These sources of heterogeneity contribute substantial variation in risk across families. In this paper we present simple and efficient methods for accounting for this variation in familial risk assessment. Our methods are based on frailty models. We implemented them in the context of generalizing Mendelian models of cancer risk, and compared our approaches to others that do not consider heterogeneity across families. Our extensive simulation study demonstrates that when predicting the risk of developing a disease over time conditional on carrier status, accounting for heterogeneity results in a substantial improvement in the area under the curve of the receiver operating characteristic. On the other hand, the improvement for carriership probability estimation is more limited. We illustrate the utility of the proposed approach through the analysis of BRCA1 and BRCA2 mutation carriers in the Washington Ashkenazi Kin-Cohort Study of Breast Cancer.
familial risk prediction; frailty model; multivariate survival; ROC analysis; risk index; breast cancer
The goal of this paper is to model cognitive control related activation among predefined regions of interest (ROIs) of the human brain while properly adjusting for the underlying spatio-temporal correlations. Standard approaches to fMRI analysis do not simultaneously take into account both the spatial and temporal correlations that are prevalent in fMRI data. This is primarily due to the computational complexity of estimating the spatio-temporal covariance matrix. More specifically, they do not take into account multi-scale spatial correlation (between-ROIs and within-ROI). To address these limitations, we propose a spatio-spectral mixed effects model. Working in the spectral domain simplifies the temporal covariance structure because the Fourier coefficients are approximately uncorrelated across frequencies. Additionally, by incorporating voxel-specific and ROI-specific random effects, the model is able to capture the multi-scale spatial covariance structure: distance-dependent local correlation (within an ROI), and distance-independent global correlation (between-ROIs). Building on existing theory on linear mixed effects models to conduct estimation and inference, we applied our model to fMRI data to study activation in pre-specified ROIs in the prefontal cortex and estimate the correlation structure in the network. Simulation studies demonstrate that ignoring the multi-scale correlation leads to higher false positives.
Functional magnetic resonance imaging; Fourier transform; Multi-scale correlation; Local spatial covariance; Spectrum
Diffusion process models are widely used in science, engineering and finance. Most diffusion processes are described by stochastic differential equations in continuous time. In practice, however, data is typically only observed at discrete time points. Except for a few very special cases, no analytic form exists for the likelihood of such discretely observed data. For this reason, parametric inference is often achieved by using discrete-time approximations, with accuracy controlled through the introduction of missing data. We present a new multiresolution Bayesian framework to address the inference difficulty. The methodology relies on the use of multiple approximations and extrapolation, and is significantly faster and more accurate than known strategies based on Gibbs sampling. We apply the multiresolution approach to three data-driven inference problems – one in biophysics and two in finance – one of which features a multivariate diffusion model with an entirely unobserved component.
autocorrelation; data augmentation; Euler discretization; extrapolation; likelihood; missing data; stochastic differential equation