Genetic association studies have been a popular approach for assessing the association between common Single Nucleotide Polymorphisms (SNPs) and complex diseases. However, other genomic data involved in the mechanism from SNPs to disease, e.g., gene expressions, are usually neglected in these association studies. In this paper, we propose to exploit gene expression information to more powerfully test the association between SNPs and diseases by jointly modeling the relations among SNPs, gene expressions and diseases. We propose a variance component test for the total effect of SNPs and a gene expression on disease risk. We cast the test within the causal mediation analysis framework with the gene expression as a potential mediator. For eQTL SNPs, the use of gene expression information can enhance power to test for the total effect of a SNP-set, which are the combined direct and indirect effects of the SNPs mediated through the gene expression, on disease risk. We show that the test statistic under the null hypothesis follows a mixture of χ2 distributions, which can be evaluated analytically or empirically using the resampling-based perturbation method. We construct tests for each of three disease models that is determined by SNPs only, SNPs and gene expression, or includes also their interactions. As the true disease model is unknown in practice, we further propose an omnibus test to accommodate different underlying disease models. We evaluate the finite sample performance of the proposed methods using simulation studies, and show that our proposed test performs well and the omnibus test can almost reach the optimal power where the disease model is known and correctly specified. We apply our method to re-analyze the overall effect of the SNP-set and expression of the ORMDL3 gene on the risk of asthma.
doi:10.1214/13-AOAS690
PMCID: PMC3981558
Causal Inference; Data Integration; Mediation Analysis; Mixed Models; Score Test; SNP Set Analysis; Variance Component Test
With the development of next generation sequencing technology, researchers have now been able to study the microbiome composition using direct sequencing, whose output are bacterial taxa counts for each microbiome sample. One goal of microbiome study is to associate the microbiome composition with environmental covariates. We propose to model the taxa counts using a Dirichlet-multinomial (DM) regression model in order to account for overdispersion of observed counts. The DM regression model can be used for testing the association between taxa composition and covariates using the likelihood ratio test. However, when the number of the covariates is large, multiple testing can lead to loss of power. To deal with the high dimensionality of the problem, we develop a penalized likelihood approach to estimate the regression parameters and to select the variables by imposing a sparse group ℓ1 penalty to encourage both group-level and within-group sparsity. Such a variable selection procedure can lead to selection of the relevant covariates and their associated bacterial taxa. An efficient block-coordinate descent algorithm is developed to solve the optimization problem. We present extensive simulations to demonstrate that the sparse DM regression can result in better identification of the microbiome-associated covariates than models that ignore overdispersion or only consider the proportions. We demonstrate the power of our method in an analysis of a data set evaluating the effects of nutrient intake on human gut microbiome composition. Our results have clearly shown that the nutrient intake is strongly associated with the human gut microbiome.
doi:10.1214/12-AOAS592
PMCID: PMC3846354
PMID: 24312162
Coordinate descent; Counts data; Overdispersion; Regularized likelihood; Sparse group penalty
The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected sub-sample. It is natural to apply such a strategy for collecting genetic data in a sub-sample enriched for exposure to environmental factors for gene-environment interaction (G × E) analysis. In this paper, we consider two-phase studies of G × E interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phase I and phase II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene-gene and gene-environment independence to trade-off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the non-parametric Bayes construction of Dunson and Xing (2009). We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The sub-sample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.
doi:10.1214/12-AOAS599
PMCID: PMC3935248
PMID: 24587840
Biased sampling; Colorectal cancer; Dirichlet prior; Exposure enriched; sampling; Gene-environment independence; Joint effects; Multivariate categorical distribution; Spike and slab prior
Analogical reasoning depends fundamentally on the ability to learn and generalize about relations between objects. We develop an approach to relational learning which, given a set of pairs of objects S = {A(1) : B(1), A(2) : B(2), …, A(N) : B(N)}, measures how well other pairs A : B fit in with the set S. Our work addresses the following question: is the relation between objects A and B analogous to those relations found in S? Such questions are particularly relevant in information retrieval, where an investigator might want to search for analogous pairs of objects that match the query set of interest. There are many ways in which objects can be related, making the task of measuring analogies very challenging. Our approach combines a similarity measure on function spaces with Bayesian analysis to produce a ranking. It requires data containing features of the objects of interest and a link matrix specifying which relationships exist; no further attributes of such relationships are necessary. We illustrate the potential of our method on text analysis and information networks. An application on discovering functional interactions between pairs of proteins is discussed in detail, where we show that our approach can work in practice even if a small set of protein pairs is provided.
doi:10.1214/09-AOAS321
PMCID: PMC3935415
PMID: 24587838
Network analysis; Bayesian inference; variational approximation; ranking; information retrieval; data integration; Saccharomyces cerevisiae
The primary goal of randomized trials is to compare the effects of different interventions on some outcome of interest. In addition to the treatment assignment and outcome, data on baseline covariates, such as demographic characteristics or biomarker measurements, are typically collected. Incorporating such auxiliary co-variates in the analysis of randomized trials can increase power, but questions remain about how to preserve type I error when incorporating such covariates in a flexible way, particularly when the number of randomized units is small. Using the Young Citizens study, a cluster randomized trial of an educational intervention to promote HIV awareness, we compare several methods to evaluate intervention effects when baseline covariates are incorporated adaptively. To ascertain the validity of the methods shown in small samples, extensive simulation studies were conducted. We demonstrate that randomization inference preserves type I error under model selection while tests based on asymptotic theory may yield invalid results. We also demonstrate that covariate adjustment generally increases power, except at extremely small sample sizes using liberal selection procedures. Although shown within the context of HIV prevention research, our conclusions have important implications for maximizing efficiency and robustness in randomized trials with small samples across disciplines.
doi:10.1214/13-AOAS679
PMCID: PMC3935423
PMID: 24587845
randomized trials; exact tests; covariate adjustment; model selection
Recent technological advances coupled with large sample sets have uncovered many factors underlying the genetic basis of traits and the predisposition to complex disease, but much is left to discover. A common thread to most genetic investigations is familial relationships. Close relatives can be identified from family records, and more distant relatives can be inferred from large panels of genetic markers. Unfortunately these empirical estimates can be noisy, especially regarding distant relatives. We propose a new method for denoising genetically—inferred relationship matrices by exploiting the underlying structure due to hierarchical groupings of correlated individuals. The approach, which we call Treelet Covariance Smoothing, employs a multiscale decomposition of covariance matrices to improve estimates of pairwise relationships. On both simulated and real data, we show that smoothing leads to better estimates of the relatedness amongst distantly related individuals. We illustrate our method with a large genome-wide association study and estimate the “heritability” of body mass index quite accurately. Traditionally heritability, defined as the fraction of the total trait variance attributable to additive genetic effects, is estimated from samples of closely related individuals using random effects models. We show that by using smoothed relationship matrices we can estimate heritability using population-based samples. Finally, while our methods have been developed for refining genetic relationship matrices and improving estimates of heritability, they have much broader potential application in statistics. Most notably, for error-in-variables random effects models and settings that require regularization of matrices with block or hierarchical structure.
doi:10.1214/12-AOAS598
PMCID: PMC3935431
PMID: 24587841
Covariance estimation; cryptic relatedness; genome-wide association; heritability; kinship
Many investigations have used panel methods to study the relationships between fluctuations in economic activity and mortality. A broad consensus has emerged on the overall procyclical nature of mortality: perhaps counter-intuitively, mortality typically rises above its trend during expansions. This consensus has been tarnished by inconsistent reports on the specific age groups and mortality causes involved. We show that these inconsistencies result, in part, from the trend specifications used in previous panel models. Standard econometric panel analysis involves fitting regression models using ordinary least squares, employing standard errors which are robust to temporal autocorrelation. The model specifications include a fixed effect, and possibly a linear trend, for each time series in the panel. We propose alternative methodology based on nonlinear detrending. Applying our methodology on data for the 50 US states from 1980 to 2006, we obtain more precise and consistent results than previous studies. We find procyclical mortality in all age groups. We find clear procyclical mortality due to respiratory disease and traffic injuries. Predominantly procyclical cardiovascular disease mortality and countercyclical suicide are subject to substantial state-to-state variation. Neither cancer nor homicide have significant macroeconomic association.
doi:10.1214/12-AOAS624
PMCID: PMC3935433
PMID: 24587843
Estimates of the effects of treatment on cost from observational studies are subject to bias if there are unmeasured confounders. It is therefore advisable in practice to assess the potential magnitude of such biases. We derive a general adjustment formula for loglinear models of mean cost and explore special cases under plausible assumptions about the distribution of the unmeasured confounder. We assess the performance of the adjustment by simulation, in particular, examining robustness to a key assumption of conditional independence between the unmeasured and measured covariates given the treatment indicator. We apply our method to SEER-Medicare cost data for a stage II/III muscle-invasive bladder cancer cohort. We evaluate the costs for radical cystectomy vs. combined radiation/chemotherapy, and find that the significance of the treatment effect is sensitive to plausible unmeasured Bernoulli, Poisson and Gamma confounders.
doi:10.1214/13-AOAS665
PMCID: PMC3935434
PMID: 24587844
Sensitivity analysis; censored costs; SEER-Medicare
Tropospheric ozone is one of six criteria pollutants regulated by the US EPA, and has been linked to respiratory and cardiovascular endpoints and adverse effects on vegetation and ecosystems. Regional photochemical models have been developed to study the impacts of emission reductions on ozone levels. The standard approach is to run the deterministic model under new emission levels and attribute the change in ozone concentration to the emission control strategy. However, running the deterministic model requires substantial computing time, and this approach does not provide a measure of uncertainty for the change in ozone levels. Recently, a reduced form model (RFM) has been proposed to approximate the complex model as a simple function of a few relevant inputs. In this paper, we develop a new statistical approach to make full use of the RFM to study the effects of various control strategies on the probability and magnitude of extreme ozone events. We fuse the model output with monitoring data to calibrate the RFM by modeling the conditional distribution of monitoring data given the RFM using a combination of flexible semiparametric quantile regression for the center of the distribution where data are abundant and a parametric extreme value distribution for the tail where data are sparse. Selected parameters in the conditional distribution are allowed to vary by the RFM value and the spatial location. Also, due to the simplicity of the RFM, we are able to embed the RFM in our Bayesian hierarchical framework to obtain a full posterior for the model input parameters, and propagate this uncertainty to the estimation of the effects of the control strategies. We use the new framework to evaluate three potential control strategies, and find that reducing mobile-source emissions has a larger impact than reducing point-source emissions or a combination of several emission sources.
doi:10.1214/13-AOAS628
PMCID: PMC3935436
PMID: 24587842
Bayesian hierarchical modeling; Generalized Pareto distribution; Spatial data analysis; Statistical downscaling
High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation, and gene expression associated with a disease. An integrated genomic profiling approach measuring multiple omics data types simultaneously in the same set of biological samples would render an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), and fused lasso (Tibshirani et al., 2005) methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design (Fang and Wang, 1994) is used to seek “experimental” points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic, and transcriptomic data for subtype analysis in breast and lung cancer data sets.
doi:10.1214/12-AOAS578
PMCID: PMC3935438
PMID: 24587839
Gaussian Graphical Models (GGMs) have been used to construct genetic regulatory networks where regularization techniques are widely used since the network inference usually falls into a high–dimension–low–sample–size scenario. Yet, finding the right amount of regularization can be challenging, especially in an unsupervised setting where traditional methods such as BIC or cross-validation often do not work well. In this paper, we propose a new method — Bootstrap Inference for Network COnstruction (BINCO) — to infer networks by directly controlling the false discovery rates (FDRs) of the selected edges. This method fits a mixture model for the distribution of edge selection frequencies to estimate the FDRs, where the selection frequencies are calculated via model aggregation. This method is applicable to a wide range of applications beyond network construction. When we applied our proposed method to building a gene regulatory network with microarray expression breast cancer data, we were able to identify high-confidence edges and well-connected hub genes that could potentially play important roles in understanding the underlying biological processes of breast cancer.
doi:10.1214/12-AOAS589
PMCID: PMC3930359
PMID: 24563684
high dimensional data; GGM; model aggregation; mixture model; FDR
In the event-related functional magnetic resonance imaging (fMRI) data analysis, there is an extensive interest in accurately and robustly estimating the hemodynamic response function (HRF) and its associated statistics (e.g., the magnitude and duration of the activation). Most methods to date are developed in the time domain and they have utilized almost exclusively the temporal information of fMRI data without accounting for the spatial information. The aim of this paper is to develop a multiscale adaptive smoothing model (MASM) in the frequency domain by integrating the spatial and temporal information to adaptively and accurately estimate HRFs pertaining to each stimulus sequence across all voxels in a three-dimensional (3D) volume. We use two sets of simulation studies and a real data set to examine the finite sample performance of MASM in estimating HRFs. Our real and simulated data analyses confirm that MASM outperforms several other state-of-art methods, such as the smooth finite impulse response (sFIR) model.
PMCID: PMC3922314
PMID: 24533041
Frequency domain; Functional magnetic resonance imaging; Weighted least square estimate; Multiscale adaptive smoothing model
Diffusion tensor imaging provides important information on tissue structure and orientation of fiber tracts in brain white matter in vivo. It results in diffusion tensors, which are 3×3 symmetric positive definite (SPD) matrices, along fiber bundles. This paper develops a functional data analysis framework to model diffusion tensors along fiber tracts as functional data in a Riemannian manifold with a set of covariates of interest, such as age and gender. We propose a statistical model with varying coefficient functions to characterize the dynamic association between functional SPD matrix-valued responses and covariates. We calculate weighted least squares estimators of the varying coefficient functions for the Log-Euclidean metric in the space of SPD matrices. We also develop a global test statistic to test specific hypotheses about these coefficient functions and construct their simultaneous confidence bands. Simulated data are further used to examine the finite sample performance of the estimated varying co-efficient functions. We apply our model to study potential gender differences and find a statistically significant aspect of the development of diffusion tensors along the right internal capsule tract in a clinical study of neurodevelopment.
doi:10.1214/12-AOAS574
PMCID: PMC3922407
PMID: 24533040
Confidence band; Diffusion tensor imaging; Global test statistic; Varying coefficient model; Log-Euclidean metric; Symmetric positive matrix
For many neurological disorders, prediction of disease state is an important clinical aim. Neuroimaging provides detailed information about brain structure and function from which such predictions may be statistically derived. A multinomial logit model with Gaussian process priors is proposed to: (i) predict disease state based on whole-brain neuroimaging data and (ii) analyze the relative informativeness of different image modalities and brain regions. Advanced Markov chain Monte Carlo methods are employed to perform posterior inference over the model. This paper reports a statistical assessment of multiple neuroimaging modalities applied to the discrimination of three Parkinsonian neurological disorders from one another and healthy controls, showing promising predictive performance of disease states when compared to nonprobabilistic classifiers based on multiple modalities. The statistical analysis also quantifies the relative importance of different neuroimaging measures and brain regions in discriminating between these diseases and suggests that for prediction there is little benefit in acquiring multiple neuroimaging sequences. Finally, the predictive capability of different brain regions is found to be in accordance with the regional pathology of the diseases as reported in the clinical literature.
PMCID: PMC3918662
PMID: 24523851
Multi-modality multinomial logit model; Gaussian process; hierarchical model; high-dimensional data; Markov chain Monte Carlo; Parkinsonian diseases; prediction of disease state
In this paper, we propose a new method remMap — REgularized Multivariate regression for identifying MAster Predictors — for fitting multivariate response regression models under the high-dimension-low-sample-size setting. remMap is motivated by investigating the regulatory relationships among different biological molecules based on multiple types of high dimensional genomic data. Particularly, we are interested in studying the influence of DNA copy number alterations on RNA transcript levels. For this purpose, we model the dependence of the RNA expression levels on DNA copy numbers through multivariate linear regressions and utilize proper regularization to deal with the high dimensionality as well as to incorporate desired network structures. Criteria for selecting the tuning parameters are also discussed. The performance of the proposed method is illustrated through extensive simulation studies. Finally, remMap is applied to a breast cancer study, in which genome wide RNA transcript levels and DNA copy numbers were measured for 172 tumor samples. We identify a trans-hub region in cytoband 17q12–q21, whose amplification influences the RNA expression levels of more than 30 unlinked genes. These findings may lead to a better understanding of breast cancer pathology.
doi:10.1214/09-AOAS271SUPP
PMCID: PMC3905690
PMID: 24489618
sparse regression; MAP(MAster Predictor) penalty; DNA copy number alteration; RNA transcript level; v-fold cross validation
Multivariate time series (MTS) data such as time course gene expression data in genomics are often collected to study the dynamic nature of the systems. These data provide important information about the causal dependency among a set of random variables. In this paper, we introduce a computationally efficient algorithm to learn directed acyclic graphs (DAGs) based on MTS data, focusing on learning the local structure of a given target variable. Our algorithm is based on learning all parents (P), all children (C) and some descendants (D) (PCD) iteratively, utilizing the time order of the variables to orient the edges. This time series PCD-PCD algorithm (tsPCD-PCD) extends the previous PCD-PCD algorithm to dependent observations and utilizes composite likelihood ratio tests (CLRTs) for testing the conditional independence. We present the asymptotic distribution of the CLRT statistic and show that the tsPCD-PCD is guaranteed to recover the true DAG structure when the faithfulness condition holds and the tests correctly reject the null hypotheses. Simulation studies show that the CLRTs are valid and perform well even when the sample sizes are small. In addition, the tsPCD-PCD algorithm outperforms the PCD-PCD algorithm in recovering the local graph structures. We illustrate the algorithm by analyzing a time course gene expression data related to mouse T-cell activation.
PMCID: PMC3898602
PMID: 24465291
Bayesian network; Composite likelihood ratio test; Genetic network; PCD-PCD algorithm
Motivated by the increasing use of and rapid changes in array technologies, we consider the prediction problem of fitting a linear regression relating a continuous outcome Y to a large number of covariates X, eg measurements from current, state-of-the-art technology. For most of the samples, only the outcome Y and surrogate covariates, W, are available. These surrogates may be data from prior studies using older technologies. Owing to the dimension of the problem and the large fraction of missing information, a critical issue is appropriate shrinkage of model parameters for an optimal bias-variance tradeoff. We discuss a variety of fully Bayesian and Empirical Bayes algorithms which account for uncertainty in the missing data and adaptively shrink parameter estimates for superior prediction. These methods are evaluated via a comprehensive simulation study. In addition, we apply our methods to a lung cancer dataset, predicting survival time (Y) using qRT-PCR (X) and microarray (W) measurements.
doi:10.1214/13-AOAS668
PMCID: PMC3891514
PMID: 24436727
High-dimensional data; Markov chain Monte Carlo; missing data; measurement error; shrinkage
Studies of smoking behavior commonly use the time-line follow-back (TLFB) method, or periodic retrospective recall, to gather data on daily cigarette consumption. TLFB is considered adequate for identifying periods of abstinence and lapse but not for measurement of daily cigarette consumption, thanks to substantial recall and digit preference biases. With the development of the hand-held electronic diary (ED), it has become possible to collect cigarette consumption data using ecological momentary assessment (EMA), or the instantaneous recording of each cigarette as it is smoked. EMA data, because they do not rely on retrospective recall, are thought to more accurately measure cigarette consumption. In this article we present an analysis of consumption data collected simultaneously by both methods from 236 active smokers in the pre-quit phase of a smoking cessation study. We define a statistical model that describes the genesis of the TLFB records as a two-stage process of mis-remembering and rounding, including fixed and random effects at each stage. We use Bayesian methods to estimate the model, and we evaluate its adequacy by studying histograms of imputed values of the latent remembered cigarette count. Our analysis suggests that both mis-remembering and heaping contribute substantially to the distortion of self-reported cigarette counts. Higher nicotine dependence, white ethnicity and male sex are associated with greater remembered smoking given the EMA count. The model is potentially useful in other applications where it is desirable to understand the process by which subjects remember and report true observations.
doi:10.1214/12-AOAS557
PMCID: PMC3889075
PMID: 24432181
Bayesian analysis; heaping; latent variables; longitudinal data; smoking cessation
We develop a Bayesian model for the alignment of two point configurations under the full similarity transformations of rotation, translation and scaling. Other work in this area has concentrated on rigid body transformations, where scale information is preserved, motivated by problems involving molecular data; this is known as form analysis. We concentrate on a Bayesian formulation for statistical shape analysis. We generalize the model introduced by Green and Mardia for the pairwise alignment of two unlabeled configurations to full similarity transformations by introducing a scaling factor to the model. The generalization is not straight-forward, since the model needs to be reformulated to give good performance when scaling is included. We illustrate our method on the alignment of rat growth profiles and a novel application to the alignment of protein domains. Here, scaling is applied to secondary structure elements when comparing protein folds; additionally, we find that one global scaling factor is not in general sufficient to model these data and, hence, we develop a model in which multiple scale factors can be included to handle different scalings of shape components.
doi:10.1214/12-AOAS615
PMCID: PMC3774796
PMID: 24052809
Morphometrics; protein bioinformatics; similarity transformations; statistical shape analysis; unlabeled shape analysis
Data used to assess acute health effects from air pollution typically have good temporal but poor spatial resolution or the opposite. A modified longitudinal model was developed that sought to improve resolution in both domains by bringing together data from three sources to estimate daily levels of nitrogen dioxide (NO2) at a geographic location. Monthly NO2 measurements at 316 sites were made available by the Study of Traffic, Air quality and Respiratory health (STAR). Four US Environmental Protection Agency monitoring stations have hourly measurements of NO2. Finally, the Connecticut Department of Transportation provides data on traffic density on major roadways, a primary contributor to NO2 pollution. Inclusion of a traffic variable improved performance of the model, and it provides a method for estimating exposure at points that do not have direct measurements of the outcome. This approach can be used to estimate daily variation in levels of NO2 over a region.
doi:10.1214/13-AOAS642
PMCID: PMC3856232
PMID: 24327824
Bayesian model; longitudinal model; nitrogen dioxide; EPA; air pollution
We present a framework for generating multiple imputations for continuous data when the missing data mechanism is unknown. Imputations are generated from more than one imputation model in order to incorporate uncertainty regarding the missing data mechanism. Parameter estimates based on the different imputation models are combined using rules for nested multiple imputation. Through the use of simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal clinical trial of low-income women with depression where nonignorably missing data were a concern. We show that different assumptions regarding the missing data mechanism can have a substantial impact on inferences. Our method provides a simple approach for formalizing subjective notions regarding nonresponse so that they can be easily stated, communicated, and compared.
doi:10.1214/12-AOAS555
PMCID: PMC3596844
PMID: 23503984
nonignorable; NMAR; MNAR; not missing at random; missing not at random
Two different approaches to analysis of data from diagnostic biomarker studies are commonly employed. Logistic regression is used to fit models for probability of disease given marker values while ROC curves and risk distributions are used to evaluate classification performance. In this paper we present a method that simultaneously accomplishes both tasks. The key step is to standardize markers relative to the non-diseased population before including them in the logistic regression model. Among the advantages of this method are: (i) ensuring that results from regression and performance assessments are consistent with each other; (ii) allowing covariate adjustment and covariate effects on ROC curves to be handled in a familiar way, and (iii) providing a mechanism to incorporate important assumptions about structure in the ROC curve into the fitted risk model. We develop the method in detail for the problem of combining biomarker datasets derived from multiple studies, populations or biomarker measurement platforms, when ROC curves are similar across data sources. The methods are applicable to both cohort and case-control sampling designs. The dataset motivating this application concerns Prostate Cancer Antigen 3 (PCA3) for diagnosis of prostate cancer in patients with or without previous negative biopsy where the ROC curves for PCA3 are found to be the same in the two populations. Estimated constrained maximum likelihood and empirical likelihood estimators are derived. The estimators are compared in simulation studies and the methods are illustrated with the PCA3 dataset.
doi:10.1214/13-AOAS634SUPP
PMCID: PMC3817965
PMID: 24204441
constrained likelihood; empirical likelihood; logistic regression; predictiveness curve; ROC curve
Despite rapid advances in experimental cell biology, the in vivo behavior of hematopoietic stem cells (HSC) cannot be directly observed and measured. Previously we modeled feline hematopoiesis using a two-compartment hidden Markov process that had birth and emigration events in the first compartment. Here we perform Bayesian statistical inference on models which contain two additional events in the first compartment in order to determine if HSC fate decisions are linked to cell division or occur independently. Pareto Optimal Model Assessment approach is used to cross check the estimates from Bayesian inference. Our results show that HSC must divide symmetrically (i.e., produce two HSC daughter cells) in order to maintain hematopoiesis. We then demonstrate that the augmented model that adds asymmetric division events provides a better fit to the competitive transplantation data, and we thus provide evidence that HSC fate determination in vivo occurs both in association with cell division and at a separate point in time. Last we show that assuming each cat has a unique set of parameters leads to either a significant decrease or a nonsignificant increase in model fit, suggesting that the kinetic parameters for HSC are not unique attributes of individual animals, but shared within a species.
doi:10.1214/09-AOAS269
PMCID: PMC3783006
PMID: 24078859
Stochastic two-compartment model; hidden Markov models; reversible jump MCMC; hematopoiesis; stem cell; asymmetric division
DNA Copy number variation (CNV) has recently gained considerable interest as a source of genetic variation that likely influences phenotypic differences. Many statistical and computational methods have been proposed and applied to detect CNVs based on data that generated by genome analysis platforms. However, most algorithms are computationally intensive with complexity at least O(n2), where n is the number of probes in the experiments. Moreover, the theoretical properties of those existing methods are not well understood. A faster and better characterized algorithm is desirable for the ultra high throughput data. In this study, we propose the Screening and Ranking algorithm (SaRa) which can detect CNVs fast and accurately with complexity down to O(n). In addition, we characterize theoretical properties and present numerical analysis for our algorithm.
doi:10.1214/12-AOAS539SUPP
PMCID: PMC3779928
PMID: 24069112
Change-point detection; copy number variations; high dimensional data; screening and ranking algorithm
When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects’ identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.
doi:10.1214/11-AOAS506
PMCID: PMC3753824
PMID: 23990852
Confidentiality; disclosure; dissemination; spatial; synthetic; tree