Background
Some environmental chemical exposures are lipophilic and need to be adjusted by serum lipid levels before data analyses. There are currently various strategies that attempt to account for this problem, but all have their drawbacks. To address such concerns, we propose a new method that uses Box-Cox transformations and a simple Bayesian hierarchical model to adjust for lipophilic chemical exposures.
Methods
We compared our Box-Cox method to existing methods. We ran simulation studies in which increasing levels of lipid-adjusted chemical exposure did and did not increase the odds of having a disease, and we looked at both single-exposure and multiple-exposures cases. We also analyzed an epidemiology dataset that examined the effects of various chemical exposures on the risk of birth defects.
Results
Compared with existing methods, our Box-Cox method produced unbiased estimates, good coverage, similar power, and lower type-I error rates. This was the case in both single- and multiple-exposure simulation studies. Results from analysis of the birth-defect data differed from results using existing methods.
Conclusion
Our Box-Cox method is a novel and intuitive way to account for the lipophilic nature of certain chemical exposures. It addresses some of the problems with existing methods, is easily extendable to multiple exposures, and can be used in any analyses that involve concomitant variables.
doi:10.1097/EDE.0b013e3182a671e4
PMCID: PMC3812826
PMID: 24051893
We consider the problem of robust Bayesian inference on the mean regression function allowing the residual density to change flexibly with predictors. The proposed class of models is based on a Gaussian process prior for the mean regression function and mixtures of Gaussians for the collection of residual densities indexed by predictors. Initially considering the homoscedastic case, we propose priors for the residual density based on probit stick-breaking (PSB) scale mixtures and symmetrized PSB (sPSB) location-scale mixtures. Both priors restrict the residual density to be symmetric about zero, with the sPSB prior more flexible in allowing multimodal densities. We provide sufficient conditions to ensure strong posterior consistency in estimating the regression function under the sPSB prior, generalizing existing theory focused on parametric residual distributions. The PSB and sPSB priors are generalized to allow residual densities to change nonparametrically with predictors through incorporating Gaussian processes in the stick-breaking components. This leads to a robust Bayesian regression procedure that automatically down-weights outliers and influential observations in a locally-adaptive manner. Posterior computation relies on an efficient data augmentation exact block Gibbs sampler. The methods are illustrated using simulated and real data applications.
doi:10.1007/s10463-013-0415-z
PMCID: PMC3898864
PMID: 24465053
Data augmentation; exact block Gibbs sampler; Gaussian process; nonparametric regression; outliers; symmetrized probit stick breaking process
In many applications, it is of interest to study trends over time in relationships among categorical variables, such as age group, ethnicity, religious affiliation, political party and preference for particular policies. At each time point, a sample of individuals provide responses to a set of questions, with different individuals sampled at each time. In such settings, there tends to be abundant missing data and the variables being measured may change over time. At each time point, one obtains a large sparse contingency table, with the number of cells often much larger than the number of individuals being surveyed. To borrow information across time in modeling large sparse contingency tables, we propose a Bayesian autoregressive tensor factorization approach. The proposed model relies on a probabilistic Parafac factorization of the joint pmf characterizing the categorical data distribution at each time point, with autocorrelation included across times. Efficient computational methods are developed relying on MCMC. The methods are evaluated through simulation examples and applied to social survey data.
doi:10.1080/01621459.2013.823866
PMCID: PMC3904485
PMID: 24482548
Dynamic model; Multivariate categorical data; Nonparametric Bayes; Panel data; Parafac; Probabilistic tensor factorization; Stick-breaking
Mixtures provide a useful approach for relaxing parametric assumptions. Discrete mixture models induce clusters, typically with the same cluster allocation for each parameter in multivariate cases. As a more flexible approach that facilitates sparse nonparametric modeling of multivariate random effects distributions, this article proposes a kernel partition process (KPP) in which the cluster allocation varies for different parameters. The KPP is shown to be the driving measure for a multivariate ordered Chinese restaurant process that induces a highly-flexible dependence structure in local clustering. This structure allows the relative locations of the random effects to inform the clustering process, with spatially-proximal random effects likely to be assigned the same cluster index. An exact block Gibbs sampler is developed for posterior computation, avoiding truncation of the infinite measure. The methods are applied to hormone curve data, and a dependent KPP is proposed for classification from functional predictors.
PMCID: PMC3903418
PMID: 24478563
Chinese restaurant process; Dirichlet process; discriminant analysis; local clustering; longitudinal data; nonparametric Bayes; random effects
We propose a generalized double Pareto prior for Bayesian shrinkage estimation and inferences in linear models. The prior can be obtained via a scale mixture of Laplace or normal distributions, forming a bridge between the Laplace and Normal-Jeffreys’ priors. While it has a spike at zero like the Laplace density, it also has a Student’s t-like tail behavior. Bayesian computation is straightforward via a simple Gibbs sampling algorithm. We investigate the properties of the maximum a posteriori estimator, as sparse estimation plays an important role in many problems, reveal connections with some well-established regularization procedures, and show some asymptotic results. The performance of the prior is tested through simulations and an application.
PMCID: PMC3903426
PMID: 24478567
Heavy tails; high-dimensional data; LASSO; maximum a posteriori estimation; relevance vector machine; robust prior; shrinkage estimation
Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods for selecting true predictors of adverse health responses from a high-dimensional set of candidate predictors. To address this problem, one may either consider linear regression models for the continuous outcomes or convert these outcomes into binary indicators of adverse responses using pre-defined cutoffs. The former strategy has the disadvantage of often leading to a poorly fitting model that does not predict risk well, while the latter approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for adverse subpopulation regression (ASPR), which relies on a two component latent class model, with the dominant component corresponding to (presumed) healthy individuals and the risk of falling in the minority component characterized via a logistic regression. The logistic regression model is designed to accommodate high-dimensional predictors, as occur in studies with a large number of gene by environment interactions, through use of a flexible nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. The methods are evaluated using simulation studies and applied to a genetic epidemiology study of pregnancy outcomes.
doi:10.1002/sim.5520
PMCID: PMC3712761
PMID: 22825854
Bayesian; Genetic epidemiology; Latent class model; Logistic regression; Mixture model; Model averaging; Nonparametric; Variable selection
In parametric hierarchical models, it is standard practice to place mean and variance constraints on the latent variable distributions for the sake of identifiability and interpretability. Because incorporation of such constraints is challenging in semiparametric models that allow latent variable distributions to be unknown, previous methods either constrain the median or avoid constraints. In this article, we propose a centered stick-breaking process (CSBP), which induces mean and variance constraints on an unknown distribution in a hierarchical model. This is accomplished by viewing an unconstrained stick-breaking process as a parameter-expanded version of a CSBP. An efficient blocked Gibbs sampler is developed for approximate posterior computation. The methods are illustrated through a simulated example and an epidemiologic application.
PMCID: PMC3869464
PMID: 24363478
Dirichlet process; Latent variables; Moment constraints; Nonparametric Bayes; Parameter expansion; Random effects
In studies where data are generated from multiple locations or sources it is common for there to exist observations that are quite unlike the majority. Motivated by the application of establishing a reference value in an inter-laboratory setting when outlying labs are present, we propose a local contamination model that is able to accommodate unusual multivariate realizations in a flexible way. The proposed method models the process level of a hierarchical model using a mixture with a parametric component and a possibly nonparametric contamination. Much of the flexibility in the methodology is achieved by allowing varying random subsets of the elements in the lab-specific mean vectors to be allocated to the contamination component. Computational methods are developed and the methodology is compared to three other possible approaches using a simulation study. We apply the proposed method to a NIST/NOAA sponsored inter-laboratory study which motivated the methodological development.
doi:10.1198/TECH.2011.10041
PMCID: PMC3869467
PMID: 24363465
Bayesian robustness; Component-wise classification; Inter-laboratory studies; Mixtures
We describe a novel class of Bayesian nonparametric priors based on stick-breaking constructions where the weights of the process are constructed as probit transformations of normal random variables. We show that these priors are extremely flexible, allowing us to generate a great variety of models while preserving computational simplicity. Particular emphasis is placed on the construction of rich temporal and spatial processes, which are applied to two problems in finance and ecology.
doi:10.1214/11-BA605
PMCID: PMC3865248
PMID: 24358072
Nonparametric Bayes; Random Probability Measure; Stick-breaking Prior; Mixture Model; Data Augmentation; Spatial Data; Time Series
Summary
In studies involving functional data, it is commonly of interest to model the impact of predictors on the distribution of the curves, allowing flexible e ects on not only the mean curve but also the distribution about the mean. Characterizing the curve for each subject as a linear combination of a high-dimensional set of potential basis functions, we place a sparse latent factor regression model on the basis coe cients. We induce basis selection by choosing a shrinkage prior that allows many of the loadings to be close to zero. The number of latent factors is treated as unknown through a highly-e cient, adaptive-blocked Gibbs sampler. Predictors are included on the latent variables level, while allowing different predictors to impact different latent factors. This model induces a framework for functional response regression in which the distribution of the curves is allowed to change flexibly with predictors. The performance is assessed through simulation studies and the methods are applied to data on blood pressure trajectories during pregnancy.
doi:10.1111/j.1541-0420.2012.01788.x
PMCID: PMC3530663
PMID: 23005895
Factor analysis; Functional principal components analysis; Latent trajectory models; Random effects; Sparse data
There has been increasing interest in applying Bayesian nonparametric methods in large samples and high dimensions. As Markov chain Monte Carlo (MCMC) algorithms are often infeasible, there is a pressing need for much faster algorithms. This article proposes a fast approach for inference in Dirichlet process mixture (DPM) models. Viewing the partitioning of subjects into clusters as a model selection problem, we propose a sequential greedy search algorithm for selecting the partition. Then, when conjugate priors are chosen, the resulting posterior conditionally on the selected partition is available in closed form. This approach allows testing of parametric models versus nonparametric alternatives based on Bayes factors. We evaluate the approach using simulation studies and compare it with four other fast nonparametric methods in the literature. We apply the proposed approach to three datasets including one from a large epidemiologic study. Matlab codes for the simulation and data analyses using the proposed approach are available online in the supplemental materials.
doi:10.1198/jcgs.2010.07081
PMCID: PMC3812957
PMID: 24187479
Clustering; Density estimation; Efficient computation; Large samples; Nonparametric Bayes; Pólya urn scheme; Sequential analysis
We develop a model for stochastic processes with random marginal distributions. Our model relies on a stick-breaking construction for the marginal distribution of the process, and introduces dependence across locations by using a latent Gaussian copula model as the mechanism for selecting the atoms. The resulting latent stick-breaking process (LaSBP) induces a random partition of the index space, with points closer in space having a higher probability of being in the same cluster. We develop an efficient and straightforward Markov chain Monte Carlo (MCMC) algorithm for computation and discuss applications in financial econometrics and ecology. This article has supplementary material online.
doi:10.1198/jasa.2010.tm08241
PMCID: PMC3614377
PMID: 23559690
Nonparametric Bayes; Option pricing; Point-referenced counts; Random probability measure; Random stochastic processes
We propose a semiparametric Bayesian local functional model (BFM) for the analysis of multiple diffusion properties (e.g., fractional anisotropy) along white matter fiber bundles with a set of covariates of interest, such as age and gender. BFM accounts for heterogeneity in the shape of the fiber bundle diffusion properties among subjects, while allowing the impact of the covariates to vary across subjects. A nonparametric Bayesian LPP2 prior facilitates global and local borrowings of information among subjects, while an infinite factor model flexibly represents low-dimensional structure. Local hypothesis testing and credible bands are developed to identify fiber segments, along which multiple diffusion properties are significantly associated with covariates of interest, while controlling for multiple comparisons. Moreover, BFM naturally group subjects into more homogeneous clusters. Posterior computation proceeds via an efficient Markov chain Monte Carlo algorithm. A simulation study is performed to evaluate the finite sample performance of BFM. We apply BFM to investigate the development of white matter diffusivities along the splenium of the corpus callosum tract and the right internal capsule tract in a clinical study of neurodevelopment in new born infants.
doi:10.1016/j.neuroimage.2012.06.027
PMCID: PMC3677778
PMID: 22732565
Confidence band; Diffusion tensor imaging; Fiber bundle; Infinite factor model; Local hypothesis; LPP2; Markov chain Monte Carlo
Factor analytic models are widely used in social sciences. These models have also proven useful for sparse modeling of the covariance structure in multidimensional data. Normal prior distributions for factor loadings and inverse gamma prior distributions for residual variances are a popular choice because of their conditionally conjugate form. However, such prior distributions require elicitation of many hyperparameters and tend to result in poorly behaved Gibbs samplers. In addition, one must choose an informative specification, as high variance prior distributions face problems due to impropriety of the posterior distribution. This article proposes a default, heavy-tailed prior distribution specification, which is induced through parameter expansion while facilitating efficient posterior computation. We also develop an approach to allow uncertainty in the number of factors. The methods are illustrated through simulated examples and epidemiology and toxicology applications. Data sets and computer code used in this article are available online.
doi:10.1198/jcgs.2009.07145
PMCID: PMC3755784
PMID: 23997568
Bayes factor; Covariance structure; Latent variables; Parameter expansion; Selection of factors; Slow mixing
A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local “expert”, and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the “experts” allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results.
PMCID: PMC3754453
PMID: 23990757
classification; incomplete data; expert; Dirichlet process; variational Bayesian; multitask learning
Gaussian factor models have proven widely useful for parsimoniously characterizing dependence in multivariate data. There is a rich literature on their extension to mixed categorical and continuous variables, using latent Gaussian variables or through generalized latent trait models acommodating measurements in the exponential family. However, when generalizing to non-Gaussian measured variables the latent variables typically influence both the dependence structure and the form of the marginal distributions, complicating interpretation and introducing artifacts. To address this problem we propose a novel class of Bayesian Gaussian copula factor models which decouple the latent factors from the marginal distributions. A semiparametric specification for the marginals based on the extended rank likelihood yields straightforward implementation and substantial computational gains. We provide new theoretical and empirical justifications for using this likelihood in Bayesian inference. We propose new default priors for the factor loadings and develop efficient parameter-expanded Gibbs sampling for posterior computation. The methods are evaluated through simulations and applied to a dataset in political science. The models in this paper are implemented in the R package bfa.1
doi:10.1080/01621459.2012.762328
PMCID: PMC3753118
PMID: 23990691
Factor analysis; Latent variables; Semiparametric; Extended rank likelihood; Parameter expansion; High dimensional
This article considers a broad class of kernel mixture density models on compact metric spaces and manifolds. Following a Bayesian approach with a nonparametric prior on the location mixing distribution, sufficient conditions are obtained on the kernel, prior and the underlying space for strong posterior consistency at any continuous density. The prior is also allowed to depend on the sample size n and sufficient conditions are obtained for weak and strong consistency. These conditions are verified on compact Euclidean spaces using multivariate Gaussian kernels, on the hypersphere using a von Mises-Fisher kernel and on the planar shape space using complex Watson kernels.
doi:10.1007/s10463-011-0341-x
PMCID: PMC3439825
PMID: 22984295
Nonparametric Bayes; Density Estimation; Posterior consistency; Sample dependent prior; Riemannian manifold; Hypersphere; Shape space
Gaussian latent factor models are routinely used for modeling of dependence in continuous, binary, and ordered categorical data. For unordered categorical variables, Gaussian latent factor models lead to challenging computation and complex modeling structures. As an alternative, we propose a novel class of simplex factor models. In the single-factor case, the model treats the different categorical outcomes as independent with unknown marginals. The model can characterize flexible dependence structures parsimoniously with few factors, and as factors are added, any multivariate categorical data distribution can be accurately approximated. Using a Bayesian approach for computation and inferences, a Markov chain Monte Carlo (MCMC) algorithm is proposed that scales well with increasing dimension, with the number of factors treated as unknown. We develop an efficient proposal for updating the base probability vector in hierarchical Dirichlet models. Theoretical properties are described, and we evaluate the approach through simulation examples. Applications are described for modeling dependence in nucleotide sequences and prediction from high-dimensional categorical features.
doi:10.1080/01621459.2011.646934
PMCID: PMC3728016
PMID: 23908561
Classification; Contingency table; Factor analysis; Latent variable; Nonparametric Bayes; Nonnegative tensor factorization; Mutual information; Polytomous regression
Summary
Gaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n3 where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-of-regressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.
doi:10.1093/biomet/ass068
PMCID: PMC3712798
PMID: 23869109
Bayesian regression; Compressive sensing; Dimensionality reduction; Gaussian process; Random projection
Admixture mapping is a popular tool to identify regions of the genome associated with traits in a recently admixed population. Existing methods have been developed primarily for identification of a single locus influencing a dichotomous trait within a case-control study design. We propose a generalized admixture mapping (GLEAM) approach, a flexible and powerful regression method for both quantitative and qualitative traits, which is able to test for association between the trait and local ancestries in multiple loci simultaneously and adjust for covariates. The new method is based on the generalized linear model and uses a quadratic normal moment prior to incorporate admixture prior information. Through simulation, we demonstrate that GLEAM achieves lower type I error rate and higher power than ANCESTRYMAP both for qualitative traits and more significantly for quantitative traits. We applied GLEAM to genome-wide SNP data from the Illumina African American panel derived from a cohort of black women participating in the Healthy Pregnancy, Healthy Baby study and identified a locus on chromosome 2 associated with the averaged maternal mean arterial pressure during 24 to 28 weeks of pregnancy.
doi:10.1534/g3.113.006478
PMCID: PMC3704244
PMID: 23665878
generalized linear model; local ancestry; mapping by admixture linkage disequilibrium; quadratic normal moment prior; quantitative traits
Summary
This paper focuses on the problem of choosing a prior for an unknown random effects distribution within a Bayesian hierarchical model. The goal is to obtain a sparse representation by allowing a combination of global and local borrowing of information. A local partition process prior is proposed, which induces dependent local clustering. Subjects can be clustered together for a subset of their parameters, and one learns about similarities between subjects increasingly as parameters are added. Some basic properties are described, including simple two-parameter expressions for marginal and conditional clustering probabilities. A slice sampler is developed which bypasses the need to approximate the countably infinite random measure in performing posterior computation. The methods are illustrated using simulation examples, and an application to hormone trajectory data.
doi:10.1093/biomet/asp021
PMCID: PMC3663599
PMID: 23710074
Dirichlet process; Functional data; Local shrinkage; Meta-analysis; Multi-task learning; Partition model; Slice sampling; Stick-breaking
Summary
High-dimensional and highly correlated data leading to non- or weakly identified effects are commonplace. Maximum likelihood will typically fail in such situations and a variety of shrinkage methods have been proposed. Standard techniques, such as ridge regression or the lasso, shrink estimates toward zero, with some approaches allowing coefficients to be selected out of the model by achieving a value of zero. When substantive information is available, estimates can be shrunk to nonnull values; however, such information may not be available. We propose a Bayesian semiparametric approach that allows shrinkage to multiple locations. Coefficients are given a mixture of heavy-tailed double exponential priors, with location and scale parameters assigned Dirichlet process hyperpriors to allow groups of coefficients to be shrunk toward the same, possibly nonzero, mean. Our approach favors sparse, but flexible, structure by shrinking toward a small number of random locations. The methods are illustrated using a study of genetic polymorphisms and Parkinson’s disease.
doi:10.1111/j.1541-0420.2009.01275.x
PMCID: PMC3631538
PMID: 19508244
Dirichlet process; Hierarchical model; Lasso; MCMC; Mixture model; Nonparametric; Regularization; Shrinkage prior
Modeling of multivariate unordered categorical (nominal) data is a challenging problem, particularly in high dimensions and cases in which one wishes to avoid strong assumptions about the dependence structure. Commonly used approaches rely on the incorporation of latent Gaussian random variables or parametric latent class models. The goal of this article is to develop a nonparametric Bayes approach, which defines a prior with full support on the space of distributions for multiple unordered categorical variables. This support condition ensures that we are not restricting the dependence structure a priori. We show this can be accomplished through a Dirichlet process mixture of product multinomial distributions, which is also a convenient form for posterior computation. Methods for nonparametric testing of violations of independence are proposed, and the methods are applied to model positional dependence within transcription factor binding motifs.
doi:10.1198/jasa.2009.tm08439
PMCID: PMC3630378
PMID: 23606777
Bayes factor; Dirichlet process; Goodness-of-fit test; Latent class; Mixture model; Motif data; Product multinomial; Unordered categorical
We focus on Bayesian variable selection in regression models. One challenge is to search the huge model space adequately, while identifying high posterior probability regions. In the past decades, the main focus has been on the use of Markov chain Monte Carlo (MCMC) algorithms for these purposes. In this article, we propose a new computational approach based on sequential Monte Carlo (SMC), which we refer to as particle stochastic search (PSS). We illustrate PSS through applications to linear regression and probit models.
doi:10.1016/j.spl.2010.10.011
PMCID: PMC3029030
PMID: 21278860
Bayes factor; Marginal inclusion probability; Model averaging; Model uncertainty; Sequential Monte Carlo; Stochastic search variable selection; Subset selection
This article considers a methodology for flexibly characterizing the relationship between a response and multiple predictors. Goals are (1) to estimate the conditional response distribution addressing the distributional changes across the predictor space, and (2) to identify important predictors for the response distribution change both within local regions and globally. We first introduce the probit stick-breaking process (PSBP) as a prior for an uncountable collection of predictor-dependent random distributions and propose a PSBP mixture (PSBPM) of normal regressions for modeling the conditional distributions. A global variable selection structure is incorporated to discard unimportant predictors, while allowing estimation of posterior inclusion probabilities. Local variable selection is conducted relying on the conditional distribution estimates at different predictor points. An efficient stochastic search sampling algorithm is proposed for posterior computation. The methods are illustrated through simulation and applied to an epidemiologic study.
doi:10.1198/jasa.2009.tm08302
PMCID: PMC3620660
PMID: 23580793
Conditional distribution estimation; Hypothesis testing; Kernel stick-breaking process; Mixture of experts; Stochastic search variable selection