Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3712761

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Bayesian Adverse Subpopulation Regression
- 3. Posterior Computation
- 4. Simulations
- 5. Application to Pregnancy Outcomes
- 6. Discussion
- Supplementary Material
- References

Authors

Related links

Stat Med. Author manuscript; available in PMC 2013 December 20.

Published in final edited form as:

PMCID: PMC3712761

NIHMSID: NIHMS423359

Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods for selecting true predictors of adverse health responses from a high-dimensional set of candidate predictors. To address this problem, one may either consider linear regression models for the continuous outcomes or convert these outcomes into binary indicators of adverse responses using pre-defined cutoffs. The former strategy has the disadvantage of often leading to a poorly fitting model that does not predict risk well, while the latter approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for adverse subpopulation regression (ASPR), which relies on a two component latent class model, with the dominant component corresponding to (presumed) healthy individuals and the risk of falling in the minority component characterized via a logistic regression. The logistic regression model is designed to accommodate high-dimensional predictors, as occur in studies with a large number of gene by environment interactions, through use of a flexible nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. The methods are evaluated using simulation studies and applied to a genetic epidemiology study of pregnancy outcomes.

Biomedical studies routinely collect multiple quantitative health outcomes and investigate how the risk of having adverse values for these outcomes is associated with predictors. The typical approach in such setting is to 1) use multivariate normal linear regression in which the mean of the response distribution varies linearly with predictors; 2) first categorize the responses based on pre-specified cutoffs and then fit a logistic regression. The former approach is insufficiently flexible to accommodate settings in which the predictors do not simply shift the response by a fixed amount for all individuals, while the latter approach is extremely sensitive to cut-point choices. In this article, we propose a simple alternative approach for adverse subpopulation regression (ASPR) relying on a two component mixture model that incorporates a logistic regression for the risk of falling into the minority component in the mixture, with the logistic regression model accommodating high-dimensional predictors. We focus on the case when researchers are interested in dichotomizing the subjects into two classes: healthy and unhealthy group (corresponding to the majority and minority of the population); and each component is modeled by the multivariate normal distribution whose mean vector and covariance matrix change with latent class membership. This model is purposefully chosen to be simple to facilitate analyses and interpretations in settings involving high-dimensional predictors, though generalizations to multiple latent classes is straightforward as discussed in Section 6.

Our approach is motivated by the Healthy Pregnancy, Healthy Baby (HPHB) Study, a prospective cohort study of pregnant women residing within Durham County, NC with the goal of identifying environmental, social and genetic factors that contribute to racial disparities in birth outcomes [1]. Here we focus on assessing how predictors - a large number of maternal candidate gene single nucleotide polymorphisms (SNPs), environmental exposures, and their interactions - impact the risk of low values of infant birth weight and gestational age at delivery. Such research questions cannot be addressed by the standard linear regression with continuous responses, where one models the predictor effects on the response means. The standard approach is instead to dichotomize the quantitative outcomes into binary indicators, such as low birth weight (LBW, birth weight<2500g), preterm birth (PTB, gestational age at delivery<37weeks) and small for gestational age (SGA, birth weight less than 10th percentile for that gestational age), and then apply logistic regression. While such analyses are easily implemented, they rely on pre-defining thresholds with the analysis results varying significantly according to the threshold choice [2].

We propose an alternative method for adverse subpopulation regression, which relies on a two component latent class model [3, 4, 5], with the component weights dependent on predictors via logistic regression. Related approaches are considered by Gage [6], Gage et al. [7] and Schwartz et al. [8], but they focused on models with fixed component weights and with the means varying with predictors. In addition, our emphasis is on applications involving high-dimensional predictors in which maximum likelihood can be expected to have poor performance. Stegle et al. [9] presented a Bayesian model for mapping expression quantitative trait loci (eQTLs) jointly contributed from genotype as well as known and hidden confounding factors. This approach is suitable for the subjects sampled from one population group, while our approach focuses on the population with subpopulation admixture (such as healthy versus unhealthy groups).

In such settings, it has become quite common to rely on either Bayesian methods or penalized likelihoods with penalties incorporated to favor having many coefficients estimated at or near zero, leading to variable selection and an effectively lower dimensional model. In linear regression and generalized linear models, such methods have become standard, with the Lasso [10], elastic net [11] and relevance vector machine [12] providing popular examples. These methods have been applied in genome-wide association studies to cope with large number of SNPs and to select multiple SNPs simultaneously [13, 14, 15]. The penalized likelihood estimators have a Bayesian interpretation in corresponding to the mode of the posterior distribution obtained under carefully chosen priors on the coefficients, with the Laplace leading to the Lasso [16] and a *t*-distribution with low degrees of freedom leading to the relevance vector machine. MacLehose and Dunson [17] recently proposed a new class of multiple shrinkage priors that allow shrinkage towards not only zero but also other values, leading to improved performance in estimating non-zero coefficients. We will consider these and other shrinkage priors in the context of our ASPR model.

The remainder of the paper is organized as follows. In Section 2, we introduce the proposed Bayesian adverse subpopulation regression model, describing both fully Bayes and fast two-stage approaches for inference. Section 3 provides details of an Markov chain Monte Carlo (MCMC) algorithm. Section 4 presents the simulation results, evaluating and comparing the proposed methods with existing methods. In Section 5, we apply the model to pregnancy outcome data. The article concludes with a discussion in Section 6.

Suppose we collect the data (* y_{i}*’

Denote *ω*_{1}(* x_{i}*) = Pr(

$$f({\mathit{y}}_{i}\mid {\mathit{x}}_{i})=\sum _{h=1}^{2}{\omega}_{h}\left({\mathit{x}}_{i}\right){\mathrm{N}}_{s}({\mathit{y}}_{i}\mid {\theta}_{h},{\Sigma}_{h}),$$

(1)

where **N*** _{s}*(

$${\omega}_{1}\left({\mathit{x}}_{i}\right)=\frac{\mathrm{exp}(\gamma +{\mathit{x}}_{i}^{\prime}\beta )}{1+\mathrm{exp}(\gamma +{\mathit{x}}_{i}^{\prime}\beta )},$$

(2)

where the coefficient vector *β* = (*β*_{1}, *β*_{2},…, *β _{p}*)’ characterizes the effect of predictors on the risk of falling in the minority subpopulation. Due to the logistic regression form, the exponentiated

For the ASPR model in (1)-(2), identifiability of the unhealthy subpopulation necessarily relies on prior information. In the absence of some prior knowledge, the two subgroups would be exchangeable, and we would encounter a label-ambiguity problem. Removal of this problem through appropriate priors is one of the advantages of the simple two component framework over more complex latent class regression models having unknown numbers of components. The most common approach would place restrictions on the means of the components; for example, ordering the components in advance by letting *θ*_{11} < *θ*_{21}. This approach assumes that low values of the first response variable are adverse, which may be reasonable for a given study but is not so in general. Moreover, placing restriction on the means will fail to solve the label ambiguity problem if the components are not separated sufficiently.

Thus, we consider alternative strategies depending on the application. The first is to elicit informative values for the means and covariance matrix in the two components from prior empirical knowledge of the typical distribution of the responses in healthy and unhealthy groups. In the absence of such extra knowledge, we may fit a mixture of two multivariate normals to the data using EM for maximum likelihood estimation, defining the minority component to be adverse. This can be done either from historical data or the current data. Then, fix the * θ_{h}*,

Another alternative is to specify conditionally conjugate prior distributions for * θ_{h}*,

$$({\theta}_{h},{\Sigma}_{h})\sim {\mathrm{NIW}}_{s}({\theta}_{h},{\Sigma}_{h}\mid {\theta}_{0},{\psi}_{0},{\rho}_{0},{\Sigma}_{0}),\phantom{\rule{1em}{0ex}}h=1,2,$$

where NIW* _{s}*(

As for the priors for *β*, if * x_{i}* is low-dimensional, we can rely on standard choices, such as independent Gaussian distributions with modest variance. However, as the number of predictors increases, we need some approach for addressing the high dimensionality. A common strategy in the frequentist literature is to use sparse penalized regression (e.g., Lasso, elastic net, etc) to favor many elements of

As an alternative approach that had excellent performance in high-dimensional logistic regression, MacLehose and Dunson [17] proposed a multiple shrinkage prior (MSP) for the *j*th coefficient *β _{j}* ~ ∫ DE(

$$\begin{array}{cc}\hfill \mathrm{P}(\cdot )& ={\pi}_{1}{\delta}_{(0,{\tau}_{1}^{\ast})}(\cdot )+\sum _{t=2}^{\infty}{\pi}_{t}{\delta}_{({\mu}_{t}^{\ast},{\tau}_{t}^{\ast})}(\cdot ),\hfill \\ \hfill {\mu}_{t}^{\ast}& \sim {\mathrm{N}}_{1}({\mu}_{t}^{\ast}\mid c,d),\hfill \\ \hfill {\tau}_{1}^{\ast}& \sim \text{Gamma}({\tau}_{1}^{\ast}\mid {a}_{0},{b}_{0}),\phantom{\rule{1em}{0ex}}{\tau}_{t}^{\ast}\sim \text{Gamma}({\tau}_{t}^{\ast}\mid {a}_{1},{b}_{1}),\hfill \\ \hfill {\pi}_{t}& ={V}_{t}\prod _{l<t}(1-{V}_{l}),\phantom{\rule{1em}{0ex}}{V}_{t}\sim \text{Beta}(1,\alpha ),\hfill \end{array}$$

(3)

where Gamma(*τ* | *a, b*) = 1/[*b ^{a}*Γ(

In describing an approach for posterior computation, we focus on the approach described in Section 2 that places a normalinverse-Wishart prior on the component-specific parameters, an informative prior on the intercept *γ* for identifiability, and a mixture of double exponential shrinkage prior on the high-dimensional vector of *β* coefficients. This approach is straightforward to modify to accommodate the other approaches described in Section 2. For example, to instead use the two-stage plug-in approach, we would run the EM algorithm first to estimate *μ _{h}*, Σ

If we observe the latent subpopulation index *z _{i}* directly for each individual and are interested in the coefficients

- Draw
and**θ**_{h}**Σ**from NIW_{h}_{s}$({\theta}_{h},{\Sigma}_{h}\mid {\widehat{\theta}}_{h},{\widehat{\psi}}_{h},{\widehat{\rho}}_{h},{\widehat{\Sigma}}_{h}),h=1,2$, wherewith ${n}_{1}={\sum}_{i=1}^{n}I({z}_{i}=1),{\stackrel{\u2012}{\mathit{y}}}_{1}=\frac{1}{{n}_{1}}{\sum}_{i:{z}_{i}=1}{\mathit{y}}_{i}$ and ${\mathit{S}}_{1}={\sum}_{i:{z}_{i}=1}({\mathit{y}}_{i}-{\stackrel{\u2012}{\mathit{y}}}_{1})({y}_{i}-{\stackrel{\u2012}{\mathit{y}}}_{1})\prime ;{n}_{2}={\sum}_{i=1}^{n}I({z}_{i}=0),{\stackrel{\u2012}{\mathit{y}}}_{2}=\frac{1}{2}{\sum}_{i:{z}_{i}=0}{\mathit{y}}_{i}$ and ${\mathit{S}}_{2}={\sum}_{i:{z}_{i}=0}({\mathit{y}}_{i}-{\stackrel{\u2012}{\mathit{y}}}_{2})({\mathit{y}}_{i}-{\stackrel{\u2012}{\mathit{y}}}_{2})\prime $.$$\begin{array}{cc}\hfill {\widehat{\theta}}_{h}& =\frac{{n}_{h}}{{n}_{h}+{\psi}_{0}}{\stackrel{\u2012}{\mathit{y}}}_{h}+\frac{{\psi}_{0}}{{n}_{h}+{\psi}_{0}}{\theta}_{0},\hfill \\ \hfill {\widehat{\psi}}_{h}& ={n}_{h}+{\psi}_{0},\hfill \\ \hfill {\widehat{\rho}}_{h}& ={n}_{h}+{\rho}_{0},\hfill \\ \hfill {\widehat{\Sigma}}_{h}& ={\Sigma}_{0}+{\mathit{S}}_{h}+\frac{{n}_{h}}{1+{n}_{h}{\psi}_{0}^{-1}}({\stackrel{\u2012}{\mathit{y}}}_{h}-{\theta}_{0}){({\stackrel{\u2012}{\mathit{y}}}_{h}-{\theta}_{0})}^{\prime},\hfill \end{array}$$ - Impute component indicator
*z*from the conditional Bernoulli distribution by setting_{i}*z*= 1 with probability $\frac{{\omega}_{1}\left({\mathit{x}}_{i}\right){\mathrm{N}}_{s}({\mathit{y}}_{i}\mid {\theta}_{1},{\Sigma}_{1})}{{\sum}_{h=1}^{2}{\omega}_{h}\left({\mathit{x}}_{i}\right){\mathrm{N}}_{s}({\mathit{y}}_{i}\mid {\theta}_{h},{\Sigma}_{h})}\text{for}\phantom{\rule{thinmathspace}{0ex}}i=1,2,\dots ,n$._{i} - Augment auxiliary variable
*g*,_{i}*i*= 1, 2,…,*n*, sampled from the normal distribution ${\mathrm{N}}_{1}({g}_{i}\mid \gamma +{\mathit{x}}_{i}^{\prime}\beta ,{\sigma}^{2}\u2215{\varphi}_{i})$, which is truncated above (below) by zero when*z*= 0 (_{i}*z*= 1)._{i} - Update
*ϕ*from Gamma $\left({\varphi}_{i}\mid \frac{\nu +1}{2},\frac{2}{\nu +{({g}_{i}-\gamma -{\mathit{x}}_{i}^{\prime}\beta )}^{2}\u2215{\sigma}^{2}}\right),i=1,2,\dots ,n$. - Update the regression coefficients
*β** = (*γ, β*’)’ given*g*,_{i}*ϕ*and other parameters, following the MCMC algorithm of MacLehose and Dunson [17]._{i}

Although we illustrate the algorithm focusing on the multiple shrinkage prior, the above algorithm could be easily modified for different shrinkage priors of *β* by using the corresponding sampling algorithm in the step (e). Moreover, we could combine the shrinkage and selection method (e.g. Lasso and elastic net) for logistic regression with step (a) and (b) to get a Monte Carlo EM algorithm [24] for the adverse subpopulation regression.

In this section, we examine the performance of our approach along with alternative simple two-stage methods through simulation studies. The two-stage methods generate the binary indicators for the adverse subpopulation in the first stage. For example, indicators can be chosen as the true binary indicators (known for simulation data), estimated by the maximum a posteriori (MAP) allocation from a simple two component mixture model with the EM algorithm, specified by using preselected cutoffs, or identified by K-means clustering for two clusters. In the second stage, we fit both the standard logistic regression model without penalization (Logit-Standard) and the penalized logistic regression models with the shrinkage methods Lasso and elastic net (Logit-Lasso and Logit-ElasticNet) [25].

One hundred datasets were simulated to represent the data observed in the HPHB data set. In particular, we simulated 813 women with two response variables corresponding to infant birth weight and gestational age at delivery. We used maternal genotype for 100 SNPs as predictors that were fixed across the simulations, with only the response variable generation varying. By using the real SNP data, we obtained simulated datasets with a realistic dependence structure among the predictors, which is important given that the dependence structure can have a fundamental impact on variable selection and estimation performance. We simulated data under the model proposed in Section 2.1, with *β* chosen so that the first ten elements were set equal to 0.5 (corresponding to odds ratios for the minor allele of exp(0.5) = 1.65) and the remaining elements were set equal to zero (corresponding to no association with risk of falling in the adverse subpopulation for SNPs 11,…, 100). In addition, we considered another scenario with the first ten coefficients equal to 0.8 (exp(0.8) = 2.23) and others to zero. We simulated * y_{i}* based on expression (1), where the

We applied the proposed ASPR model with default priors specified in Section 2.2 and the two-stage methods to the simulated datasets. For ASPR model, we implemented the data augmentation Gibbs sampling algorithm outlined in Section 3. The sampling ran for 11,000 iterations, 1,000 iterations were discarded as a burn-in and every 10th sample was saved to thin the chain. The trace and autocorrelation plots of the posterior samples were examined to determine the convergence. We used the cyclical coordinate descent algorithm by Friedman et al. [25] to find the Lasso or elastic net regularization paths for penalized logistic regression models. Ten-fold cross-validations were used to select the optimal shrinkage parameter which gave the minimal deviance.

We first compared the estimation performance measured by mean squared errors (MSEs) which were calculated for each coefficient across 100 datasets. Table 1 presents the averaged MSEs obtained across the first ten non-null coefficients and the remaining ninety null coefficients. The averaged MSEs of non-null coefficients by ASPR model are smaller than those given by the two-stage methods even with true indicators. More importantly, when the true indicators are unknown, a common scenario in practice, the two-stage methods rely on indicators generated either by classification algorithms (here the MAP allocation and K-means clustering) or by medical cutoffs will inflate the MSEs of non-null coefficients significantly. This observation indicates that if part of subjects are mis-classified in the two-stage methods, the coefficient estimates would be affected in the standard and penalized logistic regressions. A better solution is thus to avoid the classification or using cutoffs in the first place.

Simulation results to compare coefficient estimation and variable selection by the ASPR model, the standard logistic regression models without penalization and the penalized standard logistic regression models with shrinkage methods for true non-null **...**

We also compared the variable selection ability for different methods. Based on substantive knowledge, we choose an interval null hypothesis as *H*_{0j} : |*β _{j}*| ≤ ; we chose = 0.1 in practice as log odds ratios within ±0.1 of zero are clearly not significant from a public health viewpoint in our motivating applications. We can then calculate the posterior probability of the alternative ${H}_{1j}:\mid {\beta}_{j}\mid \ast \ast \ast \ast \u220a$ as ${\stackrel{~}{\pi}}_{j}={\sum}_{g=1}^{G}I(\mid {\beta}_{j}^{g}\mid \ast \ast \ast \ast \u220a)\u2215G$ under the ASPR model, with ${\beta}_{j}^{g}$ the

Based on above selection criteria, we are able to assess the variable selection ability by using the averaged true positive rate (TPR) and false positive rate (FPR) for different methods across multiple datasets. For a given method and simulated dataset, TPR is defined as the number of predictors correctly selected as significant predictors divided by the number of true non-null predictors, and similarly FPR as number of predictors falsely selected as significant predictors divided by the number of true null predictors. As listed in Table 1, the ASPR model achieves higher TPR and lower FPR (controlling FDR at 50%) than the alternative methods for most scenarios. The only exception is the Logit-ElasticNet method in the (unrealistic) case in which the true sub-population indicators are assumed known. One may argue that it is arbitrary to control FDR at 50% in the ASPR approach and use 90% confidence intervals in the two-stage method, with different values leading to different comparisons of TPRs and FDRs. To obtain a more fair comparison, we varied the values of FDR, confidence interval value, and threshold ’ (previously assumed to be zero) for penalized logistic regression and calculated a series of TPRs and FDRs, which are plotted in Figure 1 as receiver operating curves (ROCs). It it clear that the ROC curves by ASPR model stay above the other curves and are closer to the top left corner, indicating a better tradeoff between the TPRs and FPRs. In addition, we may calculate AUCs (area under the curves), which is approximated by applying the trapezoidal rule for a series of TPRs and FPRs. The AUCs by various methods under different scenarios are given in Table 1 with ASPR model showing the largest AUC. It suggests that in general the method by ASPR model achieves a better performance in variable selection than the alternative methods.

There is increasing appreciation that interactions between the genetic and environmental factors contribute to adverse birth outcomes. In this analysis, we investigated the effects of maternal genotype and their interaction with lead and tobacco exposure on adverse birth outcomes in the infant, adjusting for several confounding factors. The dataset included 813 non-Hispanic black pregnant women who had singleton pregnancy and were less than 28 weeks gestation at the time of enrollment in HPHB study. Based on published studies, we focused on 31 candidate genes which are involved with maladaptive inflammatory regulation, maternal-fetal circulation, stress response, and environmental contaminant metabolism. For those candidate genes, we selected 275 haplotype tagging SNPs which effectively capture the genetic diversity of these genes. Please see Swamy et al. [28] and Ashley-Koch et al. [29] for further details on genotyping approaches. A detailed description of the SNPs and genes used in this analysis can be found in the Web Appendix. For the purpose of this analysis, we assumed that the risk for adverse birth outcomes would be associated with minor alleles. The value of each SNP was recorded as one if the mother carried the less frequent allele and as zero otherwise. In addition to the genetic data, we measured maternal blood levels of lead and cadmium. The interaction of lead and cadmium with the SNPs in relation to gestational age and birth weight is an important research question. We also controlled confounding factors by including them in the analysis. These confounders are mother’s age, recorded as age group 18-20, 21-35 vs 35+; education, as no college vs some college; insurance, as private vs others; parity, as zero vs others; infant sex, as male vs female.

We fit the ASPR model with default priors after checking the composite normal assumption. The MCMC algorithm was run for 11,000 iterations with the first 1,000 iterations discarded as burn-in and every 10th remaining draw retained for analysis. The trace plots and the autocorrelation plots suggested the algorithm converged fast and mixes well. Table 2 presents the posterior summary for the component parameter * θ_{h}* = (

Scatter plots of birth weight (grams) and gestational age (days) overlaid with (a) the posterior mean of allocation weights *ω*_{2}(**x**_{i}) with the cutoffs (dash lines) at 257 days for the gestational age and 2500 grams for the birth weight; (b) posterior **...**

Posterior summary for component parameters in the ASPR model. *θ*_{·1} and *θ*_{·2} are the average gestational age and birthweight in the subgroup with corresponding variance and covariance denoted as Σ_{·11}, Σ **...**

Figure 3 shows the posterior means and 90% credible intervals for the coefficients of SNPs and their interactions with lead and cadmium. Although all the credible intervals cover zero, the posterior means of coefficients of SNPs rs2420620, rs10107390 and rs7017402, given in Table 3, stay clearly away from zero in the main effect (the top panel). This result is also supported by Web Figure 1, which plots ${\stackrel{~}{\pi}}_{j}={\sum}_{g=1}^{G}I(\mid {\beta}_{j}^{g}\mid \ast \ast \ast \ast \u220a)\u2215G$, the posterior probability that *j*th predictor have an effect based on G samples of *β _{j}*. The larger the posterior probability ${\stackrel{~}{\pi}}_{j}$ suggests a stronger effect. When we take = 0.1, coefficient of SNP rs2420620 stands out with the value 0.123 and coefficients of SNPs rs10107390 and rs7017402 with the value 0.098 and 0.092, while the majority of other posterior probabilities are around 0.05. SNP rs2420620 is located in gene GRK5 and SNPs rs10107390 and rs7017402 in gene NAT1. GRK5 is a member of the G protein-coupled receptor kinase family which is involved in regulating the activity levels of G protein-coupled receptors. Polymorphisms in GRK5 have been previously linked to risk for heart failure in African Americans [30]. The N-acetyltransferase genes (NAT1 and NAT2) are involved in the metabolism of xenobiotics. NAT1 has been shown to be expressed in early placenta [31]. We also analyzed the data using two-stage methods, in which the penalized logistic regressions were applied with the indicators generated by maximum a posteriori method. The results are presented in Table 3. Both penalized logistic regression methods identifies the SNP rs2420620 in GRK5 and SNP rs7017402 in NAT1, since their coefficients are not equal to zero. These two SNPs are also selected by the ASPR model. For ASPR model, additional SNPs in genes CR1 and IGF1 are interesting but not consistent across the three approaches. This may suggest that these are false positive results.

Posterior mean and 90% credible interval for coefficients *β* in ASPR model. The coefficients are illustrated for the SNPs main effects and their interactions with lead and cadmium respectively.

In this article, we propose an adverse subpopulation regression model for investigating the relationship among multiple quantitative outcomes and high-dimensional predictors. Unlike the traditional two-stage methods, the proposed method does not require dichotomizing the continuous outcomes into binary indicators and thus avoids information loss. Two stage methods are outperformed with smaller MSE and higher area under the ROC curve for variable selection as demonstrated by the simulation studies. The new model has been applied to examine the effect of gene and environment interaction on adverse pregnancy outcomes. The results suggest the gene GRK5 and NAT1 may influence the occurrence of low birth weight and preterm delivery. Our focus is on defining a simple approach for assessing the impact of high-dimensional predictors on the risk of an adverse outcome when data consist of multiple quantitative traits. By using a two component mixture model, we can use a binary response logistic regression model, a framework that is very familiar to epidemiologists, to characterize non-linear genetic and environment associations with potentially complex multivariate quantitative traits. The proposed framework provides a parsimonious alternative to normal linear regression and logistic regressions based on preliminary categorization of quantitative traits, and should be able to detect associations that would not be detected with these methods. The proposed ASPR framework has purposefully been chosen to be a simple and parsimonious model that is easy to interpret and is scalable to high-dimensional predictors. We are aware that dealing with hundreds of thousands or millions of SNPs will cause serious computational burden for ASPR model. This limitation is shared by most of existing shrinkage methods. The popular solution is to reduce the dimensionality of the predictors first by using sure independence screening [32] or strong rules for discarding predictors [33]. We choose Normal-inverse-Wishart distribution as a conjugate prior for the parameters of multivariate normal distribution. The closed forms in the MCMC step a) lead to computational efficiency. With a large number of traits, it however may be computational expensive to evaluate the posterior Normal-inverse-Wishart density. For sake of simplicity and parsimony, we have avoided fully nonparametric Bayesian density regression models [34] that allow unknown numbers of latent classes. Although generalizations in such directions are conceptually straightforward, for each additional latent class, one introduces an additional *p* regression coefficients and corresponding hyperparameters, and difficult issues in identifiability, label switching and computational complexity arise. For our pregnancy outcome application, the ASPR model provides a good fit to the data, as illustrated by the posterior predictive density.

This work was supported by Award Number R01ES017436 from the National Institute of Environmental Health Sciences, and by funding from the National Institutes of Health (5P2O-RR020782-O3) and the U.S. Environmental Protection Agency (RD-83329301-0). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences, the National Institutes of Health or the U.S. Environmental Protection Agency. We also thank the editor and two anonymous reviewers for very helpful comments and suggestions.

**7. Supplementary Materials** Web Appendix and Web Figure referenced in Sections 5 and the MCMC codes for ASPR model are available at the *Statistics in Medicine* website http://onlinelibrary.wiley.com/journal/10.1002/(ISSN)1097-0258.

1. Miranda ML, Maxson P, Edwards S. Environmental contributions to disparities in pregnancy outcomes. Epidemiologic Reviews. 2009;31(1):67. [PubMed]

2. Boucher KM, Slattery ML, Berry TD, Quesenberry C, Anderson K. Statistical Methods in Epidemiology: A Comparison of Statistical Methods to Analyze Dose-Response and Trend Analysis in Epidemiologic Studies. Journal of Clinical Epidemiology. 1998;51(12):1223–1233. [PubMed]

3. Clogg CC. Handbook of statistical modeling for the social and behavioral sciences. Springer; New York: 1995.

4. Lindsay BG. Mixture models: theory, geometry, and applications. IMS; Beachwood: 1995.

5. McLachlan GJ, Peel D. Finite mixture models. Wiley Blackwell; New York: 2000.

6. Gage TB. Classification of births by birth weight and gestational age: an application of multivariate mixture models. Annals of Human Biology. 2003;30(5):589–604. [PubMed]

7. Gage TB, Fang F, Stratton H. Modeling the pediatric paradox: Birth weight by gestational age. Biodemography and Social Biology. 2008;54(1):95–112. [PMC free article] [PubMed]

8. Schwartz SL, Gelfand AE, Miranda ML. Joint Bayesian analysis of birthweight and censored gestational age using finite mixture models. Statistics in Medicine. 2010;29(16):1710–1723. [PubMed]

9. Stegle O, Parts L, Durbin R, Winn J. A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS computational biology. 2010;6(5):e1000 770. [PMC free article] [PubMed]

10. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. 1996;58(1):267–288.

11. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67(2):301–320.

12. Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244.

13. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25(6):714–721. [PMC free article] [PubMed]

14. Cho S, Kim H, Oh S, Kim K, Park T. BMC proceedings. vol. 3. BioMed Central Ltd; 2009. Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis; p. S25. [PMC free article] [PubMed]

15. Ayers KL, Cordell HJ. Snp selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology. 2010;34(8):879–891. [PMC free article] [PubMed]

16. Park T, Casella G. The bayesian lasso. Journal of the American Statistical Association. 2008;103(482):681–686.

17. MacLehose RF, Dunson DB. Bayesian Semiparametric Multiple Shrinkage. Biometrics. 2010;66(2):455–462. [PMC free article] [PubMed]

18. Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology. 1995;6(4):450. [PubMed]

19. Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4(2):639–650.

20. Gelman A, Jakulin A, Grazia M. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics. 2008;2(4):1360–1383.

21. Holmes CC, Held L. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis. 2006;1(1):145–168.

22. O’Brien SM, Dunson DB. Bayesian multivariate logistic regression. Biometrics. 2004;60(3):739–746. [PubMed]

23. Kinney SK, Dunson DB. Fixed and random effects selection in linear and logistic models. Biometrics. 2007;63(3):690–698. [PubMed]

24. Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association. 1990;85(411):699–704.

25. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1. [PMC free article] [PubMed]

26. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. 1995;57(1):289–300.

27. Müller P, Parmigiani G, Robert C, Rousseau J. Optimal sample size for multiple testing. Journal of the American Statistical Association. 2004;99(468):990–1001.

28. Swamy GK, Garrett ME, Miranda ML, Ashley-Koch AE. Maternal vitamin D receptor genetic variation contributes to infant birthweight among black mothers. American Journal of Medical Genetics Part A. 2011;155(5):1264–1271. [PMC free article] [PubMed]

29. Ashley-Koch AE, Garrett ME, Edwards S, Quinn KS, Swamy GK, Miranda ML. Maternal genetic variation in genes involved in the inflammatory response interact with measures of air pollution exposure to affect infant birthweight among non-Hispanic black women. 2011. Submitted.

30. Liggett SB, Cresci S, Kelly RJ, Syed FM, Matkovich SJ, Hahn HS, Diwan A, Martini JS, Sparks L, Parekh RR, et al. A GRK5 polymorphism that inhibits *β*-adrenergic receptor signaling is protective in heart failure. Nature medicine. 2008;14(5):510–517. [PMC free article] [PubMed]

31. Smelt VA, Upton A, Adjaye J, Payton MA, Boukouvala S, Johnson N, Mardon HJ, Sim E. Expression of arylamine N-acetyltransferases in pre-term placentas and in human pre-implantation embryos. Human molecular genetics. 2000;9(7):1101. [PubMed]

32. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B. 2008;70(5):849–911. [PMC free article] [PubMed]

33. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B. 2012;74(2):245–266. [PMC free article] [PubMed]

34. Dunson DB, Pillai N, Park JH. Bayesian density regression. Journal of the Royal Statistical Society: Series B. 2007;69(2):163–183.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |