Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC 2013 December 20.
Published in final edited form as:
Stat Med. 2012 December 20; 31(29): 4102–4113.
Published online 2012 July 24. doi:  10.1002/sim.5520
PMCID: PMC3712761

Adverse Subpopulation Regression for Multivariate Outcomes with High-Dimensional Predictors


Biomedical studies have a common interest in assessing relationships between multiple related health outcomes and high-dimensional predictors. For example, in reproductive epidemiology, one may collect pregnancy outcomes such as length of gestation and birth weight and predictors such as single nucleotide polymorphisms in multiple candidate genes and environmental exposures. In such settings, there is a need for simple yet flexible methods for selecting true predictors of adverse health responses from a high-dimensional set of candidate predictors. To address this problem, one may either consider linear regression models for the continuous outcomes or convert these outcomes into binary indicators of adverse responses using pre-defined cutoffs. The former strategy has the disadvantage of often leading to a poorly fitting model that does not predict risk well, while the latter approach can be very sensitive to the cutoff choice. As a simple yet flexible alternative, we propose a method for adverse subpopulation regression (ASPR), which relies on a two component latent class model, with the dominant component corresponding to (presumed) healthy individuals and the risk of falling in the minority component characterized via a logistic regression. The logistic regression model is designed to accommodate high-dimensional predictors, as occur in studies with a large number of gene by environment interactions, through use of a flexible nonparametric multiple shrinkage approach. The Gibbs sampler is developed for posterior computation. The methods are evaluated using simulation studies and applied to a genetic epidemiology study of pregnancy outcomes.

Keywords: Bayesian, Genetic epidemiology, Latent class model, Logistic regression, Mixture model, Model averaging, Nonparametric, Variable selection

1. Introduction

Biomedical studies routinely collect multiple quantitative health outcomes and investigate how the risk of having adverse values for these outcomes is associated with predictors. The typical approach in such setting is to 1) use multivariate normal linear regression in which the mean of the response distribution varies linearly with predictors; 2) first categorize the responses based on pre-specified cutoffs and then fit a logistic regression. The former approach is insufficiently flexible to accommodate settings in which the predictors do not simply shift the response by a fixed amount for all individuals, while the latter approach is extremely sensitive to cut-point choices. In this article, we propose a simple alternative approach for adverse subpopulation regression (ASPR) relying on a two component mixture model that incorporates a logistic regression for the risk of falling into the minority component in the mixture, with the logistic regression model accommodating high-dimensional predictors. We focus on the case when researchers are interested in dichotomizing the subjects into two classes: healthy and unhealthy group (corresponding to the majority and minority of the population); and each component is modeled by the multivariate normal distribution whose mean vector and covariance matrix change with latent class membership. This model is purposefully chosen to be simple to facilitate analyses and interpretations in settings involving high-dimensional predictors, though generalizations to multiple latent classes is straightforward as discussed in Section 6.

Our approach is motivated by the Healthy Pregnancy, Healthy Baby (HPHB) Study, a prospective cohort study of pregnant women residing within Durham County, NC with the goal of identifying environmental, social and genetic factors that contribute to racial disparities in birth outcomes [1]. Here we focus on assessing how predictors - a large number of maternal candidate gene single nucleotide polymorphisms (SNPs), environmental exposures, and their interactions - impact the risk of low values of infant birth weight and gestational age at delivery. Such research questions cannot be addressed by the standard linear regression with continuous responses, where one models the predictor effects on the response means. The standard approach is instead to dichotomize the quantitative outcomes into binary indicators, such as low birth weight (LBW, birth weight<2500g), preterm birth (PTB, gestational age at delivery<37weeks) and small for gestational age (SGA, birth weight less than 10th percentile for that gestational age), and then apply logistic regression. While such analyses are easily implemented, they rely on pre-defining thresholds with the analysis results varying significantly according to the threshold choice [2].

We propose an alternative method for adverse subpopulation regression, which relies on a two component latent class model [3, 4, 5], with the component weights dependent on predictors via logistic regression. Related approaches are considered by Gage [6], Gage et al. [7] and Schwartz et al. [8], but they focused on models with fixed component weights and with the means varying with predictors. In addition, our emphasis is on applications involving high-dimensional predictors in which maximum likelihood can be expected to have poor performance. Stegle et al. [9] presented a Bayesian model for mapping expression quantitative trait loci (eQTLs) jointly contributed from genotype as well as known and hidden confounding factors. This approach is suitable for the subjects sampled from one population group, while our approach focuses on the population with subpopulation admixture (such as healthy versus unhealthy groups).

In such settings, it has become quite common to rely on either Bayesian methods or penalized likelihoods with penalties incorporated to favor having many coefficients estimated at or near zero, leading to variable selection and an effectively lower dimensional model. In linear regression and generalized linear models, such methods have become standard, with the Lasso [10], elastic net [11] and relevance vector machine [12] providing popular examples. These methods have been applied in genome-wide association studies to cope with large number of SNPs and to select multiple SNPs simultaneously [13, 14, 15]. The penalized likelihood estimators have a Bayesian interpretation in corresponding to the mode of the posterior distribution obtained under carefully chosen priors on the coefficients, with the Laplace leading to the Lasso [16] and a t-distribution with low degrees of freedom leading to the relevance vector machine. MacLehose and Dunson [17] recently proposed a new class of multiple shrinkage priors that allow shrinkage towards not only zero but also other values, leading to improved performance in estimating non-zero coefficients. We will consider these and other shrinkage priors in the context of our ASPR model.

The remainder of the paper is organized as follows. In Section 2, we introduce the proposed Bayesian adverse subpopulation regression model, describing both fully Bayes and fast two-stage approaches for inference. Section 3 provides details of an Markov chain Monte Carlo (MCMC) algorithm. Section 4 presents the simulation results, evaluating and comparing the proposed methods with existing methods. In Section 5, we apply the model to pregnancy outcome data. The article concludes with a discussion in Section 6.

2. Bayesian Adverse Subpopulation Regression

2.1. Model Formulation

Suppose we collect the data (yixi), for subject i, i = 1, 2,…, n, where yi is an s × 1 vector of outcomes and xi is a p × 1 vector of predictors. We make the simplifying assumption that there are two types of individuals, with zi = 0 denoting healthy individuals and zi = 1 for potentially unhealthy individuals. In addition, we assume for identifiability that the unhealthy individuals are in the minority, with the specific constraints and prior information included for identifiability discussed in detail in Section 2.2. This is a simplification which is made for ease in interpretation, assessment of risk, and scaling to higher dimensions while accommodating the curse of dimensionality that arises. In many cases, such a simplification is made in advance of the analysis by taking one or more response variables and defining cutoffs to dichotomize the data prior to analysis. However, it is well known that results are quite sensitive to the choice of cutoff [18], and hence we prefer allowing zi to be an adverse health status latent variable. By using a Bayesian approach, we can fully accommodate uncertainty in imputing zi and avoid forcing any hard threshold on the observed quantitative traits.

Denote ω1(xi) = Pr(zi = 1 | xi) as the probability of allocating subject i to the unhealthy population and let ω2(xi) = 1 – ω1(xi). We then express the conditional density of the response yi given predictors xi as


where Ns(yi | θh, Σh) is the s-dimensional Normal distribution with mean vector θh and covariance matrix Σh. We note that the traits will not have a multivariate normal distribution marginally, but will instead have a mixture of normal distributions. The primary assumption we have made, to facilitate interpretation and implementation, is that the conditional density of the response is well characterized as a mixture of two multivariate normals with the weights dependent on predictors. If we instead allowed many components, we could fit any conditional density but would lose interpretability and encounter challenging identifiability issues. Even if the mixture of two normals assumption is violated, we expect that the proposed approach will nonetheless be highly robust in terms of inferences on the impact of the predictors. We fully expect this to be a rough approximation but a better one than existing practice that dichotomizes or assumes a single normal regression. In practice, the assumption can be checked by first estimating posterior means of the standardized residuals, ei=Σh12(yiθzi), for each subject and then applying typical tests for normality. Standard transformations can be applied (e.g., Box-Cox) to improve fit. ω1(xi) in expression (1) depends on predictors xi through a logistic regression model:


where the coefficient vector β = (β1, β2,…, βp)’ characterizes the effect of predictors on the risk of falling in the minority subpopulation. Due to the logistic regression form, the exponentiated βj coefficients can be interpreted as odds ratios.

2.2. Prior Specification

For the ASPR model in (1)-(2), identifiability of the unhealthy subpopulation necessarily relies on prior information. In the absence of some prior knowledge, the two subgroups would be exchangeable, and we would encounter a label-ambiguity problem. Removal of this problem through appropriate priors is one of the advantages of the simple two component framework over more complex latent class regression models having unknown numbers of components. The most common approach would place restrictions on the means of the components; for example, ordering the components in advance by letting θ11 < θ21. This approach assumes that low values of the first response variable are adverse, which may be reasonable for a given study but is not so in general. Moreover, placing restriction on the means will fail to solve the label ambiguity problem if the components are not separated sufficiently.

Thus, we consider alternative strategies depending on the application. The first is to elicit informative values for the means and covariance matrix in the two components from prior empirical knowledge of the typical distribution of the responses in healthy and unhealthy groups. In the absence of such extra knowledge, we may fit a mixture of two multivariate normals to the data using EM for maximum likelihood estimation, defining the minority component to be adverse. This can be done either from historical data or the current data. Then, fix the θh, Σh at these estimates in the subsequent analysis. This runs the risk of under-estimating uncertainty but has the advantage of simplifying interpretation and completely eliminating identifiability concerns.

Another alternative is to specify conditionally conjugate prior distributions for θh, Σh and γ as follows,


where NIWs(θh, Σh | θ0, ψ0, ρ0, Σ0) is the Normal-inverse-Wishart distribution proportional to Σh(s+ρ0+2)2exp{12tr(Σ0Σh1)ψ02(θhθ0)Σ1(θhθ0)}. As weakly informative empirical Bayes priors, the hyperparameters θ0 and Σ0 are chosen to be the sample means and covariance matrix for all subjects, and we set ψ0 = 1 and ρ0 = s + 2 to reduce the prior information. Additionally, we place an informative prior on the intercept in the logistic regression model (2) after centering the predictors, γ ~ N1(γ | γ0, λ0). For example, by choosing γ0 = −2.20 and λ0 = 2.42 for the intercept, the expected baseline probability (a priori) of an adverse response is 10% and falls in the range between 3% and 30% with 0.95 probability.

As for the priors for β, if xi is low-dimensional, we can rely on standard choices, such as independent Gaussian distributions with modest variance. However, as the number of predictors increases, we need some approach for addressing the high dimensionality. A common strategy in the frequentist literature is to use sparse penalized regression (e.g., Lasso, elastic net, etc) to favor many elements of β that are equal to zero while shrinking the non-zero elements toward zero. In the Bayesian literature, a rich variety of shrinkage priors have been proposed for high-dimensional regression coefficients, with most approaches relying on priors that are centered at zero, potentially with a point mass incorporated to allow variable selection. Hierarchical shrinkage priors that are centered at zero can potentially lead to over-shrinkage of coefficients that are not close to zero. Such over-shrinkage can be reduced by choosing a prior which is concentrated near zero with very heavy tails, but in that case there is no borrowing of information or incorporation of prior knowledge in estimating the coefficients that are not close to zero.

As an alternative approach that had excellent performance in high-dimensional logistic regression, MacLehose and Dunson [17] proposed a multiple shrinkage prior (MSP) for the jth coefficient βj ~ ∫ DE(βj | μj, τj)dP(μj, τj), where DE(βj | μj, τj) is the double exponential (Laplace) distribution with location parameter μj and scale parameter τj; the mixture distribution P is assigned a modified Dirichlet process prior that incorporates a mass at μj = 0 for the first component. Specifically, the MSP is expressed as


where Gamma(τ | a, b) = 1/[baΓ(a)]τa−1 exp(−τ/b) with mean a × b; δθ (·) is the probability measure with all its mass at θ; following MacLehose and Dunson [17], we choose c = 0, a0 = b0 = 30, a1 = b1 = 6.5 and α = 1. Note that the MSP (3) is represented in the stick-breaking form [19], which starts with a unit probability “stick” and sequentially breaks off random proportions of the stick, with each of these pieces corresponding to the probability πt placed on one mixture component (μt,τt). This formulation allows infinitely many components, with only a relatively small number occupied by the p predictors, effectively bypassing the difficult issue of estimating the number of mixture components. In addition, the stick-breaking form facilitates MCMC sampling. The discrete form of P leads to ties between (μj, τj), j = 1, 2,…, p and hence clusters corresponding to multiple (μj, τj) equal to (μt,τt). Consequently, through MSP, the coefficients β will be shrunk to multiple locations μts, including zero in the first cluster (t = 1), corresponding to the usual Bayesian Lasso prior, while the other components are centered at unknown locations away from zero. For βj and βj’ belonging to the same cluster t, βjβj’ with E(βj)=E(βj)=μt and Var(βj)=Var(βj)=τt2. In our application of the ASPR model, it is unlikely that any of the predictors being considered have a log-odds ratio of falling in the adverse sub-group outside of βj [set membership] [−1, 1], corresponding to an interval of [0.37, 2.72] for the odds ratio. In most genetic epidemiology studies involving complex health conditions, one expects at most a modest deviation from a log-odds near zero for single SNPs or SNP × environment interactions. This small signal-to-noise ratio is one aspect that makes detection of important variants so challenging. To express this prior information, while inflating the prior variance somewhat to corresponding to a “weakly informative” prior [20], we let d = 0.1507, which leads to Pr(μt[1,1])=0.99 a priori.

3. Posterior Computation

In describing an approach for posterior computation, we focus on the approach described in Section 2 that places a normalinverse-Wishart prior on the component-specific parameters, an informative prior on the intercept γ for identifiability, and a mixture of double exponential shrinkage prior on the high-dimensional vector of β coefficients. This approach is straightforward to modify to accommodate the other approaches described in Section 2. For example, to instead use the two-stage plug-in approach, we would run the EM algorithm first to estimate μh, Σh for h = 1, 2 and then would hold these component-specific parameters fixed in the proposed data augmentation Gibbs sampling algorithm to be described below. In addition, if an alternative shrinkage prior were used for the coefficients βj, then one could simply modify the sampling steps for updating the βj appropriately. For scale mixture of normal priors, such as double exponentials, t priors or other standard choices, this is straightforward.

If we observe the latent subpopulation index zi directly for each individual and are interested in the coefficients β, then we could apply the MCMC algorithm of MacLehose and Dunson [17] directly. However, because we do not observe zi for any of the subjects, we instead modify their algorithm to include steps for imputing zi from the corresponding full Bernoulli conditional posterior distribution and sampling the mean and covariance specific to each component. We start by relating the latent subpopulation index zi = I(gi > 0) to an auxiliary random variable gi, where I(·) is the indicator function, which equals 1 when gi > 0 and 0 otherwise. To induce expression (2) through marginalizing gi, we assume gi follows a logistic distribution centered on γ+xiβ. Holmes and Held [21] proposed a data augmentation MCMC algorithm for posterior computation in logistic regression models relying on characterizing the latent gi as a scale mixture of normals, with the square root of the scale parameters following a Kolmogorov-Smirnov (KS) distribution. Due to lack of conjugacy of the conditional posteriors of scale parameters specific to each subject, they recommend using rejection sampling. However, use of a large number of rejection sampling steps can lead to inefficiencies, so we instead apply an alternative data augmentation scheme. Following O’Brien and Dunson [22], the logistic distribution can be almost exactly approximated by a noncentral t-distribution tv(giγ+xiβ,σ2)=0N1(giγ+xiβ,σ2ϕi)Gamma(ϕi | ν/2, 2/ν)i, when we set σ2 = π2 (ν − 2)/2ν with degree of freedom ν = 7.3. Kinney and Dunson [23] showed that posterior distributions of gi estimated with the Holmes and Held [21] and O’Brien and Dunson [22] algorithms are essentially completely indistinguishable given sufficient numbers of MCMC samples. We outline the Gibbs sampler for Bayesian adverse subpopulation regression in the following steps:

  1. Draw θh and Σh from NIWs (θh,Σhθ^h,ψ^h,ρ^h,Σ^h),h=1,2, where
    with n1=i=1nI(zi=1),y1=1n1i:zi=1yi and S1=i:zi=1(yiy1)(yiy1);n2=i=1nI(zi=0),y2=12i:zi=0yi and S2=i:zi=0(yiy2)(yiy2).
  2. Impute component indicator zi from the conditional Bernoulli distribution by setting zi = 1 with probability ω1(xi)Ns(yiθ1,Σ1)h=12ωh(xi)Ns(yiθh,Σh)fori=1,2,,n.
  3. Augment auxiliary variable gi, i = 1, 2,…, n, sampled from the normal distribution N1(giγ+xiβ,σ2ϕi), which is truncated above (below) by zero when zi = 0 (zi = 1).
  4. Update ϕ from Gamma (ϕiν+12,2ν+(giγxiβ)2σ2),i=1,2,,n.
  5. Update the regression coefficients β* = (γ, β’)’ given gi, ϕi and other parameters, following the MCMC algorithm of MacLehose and Dunson [17].

Although we illustrate the algorithm focusing on the multiple shrinkage prior, the above algorithm could be easily modified for different shrinkage priors of β by using the corresponding sampling algorithm in the step (e). Moreover, we could combine the shrinkage and selection method (e.g. Lasso and elastic net) for logistic regression with step (a) and (b) to get a Monte Carlo EM algorithm [24] for the adverse subpopulation regression.

4. Simulations

In this section, we examine the performance of our approach along with alternative simple two-stage methods through simulation studies. The two-stage methods generate the binary indicators for the adverse subpopulation in the first stage. For example, indicators can be chosen as the true binary indicators (known for simulation data), estimated by the maximum a posteriori (MAP) allocation from a simple two component mixture model with the EM algorithm, specified by using preselected cutoffs, or identified by K-means clustering for two clusters. In the second stage, we fit both the standard logistic regression model without penalization (Logit-Standard) and the penalized logistic regression models with the shrinkage methods Lasso and elastic net (Logit-Lasso and Logit-ElasticNet) [25].

One hundred datasets were simulated to represent the data observed in the HPHB data set. In particular, we simulated 813 women with two response variables corresponding to infant birth weight and gestational age at delivery. We used maternal genotype for 100 SNPs as predictors that were fixed across the simulations, with only the response variable generation varying. By using the real SNP data, we obtained simulated datasets with a realistic dependence structure among the predictors, which is important given that the dependence structure can have a fundamental impact on variable selection and estimation performance. We simulated data under the model proposed in Section 2.1, with β chosen so that the first ten elements were set equal to 0.5 (corresponding to odds ratios for the minor allele of exp(0.5) = 1.65) and the remaining elements were set equal to zero (corresponding to no association with risk of falling in the adverse subpopulation for SNPs 11,…, 100). In addition, we considered another scenario with the first ten coefficients equal to 0.8 (exp(0.8) = 2.23) and others to zero. We simulated yi based on expression (1), where the θh and Σh were set equal to the maximum likelihood estimates from the HPHB dataset by using a two component latent class model without predictors.

We applied the proposed ASPR model with default priors specified in Section 2.2 and the two-stage methods to the simulated datasets. For ASPR model, we implemented the data augmentation Gibbs sampling algorithm outlined in Section 3. The sampling ran for 11,000 iterations, 1,000 iterations were discarded as a burn-in and every 10th sample was saved to thin the chain. The trace and autocorrelation plots of the posterior samples were examined to determine the convergence. We used the cyclical coordinate descent algorithm by Friedman et al. [25] to find the Lasso or elastic net regularization paths for penalized logistic regression models. Ten-fold cross-validations were used to select the optimal shrinkage parameter which gave the minimal deviance.

We first compared the estimation performance measured by mean squared errors (MSEs) which were calculated for each coefficient across 100 datasets. Table 1 presents the averaged MSEs obtained across the first ten non-null coefficients and the remaining ninety null coefficients. The averaged MSEs of non-null coefficients by ASPR model are smaller than those given by the two-stage methods even with true indicators. More importantly, when the true indicators are unknown, a common scenario in practice, the two-stage methods rely on indicators generated either by classification algorithms (here the MAP allocation and K-means clustering) or by medical cutoffs will inflate the MSEs of non-null coefficients significantly. This observation indicates that if part of subjects are mis-classified in the two-stage methods, the coefficient estimates would be affected in the standard and penalized logistic regressions. A better solution is thus to avoid the classification or using cutoffs in the first place.

Table 1
Simulation results to compare coefficient estimation and variable selection by the ASPR model, the standard logistic regression models without penalization and the penalized standard logistic regression models with shrinkage methods for true non-null ...

We also compared the variable selection ability for different methods. Based on substantive knowledge, we choose an interval null hypothesis as H0j : |βj| ≤ [sm epsilon]; we chose [sm epsilon] = 0.1 in practice as log odds ratios within ±0.1 of zero are clearly not significant from a public health viewpoint in our motivating applications. We can then calculate the posterior probability of the alternative H1j:βj as π~j=g=1GI(βjg)G under the ASPR model, with βjg the gth MCMC draw from the posterior after discarding a burn-in and I(·) the indicator function. Larger values of π~j suggest a greater weight of evidence that the jth predictor is significant from a public health viewpoint based on our elicited [sm epsilon] value. We further chose a threshold c for the π~j with the goal of controlling the false discovery rate (FDR)[26]; selecting predictors having π~jc as statistically significant adjusting for multiple comparisons. For a given c, we could estimate the posterior expected FDA [27] as j=1p(1π~j)I(π~jc)j=1pI(π~jc), and chose a c as the smallest value such that the FDR was less than or equal to a desired level. For the two-stage methods, the predictor βj was selected if its 90% confidence interval did not contain zero for the standard logistic regression model, or if the estimate β^j0 for the penalized logistic regression models.

Based on above selection criteria, we are able to assess the variable selection ability by using the averaged true positive rate (TPR) and false positive rate (FPR) for different methods across multiple datasets. For a given method and simulated dataset, TPR is defined as the number of predictors correctly selected as significant predictors divided by the number of true non-null predictors, and similarly FPR as number of predictors falsely selected as significant predictors divided by the number of true null predictors. As listed in Table 1, the ASPR model achieves higher TPR and lower FPR (controlling FDR at 50%) than the alternative methods for most scenarios. The only exception is the Logit-ElasticNet method in the (unrealistic) case in which the true sub-population indicators are assumed known. One may argue that it is arbitrary to control FDR at 50% in the ASPR approach and use 90% confidence intervals in the two-stage method, with different values leading to different comparisons of TPRs and FDRs. To obtain a more fair comparison, we varied the values of FDR, confidence interval value, and threshold [sm epsilon]’ (previously assumed to be zero) for penalized logistic regression and calculated a series of TPRs and FDRs, which are plotted in Figure 1 as receiver operating curves (ROCs). It it clear that the ROC curves by ASPR model stay above the other curves and are closer to the top left corner, indicating a better tradeoff between the TPRs and FPRs. In addition, we may calculate AUCs (area under the curves), which is approximated by applying the trapezoidal rule for a series of TPRs and FPRs. The AUCs by various methods under different scenarios are given in Table 1 with ASPR model showing the largest AUC. It suggests that in general the method by ASPR model achieves a better performance in variable selection than the alternative methods.

Figure 1
Receiver Operating Characteristic (ROC) curves for different methods: —— by ASPR-MSP method, An external file that holds a picture, illustration, etc.
Object name is nihms-423359-ig0002.jpg by the two-stage methods with true indicators, An external file that holds a picture, illustration, etc.
Object name is nihms-423359-ig0003.jpg by the two-stage methods with subjects allocated by maximum a posteriori, An external file that holds a picture, illustration, etc.
Object name is nihms-423359-ig0004.jpg by the two-stage methods ...

5. Application to Pregnancy Outcomes

There is increasing appreciation that interactions between the genetic and environmental factors contribute to adverse birth outcomes. In this analysis, we investigated the effects of maternal genotype and their interaction with lead and tobacco exposure on adverse birth outcomes in the infant, adjusting for several confounding factors. The dataset included 813 non-Hispanic black pregnant women who had singleton pregnancy and were less than 28 weeks gestation at the time of enrollment in HPHB study. Based on published studies, we focused on 31 candidate genes which are involved with maladaptive inflammatory regulation, maternal-fetal circulation, stress response, and environmental contaminant metabolism. For those candidate genes, we selected 275 haplotype tagging SNPs which effectively capture the genetic diversity of these genes. Please see Swamy et al. [28] and Ashley-Koch et al. [29] for further details on genotyping approaches. A detailed description of the SNPs and genes used in this analysis can be found in the Web Appendix. For the purpose of this analysis, we assumed that the risk for adverse birth outcomes would be associated with minor alleles. The value of each SNP was recorded as one if the mother carried the less frequent allele and as zero otherwise. In addition to the genetic data, we measured maternal blood levels of lead and cadmium. The interaction of lead and cadmium with the SNPs in relation to gestational age and birth weight is an important research question. We also controlled confounding factors by including them in the analysis. These confounders are mother’s age, recorded as age group 18-20, 21-35 vs 35+; education, as no college vs some college; insurance, as private vs others; parity, as zero vs others; infant sex, as male vs female.

We fit the ASPR model with default priors after checking the composite normal assumption. The MCMC algorithm was run for 11,000 iterations with the first 1,000 iterations discarded as burn-in and every 10th remaining draw retained for analysis. The trace plots and the autocorrelation plots suggested the algorithm converged fast and mixes well. Table 2 presents the posterior summary for the component parameter θh = (θh1, θh2)’ and Σh=(Σh11Σh12Σh21Σh22). The table indicates that the healthy group in general has longer gestational age with higher birth weight, compared to the unhealthy group. In addition, the subjects in the healthy group are more homogeneous with the smaller values in the components of Σh. Figure 2(a) shows shaded circles at the raw data points, with the darkness of the shading being proportional to the estimated posterior probability of allocation to the healthy subgroup. Standard cutoffs for defining preterm birth and low birth weight are also shown. Although most of the children that are in the preterm and low birth weight bin have small posterior probabilities of allocation to the healthy subgroup, there is substantial uncertainty around the boundary region in particular. This uncertainty is taken into the account by the ASPR model but not by the other two-stage approaches. Figure 2(b) plots the raw data and also demonstrates the contours of posterior predictive density based on the MCMC samples of ASPR model. There seems no systematic discrepancy between the observations and the contours of posterior predictive density, suggesting the ASPR model fits the data well.

Figure 2
Scatter plots of birth weight (grams) and gestational age (days) overlaid with (a) the posterior mean of allocation weights ω2(xi) with the cutoffs (dash lines) at 257 days for the gestational age and 2500 grams for the birth weight; (b) posterior ...
Table 2
Posterior summary for component parameters in the ASPR model. θ·1 and θ·2 are the average gestational age and birthweight in the subgroup with corresponding variance and covariance denoted as Σ·11, Σ ...

Figure 3 shows the posterior means and 90% credible intervals for the coefficients of SNPs and their interactions with lead and cadmium. Although all the credible intervals cover zero, the posterior means of coefficients of SNPs rs2420620, rs10107390 and rs7017402, given in Table 3, stay clearly away from zero in the main effect (the top panel). This result is also supported by Web Figure 1, which plots π~j=g=1GI(βjg)G, the posterior probability that jth predictor have an effect based on G samples of βj. The larger the posterior probability π~j suggests a stronger effect. When we take [sm epsilon] = 0.1, coefficient of SNP rs2420620 stands out with the value 0.123 and coefficients of SNPs rs10107390 and rs7017402 with the value 0.098 and 0.092, while the majority of other posterior probabilities are around 0.05. SNP rs2420620 is located in gene GRK5 and SNPs rs10107390 and rs7017402 in gene NAT1. GRK5 is a member of the G protein-coupled receptor kinase family which is involved in regulating the activity levels of G protein-coupled receptors. Polymorphisms in GRK5 have been previously linked to risk for heart failure in African Americans [30]. The N-acetyltransferase genes (NAT1 and NAT2) are involved in the metabolism of xenobiotics. NAT1 has been shown to be expressed in early placenta [31]. We also analyzed the data using two-stage methods, in which the penalized logistic regressions were applied with the indicators generated by maximum a posteriori method. The results are presented in Table 3. Both penalized logistic regression methods identifies the SNP rs2420620 in GRK5 and SNP rs7017402 in NAT1, since their coefficients are not equal to zero. These two SNPs are also selected by the ASPR model. For ASPR model, additional SNPs in genes CR1 and IGF1 are interesting but not consistent across the three approaches. This may suggest that these are false positive results.

Figure 3
Posterior mean and 90% credible interval for coefficients β in ASPR model. The coefficients are illustrated for the SNPs main effects and their interactions with lead and cadmium respectively.
Table 3
List of SNPs which have largest estimated SNP effects for different methods.

6. Discussion

In this article, we propose an adverse subpopulation regression model for investigating the relationship among multiple quantitative outcomes and high-dimensional predictors. Unlike the traditional two-stage methods, the proposed method does not require dichotomizing the continuous outcomes into binary indicators and thus avoids information loss. Two stage methods are outperformed with smaller MSE and higher area under the ROC curve for variable selection as demonstrated by the simulation studies. The new model has been applied to examine the effect of gene and environment interaction on adverse pregnancy outcomes. The results suggest the gene GRK5 and NAT1 may influence the occurrence of low birth weight and preterm delivery. Our focus is on defining a simple approach for assessing the impact of high-dimensional predictors on the risk of an adverse outcome when data consist of multiple quantitative traits. By using a two component mixture model, we can use a binary response logistic regression model, a framework that is very familiar to epidemiologists, to characterize non-linear genetic and environment associations with potentially complex multivariate quantitative traits. The proposed framework provides a parsimonious alternative to normal linear regression and logistic regressions based on preliminary categorization of quantitative traits, and should be able to detect associations that would not be detected with these methods. The proposed ASPR framework has purposefully been chosen to be a simple and parsimonious model that is easy to interpret and is scalable to high-dimensional predictors. We are aware that dealing with hundreds of thousands or millions of SNPs will cause serious computational burden for ASPR model. This limitation is shared by most of existing shrinkage methods. The popular solution is to reduce the dimensionality of the predictors first by using sure independence screening [32] or strong rules for discarding predictors [33]. We choose Normal-inverse-Wishart distribution as a conjugate prior for the parameters of multivariate normal distribution. The closed forms in the MCMC step a) lead to computational efficiency. With a large number of traits, it however may be computational expensive to evaluate the posterior Normal-inverse-Wishart density. For sake of simplicity and parsimony, we have avoided fully nonparametric Bayesian density regression models [34] that allow unknown numbers of latent classes. Although generalizations in such directions are conceptually straightforward, for each additional latent class, one introduces an additional p regression coefficients and corresponding hyperparameters, and difficult issues in identifiability, label switching and computational complexity arise. For our pregnancy outcome application, the ASPR model provides a good fit to the data, as illustrated by the posterior predictive density.

Supplementary Material


This work was supported by Award Number R01ES017436 from the National Institute of Environmental Health Sciences, and by funding from the National Institutes of Health (5P2O-RR020782-O3) and the U.S. Environmental Protection Agency (RD-83329301-0). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Environmental Health Sciences, the National Institutes of Health or the U.S. Environmental Protection Agency. We also thank the editor and two anonymous reviewers for very helpful comments and suggestions.


7. Supplementary Materials Web Appendix and Web Figure referenced in Sections 5 and the MCMC codes for ASPR model are available at the Statistics in Medicine website


1. Miranda ML, Maxson P, Edwards S. Environmental contributions to disparities in pregnancy outcomes. Epidemiologic Reviews. 2009;31(1):67. [PubMed]
2. Boucher KM, Slattery ML, Berry TD, Quesenberry C, Anderson K. Statistical Methods in Epidemiology: A Comparison of Statistical Methods to Analyze Dose-Response and Trend Analysis in Epidemiologic Studies. Journal of Clinical Epidemiology. 1998;51(12):1223–1233. [PubMed]
3. Clogg CC. Handbook of statistical modeling for the social and behavioral sciences. Springer; New York: 1995.
4. Lindsay BG. Mixture models: theory, geometry, and applications. IMS; Beachwood: 1995.
5. McLachlan GJ, Peel D. Finite mixture models. Wiley Blackwell; New York: 2000.
6. Gage TB. Classification of births by birth weight and gestational age: an application of multivariate mixture models. Annals of Human Biology. 2003;30(5):589–604. [PubMed]
7. Gage TB, Fang F, Stratton H. Modeling the pediatric paradox: Birth weight by gestational age. Biodemography and Social Biology. 2008;54(1):95–112. [PMC free article] [PubMed]
8. Schwartz SL, Gelfand AE, Miranda ML. Joint Bayesian analysis of birthweight and censored gestational age using finite mixture models. Statistics in Medicine. 2010;29(16):1710–1723. [PubMed]
9. Stegle O, Parts L, Durbin R, Winn J. A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS computational biology. 2010;6(5):e1000 770. [PMC free article] [PubMed]
10. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. 1996;58(1):267–288.
11. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67(2):301–320.
12. Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research. 2001;1:211–244.
13. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25(6):714–721. [PMC free article] [PubMed]
14. Cho S, Kim H, Oh S, Kim K, Park T. BMC proceedings. vol. 3. BioMed Central Ltd; 2009. Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis; p. S25. [PMC free article] [PubMed]
15. Ayers KL, Cordell HJ. Snp selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology. 2010;34(8):879–891. [PMC free article] [PubMed]
16. Park T, Casella G. The bayesian lasso. Journal of the American Statistical Association. 2008;103(482):681–686.
17. MacLehose RF, Dunson DB. Bayesian Semiparametric Multiple Shrinkage. Biometrics. 2010;66(2):455–462. [PMC free article] [PubMed]
18. Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology. 1995;6(4):450. [PubMed]
19. Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4(2):639–650.
20. Gelman A, Jakulin A, Grazia M. A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics. 2008;2(4):1360–1383.
21. Holmes CC, Held L. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis. 2006;1(1):145–168.
22. O’Brien SM, Dunson DB. Bayesian multivariate logistic regression. Biometrics. 2004;60(3):739–746. [PubMed]
23. Kinney SK, Dunson DB. Fixed and random effects selection in linear and logistic models. Biometrics. 2007;63(3):690–698. [PubMed]
24. Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association. 1990;85(411):699–704.
25. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1. [PMC free article] [PubMed]
26. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. 1995;57(1):289–300.
27. Müller P, Parmigiani G, Robert C, Rousseau J. Optimal sample size for multiple testing. Journal of the American Statistical Association. 2004;99(468):990–1001.
28. Swamy GK, Garrett ME, Miranda ML, Ashley-Koch AE. Maternal vitamin D receptor genetic variation contributes to infant birthweight among black mothers. American Journal of Medical Genetics Part A. 2011;155(5):1264–1271. [PMC free article] [PubMed]
29. Ashley-Koch AE, Garrett ME, Edwards S, Quinn KS, Swamy GK, Miranda ML. Maternal genetic variation in genes involved in the inflammatory response interact with measures of air pollution exposure to affect infant birthweight among non-Hispanic black women. 2011. Submitted.
30. Liggett SB, Cresci S, Kelly RJ, Syed FM, Matkovich SJ, Hahn HS, Diwan A, Martini JS, Sparks L, Parekh RR, et al. A GRK5 polymorphism that inhibits β-adrenergic receptor signaling is protective in heart failure. Nature medicine. 2008;14(5):510–517. [PMC free article] [PubMed]
31. Smelt VA, Upton A, Adjaye J, Payton MA, Boukouvala S, Johnson N, Mardon HJ, Sim E. Expression of arylamine N-acetyltransferases in pre-term placentas and in human pre-implantation embryos. Human molecular genetics. 2000;9(7):1101. [PubMed]
32. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B. 2008;70(5):849–911. [PMC free article] [PubMed]
33. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, Tibshirani RJ. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B. 2012;74(2):245–266. [PMC free article] [PubMed]
34. Dunson DB, Pillai N, Park JH. Bayesian density regression. Journal of the Royal Statistical Society: Series B. 2007;69(2):163–183.