Search tips
Search criteria 


Logo of biostsLink to Publisher's site
Biostatistics. 2012 January; 13(1): 74–88.
Published online 2011 September 10. doi:  10.1093/biostatistics/kxr024
PMCID: PMC3276274

Latent class models for joint analysis of disease prevalence and high-dimensional semicontinuous biomarker data


High-dimensional biomarker data are often collected in epidemiological studies when assessing the association between biomarkers and human disease is of interest. We develop a latent class modeling approach for joint analysis of high-dimensional semicontinuous biomarker data and a binary disease outcome. To model the relationship between complex biomarker expression patterns and disease risk, we use latent risk classes to link the 2 modeling components. We characterize complex biomarker-specific differences through biomarker-specific random effects, so that different biomarkers can have different baseline (low-risk) values as well as different between-class differences. The proposed approach also accommodates data features that are common in environmental toxicology and other biomarker exposure data, including a large number of biomarkers, numerous zero values, and complex mean–variance relationship in the biomarkers levels. A Monte Carlo EM (MCEM) algorithm is proposed for parameter estimation. Both the MCEM algorithm and model selection procedures are shown to work well in simulations and applications. In applying the proposed approach to an epidemiological study that examined the relationship between environmental polychlorinated biphenyl (PCB) exposure and the risk of endometriosis, we identified a highly significant overall effect of PCB concentrations on the risk of endometriosis.

Keywords: Categorical data, Chemical exposure biomarkers, Latent variables, Monte Carlo EM algorithm, Random effects


In many epidemiological studies, it is of interest to study the association between biological markers and a binary disease outcome. When only a few biomarkers are considered, traditional statistical modeling strategies can be directly used. Challenges arise, however, when a large number of biomarkers are involved. This paper proposes joint latent class models for high-dimensional biomarker data where numerous biomarkers and disease prevalence are linked through underlying latent classes of disease risk.

As an example, we consider an epidemiological study that assessed the relationship between exposures to 62 individual polychlorinated biphenyl (PCB) congeners (chemicals with a similar structure) and endometriosis, an estrogen-dependent gynecological disease (Buck Louis and others, 2005). These PCBs can be divided into 3 groups: estrogenic PCBs, antiestrogenic PCBs, and other PCBs. The study recruited a cohort of 79 eligible participants (between 18 and 40 years of age) with serum measurements who underwent laparoscopy. Participants were classified as having endometriosis or not based on laparoscopic visualization. For every participant, the concentration of each of the 62 PCB congeners was quantified by dual-column gas chromatography with electron capture from her serum specimens. The primary goal of the study was to investigate the relationship between exposure to PCB congeners and the risk of endometriosis. The top and bottom panels in Figure 1 show boxplots for each of 62 PCBs for subjects with and without endometriosis, respectively. Differences between the 2 profiles appear subtle and may be complex. For example, out of 62 PCB markers, approximately 82% have higher median exposure levels for subjects with endometriosis as compared with the subjects without endometriosis. Therefore, it is desirable to develop a flexible approach, which can identify potentially complex associations between high-dimensional PCB exposures and endometriosis diagnosis.

Fig. 1.

Boxplots of each of 62 PCBs for subjects with and without endometriosis.

Several challenges arise in analyzing these data. (1) “High dimensionality of biomarker data.” With 62 PCB congeners and a relatively small number of participants (I = 79), it may be difficult to assess the complex relationship between PCB concentrations and the risk of endometriosis with traditional statistical techniques such as logistic regression. (2) “Collective effect of numerous biomarkers.” PCB congeners may act on the risk of endometriosis in complex ways. For example, small changes in all or many PCBs may act simultaneously in increasing the risk of endometriosis. Alternatively, larger changes in a few PCBs may influence risk. It is desired to develop methodology that allows for these flexible association structures in risk modeling. (3) “Zero exposure level.” For many of the PCBs, there is a sizable proportion of zero values. These values can result from either true zero concentrations of PCB congeners or PCB values that are too small to detect in the laboratory. Moreover, the occurrence of zero PCB values is potentially related to the means of the positive values. Figure 2A shows the rate of observing nonzero PCB concentrations versus their means. It suggests that the probability of observing nonzero exposures increases with the nonzero PCB means in a linear fashion. (4) “Variance depending on mean.” Figure 2B shows the standard deviations (SDs) of the observed nonzero PCB concentration levels versus their means. It suggests that the PCB SDs are related in a linear fashion to their corresponding means. These features of the data need to be appropriately accounted for in the analysis.

Fig. 2.

Features of the PCBs and endometriosis data. (A) The rate of nonzero PCB concentrations versus the mean of nonzero PCB concentrations and (B) SD of nonzero PCB concentrations versus the mean of nonzero PCB concentrations. Each point in the plot represents ...

In this paper, we propose joint latent class models with random effects to account for the aforementioned data analysis challenges. In particular, we consider a class of joint latent class models for associating the pattern of high-dimensional exposure data with the risk of disease through the incorporation of latent classes. Specifically, patterns in PCB exposures and risk of endometriosis are associated through assumed latent classes, which are shared between the 2 modeling components. The latent class structure allows for the flexible identification of important patterns of exposure, which can place women at high risk for endometriosis. The formulation of the joint latent class models for semicontinuous biomarkers (i.e. continuous value with numerous zeros) link the patterns in the biomarker and the risk of disease through unobserved latent classes. The pattern in the mean biomarker values varies by class and is allowed to change across biomarkers by the incorporation of random effects (Laird and Ware, 1982). Our work has close connection to the latent class models discussed in Bartholomew (1987) and Bandeen-Roche and others (1997). These models, which are recognized as categorical variable analogs to factor analysis, postulate the joint distribution of multiple observed discrete variables as a mixture of a few distributions conditional on the value of a latent categorical variable. However, our model is for the disease biomarker data where the disease indicator is binary and the biomarker levels are all semicontinuous. Another related approach is the finite Gaussian mixture model discussed in Fraley and Raftery (2002) and McLachlan and Peel (2002). The finite Gaussian mixture model assumes that continuous data follow a finite normal mixture model with each component corresponding to a latent cluster or latent class. This model can be viewed as a latent variable model since the unobserved categorical variable characterizes classification of observations into clusters. But unlike the proposed latent class joint model, the finite Gaussian mixture model is a model-based clustering technique for biomarkers where clustering is not informed by the disease outcome. Our work is also related to Lin and others (2002) who used latent classes to link longitudinal biomarker data and event times.

The remainder of the paper is organized as follows. Section 2 presents the joint latent class models for high-dimensional semicontinuous biomarker data and the binary disease outcome. Section 3 derives the log likelihood of the proposed models and discusses strategies for estimation. In Section 4, we apply the proposed model to data from a study examining the effects of environmental PCB concentrations on the risk of endometriosis. In Section 5, simulation studies are performed to demonstrate the performance of the estimation and model selection methods. Finally, we conclude in Section 6.


2.1. Joint latent class models with 2 latent classes

Let Yi be a binary variable representing disease status for participant i, i = 1,2,…,I, where Yi = 1 indicates the ith participant has disease and Yi = 0 indicates no disease. Denote Xij as the level of the jth biomarker measured from the ith participant, j = 1,2,…,J. In the motivating example of PCB concentrations and endometriosis, Yi represents endometriosis status of the ith participant and Xij represents the concentration level of the jth PCB congener measured from the ith participant. Let Li denote the latent disease risk class for the ith participant. We first consider the case with only 2 latent classes: low-risk class, denoted by Li = 0, and high-risk class, denoted by Li = 1. In Section 2.2, we extend the model to the case of more than 2 classes using an ordinal latent variable. Let π1 = P(Li = 1) be the probability of a participant belonging to the high-risk class and let π0 = 1 − π1. The probability of disease given the latent disease class is modeled as:

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx1_ht.jpg

where β0 and β1are regression coefficients characterizing the relationship between the latent variable and the disease risk. In (2.1), the coefficient β1 is the positive increase of the log odds of disease prevalence from low-risk to high-risk class, and the positivity constraint ensures model identifiability related to the “label switching” problem common in latent class modeling.

Each biomarker level Xij, conditional on the latent class, is treated as a semicontinuous measurement (Olsen and Schafer, 2001) and can be denoted by 2 variables:

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx2_ht.jpg

where Uij is the binary nonzero biomarker value indicator and Vij is the value of the biomarker when it is nonzero. We model Vij using the normal distribution:

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx3_ht.jpg

where μij(Li,bj) = E(Vij|Li,bj) is the conditional mean of Vij that depends on latent class indicator Li and random effects vector bj, and g2 is a variance function that depends on ξ, a vector of variance parameters. In (2.7), we allow the variance of nonzero biomarker values to depend on the mean biomarker value through the variance function g2(μij(Li,bj),ξ). In the application in Section 4, the SD of PCB concentration is proportional to PCB mean (see Figure 2B). Thus, in this case, it is reasonable to let g2(μij(Li,bj),ξ) = ξ2μij2(Li,bj).

For the mean structure of each biomarker, we assume

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx4_ht.jpg

where α0 and α1 are parameters defining the mean levels of the biomarkers in the low- and high-risk groups, and An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx11_ht.jpg is a random effects vector that follows a multivariate normal distribution. The inclusion of the latent classes Li allows the multiple biomarkers from the same participant to be correlated, while the random effects bj0 and bj1 allow for individual biomarkers to deviate from the overall mean. If the 2 random effects degenerate, or equivalently σ02 = σ12 = ρ = 0, then μij(Li,bj) = α0 + α1Li, which means the average concentration of each biomarker is α0 for participants in the low-risk class and α0 + α1 for participants in the high-risk class (see Figure 3A). If σ02≠0 and σ12 = ρ = 0, then μij(Li,bj) = α0 + α1Li + bj0. In this case, there is a biomarker-specific shift bj0 in the average concentration of each biomarker, but the difference in biomarker between classes is the same across biomarker (see Figure 3B). With 2 nondegenerated random effects bj0 and bj1, or equivalently σ02≠0 and σ12≠0, we allow both biomarker-specific baseline shift bj0 and biomarker-specific difference α1 + bj1 between the 2 classes to vary across the J biomarkers (see Figure 3C). When α1 is large and σ12 is small, the modeling approach corresponds to the situation where cumulative PCB effects transmit risk. In this case, a simple logistic regression model where the independent variable is the sum of PCB exposures describes the association. However, when α1 is small and σ12 is large, the association is complex and a simple cumulative logistic model will miss this association. We illustrate this with a simulation study in the supplementary material available at Biostatistics online.

Fig. 3.

Simulated patterns of mean PCB concentrations with different random effects specifications in μij(Li,bj) = α0 + α1Li + bj0 + bj1Li: (A) μij(Li,bj) = α0 + α1Li; (B) μij(Li,bj) = α0 + α ...

Finally, the model for the positive biomarker indicator Uij is

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx5_ht.jpg

in which h(μij(Li,bj)) can be specified according to data features. The inclusion of μij(Li,bj) in function h allows for a flexible yet parsimonious way to model the probability of a positive biomarker concentration as a function of the mean of the nonzero biomarker values. The parameter η1 and function h in (2.9) characterize the relationship between the occurrence of a positive measurement and the mean of the measurement given that it is positive. In our motivating example, there is a positive linear relationship between the occurrence of a positive value and the mean of these values (see Figure 2A). This motivates the specification h(μij(Li,bj)) = μij(Li,bj).

Equations (2.1–2.9) characterize joint latent class models with random effects for high-dimensional biomarker data and disease prevalence, where the disease prevalence and biomarker levels are linked and jointly modeled through binary latent classes among participants. In the next section, we extend the model to the situation in which the latent risk classes are ordinal.

2.2. Joint latent class models with multiple latent classes

When more than 2 latent classes are considered, a variety of parameterizations could be pursued to extend the proposed 2-class model. A natural specification would be to treat each latent class as a categorical variable and to treat the latent classes as dummy variables in the disease prevalence model (2.1), the mean structure model (2.8), and the model for a positive biomarker value (2.9). This would require introducing fixed and random effects of dimension equal to the number of latent classes in each of the model components. Although rich and flexible, this modeling strategy is complicated and computationally demanding. Further, this approach does not exploit the inherent ordinal nature of disease severity. As a result, we build a parsimonious extension by treating Li = 0,1,…,K − 1 as an ordinal variable in (2.1–2.5). In essence, we model both the logit of disease prevalence and biomarker levels with linear functions of the ordinal latent class variable Li. For example, if 3 latent classes are considered, the model then yields probabilities P(Yi = 1|Li = 0) = exp(β0)/[1 + exp(β0)], P(Yi = 1|Li = 1) = exp(β0 + β1)/[1 + exp(β0 + β1)], and P(Yi = 1|Li = 2) = exp(β0 + 2β1)/[1 + exp(β0 + 2β1)]. These correspond to the disease prevalence of low-, middle-, and high-risk classes, respectively. Similarly, the means of biomarker levels for the 3 latent classes are α0, α0 + α1, and α0 + 2α1, respectively. We focus on this parameterization to achieve a parsimonious modeling framework for the case of multiple disease classes.

2.3. Covariates for joint latent class models

Participant-specific or biomarker-specific covariates can be incorporated into the disease probability (2.1) and biomarker distribution (2.7) as follows: β1 > 0; and Vij|Li,bj,Zij),ξ)),

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx6_ht.jpg


An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx7_ht.jpg

where Wi is a vector of participant-specific covariates that may affect disease prevalence and Zij is a vector of variance covariates whose components are either participant specific or biomarker specific. In our example described in Section 1, there are 3 groups of biomarkers among the 62 PCB congeners: estrogenic PCBs, antiestrogenic PCBs, and other PCBs. Thus, in Section 4, we incorporate the PCB grouping information into (2.11) with Zij comprising 2 indicator variables.

Covariates can also be introduced into (2.5):

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx8_ht.jpg

with covariate vector Tij and parameter vector ζ.


3.1. The likelihood

The proposed joint latent class model (with extended covariates in Section 2.3) contains unknown parameters: β0, β1, γ in (2.10), ξ, α0, α1, σ0, σ1, ρ in (2.11) and (2.8), and η0, η1, ζ in (2.12). If we define the probability that a participant belongs to the kth class as πk = P(Li = k), k = 0,1,…,K − 1, then the components of π = (π0,π1,…,πK − 1) are also unknown parameters. Let yi, uij, vij, wi, zij, and tij be the observed values of Yi, Uij, Vij, Wi, Zij, and Tij, respectively, and let θ = (β0,β1,γ,ξ,α0,α1,σ0,σ1,ρ,η0,η1,π,ζ). Note that the latent variable Li is crossed with the random effects bj in the proposed model. Therefore, the likelihood of the proposed model is given by

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx9_ht.jpg

where fVij|Li,bj is the probability density function of a univariate normal distribution with mean μij(Li,bj) and variance σ2g2(μij(Li,bj,zij),ξ) as in (2.7), μij(Li,bj) = α0 + α1Li + bj0 + bj1Li as in (2.8), and fb is the bivariate normal distribution probability density function of bj as described in Section 2.

3.2. Monte Carlo EM algorithm

We consider an expectation-maximization (EM) algorithm (Dempster and others, 1977) as follows. We treat the latent class variable L = (L1,…,LI) and the random effects b = (b1,…,bJ) as missing data in the EM algorithm. We denote Xobs as the observed data, Zmis = (L,b) as the missing data, and let Ycom = (Xobs,Zmis). At the (t + 1)th iteration of the EM algorithm, the E-step involves calculation of the Q-function Q(θ|θ(t)) = E[logf(Ycom|θ)|xobs,θ(t)] = ∫logf(ycom|θ)f(zmis|xobs,θ(t)) dzmis, where θ(t) denotes the parameter value from the tth iteration, f(zmis|xobs,θ(t)) is the conditional distribution of (L,b) given the observed data and θ(t), and f(ycom|θ) is the full likelihood. The M-step involves maximizing Q(θ|θ(t)) with respect to θ to yield the new update θ(t + 1). The process is iterated from a starting value θ(0) to convergence. Under regularity conditions, the value at convergence maximizes (3.1).

The dimensionality of integration and summations in likelihood function (3.1) increases with the number of study participants and biomarkers. As a result, the integration in the E-step is intractable with numerical integration techniques. Therefore, we conduct estimation using Monte Carlo EM (MCEM) algorithm (Wei and Tanner, 1990). In this algorithm, we form the Monte Carlo approximation to the required expectation in Q(θ|θ(t)). Booth and Hobert (1999) proposed using a rejection sampling or an importance sampling scheme. Their method produces independent and identically distributed samples that may be used to assess Monte Carlo error at each iteration and hence suggest a rule for changing the sample size to enhance speed. However, their method may break down when the intractable integrals in the likelihood function are of high dimension. An alternative approach is to use the Metropolis–Hastings algorithm, as in McCulloch (1997), to sample from the conditional distribution using the density of latent variables as the proposal distribution. Here, we use the Metropolis–Hastings algorithm but with an adaptive proposal distribution. To generate N values (l(n),b(n)), n = 1,2,…,N, from the conditional distribution f(l,b|xobs,θ(t)) using the Metropolis–Hastings algorithm, we start from the starting values (l(0),b(0)) with f(l(0),b(0)) > 0, and then, at the nth Metropolis–Hastings step, sample a candidate (l*,b*) from f~(t)(l,b). Let (l(n + 1),b(n + 1)) = (l*,b*) with probability min(1,R), where R=f(l*,b*|xobs,θ(t))f~(t)(l(n),b(n))/f(l(n),b(n)|x o b s,θ(t))f~(t)(l*,b*), and (l(n + 1),b(n + 1)) = (l(n),b(n)) otherwise. The proposal distribution at the tth iteration of the EM algorithm is constructed to be An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx12_ht.jpg, where An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx13_ht.jpg is the empirical distribution of li's sampled from the (t − 1)th MCEM iteration and An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx14_ht.jpg is the product of 2 Student's t-distribution centered, respectively, at the means of bj0's and bj1's sampled from the (t − 1)th MCEM iteration (Gelman and others, 1995). The proposal distribution presented here dynamically incorporates the obtained information of conditional distribution f(l,b|xobs,θ(t)) from the previous MCEM step and thus greatly improves the performance of the Metropolis–Hastings Markov chain. The starting values (l(0),b(0)) of the Metropolis–Hastings algorithm are chosen to be the the M-step estimates of the previous step of the MCEM algorithm. We run 100 steps in the EM algorithm and 10 000 Monte Carlo iterations in each MCEM step. To incorporate the constraint β1 > 0 in the MCEM algorithm, we let β1 = eβ1* with − < β1* < + and implement the MCEM algorithm with respect to the new parameter β1*.

3.3. Empirical Bayes estimates of the random effects

Our estimation procedure consists of 2 steps: frequentist maximum likelihood estimation described in Section 3.2 for the unknown fixed parameters and empirical Bayes estimation for the random effects bj in this section. The maximum likelihood estimation procedure is not Bayesian since no priors are assigned to the fixed effects. In obtaining the empirical Bayes estimates of the random effects, the random effects distributions are treated as priors. This is commonly done in the literature of mixed effects models; see Laird and Ware (1982) for reference.

Viewing the bivariate normal distribution of bj as a prior distribution, approximate inferences for bj can be obtained by calculating a posterior mean E(bj|θ,xobs) and covariance matrix var(bj|θ,xobs) with the unknown parameter vector θ replaced by its maximum likelihood estimates Tnqboldθ^. Consider the posterior expectation of some function x2133 of bj, which are of the form E(x2133(bj)|θ,xobs) = ∫x2133(bj)p(bj|θ,xobs) dbj. The approximate inference for bj can be carried out by calculating E(x2133(bj)|θ,x o b s) with some form of x2133. For instance, x2133(bj) = bj results in the posterior mean E(bj|θ,xobs). The calculation is simply a by-product of the MCEM algorithm.


We fit the joint latent class models to data from the epidemiological study that examines the effect of environmental PCB concentrations on the risk of endometriosis. Details of the study have been previously reported in Buck Louis and others (2005) and have been summarized in Section 1. The primary interests of the data analysis are in examining the association between complex patterns of PCB concentrations and the risk of endometriosis. Using the methodology developed in the previous sections, we fit a series of models with different numbers of latent classes. As discussed in Section 2, there is evidence to suggest h(μij(Li,bj,Zij),Tij,ζ) = μij(Li,bj,Zij) (Figure 2A) and g2(μij(Li,bj,Zij),ξ) = ξ2μij2(Li,bj,Zij) (Figure 2B) when fitting the models. Two dummy variables Ij,e and Ij,a are created to represent estrogenic PCBs (Ij,e = 1 if a PCB is estrogenic; Ij,e = 0 otherwise) and antiestrogenic PCBs (Ij,a = 1 if a PCB is antiestrogenic; Ij,a = 0 otherwise), respectively. Thus, the PCB mean structure is:

An external file that holds a picture, illustration, etc.
Object name is biostskxr024fx10_ht.jpg

The MCEM algorithm was implemented as described in Section 3 and standard errors (SEs) were estimated using the nonparametric bootstrap method with 500 bootstrap samples (Efron and Tibshirani, 1993).

Table 1 shows parameter estimates and SEs of 4 random effects latent class models, with the number of latent classes being 2, 3, 4, and 5, respectively. The table also reports Bayesian information criterion (BIC) type ICQ values (see Ibrahim and others, 2008 and the supplementary material available at Biostatistics online) for the 4 candidate models. The model selection criterion ICQ chose the 3-class model as the best among the competing ones. This model divides participants into 3 disease classes according to their risks with respect to endometriosis: 71.40% of the participants belonged to the low-risk group, 25.27% to the middle-risk group, and 3.33% to the high-risk group. The endometriosis prevalences in the 3 groups were estimated as 30.42%, 53.16%, and 74.66%, respectively. Since β1 was restricted to be positive, we performed a one-sided Wald test of H0:β1 = 0 versus H1:β1 > 0 and obtained a p value of 0.0186. Tests of whether α1≠0, α1 + α1,e≠0, or α1 + α1,a≠0 examine whether the mean biomarker values vary with latent risk class in the other, estrogenic, and antiestrogenic PCBs, respectively; these 3 tests were all highly significant (less than 0.0001). These tests indicate that the PCB concentration pattern is associated with the risk of endometriosis, and that the overall means of PCBs are significantly different between classes. The parameter estimates η^0=0.8892 and η^1=42.8010 provide information on the probabilities of observing zero PCB concentration levels. The positive η1 estimate indicates that, for a given PCB and participant, the probability of observing a positive PCB level increases with the mean. We found a statistically significant relationship between the pattern of PCB concentrations and the risk of endometriosis using the proposed model. We now estimate features of the pattern in each of the risk groups.

Table 1.

Analysis of the endometriosis and PCB concentrations data: parameter estimates (SEs) and values of model selection criterion ICQ from joint latent class models with different number of latent classes

To investigate the features of the individual PCB congener concentration, we computed empirical Bayes estimators of random effects b~j0 and b~j1, j = 1,2,…,62. The nonzero concentration levels for the jth PCB congener in the kth latent class can be estimated by replacing random effects with their empirical Bayes estimators and replacing unknown parameters with their estimates in (4.1). The estimated rate of observing zero concentration levels for the jth PCB can be similarly derived. Figures 4 shows, in the selected 3-class model, the nonzero concentration levels of each PCB congener. There is a positive relationship between PCB concentration mean and the risk class for all PCBs (i.e. the pattern in mean concentration for all PCBs is highest for high-risk participants and lowest for low-risk participants). This suggests that there is a combined contribution of all 62 PCB concentrations in assessing the risk of endometriosis. The differences in the PCB patterns from the low risk to the high risk are largest for estrogenic PCBs among the 3 PCB groups. Moreover, we see a reverse pattern for the probability of obtaining a zero PCB. The rate of zero varied across PCB concentrations and across classes (range 0–0.7).

Fig. 4.

Estimated means of nonzero PCB concentrations by the three-class latent class model in the analysis of endometriosis and PCB concentrations data. The numbers on the lines are the latent class: 0 for low-risk class, 1 for middle-risk class, and 2 for high-risk ...

The inferences for β1, α1, α1,e, and α1,a were similar across the different models (2- through 5-class models), suggesting that even though the data are best described by the 3-class model, important inferences are robust to model misspecification. Further, empirical Bayes estimators of concentration patterns in each of the risk groups (similar to Figure 4, but for 2-, 4-, 5-class models) showed similar results to those presented in Figure 4. Namely, concentration levels increased with an increasing risk group for all PCBs, and the increasing concentration levels were largest for estrogenic markers as compared to other markers (data not shown).

The joint latent class model provided strong evidence for an association between high-dimensional PCB exposure and the risk and endometriosis. The primary goal of the proposed latent class model is to provide a way to associate high-dimensional exposure data with the risk of endometriosis. The fact that the nature of the association (β1 in Table 1 and parameters that characterize PCB biomarker exposures) is similar across the models with different numbers of latent class provides reassurance about the robustness of important inference to the number of latent classes. That said, the evidence for 3 latent classes of risk does suggest that there may be 3 types of exposure patterns that are associated with the risk of endometriosis. The fact that overall patterns are higher in the high-risk group and are lower in the other groups suggests that what distinguishes the different groups are small differences in a majority of PCBs.


To evaluate the performance of the MCEM algorithm and the model selection criterion ICQ for the joint latent class modeling approach, we conducted a simulation study by generating data similar to that in our application. We generated data according to a 2-, 3-, 4-, or 5-class model. The true parameter values were chosen to be similar to the estimated values of the corresponding models in our application example. Also, to create a scenario similar to the application, we used 79 participants and 62 biomarkers in each simulation. A total of 500 data sets were generated from each of the 4 latent class models. We assumed the number of latent classes was known and then fitted the model to the generated data. The results are reported in Table 2. For each model, the table presents the average and SD of the point estimates. We observed that the estimates were close to the true parameter values with reasonably small SDs. The parameters of most interest (e.g. β1, α1, α1,e, and α1,a) were nearly unbiased.

Table 2.

Simulation results for evaluating parameter estimation with MCEM in the joint latent class models, based on 500 simulation data sets

To examine the performance of model selection by ICQ, we fit models having 2, 3, 4, and 5 latent classes to each of the generated data sets and used BIC-type ICQ to select the best model. Supplementary Table 1 in the supplementary material available at Biostatistics online reports the frequencies of the selected models under each scenario. The correct selection rates are 92%, 81%, 79%, 77%, and 78%, respectively, for the no-association, 2-, 3-, 4-, and 5-class models. In addition, even when ICQ selected the wrong models, the majority of the misselected models are the adjacent ones (one fewer or more class). Furthermore, these simulations showed that even under a misspecified model (e.g. true model was a 3-class model and 2-class model was selected), estimates of β1, α1, α1,e, and α1,a had little bias (data not shown).


This paper proposed a joint latent class modeling approach for predicting a binary event from high-dimensional biomarker data. In this approach, the relationship between complex biomarker expression patterns and disease risk is established by incorporating ordinal latent risk classes, which link the 2 modeling components. Further, complex marker-specific differences are characterized through marker-specific random effects. The approach is flexible in that it easily incorporates a varying number of ordinal latent classes which will depend on the application of interest.

The joint latent class model presented in this paper supplements other methods for analyzing disease-biomarker data, especially when high-dimensionality and collective effects occur. Alternative analysis strategies have been proposed for analyzing the endometriosis and PCB concentration data set. Buck Louis and others (2005), Gennings and others (2010), and Herring (2010) analyzed the data with logistic–regression-based models, focusing on assessing the effect of individual PCB exposures on endometriosis risk. In general, these methods found only a small number of associations between the large number of individual PCB exposures and the risk of endometriosis. Our approach is more flexible in that we can identify complex patterns between the high-dimensional exposures and disease prevalence using latent class structure. The works of Buck Louis and others (2005), Gennings and others (2010), and Herring (2010) may have advantages when only a few biomarkers have very strong association with risk. However, when there is a large proportion of exposure measurements and each contributes a small amount to the risk, their methods may fail. Our approach is well suited for this latter case, while also allowing for the case where a few individual PCBs may contribute to the risk.

The application of the approach to the epidemiology study on PCB concentrations and the risk of endometriosis revealed some interesting findings from a substantive view point. First, we identified a highly significant overall relationship between PCB concentrations and the risk of endometriosis, a finding which was not found with other techniques. Second, the model allowed us to reveal features of the relationship, which were not available with other methods. Particularly, we found an overall positive association of all biomarkers with the risk of endometriosis, suggesting that all PCBs, to different degrees, are positively associated with the risk of endometriosis. Further, we identified that estrogenic PCBs may be particularly important in determining the risk of endometriosis.

The methodology was developed for a binary measure of risk assessment, which was consistent with the application presented here. However, the model could be adapted to other generalized linear model outcomes (e.g. counts, continuous, exponential). Our focus was primarily on assessing the role of high-dimensional biomarkers on disease risk and not on evaluating disease prediction per se. Future research will focus on comparing features of model performance such as the receiver operating characteristic curve and other measures of diagnostic accuracy between our proposed modeling framework and more conventional methods such as logistic regression with model selection.


Intramural Research Program of the National Institutes of Health; Eunice Kennedy Shriver National Institute of Child Health and Human Development; National Institute of Environmental Health Sciences (Grant 1R01ES09044-01).

Supplementary Material

Supplementary Material:


We thank associate editor and reviewers for their thoughtful and constructive comments, which have led to an improved article. We thank Drs Germaine Louis and Brian Whitcomb for their helpful comments and suggestions. We also thank the Center for Information Technology, the National Institutes of Health, for providing access to the high performance computational capabilities of the Biowulf Linux cluster. Conflict of Interest: None declared.


  • Bandeen-Roche K, Miglioretti DL, Zeger SL, Rathouz PJ. Latent variable regression for multiple discrete outcomes. Journal of the American Statistical Association. 1997;92:1375–1386.
  • Bartholomew DJ. Latent Variable Models and Factor Analysis. New York: Oxford University Press; 1987.
  • Booth JG, Hobert JP. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society. Series B, Statistical Methodology. 1999;61:265–285.
  • Buck Louis GM, Weiner JM, Whitcomb BW, Sperrazza R, Schisterman EF, Lobdell DT, Crickard K, Greizerstein H, Kostyniak PJ. Environmental PCB exposure and risk of endometriosis. Human Reproduction. 2005;20:279–285. [PubMed]
  • Cleveland WS. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association. 1979;74:829–836.
  • Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM Algorithm (with Discussion) Journal of the Royal Statistical Society. Series B, Statistical Methodology. 1977;39:1–38.
  • Efron B, Tibshirani R. An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC; 1993.
  • Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002;97:611–631.
  • Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. New York: Chapman and Hall/CRC; 1995.
  • Gennings C, Sabo R, Carneyb E. Identifying subsets of complex mixtures most associated with complex diseases. Epidemiology. 2010;21:S77–S84. [PubMed]
  • Herring A. Nonparametric bayes shrinkage for assessing exposures to mixtures subject to limits of detection. Epidemiology. 2010;21:S71–S76. [PMC free article] [PubMed]
  • Ibrahim JG, Zhu H, Tang N. Model selection criteria for missing-data problems using the EM algorithm. Journal of the American Statistical Association. 2008;103:1648–1658. [PMC free article] [PubMed]
  • Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed]
  • Lin H, Turnbull BW, McCulloch CE, Slate EH. Latent class models for joint analysis of longitudinal biomarker and event process data: application to longitudinal prostate-specific antigen readings and prostate cancer. Journal of the American Statistical Association. 2002;97:53–65.
  • McCulloch CE. Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association. 1997;92:162–170.
  • McLachlan GJ, Peel D. Finite Mixture Model. New York: John Wiley & Sons, Inc; 2002.
  • Olsen MK, Schafer JL. A two-part random-effects model for semicontinuous longitudinal data. Journal of the American Statistical Association. 2001;96:730–745.
  • Wei G, Tanner MA. A Monte Carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Journal of the American Statistical Association. 1990;85:699–704.

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press