PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Conf Proc IEEE Eng Med Biol Soc. Author manuscript; available in PMC 2010 July 3.
Published in final edited form as:
PMCID: PMC2896560
NIHMSID: NIHMS200072

A Bayesian Based Functional Mixed-Effects Model for Analysis of LC-MS Data

Abstract

A Bayesian multilevel functional mixed-effects model with group specific random-effects is presented for analysis of liquid chromatography-mass spectrometry (LC-MS) data. The proposed framework allows alignment of LC-MS spectra with respect to both retention time (RT) and mass-to-charge ratio (m/z). Affine transformations are incorporated within the model to account for any variability along the RT and m/z dimensions. Simultaneous posterior inference of all unknown parameters is accomplished via Markov chain Monte Carlo method using the Gibbs sampling algorithm. The proposed approach is computationally tractable and allows incorporating prior knowledge in the inference process. We demonstrate the applicability of our approach for alignment of LC-MS spectra based on total ion count profiles derived from two LC-MS datasets.

I. Introduction

In proteomic studies, liquid chromatography coupled with mass spectrometry (LC-MS) is a common platform to identify and determine the abundance of various peptides that characterize particular proteins in biological samples [1]. Each LC-MS run generates data comprised of thousands of peak intensities for peptides with specific retention time (RT) and mass-to-charge ratio (m/z) values. In differential protein expression studies, multiple LC-MS runs are compared to identify differentially abundant peptides between distinct biological groups. This is a challenging task because of the following reasons: (1) substantial variation in RT across multiple runs due to the LC instrument conditions and the variable complexity of peptide mixtures, (2) variation in m/z values of the peptides due to occasional drift in the calibration of the mass spectrometry instrument, and (3) variation in peak intensities due to spray conditions. Thus, efficient and robust alignment algorithms are needed for qualitative comparison of multiple LC-MS runs.

Various alignment methods have been described in literature including dynamic time warping (DTW) [2], correlation optimized warping (COW) [2], vectorized peaks [3], statistical alignment [4], and clustering [5]. Most of these algorithms are either limited to a consensus pair-wise combination of spectra for alignment or may use reference (template) spectra to find matching among datasets. These limitations may lead to sub-optimal results compared to global alignment techniques. Methods that rely on optimization of global fitting functions provide an alternative solution to alignment of multiple LC-MS spectra representing distinct biological groups. For example, a recently introduced method called continuous profile model (CPM) has been applied for alignment of continuous time-series data and for detection of differences in multiple LC-MS data [6]. Although CPM is described as a naïve and computationally intensive method, the method has some limitations, such as the susceptibility to fall into local minimum solutions due to the sub-optimal problem formulation. Also, the method creates superfluous signal gaps, leading to nonuniform trace points across multiple LC-MS spectra. Another notable limitation of CPM algorithm is its poor performance with time complexity scales, requiring substantial computation time in modeling high resolution data. Thus, CPM is more suitable for low resolution of LC-MS data generated from less complex fractionations. Recently, Morris et al. developed a Bayesian-based method for analysis of matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) proteomics data [7]. Their motivation extends from earlier work on Bayesian implementation of the wavelet-based functional mixed effects models introduced by Morris and Carroll [8]. The approach is similar to the spline-based functional mixed effects models introduced by Guo [9], which involves a generalized mixed models equation to handle potentially irregular data. The method specifically deals with the identification of differentially expressed spectral regions across different experimental conditions assuming the alignment issue has already been taken care of.

In this paper, we introduce a Bayesian multilevel functional mixed effects model with group-specific random effects. The method provides the capability to account for population homogeneous behavior (i.e., fixed systematic changes across the entire LC-MS spectra representing distinct biological groups) while allowing for modeling heterogeneity within a group (i.e., random effects). Also, this paradigm allows us to incorporate additional hierarchies such as affine transformation within the model to account for any variability along the RT and m/z dimensions, while handling implicitly the normalization of peak intensities of peptides from multiple LC-MS spectra. The method is amenable to model both low and high resolution mass spectra, since it does not introduce superfluous signal gaps across multiple LC-MS spectra. We demonstrate this through two LC-MS datasets obtained from: (1) proteins of lysed E. coli cells, and (2) six groups of tryptic digests non-human proteins with different concentrations spiked into a complex sample background of human peptides.

The remainder of this paper is organized as follows. In Section II, we outline the Bayesian hierarchical model (BHM) that describes the data modeling mechanism, based on the functional mixed-effects model, for alignment of LC-MS spectra. This section explains the Markov chain Monte Carlo (MCMC) method using the Gibbs sampling algorithm for simultaneous posterior inference of all unknown parameters. Results and discussions demonstrating the applicability of the proposed method for alignment of LC-MS spectra are given in Section III. Finally, our findings are summarized in Section IV.

II. Methods

A. Bayesian Hierarchical Model (BHM)

We propose a functional mixed-effects model to align LC-MS spectra from multiple LC-MS runs. The idea behind this approach is two-fold: (1) to model the fixed effects as a realization of partially diffused integrated Gaussian processes which account for population homogeneous behaviors (i.e., fixed systematic changes in the LC-spectra across biological groups), and (2) to model the random effects as random realizations from the same partially integrated Gaussian processes with proper variances which, in turn, allow the modeling of heterogeneity within biological groups. The estimation procedure is implemented by taking advantage of the connection between B-splines (at the design points) and mixed effects models. Let the proposed functional mixed-effects model be represented mathematically as follows:

yi{zij=j}ni×1B1ini×pγjp×1+B2ini×qηijq×1+εijni×1ηijNq(μj,j)andεijNni(0,σ2I)
(1)

where i = 1, 2,… sj denote sample size in each group j; j = 1, 2, …, m are indices for the mixture components that identify group membership with m [double less-than sign] M, where M is the total number of observation spectra ( M=j=1msj); ni denotes the data length for each spectrum yi; B1i and B2i are B-spline basis matrices associated with fixed and random effects for the i-th sample, respectively; γj accounts for fixed systematic changes in group j; while ηij accounts for the random variation; εij’s correspond to measurement errors, where ηij and εij are assumed to be independent. Alignment is performed using the information from the matrices B1i and B2i as well as from fixed systematic changes γj and random variation ηij.

Let Θ be a vector consisting of all the unknown parameters in Eq. (1) and the priors. Let Y=(y1T,y2T,,yMT)T represent a set of LC-MS spectra. Then, according to the Bayes’ theorem

p(ΘY)=p(YΘ)p(Θ)p(Y)L(ΘY)×p(Θ)
(2)

Using the functional mixed-effects modeling of Eq. (1), the likelihood function assuming that the group information is known and that the samples are independent is given by

L(ΘY,Z)=j=1mi=1sjNn(yi,Zi;B1iγj+B2iηij,σ2I)
(3)

where Z denotes a matrix of indicator vectors Zi = (zi1,zi2, …, zim), such that zi1 = 1 for some j, and zit = 0, ∀tj. Hence, the joint posterior has the general form

p(ΘY)j=1mi=1sjNn(yi,Zi;B1iγj+B2iηij,σ2I)×p(Θ)
(4)

B. Prior Distributions and Implementation

The first step in fitting BHM is to specify all prior distributions. A list of the hierarchical priors assigned to the parameters of the model is given below. The list represents the standard choice of priors for mixture models:

γjNp(w,W)μjNq(0,V),j1RWq(ρ,(ρR)1)RWq(r,(rR0)1),σ2Γ(g,h)
(5)

where W(,), N(,) and Γ(,) signify the Wishart, multivariate normal and gamma distributions, respectively. In specifying the prior distribution p(Θ), a hierarchical structure with independence assumption is considered. Combining this structural information with prior beliefs, we obtain the following joint posterior for the unknown parameters:

p(ΘY)j=1m{i=1sjNn(yi,Zi;B1iγj+B2iηij,σ2I)×Np(γj;w,W)×Nq(ηij;μzij,zij)×Nq(μj;0,V)×Wq(j1;ρ,(ρR)1)}×Wq(R;r,(rR0)1)×Γ(σ2;g,h)
(6)

Using all prior and hyperprior distributions in Eq. (5), the full conditional distributions for the parameters are as follows:

p(ηijrest)Nq([σ2B2iTB2i+j1]1(σ2B2iT(yiB2iγj)+j1μj),[σ2B2iTB2i+j1]1)p(γjrest)Np([σ2i:zij=1B1iTB1i+W1]1(σ2i:zij=1B1iT(y¯iB2iη¯j)+W1w),[σ2i:zij=1B1iTB1i+W1]1)p(μjrest)Nq(sjj1+V1)1sjj1η¯j,[sjj1+V1]1)whereη¯j=i:zij=1ηijp(j1rest)Wq(sj+ρ,[rR+i:zij=1(ηijμj)(ηijμj)T]1)p(Rrest)Wq(r+mρ,[rR0+ρj=1mj1]1)p(σ2rest)Γ(j=1mi=1sini2+g,[1h+12j=1mi:zij=1(yiB1iγjB2iηij)(yiB1iγjB2iηij)T]1)

C. Gibbs Sampling Algorithm

Consider the Bayesian model of Eq. (4). Let the number of groups m be fixed and Θ denote all of the unknown parameters in the model, i.e.,

Θ=({{ηij}i=1sj,γj,μj,j}j=1m,R,σ2)

Then, using Θ(0) as starting value, the Gibbs sampling algorithm [10, 11] proceeds as follows for t = 1,2, … N iterations:

Drawηij(t+1)fromp(ηijY,γj(t),,σ2(t))fori=1,2,,sjandj=1,2,,mDrawγj(t+1)fromp(γjY,ηij(t+1),,σ2(t))forj=1,2,,mDrawμj(t+1)fromp(ηijY,ηj(t+1),γj(t+1),,σ2(t))forj=1,2,,mDrawσ2(t+1)fromp(σ2Y,ηij(t+1),γj(t+1),μj(t+1),j(t+1),R(t+1))

Note that the computations for the conditional probabilities are highly simplified due to the conjugacy of the prior distributions and their conditional independence.

D. Modeling of Variability along the RT or m/z Dimensions among Different LC-MS Datasets

The BHM presented in Eq. (1) can be easily extended to incorporate detail modeling. It is important to introduce priors that appropriately apportion the variability among the replicates and separating out the differing locations or scales along the RT or m/z dimensions. This provides a distinct interpretation of the LC-MS data. The alignment model and the associated parameters should allow each replicate sample to have its own affine warping transformation in RT or m/z dimensions. Let each spectrum yi (xij) be replaced by yi(d)(aixijbi), where ai and bi are the scaling and shifting parameters for the i-th replicate LC-MS spectrum along the dimension d=1,2,…, D corresponding to the m/z dimensions of each sample spectra. To treat alignment in the hierarchical structure, we include the priors with suitable hyperparametrs in Eq. (6). With the assumption of independence, the joint prior model for the time scaling and translation p(ai, bi) = p(ai) × p(bi) should encode the idea that the most likely translation is the affine warping translation and should also discount large and unlikely translations. A normal prior distribution is a good fit for this, i.e., biN(μb,σb2) where μb and σb2 are the mean and variance of the hyperparameters. Moreover, we assume a normal distribution aiN(μa,σa2)I{ai>1} for the time-scaling prior ai, since its most likely values for the mean and variance could be captured from the data. Therefore, the parameters μa, μb, σa2 and σb2 are estimated from the data within the ensuing MCMC algorithm. The corresponding directed acyclic graph (DAG) in Fig. 1 shows the dependences of all hierarchical parameters in the model.

Fig. 1
DAG of the Bayesian hierarchical model

Combining the affine warping transformation with our prior beliefs for ai and bi, the posterior distribution for the unknown parameters is modified as follows:

p(Θ/Y)d=1Dj=1m{i=1sjNn(yi(d),Zi;B1i(d)γj(d)+B2i(d)ηij(d),σ2(d)I)×Np(γj(d);wd,W(d))×Nq(ηij(d);μzij(d),zij(d))×N(μb;σb2)×N(μa;σa2)I{ai>1}×Nq(μj(d);0,V(d))×Wq(j1(d);ρ(d),(ρ(d)R(d))1)}×Wq(R(d);r,(rR0)1)×Γ(σ2(d);g,h)
(7)

where Θ denotes all unknown parameters in this new model.

Θ=({{{ηij(d)}i=1sj,γj(d),μj(d),j(d),R(d),σ2(d)}d=1D,μa,μb,σa2,σb2}j=1m)
(8)

The B-spline basis matrices associated with yi(d)(aixijbi) need to be updated at each iteration based on the estimates of RT transformation parameters ai and bi. Moreover, these parameters need to be shared over the D dimensions for each group data since the dynamic behavior for each dimension occurs over the same time scale. The Gibbs sampling algorithm for the modified BHM of Eq. (7) proceeds in the same manner as that of Section C. The algorithm continues to draw the other parameters in the order outlined below:

ηij(d)(t+1),γj(d)(t+1),μj(d)(t+1),j(d)(t+1),R(d)(t+1),σ2(d)(t+1),μa(t+1),μb(t+1),σa2(t+1),σb2(t+1),fori=1,2,,sj,j=1,2,,mandd=1,2,,D

III. Results and Discussions

We used BHM to align 11 replicate LC-MS spectra obtained from http://www.cs.toronto.edu/~jenn/LCMS. The spectra are generated from proteins of lysed E. coli cells by capillary-scale LC coupled on-line to an ion trap mass spectrometer (see Listgarten et al. [6] for details). Each spectrum was represented by two dimensions after calculating the total ion count (TIC) profiles for each RT point across the m/z values from the original 400×2400 data matrix corresponding to 400 RT points (~55 min.) and 2400 m/z bins spanning between 400 and 1600 Dalton (Da). Fig. 2 depicts these 11 two-dimensional replicate spectra. From this figure, we can see that the spectra show significant shifts along RT as well as distortions in the ion abundance measurement space. We applied our BHM method for alignment of LC-MS spectra with respect to RT. Fig. 3 depicts the aligned spectra. BHM reduced the coefficient of variation (CV) of the original TIC profiles from 82% to 66%. The CV of the spectra aligned by DTW, COW and CPM were 70%, 80% and 57%, respectively.

Fig. 2
Plots of TIC profiles before alignment
Fig. 3
TIC profiles after alignment by BHM.

The second dataset was obtained from http://prottools.ethz.ch/muellelu/web/Latin_Square_Data.php It consists of 18 LC-MS spectra generated from tryptic digests of six standard non-human proteins (myoglobin, carbonic anhydrase, cytochrome c, lysozyme, alcohol dehydrogenase, and aldolase A) spiked with different concentrations into a complex sample background of human peptides and isolated by solid-phase Nglycocapture from serum. The LC-MS spectra generation for these samples was performed using the Fourier transformed-linear trap quadrupole (FT-LTQ) mass spectrometer (see Mueller et al. [5] for details). The 18 spectra represent six groups based on the concentration of the proteins. We processed the raw spectra and obtained for each spectrum a 2000×1300 data matrix corresponding to 2000 RT points (~55 min.) and 1300 m/z bins between 300 and 1600 Da. We calculated the TIC for each RT point across the m/z values and obtained 18 two-dimensional TIC profiles (for each of the six groups). Figs. 4 and and55 depict TIC plots of the original and aligned LC-MS spectra, respectively. Fig. 6 shows the corresponding heat maps for the original and aligned LC-MS spectra. BHM reduced the average CV of the original TIC profile across the six groups from 18% to 13%. Both DTW and COW yielded a CV of 17%, while CPM resulted in a CV of 13%.

Fig. 4
TIC profiles before alignment.
Fig. 5
TIC profiles after alignment by BHM.
Fig. 6
Heat maps of the TIC profiles for Mueller et al. dataset.

IV. Conclusion

This paper utilizes a Bayesian hierarchical model for alignment of LC-MS spectra. Specifically, it presents a fully Bayesian mixed-effects model that effectively accounts for population homogeneous behavior across biological groups (i.e., fixed systematic changes) and for heterogeneity within groups (random effects). Bayesian inference of unknown parameters is carried out via MCMC method using the Gibbs sampling technique with conjugate priors. The proposed approach not only allows alignment with respect to RT and m/z dimensions, it also implicitly normalizes the peak intensities of peptides. The performance of the approach is assessed through two LC-MS datasets: replicate spectra generated from proteins of lysed E. coli cells and spectra representing six groups, where six proteins are spiked at different concentrations into a complex sample background of human peptides. Through these datasets, it is demonstrated that BHM achieves good performance in reducing coefficient of variation of replicate TIC profiles, while preserving the original experimental retention time (i.e., without introducing superfluous signal gaps across multiple LC-MS spectra). A limitation of BHM is that it requires considerable amount of computation time in aligning LC-MS data with respect to both RT and m/z dimensions. Future work will focus on addressing this limitation through optimization of the algorithm.

Acknowledgments

This work was supported in part by the National Science Foundation Grant IIS-0812246, the National Cancer Institute (NCI) R21CA130837 Grant, NCI R03CA119313 Grant, NCI Early Detection Research Network Associate Membership Grant, and the Prevent Cancer Foundation Grant awarded to HWR.

Contributor Information

Getachew K Befekadu, The Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA.

Mahlet G Tadesse, The Department of Mathematics, Georgetown University, 308 St. Mary’s Hall, Washington, DC 20057, USA.

Habtom W Ressom, The Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA.

References

1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422(6928):198–207. [PubMed]
2. Tomasi G, van den Berg F, Andersson C. Correlation Optimized Warping and Dynamic Time Warping as Preprocessing Methods for Chromatographic Data. Journal of Chemometrics and Intelligent Laboratory Systems. 2004;18:231–241.
3. Hastings CA, Norton SM, Roy S. New algorithms for processing and peak detection in liquid chromatography/mass spectrometry data. Rapid Commun Mass Spectrom. 2002;16(5):462–7. [PubMed]
4. Wang P, Tang H, Fitzgibbon MP, et al. A statistical method for chromatographic alignment of LC-MS data. Biostatistics. 2007;8(2):357–67. [PubMed]
5. Mueller LN, Rinner O, Schmidt A, et al. SuperHirn - a novel tool for high resolution LC-MS-based peptide/protein profiling. Proteomics. 2007;7(19):3470–80. [PubMed]
6. Listgarten J. Department of Computer Science, vol. Ph.D. Toronto: University of Toronto; 2006. Analysis of sibling time series data: alignment and difference detection.
7. Morris JS, Brown PJ, Herrick RC, et al. Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics. 2008;64(2):479–89. [PMC free article] [PubMed]
8. Morris JS, Carroll RJ. Wavelet-based functional mixed models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2006;68:179–199. [PMC free article] [PubMed]
9. Guo W. Functional data analysis in longitudinal settings using smoothing splines. Statistical Methods in Medical Research. 2004;13(1):49–62. [PubMed]
10. Casella G, George EI. Explaining the Gibbs Sampler. The American Statistician. 1992;46(3):167–174.
11. Geman S, Geman D. Readings in uncertain reasoning. Morgan Kaufmann Publishers Inc; 1990. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images; pp. 452–472. [PubMed]