|Home | About | Journals | Submit | Contact Us | Français|
A Bayesian multilevel functional mixed-effects model with group specific random-effects is presented for analysis of liquid chromatography-mass spectrometry (LC-MS) data. The proposed framework allows alignment of LC-MS spectra with respect to both retention time (RT) and mass-to-charge ratio (m/z). Affine transformations are incorporated within the model to account for any variability along the RT and m/z dimensions. Simultaneous posterior inference of all unknown parameters is accomplished via Markov chain Monte Carlo method using the Gibbs sampling algorithm. The proposed approach is computationally tractable and allows incorporating prior knowledge in the inference process. We demonstrate the applicability of our approach for alignment of LC-MS spectra based on total ion count profiles derived from two LC-MS datasets.
In proteomic studies, liquid chromatography coupled with mass spectrometry (LC-MS) is a common platform to identify and determine the abundance of various peptides that characterize particular proteins in biological samples . Each LC-MS run generates data comprised of thousands of peak intensities for peptides with specific retention time (RT) and mass-to-charge ratio (m/z) values. In differential protein expression studies, multiple LC-MS runs are compared to identify differentially abundant peptides between distinct biological groups. This is a challenging task because of the following reasons: (1) substantial variation in RT across multiple runs due to the LC instrument conditions and the variable complexity of peptide mixtures, (2) variation in m/z values of the peptides due to occasional drift in the calibration of the mass spectrometry instrument, and (3) variation in peak intensities due to spray conditions. Thus, efficient and robust alignment algorithms are needed for qualitative comparison of multiple LC-MS runs.
Various alignment methods have been described in literature including dynamic time warping (DTW) , correlation optimized warping (COW) , vectorized peaks , statistical alignment , and clustering . Most of these algorithms are either limited to a consensus pair-wise combination of spectra for alignment or may use reference (template) spectra to find matching among datasets. These limitations may lead to sub-optimal results compared to global alignment techniques. Methods that rely on optimization of global fitting functions provide an alternative solution to alignment of multiple LC-MS spectra representing distinct biological groups. For example, a recently introduced method called continuous profile model (CPM) has been applied for alignment of continuous time-series data and for detection of differences in multiple LC-MS data . Although CPM is described as a naïve and computationally intensive method, the method has some limitations, such as the susceptibility to fall into local minimum solutions due to the sub-optimal problem formulation. Also, the method creates superfluous signal gaps, leading to nonuniform trace points across multiple LC-MS spectra. Another notable limitation of CPM algorithm is its poor performance with time complexity scales, requiring substantial computation time in modeling high resolution data. Thus, CPM is more suitable for low resolution of LC-MS data generated from less complex fractionations. Recently, Morris et al. developed a Bayesian-based method for analysis of matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) proteomics data . Their motivation extends from earlier work on Bayesian implementation of the wavelet-based functional mixed effects models introduced by Morris and Carroll . The approach is similar to the spline-based functional mixed effects models introduced by Guo , which involves a generalized mixed models equation to handle potentially irregular data. The method specifically deals with the identification of differentially expressed spectral regions across different experimental conditions assuming the alignment issue has already been taken care of.
In this paper, we introduce a Bayesian multilevel functional mixed effects model with group-specific random effects. The method provides the capability to account for population homogeneous behavior (i.e., fixed systematic changes across the entire LC-MS spectra representing distinct biological groups) while allowing for modeling heterogeneity within a group (i.e., random effects). Also, this paradigm allows us to incorporate additional hierarchies such as affine transformation within the model to account for any variability along the RT and m/z dimensions, while handling implicitly the normalization of peak intensities of peptides from multiple LC-MS spectra. The method is amenable to model both low and high resolution mass spectra, since it does not introduce superfluous signal gaps across multiple LC-MS spectra. We demonstrate this through two LC-MS datasets obtained from: (1) proteins of lysed E. coli cells, and (2) six groups of tryptic digests non-human proteins with different concentrations spiked into a complex sample background of human peptides.
The remainder of this paper is organized as follows. In Section II, we outline the Bayesian hierarchical model (BHM) that describes the data modeling mechanism, based on the functional mixed-effects model, for alignment of LC-MS spectra. This section explains the Markov chain Monte Carlo (MCMC) method using the Gibbs sampling algorithm for simultaneous posterior inference of all unknown parameters. Results and discussions demonstrating the applicability of the proposed method for alignment of LC-MS spectra are given in Section III. Finally, our findings are summarized in Section IV.
We propose a functional mixed-effects model to align LC-MS spectra from multiple LC-MS runs. The idea behind this approach is two-fold: (1) to model the fixed effects as a realization of partially diffused integrated Gaussian processes which account for population homogeneous behaviors (i.e., fixed systematic changes in the LC-spectra across biological groups), and (2) to model the random effects as random realizations from the same partially integrated Gaussian processes with proper variances which, in turn, allow the modeling of heterogeneity within biological groups. The estimation procedure is implemented by taking advantage of the connection between B-splines (at the design points) and mixed effects models. Let the proposed functional mixed-effects model be represented mathematically as follows:
where i = 1, 2,… sj denote sample size in each group j; j = 1, 2, …, m are indices for the mixture components that identify group membership with m M, where M is the total number of observation spectra ( ); ni denotes the data length for each spectrum yi; B1i and B2i are B-spline basis matrices associated with fixed and random effects for the i-th sample, respectively; γj accounts for fixed systematic changes in group j; while ηij accounts for the random variation; εij’s correspond to measurement errors, where ηij and εij are assumed to be independent. Alignment is performed using the information from the matrices B1i and B2i as well as from fixed systematic changes γj and random variation ηij.
Let Θ be a vector consisting of all the unknown parameters in Eq. (1) and the priors. Let represent a set of LC-MS spectra. Then, according to the Bayes’ theorem
Using the functional mixed-effects modeling of Eq. (1), the likelihood function assuming that the group information is known and that the samples are independent is given by
where Z denotes a matrix of indicator vectors Zi = (zi1,zi2, …, zim), such that zi1 = 1 for some j, and zit = 0, ∀t ≠ j. Hence, the joint posterior has the general form
The first step in fitting BHM is to specify all prior distributions. A list of the hierarchical priors assigned to the parameters of the model is given below. The list represents the standard choice of priors for mixture models:
where W(,), N(,) and Γ(,) signify the Wishart, multivariate normal and gamma distributions, respectively. In specifying the prior distribution p(Θ), a hierarchical structure with independence assumption is considered. Combining this structural information with prior beliefs, we obtain the following joint posterior for the unknown parameters:
Using all prior and hyperprior distributions in Eq. (5), the full conditional distributions for the parameters are as follows:
Consider the Bayesian model of Eq. (4). Let the number of groups m be fixed and Θ denote all of the unknown parameters in the model, i.e.,
Note that the computations for the conditional probabilities are highly simplified due to the conjugacy of the prior distributions and their conditional independence.
The BHM presented in Eq. (1) can be easily extended to incorporate detail modeling. It is important to introduce priors that appropriately apportion the variability among the replicates and separating out the differing locations or scales along the RT or m/z dimensions. This provides a distinct interpretation of the LC-MS data. The alignment model and the associated parameters should allow each replicate sample to have its own affine warping transformation in RT or m/z dimensions. Let each spectrum yi (xij) be replaced by , where ai and bi are the scaling and shifting parameters for the i-th replicate LC-MS spectrum along the dimension d=1,2,…, D corresponding to the m/z dimensions of each sample spectra. To treat alignment in the hierarchical structure, we include the priors with suitable hyperparametrs in Eq. (6). With the assumption of independence, the joint prior model for the time scaling and translation p(ai, bi) = p(ai) × p(bi) should encode the idea that the most likely translation is the affine warping translation and should also discount large and unlikely translations. A normal prior distribution is a good fit for this, i.e., where μb and are the mean and variance of the hyperparameters. Moreover, we assume a normal distribution for the time-scaling prior ai, since its most likely values for the mean and variance could be captured from the data. Therefore, the parameters μa, μb, and are estimated from the data within the ensuing MCMC algorithm. The corresponding directed acyclic graph (DAG) in Fig. 1 shows the dependences of all hierarchical parameters in the model.
Combining the affine warping transformation with our prior beliefs for ai and bi, the posterior distribution for the unknown parameters is modified as follows:
where Θ denotes all unknown parameters in this new model.
The B-spline basis matrices associated with need to be updated at each iteration based on the estimates of RT transformation parameters ai and bi. Moreover, these parameters need to be shared over the D dimensions for each group data since the dynamic behavior for each dimension occurs over the same time scale. The Gibbs sampling algorithm for the modified BHM of Eq. (7) proceeds in the same manner as that of Section C. The algorithm continues to draw the other parameters in the order outlined below:
We used BHM to align 11 replicate LC-MS spectra obtained from http://www.cs.toronto.edu/~jenn/LCMS. The spectra are generated from proteins of lysed E. coli cells by capillary-scale LC coupled on-line to an ion trap mass spectrometer (see Listgarten et al.  for details). Each spectrum was represented by two dimensions after calculating the total ion count (TIC) profiles for each RT point across the m/z values from the original 400×2400 data matrix corresponding to 400 RT points (~55 min.) and 2400 m/z bins spanning between 400 and 1600 Dalton (Da). Fig. 2 depicts these 11 two-dimensional replicate spectra. From this figure, we can see that the spectra show significant shifts along RT as well as distortions in the ion abundance measurement space. We applied our BHM method for alignment of LC-MS spectra with respect to RT. Fig. 3 depicts the aligned spectra. BHM reduced the coefficient of variation (CV) of the original TIC profiles from 82% to 66%. The CV of the spectra aligned by DTW, COW and CPM were 70%, 80% and 57%, respectively.
The second dataset was obtained from http://prottools.ethz.ch/muellelu/web/Latin_Square_Data.php It consists of 18 LC-MS spectra generated from tryptic digests of six standard non-human proteins (myoglobin, carbonic anhydrase, cytochrome c, lysozyme, alcohol dehydrogenase, and aldolase A) spiked with different concentrations into a complex sample background of human peptides and isolated by solid-phase Nglycocapture from serum. The LC-MS spectra generation for these samples was performed using the Fourier transformed-linear trap quadrupole (FT-LTQ) mass spectrometer (see Mueller et al.  for details). The 18 spectra represent six groups based on the concentration of the proteins. We processed the raw spectra and obtained for each spectrum a 2000×1300 data matrix corresponding to 2000 RT points (~55 min.) and 1300 m/z bins between 300 and 1600 Da. We calculated the TIC for each RT point across the m/z values and obtained 18 two-dimensional TIC profiles (for each of the six groups). Figs. 4 and and55 depict TIC plots of the original and aligned LC-MS spectra, respectively. Fig. 6 shows the corresponding heat maps for the original and aligned LC-MS spectra. BHM reduced the average CV of the original TIC profile across the six groups from 18% to 13%. Both DTW and COW yielded a CV of 17%, while CPM resulted in a CV of 13%.
This paper utilizes a Bayesian hierarchical model for alignment of LC-MS spectra. Specifically, it presents a fully Bayesian mixed-effects model that effectively accounts for population homogeneous behavior across biological groups (i.e., fixed systematic changes) and for heterogeneity within groups (random effects). Bayesian inference of unknown parameters is carried out via MCMC method using the Gibbs sampling technique with conjugate priors. The proposed approach not only allows alignment with respect to RT and m/z dimensions, it also implicitly normalizes the peak intensities of peptides. The performance of the approach is assessed through two LC-MS datasets: replicate spectra generated from proteins of lysed E. coli cells and spectra representing six groups, where six proteins are spiked at different concentrations into a complex sample background of human peptides. Through these datasets, it is demonstrated that BHM achieves good performance in reducing coefficient of variation of replicate TIC profiles, while preserving the original experimental retention time (i.e., without introducing superfluous signal gaps across multiple LC-MS spectra). A limitation of BHM is that it requires considerable amount of computation time in aligning LC-MS data with respect to both RT and m/z dimensions. Future work will focus on addressing this limitation through optimization of the algorithm.
This work was supported in part by the National Science Foundation Grant IIS-0812246, the National Cancer Institute (NCI) R21CA130837 Grant, NCI R03CA119313 Grant, NCI Early Detection Research Network Associate Membership Grant, and the Prevent Cancer Foundation Grant awarded to HWR.
Getachew K Befekadu, The Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA.
Mahlet G Tadesse, The Department of Mathematics, Georgetown University, 308 St. Mary’s Hall, Washington, DC 20057, USA.
Habtom W Ressom, The Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University, Washington, DC 20057, USA.