|Home | About | Journals | Submit | Contact Us | Français|
Motivation: LC-MS allows for the identification and quantification of proteins from biological samples. As with any high-throughput technology, systematic biases are often observed in LC-MS data, making normalization an important preprocessing step. Normalization models need to be flexible enough to capture biases of arbitrary complexity, while avoiding overfitting that would invalidate downstream statistical inference. Careful normalization of MS peak intensities would enable greater accuracy and precision in quantitative comparisons of protein abundance levels.
Results: We propose an algorithm, called EigenMS, that uses singular value decomposition to capture and remove biases from LC-MS peak intensity measurements. EigenMS is an adaptation of the surrogate variable analysis (SVA) algorithm of Leek and Storey, with the adaptations including (i) the handling of the widespread missing measurements that are typical in LC-MS, and (ii) a novel approach to preventing overfitting that facilitates the incorporation of EigenMS into an existing proteomics analysis pipeline. EigenMS is demonstrated using both large-scale calibration measurements and simulations to perform well relative to existing alternatives.
Availability: The software has been made available in the open source proteomics platform DAnTE (Polpitiya et al., 2008)) (http://omics.pnl.gov/software/), as well as in standalone software available at SourceForge (http://sourceforge.net).
Supplementary information: Supplementary data are available at Bioinformatics online.
Spectral peaks from LC-MS can be used to identify and quantify proteins in complex biological samples (Nesvizhskii et al., 2007). However, LC-MS experiments are susceptible to many sources of systematic bias, including non-constant instrument calibration, imperfect sample preparation, sample run order, etc. (Fig. 1; Callister et al., 2006, Jaitly et al., 2006, Petyuk et al., 2008). Here, we characterize typical biases in LC-MS peak intensities using large-scale calibration measurements with known internal controls. We compare several existing normalization methods and propose an algorithm for estimating and removing systematic biases using the singular value decomposition (SVD) of residual peak intensities. Existing methods are found to either be incapable of capturing complex bias trends or to overfit the data, resulting in invalid downstream statistical inference. The proposed algorithm, called EigenMS, is demonstrated to capture biases with great accuracy without overfitting (Tables 2 and and33).
Calibration datasets are extremely valuable for assessing the performance of computational methods, since they permit comparisons between the results of a method and known characteristics of the experimental design (Tseng et al., 2001). In order to address the normalization of peak intensities in LC-MS, we obtained a large-scale calibration dataset prepared in-house. Samples of Salmonella (S.typhi) proteins under two biologically distinct conditions were mixed together in five different concentrations, with equal concentrations of a mixture of quality control (QC) proteins added to each. Five replicates of each of the five concentration groups were obtained in five batches using a randomized block design. With this design, we are able to definitively separate biological signal from technical bias. In particular, the QC proteins can be used to evaluate a normalization method's ability to remove bias, and the Salmonella proteins can be used to evaluate the method's ability to preserve biological signal. We note, however, that maintaining perfect equality of control protein concentrations across samples is difficult in practice, an issue that we encountered here.
We examine several standard normalization methods for LC-MS peak intensities, including global scaling, peptide-specific ANOVA models (Kerr et al., 2000) and scatterplot smoothing techniques like lowess (Yang et al., 2002) and quantile normalization (Bolstad et al., 2003). Global scaling cannot capture complex bias trends like those commonly seen in high-throughput genomic or proteomic experiments. When the sources of bias are known exactly, ANOVA models can effectively estimate and remove systematic biases (Hill et al., 2008). However, the use of peptide-specific models for preprocessing may overfit the data and invalidate downstream statistical analysis (Dabney and Storey 2007a). Furthermore, it will not generally be possible to identify all of the relevant sources of bias to sufficiently model biases with ANOVA. Scatterplot smoothing techniques are widely used in the microarray literature, but they do not necessarily address all sources of systematic bias, and they can actually introduce additional biases in common settings (Dabney and Storey 2007b).
We present an algorithm, called EigenMS, that adapts the surrogate variable analysis (SVA) method (Leek and Storey 2007) to the problem of normalization of LC-MS peak intensities. SVA uses SVD on model residuals to find trends that are responsible for significant variation that is not explained by the experimental factors of interest. In our context, these ‘residual eigenpeptides’ can be used to flexibly capture and remove complex biases in LC-MS peak intensities. Our adaptations of SVA result in several beneficial features of EigenMS for the proteomics setting. First, EigenMS is applicable to data with widespread missing measurements, as is common in MS-based proteomics. Second, the EigenMS algorithm is well-suited for inclusion in an existing proteomics analysis pipeline, as it does not require any special downstream steps or housekeeping. EigenMS is shown to perform well relative to existing alternatives, based on the calibration data as well as simulations. The algorithm is available as part of the open source proteomics analysis platform DAnTE (Polpitiya et al., 2008). EigenMS is generally applicable in bottom-up MS-based experiments based on either tandem MS (Nesvizhskii et al., 2007), high-resolution LC-MS or hybrids of the two (Zimmer et al., 2006).
Salmonella samples grown under two biologically distinct conditions (Log—logarithmic phase cultures grown in a rich medium and MgM—acidic, magnesium-depleted minimal nutrients medium) were combined in five different concentrations, with 25 QC proteins added in equal concentrations to each as internal controls. The Log phase is what is typically achieved by Salmonella cells in rich medium, whereas MgM is thought to mimic the virulent conditions created within host organisms. The QC proteins are derived from a mixture of organisms, including horse, rabbit and cow. Five replicates of each of the five concentration groups were obtained in five batches. Concentration run order was randomized within each batch. Table 1 details the concentrations for each of the Log, MgM and QC proteomes by experiment number. We removed the first concentration group from our analysis due to its overall low intensity across all batches. The Figure 1A shows a heatmap of the raw peak intensities (see Fig. 5 in Supplementary Material for the corresponding picture of the QC peptides). There are 228 peptides identified from the 25 QC proteins, and 3627 Salmonella peptides, corresponding to 686 unique Salmonella proteins. Overall, about 50% of all peak intensities are missing, due to peptides that were identified in some samples but not in others.
While the peptides derived from the QC proteins should be identical, on average, across samples, there are practical difficulties (e.g. pipetting errors) in ensuring exactly equal concentrations. Another technical feature to note is that batches 4 and 5 were run using a different liquid chromatography column than batches 1, 2 and 3; differences between the data from the two columns are apparent in the heatmap of the raw intensities. Furthermore, as is the case with many high-throughput experiments, replication in these samples is technical in nature, rather than biological. This has the potential to underestimate the sources of variability present in the resulting samples, producing invalid P-values (Churchill 2002, Dabney and Storey 2006). Nevertheless, the data from the calibration measurements considered here proved to be valuable for building and validating our normalization algorithm. The data are included in the software download at SourceForge, for reference; simply search for ‘EigenMS’ at http://sourceforge.net.
We simulated data to mimic the calibration data described above, using the following model:
where yijk is the log-transformed peak intensity for peptide i, comparison group j and batch k; μi is the overall mean intensity for peptide i; C, B and T represent mean differences between comparison groups, batches and arbitrary additional sources of technical variation, respectively; and the ϵ represent random error. Usual sum-to-zero constraints apply to the C, B and T terms. We generated five comparison groups, five batches within each group and m = 200 peptides, of which 80 were differentially expressed across groups. The group effects were chosen as monotonic step functions, similar to what is expected in the calibration dataset. The batch effects were constructed to reflect the LC column effect expected in the calibration dataset, with batches 4 and 5 differing from batches 1–3. Finally, the T terms in the simulation model were generated completely randomly, which meant to reflect systematic differences between samples that are not related to any known aspect of the experimental design. For full details of the simulation, see the Supplementary Material. The Figure 4A shows a heatmap of the raw intensities.
Most widely used normalization techniques in high-throughput genomic or proteomic studies involve some variation of global scaling, scatterplot smoothing or ANOVA (Quackenbush 2002). Global scaling generally involves shifting all the measurements for a single sample by a constant amount, so that the means, medians or (in mass spectrometry) total ion currents (TICs) of all samples are equivalent. Scatterplot smoothing techniques are widely used in microarray studies (Bolstad et al., 2003, Yang et al., 2002). ‘M versus A’ plots (or MA plots, for short) compare two samples by forming a scatterplot with the averages of intensities on the x-axis and the differences of intensities on the y-axis; these are often used for highlighting intensity-dependent biases. As an example, in terms of model (1), an MA plot comparing batches k = 1 and k = 2 from group j = 1 would plot Ai = ( yi11 + yi12)/2 versus Mi = yi12 − yi11, i = 1, 2,…, m. Under certain conditions, any systematic deviations in the scatterplot away from the horizontal zero line can be interpreted as bias. MA smoothing would proceed by fitting a smooth curve to the relationship between the Ai and Mi, then shifting the points to remove the fitted values of that curve, forcing the scatter to follow the horizontal zero line. In the results that follow, we apply scatterplot smoothing of MA plots, arbitrarily using the first sample as the baseline to normalize all other samples too. ANOVA-based normalization can be carried out by fitting gene- or peptide-specific ANOVA models, with specific model terms included for expected sources of bias, then subtracting off the estimated biases from the model (Kerr et al., 2000). In the calibration dataset, we employ peptide-specific models that include batch effects:
similar to model (1), but only including terms that represent known aspects of the experimental design.
The EigenMS algorithm is an adaptation of the SVA method for detecting significant unmodeled factors in the gene expression microarray setting (Leek and Storey 2007). The basic idea of SVA is to (i) use knowledge of the experimental design to estimate effects of the experimental factors of interest, (ii) carry out SVD on the model residuals to capture remaining systematic trends, then (iii) use the estimated residual trends as additional factors to be adjusted for in the downstream, inferential model. This captures the cumulative effect of technical features without requiring knowledge of their exact sources. The SVA algorithm estimates ‘surrogate variables’ as projections of gene expression trends due to technical factors and requires that downstream statistical inference include the surrogate variables as covariates. In so doing, normalization is carried out simultaneously with inference, resulting in a clever work-around to the common problem of overfitting due to complex preprocessing.
EigenMS follows the general approach of SVA with a few modifications. Because EigenMS is intended to be implemented within a pipeline of informatics tools, we prefer to avoid requiring the user to keep track of surrogate variables in downstream analysis. Instead, we follow the general SVA approach of estimating systematic residual trends using SVD, then subtracting off those estimates from the raw data. To prevent overfitting, we employ a rescaling trick to the normalized data, resulting in valid downstream inference and no qualitative difference from the full SVA results. We adapt the SVA algorithm as follows:
The normalized intensities are then contained in , a m × n matrix of the same dimensions as the raw data. Our intention is that the users be able to explore and analyze the normalized data without considering the steps taken in preprocessing. However, as detailed in the SVA paper, a substantial number of degrees of freedom have been used up by the above algorithm. Thus, downstream inference on the normalized data without account for this will proceed as though there are more available degrees of freedom than there really are, resulting in underestimated P-values and anticonservative inference. With this in mind, we describe next a rescaling trick that avoids this problem.
To maintain our ability to compute valid P-values on the normalized data, we incorporate a rescaling algorithm that replaces the systematic variability removed by EigenMS with just enough random variability to approximately achieve the correct number of degrees of freedom. We first compute P-values for experimental factors of interest (‘group effects’ in the examples considered here) using a model that also includes the residual eigenpeptides computed by EigenMS, as would be done if carrying out normalization and inference simultaneously. We then compute P-values using the already-normalized data and a model that only includes the experimental factors of interest, as would be done if carrying out normalization in a preprocessing step, then inference on the already-normalized data, without any special steps to address the degrees of freedom used up in normalization. The rescaling algorithm then adds residual variability to the second model by spreading out its residuals, until its resulting P-values are indistinguishable from the first model's.
Let yij be the intensity for peptide i in sample j, i = 1, 2,…, m, j = 1, 2,…, n. Consider the model
where the wkj are the primary variables of interest, and the glj are secondary factors whose effects we would like to remove from downstream analysis. Significance analysis could be based on the null hypotheses H0i : α1i = … = αKi, i = 1, 2,…, m. A natural test statistic for peptide i is the ANOVA F-statistic, Fi = MSTri/MSEi, computed under model (3). Under standard ANOVA model assumptions, Fi follows the F(K−1),(n−K−L) distribution if the null hypothesis is true, and the corresponding P-value can be computed as P((K−1),(n−K−L) ≥ Fi), where (K−1),(n−K−L) is a random variable from the F(K−1),(n−K−L) distribution.
For the purposes of normalization, we would like to estimate and remove the effects of the glj to produce normalized data
Subsequent significance analysis on the normalized data would be conveniently carried out by computing F-statistics under the model assuming only primary factors are present:
However, without acknowledging the preprocessing steps taken to estimate and remove bias, the assumed null distributions for the resulting test statistics will be incorrect. For example, if model (5) were assumed to apply to the data after normalization, the null distribution for a resulting F-statistics would be assumed to be F(K−1),(n−K), instead of F(K−1),(n−K−L). This overestimate of the number of degrees of freedom would result in underestimated P-values.
Various approaches are available for avoiding this problem, including (i) Storing the estimated secondary factor effects and plugging them in as covariates in the significance-testing model, and (ii) manually adjusting the number of degrees of freedom in the null distribution. Both of these require the user to keep track of the preprocessing steps taken and incorporate them into any software used for downstream inference. To avoid this requirement, we instead consider adding random variability to the model residuals to effectively remove the appropriate number of degrees of freedom. This allows for significance analysis post-normalization with no special steps required on by the user; that is, no additional covariates need to be added to the inferential model, and no adjustment to the null sampling distribution of test statistics are required. The specific algorithm employed for rescaling is as follows:
The upper boundary S for the range of the scale factors is chosen such that the rescaled residual SDs are not permitted to exceed the residual standard deviations estimated from model (3). In our analyses, we have set nγ = 100.
By using the selected residual eigenpeptides to carry out inference and normalization simultaneously, we minimize issues with overfitting. Note, however, that the above procedure does not address the uncertainty associated with the computation and selection of the residual eigenpeptides themselves, although simulation results suggest this is a minor omission. P-values for the normalized data are computed without any need for inclusion of surrogate variables as model covariates, or adjustments to the degrees of freedom used in the null sampling distribution. Thus, for example, the complete peptides from a two-class problem with n samples in each comparison group could be tested for differential expression using a standard two-sample t-statistic and t2(n−1) null distribution (assuming equal variances) after EigenMS normalization; note that we define ‘EigenMS normalization’ to include both the SVD-based adjustment and the rescaling algorithm.
Peptides with missing measurements cannot be included in the matrix algebra required for SVD. However, since all peptides in the same sample were subjected to identical experimental conditions, complete peptides from that sample can be used to identify the residual eigenpeptides required for normalization by EigenMS. Normalization of the complete peptides proceeds exactly as described above. Incomplete peptides are treated similarly, but (i) the residual eigenpeptides found using the complete data are used, and (ii) the subsequent normalization and rescaling steps proceed using only the complete measurements. For an incomplete peptide with intensity vector y, let k0 = (k01, k02,…, k0n0) be the vector of indices for the n0 entries in y with missing measurements. Let y(k0) be the vector containing only the n − n0 complete measurements. Similarly, let e(k0) be the vector of n − n0 residuals from using y(k0) to fit the model in step one of the EigenMS algorithm, V0(k0) be the (n − n0) × H version of V0 after removal of rows corresponding to k0 and X(k0) be the similarly modified version of X. Normalization proceeds with e(k0) and V0(k0) in steps 4 and 5 of the EigenMS algorithm, then using y(k0), V0(k0), and X(k0) in the rescaling algorithm.
We applied EigenMS separately to both the Salmonella and QC peptides in the calibration dataset; in both cases, we fit the model with only concentration group included:
since concentration group is the only experimental factor of interest. Under the assumption that the QC peptides originated from proteins of the same concentration in all samples, identification and removal of any systematic trends present in their profiles across samples will reduce bias. As noted below, however, there are indications of variations from sample preparation that may have resulted in slightly unequal QC protein concentrations across concentration groups. Furthermore, the use of technical replication may have the effect here of underestimating sources of variability. While EigenMS is applicable to incomplete peptides (see Section 2.4.2), we only consider complete peptides in what follows, since: (i) our performance assessments are largely based on the validity of downstream statistical inference, and (ii) for peptides with missing values, special statistical methods are required to obtain unbiased model estimates and valid inference (Karpievitch et al., 2009), and this is not the focus of the present article. In the calibration data, this left us with 66 QC peptides and 425 Salmonella peptides.
In the Salmonella peptides, EigenMS identified three-residual eigenpeptides that explained a significant portion of residual variation in intensities. These eigenpeptides are shown in theFigure 2A (see Fig. 6 in Supplementary Material for the corresponding picture for the QC peptides). On the x-axis is sample index, with the first five ticks corresponding to batches 1–5 of concentration one, the second five ticks corresponding to concentration two and so on. The first residual eigenpeptide explains 49% of the variance and is evidently due to the different amounts of the Salmonella proteins in the mixture, as expected. Note that, while it is difficult to pull a coherent trend out of the remaining pictures, batch 4 appears to differ in all concentrations in eigenpeptide 2, as does batch 5 in eigenpeptide 3. These differences in eigenpeptides for batches 4 and 5 may be due to the use of a different LC column for these measurements. Note that each residual eigenpeptide has the same interpretation if reflected about the horizontal zero line. Thus, an eigenpeptide with an apparent negative effect of batch 4, for example, is equivalent to that with a corresponding positive effect. In the Salmonella peptides, for example, the step-function appearance of the first eigenpeptide reflects the experimental design. However, it is expected that some peptides will become more abundant and some less abundant as the concentration combinations are changed. The one step-function trend reflects both possibilities.
EigenMS removed the effects of the first three residual eigenpeptides (corresponding to eigenpeptides 2–4, the fourth of which is not shown in Fig. 2) in Salmonella, as described in Section 2. The Figure 1B shows a heatmap of the normalized Salmonella peptides, and it is qualitatively evident that much of the systematic within-group features in the raw data have been removed; a similar picture is seen for the QC peptides in the Supplementary Materials. We also carried out SVD analysis of the normalized Salmonella data, with the results shown in Figure 2B. We observed an increase in the relative variance explained by the first eigenpeptide from 49% in raw Salmonella peptides to 62% and a decrease in relative variance explained by the rest of the top eigenpeptides. Since the first eigenpeptide evidently reflects the pattern across samples expected by the experimental design, we take this as evidence that EigenMS has enhanced our ability to discern real concentration differences, an important goal of any normalization procedure.
As discussed in Section 2, EigenMS should produce valid P-values and, in particular, null P-values that are approximately uniformly distributed. We checked this by examining the distribution of P-values for the QC peptides. Figure 3 shows histograms of the P-values for the raw QC peptides, as well as for data normalized by EigenMS and ANOVA. The P-values for the raw QC intensities are not uniformly distributed, indicating that either the actual concentrations of the QC peptides changed from group to group, or technical replication substantially underestimated the variation that led to the observed data; either mechanism (or both) could lead to apparent differential expression in the QC peptides (Churchill 2002, Dabney and Storey 2006). The distribution of the P-values after EigenMS normalization does not look uniform either, although the P-values have been shifted in the right direction; see Figure 9 in Supplementary Data for a comparison of P-values before and after rescaling. This highlights the fact that normalization cannot undo systematic biases across comparison groups, cases in which bias and biological signal are confounded. The ANOVA normalization method produces extremely right-skewed P-value distributions, an indication of overfitting. The scatterplot smoothing normalization method, as well as global scaling, produce more uniformly distributed P-values than does ANOVA, but visual examination of the heatmaps confirm that these normalization methods are unable to capture the systematic biases present in the calibration data. Furthermore, global scaling or scatterplot smoothing to a single reference sample, as is often done, actually obscures the concentration differences (see Fig. 7 in Supplementary Material and the left portion of Table 2).
We then compared the normalization methods in terms of the number of Salmonella peptides selected as statistically significant over a range of false discovery rate (FDR) cutoffs. To estimate the FDR associated with a particular significance cutoff, we used the Q-value, an FDR analog of the P-value (Storey and Tibshirani 2003). At a Q-value cutoff of 5%, corresponding to an expected FDR of 5% among the selected peptides, the number of Salmonella peptides called significant increased by 77 over the raw data (see the left portion of Table 2). Furthermore, EigenMS results in a greater increase in the number of selected peptides than all other normalization methods considered, except ANOVA, although the performance of ANOVA is in large part due to its tendency to underestimate P-values.
We applied EigenMS to the simulated data, again based on model (6).Figure 4B shows the EigenMS-normalized intensities for the simulated peptides, and it is qualitatively evident that much of the systematic within-group trends have been successfully removed. EigenMS identified one-residual eigenpeptide that explained 80% of the variation in intensities after accounting for the concentration differences (Fig. 8 in Supplementary Material). This residual eigenpeptide incorporated the effects of both the B and T factors in the simulation model (1). The right portion of Table 2 shows the number of significant peptides by FDR cutoff for the different normalization methods. EigenMS consistently identifies all of the 80 truly differentially expressed peptides with high confidence. ANOVA normalization fails to identify as many as EigenMS. Scatterplot smoothing and TIC normalization select more peptides than EigenMS, but they underestimate their FDR levels. This is due to the fact that these methods underestimate P-values in the simulation, as seen in Table 3. Table 3 shows the distribution of P-values for the null peptides for the different methods. In the simulation, the T factors, representing technical features in addition to batch effects in Equation (1), were created in such a way that they account for much of the experimental variation. Thus, analysis of the raw data, failing to account for the sources of bias, results in underfitting and a left-skewed null distribution. ANOVA normalization only adjusts for batch effects and hence fails to capture the additional strong technical factors, again resulting in left-skewed null P-values. Scatterplot smoothing and TIC normalization are sample-specific and hence are able to remove much of the bias, but they underestimate P-values, resulting in many more type I errors than expected. The null P-values for EigenMS are approximately uniform as expected, being slightly conservative.
Normalization is an important, but difficult problem in high-throughput proteomics or any other -omics data. Failing to properly account for biases that result from uncontrolled technical aspects of an experiment can have serious adverse effects on the conclusions that can be drawn from the resulting data. On the other hand, overly aggressive normalization routines can easily do more harm than good by overfitting and thus invalidating downstream inferential decisions. In all cases, it is necessary to consider normalization in the larger context of the entire analysis of a dataset, as the steps taken to preprocess data will inevitably affect all subsequent analysis steps. EigenMS has been constructed with these issues in mind, and has been demonstrated to be able to flexibly capture complex biases while preserving the validity of downstream statistical inference.
The approach taken by EigenMS to the normalization of MS peak intensities could be applied to a wide variety of problems in MS-based proteomics, such as the normalization of mass measurements (Petyuk et al., 2008) or elution times (Jaitly et al., 2006). Similar approaches may be applicable to the alignment of LC-MS data, where elution time and mass images are warped across samples to maximize overlap of LC-MS feature clusters (Finney et al., 2008). More generally, systematic errors in alignment on additional dimensions, incorporating ion mobility drift times, for example, could be approached using similar strategies to those employed here.
Portions of this work were performed in the Environmental Molecular Science Laboratory, a United States Department of Energy (DOE) national scientific user facility at Pacific Northwest National Laboratory (PNNL) in Richland, WA.
Funding: NIH R25-CA-90301 training grant in biostatistics and bioinformatics at TAMU; the National Institute of Allergy and Infectious Diseases NIH/DHHS through interagency agreement Y1-AI-4894-01; National Center for Research Resources (NCRR grant no. RR 18522); PNNL is operated for the DOE Battelle Memorial Institute (contract DE-AC05-76RLO01830).
Conflict of Interest: none declared.