As has been previously stated, the goal of the Corra project was to create a single computational platform that was customizable, free and open source, for the enabling LC-MS-based proteomic workflows. The sections that follow below serve to illustrate the data flow through the Corra framework, and include discussions of the processing/analysis options available in the Corra implementation presented here, summarized in Figure . This is then followed by sections describing two, very different, biological pilot studies, chosen as 'real world' experimental examples, both to illustrate and validate the application of key aspects of the Corra workflow to quantitative LC-MS data processing and analysis, and it's use for informing subsequent identification of peptides/proteins of interest via targeted MS/MS.
Figure 6 Summary of the Corra framework data flow. In the flow chart, the rectangular boxes represent one or more software processing steps, parallelograms represent data, and the cylinder represents databases. The application of Corra begins with the input of (more ...)
Corra data input and software processing
At the outset, (multiple) raw LC-MS data files are first converted to mzXML [30
], prior to importing into the Corra framework. The data can then be processed for feature/peak picking and alignment. The current implementation of Corra uses SuperHirn for very high mass accuracy OrbiTrap or FT-MS data (and is the default tool setting for such data, unless otherwise specified by the user) and SpecArray for high mass accuracy TOF-MS data (similarly the default tool setting for these data). These default settings were, in fact, determined through testing the feature picking/alignment tools on multiple data types, where we observed that a given tool performed better with certain types of MS data, and in a somewhat instrument-dependent way. We reasoned that this effect likely resulted from the original data sets that were used during the initial stages of tool development and testing. SpecArray was initially developed for the analysis of ESI-TOF data [12
], and thus performed better than SuperHirn for the analysis of the ESI-TOF data, such as that obtained for the human type 2 diabetes plasma study shown below. On the other hand, SuperHirn was initially developed for the analysis of very high mass accuracy FT-MS data [15
], and thus performed better than SpecArray for the analysis of very high mass accuracy OrbiTrap FT-MS data, such as that in the yeast kinase knockout study, also shown below. Since it is necessary to have high mass accuracy data (i.e. from TOF or FT mass spectrometry platforms) in order to perform LC-MS profiling of complex samples, the implementation of SpecArray and SuperHirn in this initial version of Corra represents sufficient choice for anyone wanting to perform LC-MS profiling using Corra. However, the Corra platform, being open source, was designed so that additional tools could be integrated, according to project-specific needs, as described under Methods above.
Another feature of the Corra framework is that it facilitates the process of peak picking and alignment on the server side (thus not tying up the user's own computer) utilizing an underlying cluster environment with a job-scheduling system (in this case a Portable Batch System) to process all the LC-MS runs in a given data set. This mode of analysis also ensures that the process is not limited by the size of the data set (i.e. the number of LC-MS runs) as some stand-alone tools can be. Following feature detection, the user can display pertinent statistics to evaluate the data quality and usefulness, such as a distribution for the number of features extracted from each LC-MS run. Finally, the outputs of these analyses are converted to APML, both for storage, and for data visualization via Corra's APML viewer (see Figure and above under Methods).
Corra peak picking and alignment
As discussed above, the current implementation of the Corra framework allows for the use of either SpecArray or SuperHirn for feature picking and alignment purposes. However, in order to facilitate the analysis of large data sets, where MS signal intensity typically varies over time, Corra normalizes the MS signal intensity data, prior to importing into the peak picking and alignment tools, using LIMMA (Linear Models for Microarray Data) [21
]. Corra, via APML, also allows for subsequent annotation of aligned peaks where MS/MS data is available, for example via subsequent targeted MS/MS identification of differentially abundant peptide features.
It is worth noting here that Corra was designed primarily for the analysis of LC-MS-based (i.e. label free) quantitative proteomic data. It is thus highly desirable that the data itself should be acquired under conditions that maximize for reproducibility. To this end, in one of the studies discussed below, we instituted an automated calibration of the mass spectrometer via the inclusion of a calibration standard in the blank/wash cycle. This provided for very consistent mass resolution and accuracy, meaning that the major concerns to focus efforts on were the maintaining reproducibility for both chromatography and sensitivity. If sensitivity were to drop appreciably, then many features that were above signal-to-noise may no longer be detected. This effect could mislead the user into thinking that the absence of such features was related to biology, rather than machine performance. It is therefore important to be mindful of this issue when analyzing large LC-MS profiling datasets. Indeed, the use of the calibration standard between runs enabled us to closely follow MS sensitivity over the course of large-scale experiments, and several datasets were abandoned before we obtained the data presented in the diabetes example study below, due to the detection of MS sensitivity-related problems. Reproducibility of LC retention time, on the other hand, is somewhat more challenging, the main issues being sample carry-over and gradient drift. Fortunately, the short wash cycle between analyses greatly reduced carry-over, without significantly increasing the time required per sample analysis. Gradient drift can be harder to control. However, improvement in the alignment algorithms currently implemented, have meant that, for the higher-end LC systems commonly in use, this is rarely an issue, save for a major breakdown in the LC-system.
Statistical data analysis using CorraStatistics.R
To be complete, any single platform for LC-MS data analysis would need to include statistical algorithms, appropriate for the analysis of LC-MS data, to generate measures of significance for (peptide) features that appeared to be differentially expressed or abundant between two or more sample groups. Fortunately, many such algorithms have been established for the analysis of genomic and microarray data, now freely available via Bioconductor [20
]. Corra thus includes Bioconductor R statistical packages that are useful and appropriate for the interpretation LC-MS data to meet this need, and which are contained in the CorraStatistics.R module.
To perform statistical analyses, APML data sets are first imported into CorraStatistics.R, which first parses APML's dataset and sample information to create an annotated sample and feature intensity data format in the ExpressionSet [31
] object, the format required for application to the R's statistical packages from Bioconductor. The implementation of CorraStatistics.R presented here, uses LIMMA [21
] for processing data without time course information, and maSigPro [22
] for data that does contain time course information. Regardless of which is used, the final output is always the same: a ranked list of features that best discriminate between one or more biological/physiological/clinical groups. This list can then be used to generate an inclusion list for targeted MS/MS analysis and subsequent identification of the discriminatory peptides/proteins of interest (see Figure ). Finally, the MS/MS spectra, and resultant peptide sequences identified etc., can be annotated back into the aligned APML file for that particular data set.
One drawback of clustering analyses, and indeed many other statistical methods that could be applied, is that they can only use features that aligned across all LC-MS runs. However, it is possible that a given feature may not be present in one sample pool versus another due to a genuine biological effect, rather than it being below the limit of MS detection in one or more LC-MS runs, or due to an error made by either the feature picking or alignment tool. Thus, in order to work around the clustering limitations for such real-life situations, we included an optional function within Corra (called 'n/a replace'), where the user can replace missing intensity data (i.e. given features not aligned across all LC-MS-runs) with the minimum measured intensity for the entire dataset (the default setting), or a specified nominal value of their choosing. When missing intensity values are not replaced, only features that have intensities across all LC-MS runs will be used for supervised or unsupervised clustering analyses. LIMMA analysis of aligned features can then be performed to calculate fold-changes in intensity for each aligned feature across all LC-MS runs which, in turn, can be used to assign a measure of statistical significance for the observed fold changes, for the given dataset. When missing values are replaced prior to clustering, this will produce highly artificial ratios, which can be very misleading if interpreted improperly. Therefore, great care must be taken in applying this optional functionality prior to clustering analysis. For example, if a given feature in the 'control' population aligned across 19 of 20 runs, then replacing the missing feature could be beneficial, since this is likely a 'real' feature that was missed in just one run by the MS or software tools. However, it may be unwise to replace all 17 missing features for another feature that aligned across only 3 runs.
Nevertheless, there are clearly situations where the ability to replace missing features with a nominal value are of use, hence the provision of this function. An example of such a case is given below, where LC-MS profiling was performed on phosphopeptides isolated from a specific protein kinase knockout strain of yeast, in comparison to a wild-type control strain. In this case, we expected missing features in the knockout, when compared with the control. Thus, by using the missing feature replacement function wisely, we were able to successfully cluster the data to identify phosphopeptides that were not present in the profile from the knockout yeast strain. Similarly, in a typical biomarker discovery workflow, there may be markers only present (or absent) in the disease samples, due to a change in gene expression versus the control samples. Thus if one were to observe features that aligned across all (or most) cases, but not the controls, the judicious use of this function would similarly help identify such features. However, since the ratios it generates are highly artificial and therefore open to misinterpretation, it is up to the individual user to ensure that they use this particular function wisely, and to report it when they do so. Indeed, it should be stressed that there are many ways in which high-dimensional data, such as LC-MS data, can be validly analyzed. Thus it is always incumbent on the individual user to first consult with the literature, and/or a suitably qualified biostatistician, before embarking on such complex statistical analyses. Finally, the Bioconductor packages implemented so far were chosen for applicability to our current proteomics research. However, alternative approaches to both statistical data analysis and missing feature replacement are enabled by Corra's open software architecture. With the built-in converter to R's ExpressionSet file format, a user can readily extend or plug-in their own Bioconductor packages of preference into the Corra pipeline, as discussed under Methods above.
Examples of Corra application to biological studies
We next used Corra for the analysis of LC-MS data from two biological pilot studies, as examples of commonly performed proteomic LC-MS experiments. The first goal of these biological studies was to validate the Corra platform's capability to take LC-MS data all the way through to the identification of statistically credentialed, differentially abundant peptides for targeted MS/MS identification. The second goal was to provide 'real life' examples of discovery-based proteomic experiments to illustrate the type of experiments Corra is useful for analyzing, and to show the type of information it can provide for the biologist end user. These two studies were also chosen since they separately highlight different aspects of, and the flexibility of Corra.
The first of these is from a pilot type 2 diabetes biomarker discovery project using human plasma. Here we wanted to be able to initially classify the samples according to disease state via label-free LC-MS analysis, then subsequently identify differentially abundant peptides via MS/MS. The second is from a study to identify candidate protein kinase substrates in vivo via LC-MS phosphopeptide profiling, using kinase deletion strains of the yeast S. cerevisiae. Here we show an example using a yeast deletion strain for the kinase Ark1. In this case, unlike the type 2 diabetes study, we expected the phosphopeptides of interest to be completely absent from the LC-MS profiles compared with a wild-type strain, requiring a different analytical strategy using Corra. Again, we subsequently identified the missing phosphopeptides of interest by MS/MS reanalysis of the wild-type yeast strain.
Application of Corra to plasma biomarker discovery for human type 2 diabetes
The purpose of the pilot study presented here was to apply current LC-MS quantitative proteomics technology to try and identify potential type 2 diabetes candidate plasma biomarkers via profiling of (formerly) N-glycosylated peptides (N-glycosite peptides). To do this, N-glycosite peptides were isolated from plasma samples collected from control individuals with normal glucose tolerance (NGT), as well as from newly diagnosed cases of type 2 diabetes (DB).
-glycosite peptide isolates were thus prepared from 13 individual NGT plasma samples and 9 individual DB plasma samples, as previously described [25
] and summarized above under Methods. The 22 samples were then randomized for LC-MS analysis, followed by 2 additional technical LC-MS replicate analyses of all 22, each with a new randomization of sample run order to reduce potential bias, for a total of 66 LC-MS runs. Following conversion of the raw data to mzXML format, the 66 data files were input into Corra for feature picking and alignment by SpecArray. The aligned datasets were then imported into the CorraStatistics.R module for statistical analysis, as described above.
Figure shows an example of a Corra analysis output, in this case the result of an unsupervised hierarchical clustering of the 66 LC-MS runs for the 13 NGT and 9 DB patients. For this particular analysis, the clustering algorithm utilized the 588 multiply charged features (i.e. excluding 1+ ions) that had been aligned across all 66 runs. This unsupervised cluster dendrogram showed good separation between the two diagnostic groups. It also showed that, as would be expected, the 3 replicate LC-MS runs of each sample were consistently the most similar to each other. The cluster also identified one particular outlier (indicated in Figure ) that was explainable. The plasma sample from this particular individual was also annotated with a high blood triglyceride measurement. Thus this person had likely not fasted, as required, before the OGTT used to make the physiological state classification, and thus the results of the OGTT are not reliable. The cluster also shows that two additional NGT individuals were misclassified. However, this is more likely to be due to natural variation from one individual to the next when trying to diagnose a physiological state via LC-MS profiling alone, rather than due to the method of analysis. It is also worth noting that the OGTT itself, used to define the sample populations, while being the current 'gold-standard' test for diabetes, is itself < 70% reproducible [32
Figure 7 Corra-generated hierarchical clustering of human type 2 diabetes plasma analyses. N-glycosite peptides were isolated from human plasma samples and analyzed via LC-MS, as described under Methods. Randomized, triplicate analyses were performed for each (more ...)
An alternate Corra analysis output for this same data is shown in Figure . In this case, a "volcano" plot was generated, using LIMMA, to show the detection of differential abundance for the 4240 features that aligned across at least 3 of the 66 LC-MS runs (3 being chosen, in this case, since each individual sample was analyzed in triplicate). In this plot, the aligned features are ordered by log Odds (or B value) on the y-axis. The log Odds B value is, essentially, a measure of probability that the feature is differentially expressed/abundant (as opposed to being observed so by random chance), i.e. the higher the log Odds for each feature, the higher the probability that the feature is genuinely differentially expressed/abundant.
Figure 8 Corra-generated volcano plot of human type 2 diabetes plasma analyses. N-glycosite peptides were isolated for human plasma samples and analyzed via LC-MS, as described under Methods. Randomized, triplicate analyses were performed for each of 22 human (more ...)
We then ranked these 4240 aligned features according to log Odds for differential abundance to generate a list of the top 400 for targeted MS/MS, i.e. to identify the peptides, hence proteins, that best discriminated between the NGT and DB diagnostic groups. An inclusion list was thus made for these 400 features and two samples, each a pool of 4 randomly chosen plasma isolates for each disease state, NGT and DB, were analyzed, separately, in triplicate, for acquisition of MS/MS data on an FT-LTQ spectrometer. In this way, we were able to assign peptide sequence identity to over half of the 400 targeted features (data not shown). Table shows the top 20 most discriminatory peptides (i.e. highest log Odds, using a cut-off of ≥ 2.5) that were also successfully matched to a peptide sequence, with a PeptideProphet score of ≥ 0.9 via MS/MS, which corresponded to a false discovery rate of 2%. A log Odds of 2.5 represents a minimum probability of 92.4% for likelihood of non-random differential abundance for these 20 peptides, which, in turn, had a minimum probability 91% for correct identification by MS/MS (as determined by PeptideProphet [29
Top 20 most discriminatory peptides for diabetes identified by follow-up targeted MS/MS analysis.
Finally, it should be noted here that these data are from a pilot study, and as such, none of the proteins identified were, or should be considered as candidate markers for diabetes without further studies. Nevertheless, they do illustrate how the Corra framework was implemented to determine, and then identify, targets of interest in any LC-MS-based biomarker discovery workflow. In turn, it also shows how Corra could be of use to biologists or other researchers interested in LC-MS data analysis of any other, similar comparison of related physiological states.
Application of Corra to phosphopeptide profiling of an Ark1 kinase knockout yeast strain
As a second example of Corra application, we analyzed LC-MS profiles of phosphopeptides isolated from a wild-type yeast strain, for comparison to those obtained from a yeast strain lacking the protein kinase Ark1. The goal was to see whether we could identify potential Ark 1 target proteins, and phosphorylation sites, for this kinase. Unlike for the human diabetes pilot study above, here we fully expected to observe the 'missing features' effect in the Ark 1 deletion strain. Additionally, these analyses were performed on a very high mass accuracy OrbiTrap-LTQ spectrometer, and the SuperHirn tool was instead used for the feature picking and alignments within Corra.
We therefore prepared total phosphopeptide isolates from the two yeast strains, as previously described [27
] and summarized under Methods. The two samples were then analyzed on an OrbiTrap-LTQ spectrometer, in triplicate (6 LC-MS runs in total), limiting the LC-retention time range for data analysis to the 20 to 90 minute window, since this was the region where the peptides eluted for these analyses. On average, SuperHirn detected ~23,300 features per LC-MS run, with 54,059 total detected features. Of these, 6,840 aligned across all six LC-MS runs, with 22,562 that aligned across three or more LC-MS runs.
Since there were 2 biological samples, each analyzed in triplicate, we took the 22,562 features that aligned across three or more runs for importing into the CorraStatistics.R module, to search for the differentially abundant features between the control and the kinase knockout yeast strains. In this study, we were especially interested in phosphopeptides not detected in the knockout strain versus the control strain, since these would represent potential targets for the missing kinase. Thus to do this, we utilized the 'n/a replace' (for missing values) functionality within Corra to set missing features between all 6 runs to the minimum intensity value detected for the entire dataset. While we recognized that this would generate artificial ratios when we performed the analysis, it nevertheless provided us with the information we needed, since we were, ultimately, only interested in peptide identity here, and not a quantitative measure of differential abundance.
Figure shows a clustering analysis for all 22,562 features that aligned across 3 or more runs, and demonstrated that, as expected, the aligned features distinguished very well between the two yeast strains. The excellent separation observed between the replicate analyses of each sample was clearly enhanced by the large, artificial, ratios generated via use of the 'n/a replace' function. A volcano plot, shown in Figure , shows the log Odds distribution for differential abundance, for the same 22,562 features aligned in 3 or more runs. Those with a log Odds value of ≥ 2.2 (i.e. > 90.0% chance of non-random observation of differential abundance), and for which the 'n/a replace' function was used, are colored red. The smaller number of blue-colored features represent those also with a log Odds value of ≥ 2.2, but for which the 'n/a replace' was not required, and thus these generally showed lower ratios of differential abundance (i.e. not artificial) than the red-colored features. In comparing Figures and , we can also make a couple of general observations. In the yeast study, we observed much larger ratios, almost certainly due to the replacement of missing features. On the other hand, in the human diabetes study shown in Figure , we observed much larger log Odds values (i.e. increased confidence in differential abundance). This is almost certainly due to the much larger sample size (66 LC-MS analyses vs. 6 in the yeast study), therefore leading to better statistical confidence.
Figure 9 Corra-generated hierarchical clustering of yeast phosphopeptide analyses. Phosphopeptides were isolated from two yeast strains, one wild type, and the other an Ark1 protein kinase knockout, and analyzed in triplicate on a very high mass accuracy LC-MS (more ...)
Figure 10 Corra generated volcano plot of yeast phosphopeptide analyses. Phosphopeptides were isolated from two yeast strains, one wild type, and the other an Ark1 protein kinase knockout, and analyzed in triplicate on a very high mass accuracy LC-MS platform, (more ...)
From these data analyses, as with the diabetes study above, we next made an inclusion list for targeted MS/MS, to try and identify some of the phosphopeptides lost in the Ark1 knockout yeast versus the control. Table lists the top 12 most discriminatory peptides, with a log Odds of ≥ 2.2, and that also matched a peptide sequence by MS/MS, with a PeptideProphet score of ≥ 0.7 (representing a false discovery rate of 5%). Ark1 is known to be involved in endocytosis and actin reorganization, as also are 4 other proteins from Table (YOL109W, YBL037W, YMR109W, and YJR083C), demonstrating that Corra successfully enabled the generation of potentially biologically relevant information. Finally, for confirmation purposes, Figure shows extracted ion chromatograms, for all 6 LC-MS runs, for the identified YDR293 peptide, RHS*LGLNEAKK (where S* represents phosphoserine) at m/z = 444.895 [M+3H]3+, confirming it's detection in all 3 replicate analyses of the control strain, and its absence in all 3 replicate analyses of the Ark1 knockout strain.
Top 12 most discriminatory yeast phosphopeptides identified by follow-up targeted MS/MS.
Figure 11 Verification of a Corra-identified Ark1 kinase substrate peptide/protein. Following targeted MS/MS identification of the top-ranked Corra-identified discriminatory features (see Figure and Table ) ion chromatograms were (more ...)