|Home | About | Journals | Submit | Contact Us | Français|
In a recent study, in vivo metabolic labeling using 15N traced the rate of label incorporation among more than 1700 proteins simultaneously and enabled the determination of individual protein turnover rate constants over a dynamic range of three orders of magnitude (Price, J. C., Guan, S., Burlingame, A., Prusiner, S. B., and Ghaemmaghami, S. (2010) Analysis of proteome dynamics in the mouse brain. Proc. Natl. Acad. Sci. U. S. A. 107, 14508–14513). These studies of protein dynamics provide a deeper understanding of healthy development and well-being of complex organisms, as well as the possible causes and progression of disease. In addition to a fully labeled food source and appropriate mass spectrometry platform, an essential and enabling component of such large scale investigations is a robust data processing and analysis pipeline, which is capable of the reduction of large sets of liquid chromatography tandem MS raw data files into the desired protein turnover rate constants. The data processing pipeline described in this contribution is comprised of a suite of software modules required for the workflow that fulfills such requirements. This software platform includes established software tools such as a mass spectrometry database search engine together with several additional, novel data processing modules specifically developed for 15N metabolic labeling. These fulfill the following functions: (1) cross-extraction of 15N-containing ion intensities from raw data files at varying biosynthetic incorporation times, (2) computation of peptide 15N isotopic incorporation distributions, and (3) aggregation of relative isotope abundance curves for multiple peptides into single protein curves. In addition, processing parameter optimization and noise reduction procedures were found to be necessary in the processing modules in order to reduce propagation of errors in the long chain of the processing steps of the entire workflow.
Protein turnover, including synthesis and degradation, is an essential element in homeostasis that is highly regulated and controlled at many levels in a cell. Rates of turnover for individual proteins depend on their physiological function, subcellular localization and tissue type and processes of degradation. They may also be affected by their participation in protein machines, their post-translational status and their post-translational modification (PTM) dynamics as well as host-pathogen interactions and other environmental factors. The precise factors affecting protein turnover are mostly unknown at this time. However, tissue-specific comparative measurements of protein turnover on a proteome scale would provide new system-level knowledge of protein dynamics. One might anticipate that gaining this insight would reveal a deeper and systematic understanding of the healthy development and well-being of complex organisms. Furthermore, gaining a better understanding of protein dynamics would provide clues to the progression of diseases caused by protein misprocessing, which include Alzheimer, Parkinson, Huntington, and the prion diseases as well as the tauopathies (2).
Both radioactive (3) and stable isotope labeling techniques (4) have been employed in metabolic research. 15N isotope labeling provided some of the first evidence that dietary nutrients are used to maintain cell homeostasis as well as to provide energy (5). 15N metabolic labeling has been applied to studies of a variety of model organisms (6) for several decades, but only recently has been employed for in vivo studies of mammals, such as Rattus norvegicus (rat) (7, 8). In one method reported by Yates and coworkers, termed stable isotope labeling of mammals (SILAM)1, 15N incorporation was achieved by feeding rats a diet of 15N-enriched algae. They found that after two rounds of labeling, more than 95% 15N incorporation was achieved in proteins in brain tissue (7).
Similarly, we chose isotopically labeled nitrogen for our in vivo proteome dynamics investigations, since each amino acid contains at least one, but no more than four, nitrogen atoms. The natural nitrogen-15 isotope abundance is 0.37%, ensuring a low natural background level of “prelabeled” molecules. Because nitrogen in proteins is relatively inert chemically, the degree of isotope incorporation produces and retains a unique signature in cellular proteins. This isotopic signature can be followed kinetically and read out using mass spectrometric analyses and with modern technologies even under high resolution and high mass accuracy conditions. Metabolic 15N labeling generates more complex isotope patterns than those observed in a standard stable isotope labeling in cell culture (SILAC) experiment using a few 13C- and 15N-labeled amino acids. These more complex isotope patterns therefore require more sophisticated and powerful software for data processing.
Here we provide a comprehensive description of the extensive data processing pipeline required to analyze the in vivo 15N labeling results obtained from high resolution and high mass accuracy mass spectrometric analyses. Using this strategy for studies of protein dynamics and turnover in mammals in vivo, we measured the dynamic behavior of 1700 proteins in vivo for brain, liver, and blood in the mouse. As published elsewhere (1), these studies allowed us to extract mammalian protein turnover rate constants spanning three orders of magnitude and provided evidence that turnover of particular protein complexes and organelles behave in a synchronized manner. Our dynamic labeling strategy dramatically expands the capabilities of both the stable isotope metabolic labeling of mammals (SILAM) (7) and the dynamic SILAC (9) methods used in cell culture.
In this contribution, we describe an essential and enabling component of these large-scale investigations: a robust data processing and analysis pipeline containing novel data-processing modules specifically developed for 15N metabolic labeling. It significantly extends the capabilities of the currently available quantification software platforms.
Experimental procedures are described in detail elsewhere (1). Briefly, 9-week-old FVB mice were fed with natural algae for 7 days. After this habituation period, a diet of 15N-enriched algae was introduced (t = 0) and mice were maintained on this isotopically enriched diet for the remainder of the experiment. At 15N-incorporation time points of 0, 0.38, 1, 2, 4, 8, 16, 24, and 32 days, tissue samples (liver, blood, and brain) were collected. Proteins were extracted from the samples for measurement of turnover.
Protein samples were reduced and alkylated before separation on one-dimensional SDS-PAGE. For a single tissue, gel lanes corresponding to the nine 15N incorporation time points were cut from the gel and each lane was further cut into nine slices, approximately corresponding to the molecular weight bands of Precision Plus Protein standard from Bio-Rad, Hercules, CA (10–15, 15–20, 20–25, 25–37, 37–50, 50–75, 75–100, 100–150, 150–250 kDa). A gel image was taken for measurement of optical density of the gel slices. A utility program (a separate code from the main body of the program described below) was written in MATLAB to calculate the darkness of the gel slices, subjected to blank gel background subtraction. The purpose of the utility program was to provide estimates of protein amount in each of the gel slices.
The gel slides were subjected to a standard, in-gel digestion procedure with trypsin and the resulting peptides were analyzed by a 1-hour LC MS/MS method on a LTQ FT instrument (ThermoFisher Scientific, Bremen, Germany), interfaced with a NanoAcquity LC system (Waters Corporation, Milford, MA).
Injection of a similar amount of peptides reduces retention time variation between different samples (gel slices). In order to inject a similar amount of peptides for each LC MS/MS analysis, the injection volume was adjusted according to the background-corrected darkness of the gel piece. Survey scans were acquired in the FTICR MS using a mass resolution of 100,000. Six MS/MS scans were acquired in the ion trap for each survey scan. See reference (1) for more details.
Peptide identification was obtained by use of in-house Protein Prospector search engine. The search parameters included: database = Swiss Prot; species = Mus musculus; enzyme specificity = trypsin; allowed missed cleavages = 1; fixed modification = carbmidomethylation; variable modifications = acetylation on protein N terminus, glutamine on peptide N-terminal to glutamic acid, methionine loss from protein N terminus, methionine loss from protein N terminus and acetylation, methionine oxidation; maximal number of variable modifications = 2; parent mass tolerance = 50 ppm; fragment mass tolerance = 0.6 Da; peptide expectation cutoff = 0.05. The match of sequences from the decoy database (normal + random) indicated that the false discovery rate for this expectation cutoff value is 0.08%.
15N enrichment was measured for the algae used in the mouse diet. Approximately 30 μg of proteins was extracted each from the natural and 15N-enriched algae. The natural and 15N-labeled algae proteins were subjected to the SDS-PAGE separation. Protein bands were cut from the gel and were subjected to in-gel digestion and LC MS/MS analysis. The 15N relative abundance in the 15N-enriched algae was 99.56%, calculated from measurements of molecular ion isotope distributions of 19 most intense algal peptide ions.
A novel software strategy was developed to extract protein turnover rate constants from raw LC MS/MS data files. For each tissue, let NIT be the number of incorporation time points or gel lanes and NGB be the number of gel slices in each gel lane. There are NIT × NGB LC MS/MS data files. A log file was created in order to track each raw data file through the data processing workflow. The modules and algorithms of the data processing pipeline are described in detail in the supplemental material. The software described here and data files can be provided upon request.
For each 15N incorporation time point, proteins were separated in a single lane of SDS-PAGE. Each lane was divided into nine molecular weight slices. Because there were nine time points, a total of 81 LC MS/MS runs were carried out for each tissue sample. For each time point, we analyzed two biological replicates, for a total of 162 LC MS/MS runs for each tissue. MS/MS peaklists were extracted using the PAVA program (see the supplemental Material) from the 18 LC MS/MS raw data files corresponding to the zero 15N incorporation (t = 0) time point.
For each tissue sample, the MS/MS peaklist files of the 18 LC MS/MS runs corresponding to zero incorporation were submitted for a database search with Protein Prospector against Mus musculus (Mouse) protein sequences in Swiss Prot database release July 9, 2009 (16201/510076 entries searched). The search result was exported into a tab-delimited text file. The search results are summarized in Table I.
To determine the ion chromatographic peak retention times for the identified peptides, we created the fitXIC module (supplemental Material). The module first extracts an ion chromatogram according to the parent m/z value and the retention time obtained from the identification of the MS/MS scan. The mass accuracy allowed in the extraction was 20 ppm (see mass error distribution discussion below). The retention time window for the extraction is 2.25 mins, which is an order of magnitude larger than the measured width of the distribution of differences between the extracted ion chromatographic peak location and the MS/MS retention time.
As shown in Table II, about two-thirds of identified peptides gave a good fit in extracted ion chromatograms. The remaining one-third peptide parent ions had low signal-to-noise ratio and their extracted ion chromatograms were poorly fitted by this method. In those cases, the MS/MS retention times were used for further processing.
The accurate cross-extraction of 15N peaks from the different incorporation time points was possible because of the high mass accuracy of survey scans and the stability of retention times between LC MS/MS runs. This module searches for ion peaks in a two-dimensional space narrowly defined by retention time and m/z. The narrower the space, the more specific the cross-extraction becomes. However, one may lose signals using a space that is too narrow. In order to optimize the search space, we examined carefully the retention time variations and mass error distribution.
In a typical comparative label-free proteomic quantification experiment, the goal is to detect quantitatively differences of a small portion of components. Quantities of the majority of peptides do not change dramatically. If the retention time varies from different LC MS/MS runs, a LC alignment procedure (10) can be applied based on many so-called “landmark” ions distributed throughout the retention time and m/z space. In our experiment, all peptides were subjected to extensive 15N incorporation. There are no reliable “landmark” ions (11) available, especially in the later incorporation time points. We took extra precautions experimentally to achieve the maximal retention time stability. First, we used a Waters nanoAcuity LC system, which provides high flow stability and controlled column temperature. Second, we chose a more robust 100 μm ID × 100-mm-long C18 column running at 0.8 μl/min, instead of the high performance 75 μl ID × 150-mm-long UPLC column running at 0.3 μl/min. The trap and analytical columns dedicated for the experiment were properly “aged” so that the retention times for the identical (or isotopically labeled) samples did not drift more than the LC peak width. Third, the total peptide concentration injected onto the LC was carefully controlled. The absolute amount for injection was optimized for (A) sufficient amount permitting identification of maximal number of peptides and to achieve high survey scan signal to noise ratio, for (B) preservation of separation efficiency, and for (C) long-term column stability. Injecting too much sample will broaden the LC peak width. High sample concentrations may also permanently deposit materials on the column and cause back pressure to increase. The stability of retention times between experiments was further improved when the amount of samples injected was similar. This normalization of sample concentrations was realized by “optical density” measurement using the utility program to calculate the darkness of gel bands mentioned in the experimental section. The procedure worked quite well as shown by the total ion chromatograms for MW 10–15 kDa gel slices of the brain sample (Fig. 1).
In order to evaluate the retention time stability, two identical samples were analyzed 100 h apart from each other. Between the two samples, 89 other samples were continuously analyzed. The fitXIC module (see the supplemental Material) was used to extract XIC peak locations for identical ions in the two runs. The ions selected for the comparison must have good fit of XIC in both runs; 277 ions common to the both runs were used for comparison. The distribution of retention time differences of the ions indicated that the median retention time shift was 10.7 s and the spread of the distribution (2 sigmas) was 7.2 s. The retention time shift was very close to the median peak width (10.8 s) of the large-scale experiments for the three tissues. In order to allow for retention time variability in locating ions, the cross-extraction retention time window was chosen to be ±12 s from the 14N peptide retention time. The 24-s window was wide enough to accommodate the retention time variations, yet was sufficiently narrow to avoid the extensive inclusion of noise peaks (from other peptides).
Mass error distributions were constructed simply from the mass error lists of the database search results. The histograms were fit with a Gaussian. The average mass error from all the peptides in the three sets of data was 2.4 ppm and the average distribution width (FWHH) was 8.5 ppm. The mass error can be reduced by performing internal calibration for every survey scan in which several identified ions are ideally distributed evenly across the spectrum. This can be easily done for the LC MS/MS runs at the zero 15N incorporation time point. It is more difficult to perform for other 15N incorporation time points. For our current purpose, a wider (20 ppm) mass tolerance was used to guarantee the inclusion of the desired peaks. The mass accuracy for the extraction was 20 ppm, significantly larger than the width (7.9 ppm) of the parent m/z distribution for all the identified peptides, measured as full width at half height. The drawback for the wider windows is the introduction of more noise peaks.
To illustrate the 15N cross extraction process, a sample peptide ion, LYEQLSGK (z = 2) from phosphatidylethanolamine-binding protein 1 (P70296) in the brain sample was randomly selected. It was identified in the t = 0 sample with m/z of 469.25274 and a chromatographic retention time of 13.3 min. Accounting for contributions from natural isotopic substitutions (13C), there are 14 possible distinct isotopomers to be cross-extracted (NPX, defined in the supplemental Material). For each 15N incorporation time point, the spectral segment in profile mode from m/z 469 to 476 was averaged from the starting retention time of 13.3 –0.2 min to 13.3 +0.2 min. The profile spectral segment was first centroided into a short peaklist. The monoisotopic ion intensity was the most intense peak within a window of theoretical m/z 469.25255 ± 20 ppm and the remaining 13 peaks were extracted in the same manner.
Shown in Fig. 3, the profile mode spectral segment (Fig. 3A) for peptide ion, LYEQLSGK, z = 2, at the incorporation time of 16 days was converted into the centroid peaks (Fig. 3B). The cross-extraction module (see the supplemental Material) must repeatedly access the ~100 MB raw files and was therefore the most time-consuming processing step. It required, on average, 1 s for each peptide ion (from 9 LC MS/MS raw data files).
Fig. 4A shows the spectral segments of LYEQLSGK, z = 2, peptide ions for all the nine incorporation time points.
The 15N incorporation distribution is the fraction of the total population of the peptide with a defined number of 15N. The theoretical isotope distributions for peptide ion LYEQLSGK, z = 2, shown in Fig. 2, serves as a “template” matrix A of the least-square Equation (1) of the supplemental Material, in which pi is the experimental peaks of Fig. 3B. Because in the 15N incorporation distribution of Equation (1) of the supplemental material, fi, should not be negative numbers, we use a nonnegative least-square (NNLS) algorithm (12) to solve the equation. Because the number of variables (11, in this example) is always less than the number of equations (14, in this example), the solution of the least-square equation is an over-determined problem. In this example (Fig. 3C), at the incorporation day 16, there is ~50% of peptide LYEQLSGK in brain containing no 15N atoms, 45% containing 5 15N atoms, and 5% containing 6 15N atoms.
In the genCurve processing module (supplemental Material), we introduced a noise-reduction algorithm before the construction of 15N incorporation curves. The 15N incorporation distributions for all the nine incorporation time points are shown in Fig. 4B.
A boundary for recognizing the signal region was developed from the upper-left corner to the lower-right corner. The algorithm is very effective for elimination of noise peaks in the short incorporation time region.
The extracted 15N distributions were converted to the relative isotope abundance (RIA) of the newly synthesized peptide by use of Equation (2) of the supplemental Material. The quantity of relative isotope abundance (RIA) is essentially the relative abundance of all 15N labeled peptides. The isotopic labeling allowed us to differentiate between a newly synthesized peptide and peptides remaining before the initiation of labeling.
This data reduction step only extracted the “first moment” from the 15N distribution and neglected the high order information, such as the distribution width. A relative abundance curve was constructed for every ion of all the identified peptides. Conversion of 15N distributions to the relative abundances also normalized the data and allowed the aggregation of peptide relative abundance curves into protein curves.
The relative abundance curves for 15 observed peptides of phosphatidylethanolamine-binding protein 1 (P70296) in mouse brain, including LYEQLSGK, are shown in Fig. 5A.
The Pep2Prot module (see the supplemental Material) selects peptide relative abundance curves based on their similarity to the aggregated curve. When all peptide curves were exhausted, the aggregated curve became the protein relative abundance curve. The protein curve of phosphatidylethanolamine-binding protein 1 (P70296) in brain, constructed from 12 peptide curves, is shown in Fig. 5A.
A comprehensive list of the protein turnover rate constants and their biological implications are given in a separate account (1). Here a few examples are given. Shown in Fig. 5B, the fit of the relative abundance curve of phosphatidylethanolamine-binding protein 1 in brain indicates that the half-life time of the protein is 12.3 days.
For highly abundant proteins, tens and sometimes over a hundred corresponding peptide relative abundance curves are available. With large numbers of independently measured peptides, highly accurate protein curves can be constructed. Shown in Fig. 6, serum albumin (P07724) was observed in all three tissues. Because the vast majority of serum albumin is introduced by contaminating blood serum, we expected that the serum albumin turnover be the same regardless of the tissue where it was detected. Indeed, we observed that serum albumin in the three tissues exhibited similar protein relative abundance curves.
The turnover rate constant of 0.36 day−1 for serum albumin in normal mouse blood was consistent with the previous reports of 0.33 and 0.30 day−1 obtained by a protein turnover study (13) with 2H- and 18O-labeled water, respectively.
The computation times for each of the major processing steps in analysis of proteins from the brain are: 1 min for generation of MS/MS peaklists, 1.5 h for Protein Prospector database search, 5.3 h for extraction and integration of identified ion chromatograms of zero 15N incorporation, 50.4 h for cross-extraction and construction of peptide relative abundance curves, and 1.5 h for nonlinear least-square fitting of protein curves. The database search was carried out on an in-house server allowing up to six simultaneous searches. The search time varied depending on how many users were performing searches simultaneously. The rest of the data processing was carried out on a Windows XP computer with Pentium 4 processor running at 3.2 GHz and 3.49 GB of RAM. The major processing time was the cross-extraction of 15N incorporation ions because the program repeatedly accessed 81 raw data files with an average size of 110 MB. We are investigating different algorithms to reduce the cross-extraction times.
SILAC (14) has been widely used for protein quantification. Labeled and unlabeled cells can be mixed before protein extraction and other downstream processing. The major advantage of SILAC is the elimination or reduction of discrimination between labeled and unlabeled peptides in sample handling, separation, and mass spectrometry analysis. Protein localization information can be extracted by use of a carefully designed SILAC experiment (6). As mentioned, protein turnover information can also be obtained by using the dynamic SILAC method (15) and its extension to a mammalian system (16). However, we have established that our dynamic SILAM strategy using 15N labeling has distinctive advantages. Algae, a natural food source for mammalian systems can be labeled with 15N relatively easily. More importantly, all the amino acids are labeled with 15N so that abundance ratios between labeled and unlabeled populations can be measured over a large dynamic range. This advantage is reflected in our protein turnover results, which span three orders of magnitude in rate constants (1). Unlike a typical SILAC experiment in which two samples are mixed together, we combined the labeling concept with a cross comparative strategy adapted from label free methodology to retrieve information on 15N labeling abundances from many different incorporation time points.
As mentioned above, the extraction of protein turnover information from many LC MS/MS runs generated from our dynamic SILAM experiments presented a set of unique challenges in both experimental execution and data processing. To meet these challenges, we developed robust protocols and instrument methods to achieve maximal LC retention time stability and to provide balanced trade-offs between sensitivity, mass resolution, and sampling density in the mass spectrometry. For data analysis, many processing steps were required to reduce a high volume of raw data into a small set of protein turnover rate coefficients. Some of the processing steps, such as database searching for peptide identification, are well established. Other processing steps, such as aggregation of peptide relative isotope abundance curves into protein curves, are specific for extracting protein turnover parameters. Because of the large number of processing steps, quality control at each step becomes quite important in order to reduce error propagation.
Our turnover data of serum albumin in different tissues are consistent with those from an independent study using labeled water (13). The small discrepancy in rate constants led to a calculated difference of 8 h in the serum albumin half-life. The sampling frequency for both studies was not dense enough to resolve such a small difference in half-life. This small discrepancy may also have resulted from use of different methods for isotope tracer introduction. Labeled water was administered by bolus injection in the short-term study and thus the labels can be incorporated to proteins immediately (13). In our case, use of the oral diet introduction involves effects from digestion to whole animal circulation. In an attempt to deal with these nutrient transport issues, a delay parameter was introduced during our modeling process. As indicated in Fig. 6, the curve fittings are not perfect. More sophisticated functions may be needed to fit the intricate details of the curves. However, for our current large scale proteome dynamics study, use of a simple function with the least fitting parameters capable of capturing the essence of the experimental curves seems more appropriate.
We have described how the processing modules are streamed together to produce the required comprehensive processing workflow. We also illustrated how the key parameters employed in each module are selected by use for statistical analysis.
In developing this processing platform, we incorporated some ideas from published studies as noted below. Several programs for processing 15N labeling data, mostly for quantification purposes, have been described. MacCross et al. (17) extracted 15N incorporation distributions from data using a Pearson correlation coefficient. Haeger et al. (18) introduced “QuantiSpec” software for computing 15N incorporation distribution based on isotope distribution synthesis Mercury (19) algorithms. It should be noted that the computation for stable isotope incorporation has long been known in the fields of physiology as mass isotopomer distribution analysis (MIDA) (20). We also adapted several key concepts from the field of label free quantification (21) and our own software development in this area (22), such as extraction and integration of ion chromatograms. A review of isotope labeling for general metabolic research can be found in (23), and protein turnover studies especially in humans are summarized in (24).
More importantly, we have developed several novel concepts in order to realize the complete data process pipeline for proteome dynamics studies using 15N metabolic labeling. These include integration of ion chromatograms by a label-free method combined with careful control of experimental conditions allowing 15N peaks to be extracted reliably from a large set of LC MS/MS raw data files. Development and implementation of effective noise elimination procedures are critical for construction of consistent peptide incorporation curves that we obtained.
As commented on in a recent review (6), a lack of suitable software is one of the major road blocks for wide adaptation of quantification methods based on metabolic labeling protocols. In this study, we documented in detail our data processing flow for extracting protein turnover information from raw LC MS/MS data. We anticipate that our suite of data processing algorithms will remove some of the earlier road blocks. It is our expectation that the capability of our software for processing 15N-containing mass spectral data sets generated from dynamic proteomics experiments will provide new insight and a better understanding of the inner workings of the proteome complexity in mammalian biological systems.
Recently, an article describing a simple data processing algorithm for 15N labeling experiment has been published (25). In the algorithm, the proportion of the labeled peptide signal is simply the fraction of the weighted average of relative isotopomers. In contrast, our algorithm uses a nonnegative least square method to deconvolve the 15N distribution from the natural isotope distributions. In addition, we have described the detailed algorithms for noise reduction and protein incorporation curve generation.
In conclusion, we have developed a powerful data processing platform for studying proteomic dynamics of mammalian tissues with 15N metabolic labeling in vivo. Reliable turnover data for over 1700 proteins from liver, blood, and brain were obtained. The software described here can be easily adapted for processing H/D exchange experiments and obviously for metabolic-labeling experiments with other stable isotopes such as 13C.
* This work was supported by National Institutes of Health (NIH) NCRR grants RR01614 (to ALB) and RR019934 (to ALB), as well as by a gift from the G. Harold and Leila Y. Mathers Charitable Foundation. J.C.P. was supported by the NIH (T32 AG00278) and by the Larry L. Hillblom Foundation.
This article contains supplemental Material.
1 The abbreviations used are: