|Home | About | Journals | Submit | Contact Us | Français|
The success of mass spectrometry based proteomics depends on efficient methods for data analysis. These methods require a detailed understanding of the information value of the data. Here, we describe how the information value can be elucidated by performing simulations using synthetic data.
Mass spectrometry based proteomics is a method of choice for identifying, characterizing, and quantifying proteins. Proteomics samples are often complex and the range of protein amounts is typically large (>106), while the dynamic range of mass spectrometers is limited (<103) (1). Because of this mismatch it is necessary to process the protein samples so that the protein mixture that reaches the mass spectrometer at any given time is much less complex. This is often achieved by first separating the proteins, followed by digestion, and separation of the peptides. The peptides are subsequently analyzed in the mass spectrometer.
With mass spectrometry it is possible to measure the mass and the intensity of peptide ions and their fragments. To identify proteins and to characterize their post-translational modifications, the mass measurements are used (2-4) and sometimes to lesser degree the intensity measurements can also be used (5, 6). For quantification, the intensity measurements can be used, but only if the intensity scale is calibrated for each peptide, because the intensity of a peptide ion signal depends strongly on its sequence.
The two most common types of analysis are peptide mass fingerprinting and tandem mass spectrometry. In both these approaches the proteins are digested with an enzyme having high digestion specificity (usually trypsin) prior to the mass spectrometric analysis. The digestion results in mixtures of proteolytic peptides. In peptide mass fingerprinting the mass spectrometer detects ions of the proteolytic peptides and measures their respective mass. The mass of a proteolytic peptide is typically not unique (7) and therefore observation of several proteolytic peptides from a single protein is needed to generate a peptide mass fingerprint that is useful for protein identification. The peptide mass fingerprinting approach is usually used for samples where the protein of interest can be purified quite well, because peptide ion signals from different proteins can interfere with each other in an individual mass spectrum and the inclusion of mass values of peptides from more than one protein reduces the specificity of the peptide mass fingerprint. In tandem mass spectrometry, individual proteolytic peptide ion species are isolated in the mass spectrometer and are subjected to fragmentation. The masses of the proteolytic peptides and their fragments are measured, making it more applicable to complex mixtures, because a large amount of information is obtained for each peptide and the interference from peptides originating from other proteins is reduced.
Here we describe a few methods for generating synthetic mass spectra, including peptide mass fingerprints and tandem mass spectra. We also give a few examples of how these synthetic mass spectra can be used to better understand the dependence of the value of information in mass spectra on the nature and accuracy of the measurements.
In peptide mass fingerprinting, protein identification is achieved by comparing the experimentally obtained peptide mass fingerprint to masses calculated from theoretical proteolytic digests of protein sequences from a sequence collection. Each sequence in the collection that has some extent of matching with the experimental peptide mass fingerprint is given a score, the statistical significance of the high scoring matches is tested, and the statistically significant proteins are reported. The statistical significance is tested by generating a distribution of scores for false and random matches. The score of the high-scoring proteins are then compared to the distribution of scores for false and random matches, and the significance level of the match is calculated. The distribution of scores for false and random matches can be obtained by direct calculations (8), by collecting statistics during the search (9, 10), or by simulations using random synthetic peptide mass fingerprints (11). Here we describe a method for generation of synthetic random peptide mass fingerprints to obtain a distribution of scores for false and random identification that can be used to test the significance of protein identification results (11) (Fig. 1):
For investigating other aspects of protein identification, it is useful to construct non-random peptide mass fingerprints. This can be achieved by modifying Step 3:
These non-random synthetic peptide mass fingerprints can be used to for example improve or compare algorithms, and investigate the effect of search parameters including mass accuracy, enzyme specificity, number missed cleavage sites, and size of sequence collection searched (8, 12). Non-random synthetic peptide mass fingerprints have also been used to investigate the potential of identifying complex mixtures of proteins by peptide mass fingerprinting (13). It was concluded that mass fingerprinting can be applied to complex mixtures of a few hundred proteins, if the mass accuracy and the dynamic range of the measurement are sufficient (Fig. 2). In most practical cases, however, the dynamic range of the measurement is severely limiting and only a few proteins can be identified by peptide mass fingerprinting (14).
The method of choice for complex protein mixtures is to search sequence collections using the observed mass of an intact individual peptide ion species together with the masses of the fragment ions observed upon inducing fragmentation of the peptide in the mass spectrometer. This method requires much lower sequence coverage, and in some cases even one peptide can be sufficient to identify a protein. Synthetic peptide tandem mass spectra can be generated by:
Random synthetic tandem mass spectra can be constructed by skipping Steps 3-8 above. These random synthetic tandem mass spectra can be used for significance testing in a similar way as for peptide mass fingerprinting (15).
Non-random synthetic tandem mass spectra can for example be used to answer the question: How many fragment ions are needed for identification? By generating non-random synthetic tandem mass spectra containing varying amounts sequence information the number matching fragments needed for identification can be determined (see Note 5 and Fig. 4). In this way it is possible to investigate how many fragment ions are needed for identification depending on the precursor mass, precursor and fragment mass errors, background levels, and modification states (16).
This work was supported by funding provided by the National Institutes of Health Grants RR00862 and RR022220.
1The distribution of peptide masses is far from uniform, because peptides contain only a few different types of atoms, and it is, therefore, important to use actual peptide masses in simulations. The distribution of peptide masses consists of peaks with centroids approximately 1 Da apart, and regions in between the peaks that are devoid of peptide masses. Using a uniform mass distribution would therefore result in unrealistic synthetic peptide mass fingerprints.
2The intensities are often set to the same value for all masses. Alternatively, an intensity distribution derived from experimental data can be used.
3The number of peptides to pick can for example be determined by selecting a target coverage for the proteins, and then randomly picking peptides until that coverage is reached.
4An example of the kind of information that can be extracted from experiments is shown in Fig. 3. First the data acquired on an LTQ-Orbitrap was searched using X! Tandem and all peptides with expectation value <10−3 were used to characterize the data set. The average and the standard deviation of the number of ions that match the peptide sequence first increases with mass, and at masses above 1500 Da the average saturates (Fig. 3A-B). The average number of background peaks increases with the number of matching peaks up to about 15 matching peaks, and then saturates (Fig. 3C). The standard deviation of the number of background peaks is constant within the uncertainty of the measurement (Fig. 3D). The matching peaks dominate at high intensity, but even though the majority of peaks with low relative intensity are background (<20% of the base peak), there are still a considerable number of low-intensity peaks that match the sequence (Fig. 3E-F).
5Tryptic peptides were randomly selected from a proteome, and a set of fragment mass spectra was generated for each selected peptide assuming that they were unmodified or phosphorylated. These fragment mass spectra were constructed by randomly selecting fragment ions, and the number of fragments selected was varied over a wide range. The fragment mass spectra were searched against the proteome using X! Tandem, and the probability of successful peptide identification was obtained as a function of the number of fragment ions in the spectra. From these curves, the critical number of fragment masses was derived for a given experimental condition, i.e. the number of fragment masses needed for successfully identifying half of the peptides.