|Home | About | Journals | Submit | Contact Us | Français|
Mass spectrometry is a powerful tool with much promise in global proteomic studies. The discipline of statistics offers robust methodologies to extract and interpret high-dimensional mass-spectrometry data and will be a valuable contributor to the field. Here, we describe the process by which data are produced, characteristics of the data, and the analytical preprocessing steps that are taken in order to interpret the data and use it in downstream statistical analyses. Because of the complexity of data acquisition, statistical methods developed for gene expression microarray data are not directly applicable to proteomic data. Areas in need of statistical research for proteomic data include alignment, experimental design, abundance normalization, and statistical analysis.
Protein mass spectrometry is becoming a popular tool for biomarker discovery for the early detection of disease and for understanding disease prognosis and progression. Protein mass spectrometry allows one directly to assess protein expression as opposed to inferring protein expression from mRNA expression profiles. Proteins are potentially more interesting than DNA or RNA because proteins are more dynamic and reflective of cellular physiology (Aebersold and others, 2005). Proteins are the final product released by normal or tumor cells and therefore are the culmination of transcription, translation, and posttranslational modifications.
Mass spectrometry consists of a diverse range of technologies and techniques; our efforts have been directed towards accurately characterizing complex mixtures, such as the plasma proteome, which requires high-end instrumentation. Characterizing the plasma proteome is a challenging problem. Review papers by Diamandis (2004) and Anderson and Anderson (2002) delineate that not only is the plasma proteome complex but also it spans a wide dynamic range (>1010). For example, prostate-specific antigen exists at approximately 1/4000000 of the concentration of albumin and approximately 1/300 of other common protein biomarkers such as C-reactive protein. As a result, it is difficult to detect and identify low-abundance proteins in plasma even though these may be the most informative molecules. However, with continued improvements in mass-spectrometry technology and statistical methods to analyze the resulting data, the potential is being more fully realized.
Due to the wide dynamic range of protein abundance associated with complex samples, the experimental preprocessing of samples consists of numerous steps prior to mass-spectrometry analysis with the goal of maximum proteome coverage. The protocols for mass-spectrometry experiments analyzing complex mixtures typically employ a divide-and-conquer strategy, dividing a sample into multiple subsamples, each of which is subsequently subjected to mass-spectrometry analysis. The divide-and-conquer strategy consists of multiple steps, each having the goal of reducing the volume and number of molecules in a subsample. These steps are depicted in Figure 1(a) and include, but are not limited to, removal of the most abundant proteins, fractionation, protein digestion, and separation. For example, for a plasma sample, the first step is depletion of the most abundant proteins (e.g. albumin, immunoglobulins, transferrin, alpha-1-antitrypsin, etc.). These abundant proteins constitute 90% of the plasma proteome, overload chromatographic columns, and preclude detection of low-abundance proteins because of ion suppression (Anderson and Anderson, 2002), (Bergen and others, 2004). Second, the proteins remaining after depletion are split up into multiple fractions (e.g. by strong cation exchange [SCX]) such that each fraction contains fewer proteins and a smaller dynamic range. Third, the proteins are digested using sequence-specific proteases (e.g. trypsin) that convert proteins into smaller peptides. Alternatively, proteins can be digested into peptides and subsequently fractionated. Fourth, peptides are temporally separated into multiple subfractions, typically hundreds to thousands (e.g. using reverse-phase liquid chromotography [LC]), where presumably each subfraction contains fewer peptides than the original fraction. Finally, peptides exiting the separation device are then converted to gas-phase ions by one of 2 common methods. When directly coupled to the mass spectrometer, electrospray ionization is utilized. Alternately, the column effluent can be spotted on a target with an energy absorbing matrix and converted to gas-phase ions by matrix-assisted laser desorption ionization (MALDI). Both methods are utilized to introduce gas-phase ions into the mass spectrometer. Molecules are distinguished from one another in the resulting mass spectrum based on the ratio of their mass to charge (m/z). An excellent review that describes sample preprocessing in more detail is Steen and Mann (2004).
Steps in the processing of a complex sample subjected to mass-spectrometry analysis with the goal of cataloguing all proteins in the sample using the (a) bottom–up and (b) top–down approach. A protocol typically employs a divide-and-conquer ...
Different types of mass analyzers vary in their resolution, mass accuracy, and sensitivity. Resolution refers to the ability of the mass analyzer to separate 2 molecules that have similar mass. Mass accuracy is the difference between the measured and the actual mass. Finally, sensitivity refers to the detection limit of the instrument; modern instruments allow detection in the low femtomole range. Four commonly used analyzers are time of flight (TOF), Fourier transform (FT)–based technologies (e.g. Fourier transform ion cyclotron resonance [FT-ICR] and FT-orbitrap), quadrupole mass filters, and ion traps. Additionally, tandem (hybrid) instruments that typically incorporate some combination of mass analyzers are available. Most proteomics laboratories utilize these hybrid combinations to isolate a peptide (parent ion) and fragment the parent into daughter ions from which the amino acid sequence can be read (referred to as tandem mass spectrometry [MS/MS]); these hybrid instruments include quadrupole-TOF or ion-trap-FT (either ICR or orbitrap).
Our focus is on the analytical preprocessing and interpretation of the resulting data rather than the experimental preprocessing of the individual samples. Ultimately, each mass analyzer produces an m/z and abundance pair for each detected molecule. However, the analytical preprocessing steps to create these data pairs are mass-analyzer dependent. For example, for FT-based mass analyzers, the instrumentation results in a sum of sinusoids (transient) that requires a Fourier transformation in order to obtain a set of frequency and abundance pairs. For TOF-based mass analyzers, the user obtains a list of flight time and abundance pairs directly. For both FT- and TOF-based mass analyzers, a mass-calibration transformation is applied in order to ultimately obtain a set of m/z and abundance pairs.
The purpose of this article is to provide an overview of mass-spectrometry data, in particular how m/z and abundance values are generated, and to highlight areas that deserve attention from the statistical community after the m/z and abundance pairs are generated. Although recent research has focused on improving mass-spectrometry technologies, insufficient attention has been given to proper statistical methods for optimally preprocessing and analyzing the acquired data. Thus, herein we aim to (1) provide an introduction to the resulting data and the multiple analytical steps that are required to obtain abundance and m/z pairs for each detectable molecule and (2) discuss places where statistical methods can play an important role in improving the quality of inferences derived from the data.
We begin in Section 2 with a description of bottom–up versus top–down proteomics and explain why the proteomics community makes samples more complicated by digestion of a protein to multiple peptides, that is, smaller chains of amino acids. In Section 3, we discuss data acquisition and the analytical preprocessing that is required to obtain m/z and abundance pairs from the data obtained from a mass analyzer. We have chosen to provide examples from the FT-based technology with which we are the most familiar; however, the general analytical preprocessing steps described herein apply to other mass analyzers. For a more thorough discussion of TOF-based technology, see the 2003 special edition on proteomics in Chance. After transforming to the m/z domain, the next step is data reduction via peak detection, which is discussed in Section 4. Section 5 introduces alignment, and Section 6 provides a discussion on how peptides and proteins are identified. Section 7 discusses important statistical considerations for experimental design and analysis.
Proteomics in the broad sense implies the identification and quantification of proteins and peptides present in a tissue or cell at a single point in time or under a set of conditions. Top down (protein level) and bottom up (peptide level) are 2 techniques that have been broadly utilized for this task (Figure 1).
In a top–down approach, accurate mass and high-resolution mass spectrometers are required. When using a top–down approach, the protein sample is fractionated prior to introduction into the mass spectrometer and one or more of the charge states of a single intact protein are isolated in the gas phase (Kelleher, 2004). In order to identify the corresponding protein, fragmentation of the intact protein is subsequently performed on the isolated ion by tandem MS (e.g. using electron capture dissociation or infrared multiphoton dissociation) in order to determine the amino acid sequence. Although some exceptions exist, this methodology only works well on abundant proteins and on proteins with molecular weights less than 30 kDa (Han and others, 2006). As such, top–down proteomics is commonly used for focused testing but not for global proteomics.
The bottom–up approach has also been referred to as the shotgun approach and is the conventional approach for global proteomics (McDonald and Yates, 2002). In bottom–up proteomics, the biological sample is digested with a protease (usually trypsin) and subsequently analyzed in a mass spectrometer. With respect to identification, there are 2 techniques. First, using peptide mass fingerprinting, the observed masses for a set of peptides (the fingerprint) corresponding to a protein are used to identify the corresponding protein; better results are obtained with a high mass–accuracy instrument. Alternatively, tandem MS (MS/MS) can be utilized for identification. Fragmentation via MS/MS yields a ladder of ions corresponding to peptides differing in mass by the loss of individual amino acids. The difference in mass of each ion corresponds to the residue mass of the amino acids; the series of mass differences corresponds to the amino acid sequence of the peptide.
Clearly, the resulting sample that is analyzed using the bottom–up approach is much more complex than the top–down approach, since an average protein may have more than 30 tryptic peptides. Furthermore, the bottom–up approach typically does not provide complete protein sequence. However, the resulting peptides are more responsive to mass analysis. Additionally, every protein should have some peptides that are amenable to mass analysis, whereas intact heavily glycosylated proteins (e.g. alpha-1-antitrypsin) are not necessarily amenable to mass analysis. Steen and Mann (2004) provide a nice overview of mass spectrometry methods and available tools for various steps in the sample-processing pipeline.
A TOF mass analyzer accelerates molecules with an electric field and then measures the time it takes a molecule to travel through a flight tube and reach a detector; molecules with smaller mass go through the flight tube quicker than those with larger mass. The m/z-value of a molecule analyzed via a TOF mass analyzer is proportional to the square of the flight time. Alternatively, a FT-ICR mass analyzer keeps molecules confined in a high magnetic field of a superconducting magnet where the molecules orbit with frequencies that are inversely proportional to their m/z-value (Comisarow and Marshall, 1974). The orbitrap is a new FT-based mass analyzer that is less expensive and smaller than a FT-ICR and traps ions in an electrostatic field about a central spindle electrode such that molecules precess along the axis of the spindle with frequencies that are inversely proportional to (Hu and others, 2005b). FT-based mass spectrometers currently provide the highest resolution and mass accuracy, whereas TOF mass analyzers generally have lower resolution and lower mass accuracy but are less expensive and less technically demanding (Mann and Kelleher, 2008). Although very versatile and commonly used for detecting a single known compound, the quadrupole mass filters and quadrupole ion traps have the lowest mass accuracy and resolution and thus are not commonly used on their own for global proteomic investigations.
The 2003 special edition on proteomics in Chance focused solely on TOF data. Herein, we primarily focus on FT-ICR and FT-orbitrap technology, which has the advantage of extremely high resolving power, mass-measurement accuracy, precision, and wide dynamic range.
An FT-ICR measures the rotational frequency of ions as they orbit in the magnetic field of a superconducting magnet. Ions are introduced into an ICR cell and are subsequently excited. Ions of similar m/z orbit together as a packet and induce an electrical current that is detected by a set of detector plates. Similarly, the orbitrap operates by trapping ions about an electrode. The trapped ions then precess along the axis of the electric field.
Analysis of FT-ICR or orbitrap data is based on the fact that each charged molecule generates a single sine wave signal as it orbits (ICR) or precesses (orbitrap) within the chamber. The detected signal is thus a sum of sinusoids (Figure 2a),
where y(ti) is the transient signal at time ti, i = 0,1,…,n − 1; n is the number of sample points collected across the sampling interval, c is the number of (unknown) ions in the sample, αj is the amplitude (equivalently, abundance) of ion j, γj is the frequency (typically expressed in Hz=cycles/second) of rotation of ion j, and δj is the phase offset of ion j. The values αj and γj are the primary quantities of interest, whereas the phase offset δj merely reflects that ion j was not directly under the detection plate at time 0. Analytical preprocessing proceeds by taking the discrete Fourier transform (DFT) of the obtained transient signal y(ti) most commonly using the fast Fourier transformation (FFT). To improve local resolution, the transient data are typically zero filled prior to implementing the FFT. Assuming n is a power of 2, if the data are zero filled d times, then (2d − 1)n zeros are placed at the end of the transient and the corresponding frequencies are fi = i/(2d), where i = 0,1,…,(n2d) − 1. For example, if the data are zero filled d = 2 times, then 3n zeros are placed at the end of the transient and the frequencies are equal to f = 0,1/4,1/2,3/4,1,…,n − 1. Furthermore, before applying the FFT, the transient data are often windowed (referred to as apodization), which is equivalent to applying a kernel smoother to the FFT output.
(a) Transient and (b) unwindowed frequency spectrum from a FT-ICR mass analyzer with d=2 zero fills. The region of the spectrum from 200 to 300 kHz is shown. Each peak corresponds to a unique m/z ratio.
Figure 2(a) displays a transient signal for a biologically simple sample as obtained by a FT-ICR mass analyzer. Figure 2(b) displays the corresponding frequency spectrum (200–300 kHz range) after applying a FFT to the zero-filled data. Note that complex biological samples such as human blood plasma will have many more peaks than are shown in Figure 2(b). Each peak (vertical line) in the figure corresponds to a particular m/z ratio. Figure 3(a) expands the 219.5–220.5 kHz region of this spectrum and displays a single isotopic cluster of a doubly charged molecule. In Figure 3(b), the data have been transformed via a mass-calibration transformation to the mass domain and the peak spacing is 1/charge; mass calibration is further discussed in Section 3.2. The peaks in Figure 3(b) differ by a single neutron (1 Da), usually due to the replacement of one 12C by one 13C isotope. Note that because of the inverse relationship between frequency and mass, the peak with the highest frequency (denoted A in Figure 3(a)), corresponds to the peak with the smallest mass in Figure 3(b).
High mass accuracy is essential for peptide and protein identification; mass accuracy is the difference between the measured and the actual mass. Thus, whether using TOF- or FT-based technology, a mass-calibration transformation is necessary in order to transform from the time domain (TOF data) or frequency domain (FT data) to the mass-to-charge (m/z) domain.
As an informative example, consider FT-ICR data. In an ideal setting of a perfectly regular field, (no interference between different ions, etc.) frequency is directly inversely proportional to m/z; however, it is well known that the electric field that traps ion in the ICR cell causes frequency shifts, and thus, frequency and m/z are not exactly inversely proportional (Marshall and others, 1998). Consider an experiment that uses internal calibrants; internal calibration necessitates the existence of an internal standard in every spectrum with known m/z-values. A common calibration method proposed by Ledford and others (1984) that accounts for frequency shifts is as follows:
As statisticians, we are immediately puzzled by the approach described above as basic statistical principles suggest that the most accurate prediction is obtained when the measured response is on the left side of the equation and the predictor variable(s) are on the right side of the equation. That is, a statistician would naturally do the following:
In fact, using a statistically-driven model, Muddiman and Oberg (2005) developed the calibration model, f = β0 + β1(z/m) + β2ATotal + β3AIon, where AIon denotes ion abundance (peak height) and ATotal denotes the total charge content in the ICR cell (summed abundance across the entire spectra). The unknown calibration parameters β0,β1,β2,andβ3 are typically estimated using either internal or external calibrants; β0 represents space-charge effects, β1 is proportional to the magnetic-field strength, β2 represents the effect of total charge, and β3 adjusts for ion abundance.
The Muddiman and Oberg calibration model is one example where statistics has contributed to the field of mass spectrometry. By restructuring previously developed calibration models into a simple linear regression framework, mass-measurement accuracy was improved. The example provided was for FT-ICR data; however, a similar equation would apply for orbitrap data. Additionally, although the theoretical calibration formula for TOF data is different than FT-based technology, a calibration equation is necessary to transform from the time domain to the m/z domain and may benefit from statistical input as well.
Typically, the vendor software will have dealt with the analytical preprocessing steps to produce data in the form of m/z and abundance pairs. Additional preprocessing steps are then needed before statistical analysis can be performed to discover and identify proteins that are up- or down-regulated when comparing 2 or more disease groups of interest. These steps include peak detection, alignment, and identification.
After obtaining the full list of m/z and abundance pairs, the next step is data reduction via peak detection. Peak detection refers to the process of reducing an individual spectrum to a set of biologically meaningful peaks that denote potential molecules. There are 2 general approaches for peak detection, and the preferred approach depends on the mass resolution of the mass analyzer. With high-resolution devices, isotopic clusters of peaks can be resolved, whereas for low-resolution devices, these clusters will appear as a single peak in a mass spectrum. For illustration, Figure 3(b) displays an isotopic cluster of peaks for a doubly charged molecule with monoisotopic m/z approximately 713 (monoisotopic mass of 1426); the monoisotopic peak is defined to be the peak that contains the most abundant isotope for each element (12C, 1H, 14N, 16O, 32S, etc.). Peaks in an isotopic cluster are separated by 1 Da due to the fact that approximately 1% of available carbon is the 13C isotope and 99% is 12C. Because 13C is far more common than all other isotopes combined, the pattern of peak heights for an isotopic cluster approximately follows a Poisson distribution. Thus, for larger mass values, the shape of the isotopic cluster will look more Gaussian and the monoisotopic peak will no longer be the most abundant peak in the cluster. In fact, for large masses, the monoisotopic peak will exist below the noise threshold and will be undetectable. As such, it is sometimes more interesting to retain the average mass of a cluster instead of the monoisotopic mass.
For data from high-resolution devices such as FT-based mass analyzers, peak-detection algorithms take advantage of resolved isotopic clusters to obtain more accurate determination of both the mass and the abundance by matching a theoretical spectral template to the observed cluster. Computationally, this can be approached in several ways. Yergey (1983) proposed using probability-generating functions, which are products of polynomials; the more often used mercury algorithm (Rockwood and Van Orden, 1996) uses DFTs. Horn and others (2000) developed thorough high resolution analysis of spectra by Horn (THRASH) for high-resolution FT data. This is a fully automated algorithm that uses the fictitious amino acid averagine (Senko and others, 1995) to approximate the theoretical distribution using only the mass of an observed isotopic cluster; averagine is defined as an average amino acid (C4.9384 H7.7583 N1.3577 O1.3577 S0.0417). That is, the number of averagine molecules can be estimated from the observed mass, and subsequently, an estimated theoretical isotopic distribution is obtained. Isolated noise peaks are then distinguished from true data by the lack of an isotopic distribution of peaks.
Distinguishing between real and noise peaks is more problematic for low-resolution data where isotopic clusters are not resolved. For data that do not have resolved isotopic patterns, Morris and others (2005) proposed using a mean spectrum for peak detection, where the mean spectrum is simply the average of multiple spectra. The underlying assumption is that a peak that denotes a molecular feature should stand out above the noise as well as above the background, and these properties should be preserved in the mean spectrum. Morris and colleagues show that the mean spectrum has a reduction in noise, can find peaks that are only present in a few samples, and can find low-abundant peaks consistently. Furthermore, the mean spectrum eliminates the difficult task of matching peaks selected on individual spectra (defined as the alignment problem—discussed further in Section 5).
LC separates molecules temporally into multiple subfractions, and the resulting data consist of a series of spectra that are obtained serially over time. More specifically, molecules continually exit the LC column according to their propensity for mixing with water (hydrophobicity) and are subsequently charged via an ionization method and subjected to mass analysis. For example, if the output is sampled once per second (the average cycle time of many MS instruments) across 60 min, then 3600 spectra will be obtained for a given fractionated sample. Depending primarily on the abundance and amino acid sequence of a molecule, the molecule might be present in few or many adjacent elution-time spectra. Figure 4(a) provides an example of data from a LC–MS experiment and displays the elution profile of isotopic clusters. Figure 4(b) provides an expanded view of one of the isotopic clusters in Figure 4(a), which reveals the resolved isotopic distribution of the molecule in addition to the elution profile. With respect to LC data, a recent peak-detection approach proposed by Yu and others (2008) utilizes the fact that in a chromatographic elution profile, a noise peak should not repeat itself in adjacent retention-time spectra, whereas a peak associated with a molecule is more likely to be seen at the same mass in at least 2 adjacent elution-time spectra.
(a) An example of an LC–MS experiment displaying the 3 dimensions: retention time, m/z, and abundance. The peak height for each m/z and retention-time pair denotes the abundance of the corresponding ion. (b) An expanded view of a single isotopic ...
In summary, careful peak detection is important because alignment, identification algorithms, and downstream statistical analyses all rely on accurate peak locations (m/z-values) and abundance values.
Statistical analysis of mass-spectrometry data can be approached either before (using unique mass values) or after peptide/protein identification; identification is discussed in Section 6. Often it is of interest to analyze all potential molecules derived from a peak-detection routine regardless of whether the molecules can be identified, and thus, unique mass values are used to classify molecules across spectra. Although a mass-calibration equation is applied, the equation is not perfect, and thus, there will be inherent variation in mass values across spectra for the same molecule. Assuming that a set of peak locations have been extracted from the spectra, Tibshirani and others (2004) proposed the implementation of complete-linkage clustering for matching molecules (mass values) across spectra. Alternatively, Morris and others (2005) neatly avoid the problem of matching peaks across spectra by using the mean spectrum to define unique masses.
In order to simplify a complex biological sample, LC is often utilized in the divide-and-conquer strategy to further separate molecules temporally into multiple subfractions. The elution time for a particular molecule is a property of the hydrophobicity of the molecule and thus should be similar across spectra. Even so, LC also has variation associated with it, and thus, the elution time of a molecule will vary across fractions within the same biological sample as well as across biological samples. With respect to classifying molecules across spectra, the elution time in addition to the mass value can be utilized to classify molecules more accurately, and thus, an alignment algorithm is typically applied to align masses across the elution profile.
Many retention-time alignment procedures apply techniques that delete or add data points (Wong and others, 2005), and many procedures align all spectra to a single reference spectrum in an ad hoc fashion (Wong and others, 2005), (Zhang and others, 2005), (Nielsen and others, 1998), (Johnson and others, 2003). With respect to gene expression microarrays, it is well understood that making use of all data is superior to standardizing to a single reference template (Bolstad and others, 2003). Taking a similar view, Wang and others (2007) proposed a method called peptide element alignment that simultaneously aligns features from multiple spectra instead of aligning spectra to an arbitrarily chosen template. More recently, Yu and others (2008) proposed an iterative feedback strategy between peak detection and retention-time alignment.
Returning to the topic of peak matching, when incorporating LC into the experimental process, molecules are binned together by defining a small mass and retention-time threshold. For example, all masses within 5 parts-per-million and that elute within 10 seconds are classified as a unique molecule. The precision at which the mass and retention-time thresholds are set and the ability correctly to classify molecules both depend on the ability to obtain accurate m/z-values as well as precise retention-time values.
A peptide's observed mass, even a highly accurate mass, is typically not sufficient to match to a single protein in a database. As a result, MS/MS is commonly utilized for identification (Figure 5). In MS/MS, a survey scan in a high mass–accuracy instrument (e.g. ICR or orbitrap) is used to determine a parent ion mass. The mass spectrometer subsequently isolates and then fragments the parent ion mass and produces corresponding product ion spectra (MS/MS spectra). Collision-induced dissociation (CID) is most commonly used to fragment ions in order to generate characteristic fragments in the MS/MS experiments. The MS/MS product ion spectra consist of a ladder of peptides differing in mass by the mass of individual amino acids, and the amino acid sequence is deduced by the spacing of these product ions. The parent ion mass and the sequence information obtained from the daughter ions are then utilized to interrogate a protein database so as to assign the best protein match. This combination of parent ion mass and sequence information provides a more accurate protein identification than the mass value alone (Hu and others, 2005b). Thousands of these MS/MS spectra are acquired during any mass-spectrometry run.
(a) The base peak ion chromatogram for a whole yeast trypsin digest and the extracted ion chromatogram of a single mass (735.92 Da). (b) A survey scan at the retention time of the ion with mass 735 Da. This mass (735 Da) is then isolated and selected ...
There is a broad array of software tools for the identification of peptides/proteins from the MS/MS spectra generated from the peptide parent ion. All utilize information on the precursor mass of the peptide from the first MS and the fragment ion spectra from the second MS, which contains information of the amino acid sequence of the peptide. All search engines will return identifications indicating the most likely protein of origin; however, as stated nicely by Johnson and others (2005), “…only the credulous are willing to accept them carte blanche.” Therefore, assigning a probability score to each identification is important. The ProteinProphet and PeptideProphet algorithms have been widely utilized to assign identification scores, as has the molecular weight search (MOWSE) score utilized in Mascot (Keller and others, 2002), (Nesvizhskii and others, 2003), (Pappin and others, 1993). More recently, false-positive identification rates have been determined by searching a reverse or randomized protein sequence database or both (Peng and others, 2003), (Cargile and others, 2004), (Huttlin and others, 2007). The number of hits emanating from one of these “nonsense” databases yields an estimate of the false discovery identification rate in the authentic sequence database. Although many refer to the false discovery identification rate as simply a false discovery rate, Fitzgibbon and others (2008) recently suggested that false identification rate be used instead to distinguish between its use as applied to evaluating peptide/protein sequence confidence versus the more common use to evaluate differential expression in gene expression microarrays. Regardless of the method utilized, only a small fraction of the collected MS/MS spectra are identified. This low rate of identification can likely be attributed to the presence of posttranslational modifications that are not catalogued in the database. Automatic addition of common modifications to the search is possible but greatly increases the computer time, and no more than 3–5 minor changes are usually allowed. The use and constraints of algorithms utilized to identify proteins from mass-spectrometry data have been well reviewed (Yates, 1998), (Baldwin, 2004), (Rappsilber and Mann, 2002), (Sadygov and others, 2004). We also refer the reader to the 2008 Journal of Proteome Research special issue that is focused on statistical and computational proteomics and features numerous contemporary articles on identification (Volume 7, Issue 1).
Protein mass spectrometry using human samples (plasma, tissue, etc.) is a promising technique being utilized for biomarker studies focusing on early detection and understanding prognosis and progression; however, there are important challenges that are hindering the full potential of the technology. Specifically, computational and statistical tools for capturing, interpreting, and quantifying mass-spectrometry data are deficient (Srivastava and Srivastava, 2005). The sheer dynamic range of proteins in human plasma and the need to detect the lower end of this dynamic range to capture proteins that are potentially the most interesting require improved analytic tools (Rifai and others, 2006), (Jacobs and others, 2005). The discipline of statistics offers robust methodologies to extract and interpret high-dimensional proteomics data and has been and will continue to be a valuable contributor to the field of mass spectrometry. Below, we discuss areas where statistics has contributed to the field and suggest areas that could still benefit from statistical input.
Experimental design is important in obtaining valid estimates of relative abundance that are not influenced by ancillary sources of variation. Even though experimental design is important, minimal guidance exists in the mass-spectrometry literature with respect to optimal designs.
Due to the time constraints associated with complex sample processing, sample sizes are often limited, and therefore, studies lack statistical power to detect meaningful differences. For example, consider a mass-spectrometry experiment that consists of 50 disease/control patient pairs; this is a standard size for studies that attempt to see a difference between groups with more standard laboratory tests. A typical protocol might separate each biological sample into 30 fractions (e.g. using SCX), and each of the fractions might be further separated by a 60-min reverse-phase LC separation that is sampled once per second resulting in 108 000 spectra per biological sample. In total, this currently represents approximately half a year of instrument time with no allowance for maintenance. Clearly, the cost and time of running these experiments demand efficient allocation of resources across biological samples and fractionation as well as efficient analysis methods to extract the maximum amount of information possible from any given experiment.
In general, a mass-spectrometry experiment may allocate resources over 4 dimensions: (i) disease groups, (ii) biological replicates, (iii) fractionation, and (iv) technical replicates. Technical replicates are analyzed to increase proteome coverage for low-abundance proteins that exist near the level of detection of the mass analyzer (Liu and others, 2004). As with most studies, biological variability between subjects within the same disease group is typically the largest source of variability (Hunt and others, 2005) and arises from both systematic variability and phenotypic variation. Fraction-to-fraction variability includes variation due to chemical properties of a molecule as well as variation introduced during the fractionation process, for example, differences in total protein concentration across fractions. Differences introduced by replicate analyses of the same fraction usually account for the smallest component of the variability. The necessary trade-off between increased sample-run times associated with fractionating a sample into multiple subfractions and running technical replicates to increase proteome coverage in comparison to the number of biological samples has yet to be adequately addressed for mass-spectrometry technologies.
In order to evaluate biological replication, fractionation, technical replication variability, and the potential gains from expansion of effort at each level, we propose merging experiments to evaluate variation with current biomarker protocols. By doing so, one could consider variations in the sample protocol that utilize subsampling from the entire collection of fractions (e.g. mass analyze every other fraction) as well as creation of fewer coarsened fractions; to implement coarsening, adjacent fractions would be pooled together and then mass analyzed resulting in more complex fractions. Subsampling is attractive as it maintains the same level of reduced sample complexity that full fractionation affords; however, by mass analyzing only some of the fractions, the entire proteome for the sample will not be interrogated. Alternatively, the creation of coarsened fractions is attractive as it interrogates the entire proteome with a reduction of instrument time; however, the coarsened fractions will be more complex. The primary motivation for investigating subsampling and coarsening is to reduce instrument time. A reduction in instrument time allows additional biological samples to be evaluated and thus increased statistical power to detect differences across groups of interest. When merging the aforementioned investigations into current biomarker protocols, variance estimates can be derived using variance component models and compared with those from the analysis of the full set of fractions. For example, let ygimr denote the observed abundance for the rth technical replicate nested within the mth fraction nested within the ith biologic subject nested within the gth disease group. A variance component model can be defined as
where μg denotes the mean of the gth level of disease group, αi(g)~N(0,σB2), βm(ig)~N(0,σF2), δr(miig)~N(0,σR2), σB2 is the across biological sample variability, σF2is the across fraction variability, and σR2is the across replicate variability. In addition to comparing variance estimates, coverage of the proteome (i.e., detection of peptides and proteins) from the corresponding analyses can be compared with the full analysis. Ultimately, these types of empirical studies will allow evaluation of the trade-off between proteome coverage and efficiency of quantification for biomarker studies.
In addition to determining efficient allocation of resources across biological samples and fractionation in order to extract the maximum amount of information possible from any given experiment, more classical experimental design considerations are also important for mass-spectrometry experiments. Hu and others (2005a) provide 4 examples illustrating the importance of randomization, replication, and blocking in mass-spectrometry experiments. Their discussion focused on singly labeled experiments; however, analogous to 2-channel spotted microarray technology, there are also labeled techniques for mass spectrometry. Using isotopic labeling, samples are labeled using stable isotopes and subsequently mixed together and subjected to mass analysis (Mueller and others, 2008). Both reverse labeling (Yao and others, 2001), (Wang and others, 2001), better known as a dye swap in the microarray literature, and one-way labeling (Johnson and Muddiman, 2004), where each group is labeled with a different isotope, have been proposed for isotopic labeled experiments. In the statistical community, it is well-known that there are more efficient designs than reverse labeling (e.g. balanced block design or balanced incomplete block design). Furthermore, statisticians recognize that one-way labeling results in group effects being confounded with labeling. The statistical community can therefore add valuable contributions to the field of mass spectrometry, particularly with respect to optimizing experimental designs. Likewise, the statistical community can further contribute to peak detection and quantification with respect to stable-isotope labeled data (Eckel-Passow and others, 2006), (Mason and others, 2007).
The objective of global proteomics via mass spectrometry is to detect and quantify all proteins present in a biological sample; however, characterizing the plasma proteome is a challenging problem. Review papers by Diamandis (2004) and Anderson and Anderson (2002) state that the plasma proteome spans a wide dynamic range (>1010); however, current mass-spectrometry technologies can only achieve a dynamic range of approximately 103–104. Summarizing mass-spectrometry data is also complicated by the fact that low-abundance proteins are often of greatest interest as potential biomarkers. The fact that the detection dynamic range of global proteomic technologies is as many as 6 orders of magnitude less than the abundance dynamic range of complex samples results in a large subset of low-abundance proteins existing near the limit-of-detection threshold of the technology and leads to missing (incomplete) observations (Grove and others, 2006). Even with the divide-and-conquer strategy discussed in Section 1, the complexity of the human proteome is such that numerous proteins will still be at the limit of detection of the mass analyzer. Our group and others (Liu and others, 2004) have noted that the abundance profile of the proteome is approximately geometric. One consequence of this is that, no matter the detection limit of the instrument, approximately half of the detectable proteins will be at the lower limit of detection. In fact, Liu and colleagues found that approximately 25% of all detected proteins across 9 technical replicates were detected in only one of the replicates and approximately 40% in all 9 replicates, suggesting that low-abundance peptides, which are often of most interest for biomarker discovery purposes, will often have incomplete observations.
Currently, incomplete mass-spectrometry data are typically deleted from further analyses (Callister and others, 2006) or replaced with missing values or a fixed constant (Gillette and others, 2005). Subsequently, incorrect statistical methods are applied that do not account for the fact that the data are thresholded. Deleting incomplete observations leads to severe bias in results (Berkson and Gage, 1950), and treating the thresholded data as missing values leads to left-truncated data and results in severe estimation bias (Turnbull, 1976). Furthermore, replacing the thresholded data with a fixed constant is statistically inefficient, especially when the threshold is known to vary. Wang and others (2006) recognized the importance of accurately accounting for incomplete observations and proposed a probability model for abundance-dependent incomplete data noting that ignoring nonrandom incomplete data may introduce significant bias into subsequent analyses. They considered not only peaks that were incomplete due to the limited dynamic range of the mass analyzer but also recognized that even a detected signal may be difficult to distinguish from background noise. Our group is actively researching whether censoring methodologies might be an alternative solution that could result in improved abundance estimates.
Differences in sample collection, sample characteristics (cellular concentration), variations in sample processing, and the experimental process may lead to unknown baseline abundances for each sample. Systematic biases can severely compromise biological results, and thus, it is important to eliminate these biases via abundance normalization before applying differential analysis methodologies. As described in Section 7.2, analysis of mass-spectrometry data is further complicated by the fact that the proteins of most interest will have incomplete observations as a result of the limit-of-detection threshold of the technology.
Callister and others (2006) applied and evaluated an assortment of normalization routines that were originally developed for gene expression microarray data on FT-ICR mass-spectrometry data. Although all the techniques (central tendency, linear regression, locally weighted regression, and quantile) removed a portion of the systematic bias, the linear regression normalization routines (equivalently, analysis of variance [ANOVA] models) performed the best. Even so, a lack of definitive trend among the normalization routines suggests the need for further investigation. Furthermore, Callister and colleagues only utilized peptides that were detectable in every mass-spectrometry analysis for normalization and thus ignored low-abundant peptides that exist at the limit of detection and are thought to be of most interest with respect to biomarker discovery. When ignoring incomplete data and only using complete data for normalization, Wang and others (2006) showed that the normalization parameters are severely biased. In response, Oberg and others (2008) discussed the use of an ANOVA model for incomplete data by considering peptide as an incomplete block factor. They showed that the observed base-2 logarithm transformed abundance can be decomposed as
“Experimental” denotes effects due to loading, mixing, and other sample handling effects; features of the experiment that would not exist in a world of perfectly reproducible hardware, procedures, and subjects. “Peptide” accounts for variability across peptides; the experimental term is biased toward zero if the peptide term is not included in the ANOVA model. “Group” denotes the effect of actual interest.
With respect to abundance normalization, the experimental component in (7.1) is meant to account for the variability in mass-spectrometry data due to experimental effects, which in an ideal world would not exist. For labeled data where multiple samples are mixed together and mass analyzed simultaneously, experimental = vq,l + bq + tl, where vq,l denotes the experimental effects of loading, mixing, and other sample handling effects; bq denotes the effect due to the qth mass analysis run; and tl denotes the effect due to the lth label. If a given patient sample is used only once in an experiment, then vq,l could be considered as the patient effect since all later effects of sample handling are indistinguishable from baseline differences in the patient's total protein. It is usually assumed that the experimental effects are global effects, that is, they are expected to be the same for all peptides and are estimated using all the available data. As is also the case for gene expression microarray data, estimating global effects in mass-spectrometry data is a difficult task since there are usually thousands of identified peptides (unique mass values). Oberg and others (2008) discussed the use of subsetting, stagewise regression, and iterative regression techniques to confront this computationally challenging problem.
Characterization of the proteome is a resource of tremendous potential to inform biological research. Mass spectrometry is an attractive technology to aid in this crucial task as it has the promise to provide a noninvasive screening mechanism on easily accessible fluids. However, improved computational and statistical tools are necessary in order for mass-spectrometry techniques to reach their full potential. Although there is a large body of ongoing work with respect to improving the range and capacity of mass-spectrometry technologies, there has been less attention to the equally important problem of optimal use of and inference from the data. Here, we described the rationale for the divide-and-conquer strategy, characteristics of mass-spectrometry data, and processing steps that are required in order to obtain m/z and abundance data. We hope that readers will identify areas that can be improved either by incorporating classical statistical methodologies or by developing novel methods. Further statistical research is necessary to evaluate optimal allocation of resources to balance the trade-off between biological replication and proteome coverage and to benefit peak detection, alignment, and abundance normalization in the presence of incomplete observations. We would encourage more statisticians to work on mass-spectrometry methods with a view to realizing their potential to inform biological research and discover disease markers for early discovery and for prognosis and progression of disease.
National Cancer Institute (R25 CA92049); Fraternal Order of Eagles Cancer Fund; David Woods Kemper Memorial Foundation; a generous gift from Gordon and Elizabeth Gilroy.
We are thankful to Dr David Muddiman for reviewing previous versions of this manuscript and Douglas Mahoney for his assistance in creating figures. Conflict of Interest: None declared.