|Home | About | Journals | Submit | Contact Us | Français|
Peak detection is a pivotal first step in biomarker discovery from mass spectrometry (MS) data and can significantly influence the results of downstream data analysis steps. We developed a novel automatic peak detection method for prOTOF MS data which does not require a priori knowledge of protein masses. Random noise is removed by an undecimated wavelet transform and chemical noise is attenuated by an adaptive short-time discrete Fourier transform. Isotopic peaks corresponding to a single protein are combined by extracting an envelope over them. Depending on the signal-to-noise ratio (SNR), the desired peaks in each individual spectrum are detected and those with the highest intensity among their peak clusters are recorded. The common peaks among all the spectra are identified by choosing an appropriate cut-off threshold in the complete linkage hierarchical clustering. To remove the 1Da shifting of the peaks, the peak corresponding to the same protein is determined as the detected peak with the largest number among its neighborhood. We validated this method using a dataset of serial peptide and protein calibration standards. Compared with MoverZ program, our new method detects more peaks and significantly enhances SNR of the peak after the chemical noise removal. We then successfully applied this method to a dataset from prOTOF MS spectra of albumin and albumin-bound proteins from serum samples of 59 patients with carotid artery disease to detect peaks with SNR ≥2. Our method is easily implemented and is highly effective to define peaks that will be used for disease classification or to highlight potential biomarkers.
Mass spectrometry (MS) is increasingly used to detect disease-related biomarkers from human plasma or serum for early diagnosis, prognosis and monitoring of disease progression or response to treatment. Peak detection is a critical early analytical step that will directly influence the identification of potential biomarkers from the MS spectra. Due to the different properties of spectra data from different types of MS instruments, consistency in peak detection methods are essential for data analysis.
Matrix assisted laser desorption ionization-time of flight (MALDI-TOF) is one of the most popular MS approaches for detecting proteins or peptides. The prOTOF 2000 (PerkinElmer), an orthogonal MALDI-TOF MS, has high mass accuracy and resolution over a wide mass range, which overcomes many key performance barriers in traditional MALDI-TOFs. However, prOTOF MS spectra show a unique noise pattern in the background. In addition to the random noise normally observed in other high resolution mass spectra, there are low frequency noise peaks at 1Da intervals [Figure 1(a)]. The exact source of this noise is unknown but it is hypothesized to originate from the matrix since the matrix is also ionized during the data acquisition process . Removal of this “chemical noise” may significantly improve the signal to noise ratio (SNR) in prOTOF MS spectra.
Several methods for peak detection in low resolution MALDI-TOF data have been described [2-5]. However for high resolution data, there are fewer published methods and each is specific to the MS platform used [6-9]. These methods are all based on the a priori knowledge of the masses of the peptides being queried and are applied to specifically match the known distributions of isotopic peptides in a spectrum. Yu W. et al  applied a Gaussian filter for smoothing and a Gabor quadrature filter to extract the envelope signal of each peptide. The main benefit of this method is that it combines all the peaks corresponding to one peptide together. However, the threshold for defining a peak was empiric and did not consider the “chemical noise” which may mask some small peaks. Jurgen K. et al  removed the “chemical noise” by Fourier transform with a fixed window size. This method was applied to the electrospray quadrupole TOF data with a maximum m/z of 2000 Da. However, Fourier transformation with a fixed window size cannot address high-frequency noise at higher m/z values in the operating range (>10 kDa) of the prOTOF 2000.
Peak alignment is the next step in MS data processing. Several methods have been applied to lower resolution MS data, but these are not applicable to the higher resolution prOTOF data [5, 12]. Johnson K. J. et al  applied a closest point matching method in peak alignment. This method is limited by false-positive peak detection. Tibshirani R. et al used complete linkage hierarchical clustering that depends on threshold to cut the hierarchical tree . Yu W. et al collected all the possible peaks into a super set to find the common peak set among all peak sets as a standard and then aligned all peak sets to the standard using the robust point matching approach in a sequential manner .
Based on these previous analytic approaches, we applied an adaptive, short-time, discrete Fourier transform combined with a wavelet transform to remove “chemical noise” and the higher frequency random noise from the prOTOF MS data. All of the single peaks in each spectrum were detected and those peaks corresponding to the same isotopic cluster were combined into an envelope. The desired peaks were then identified based on the SNR threshold. Complete linkage hierarchical clustering was used for peak alignment along all the spectra. The method was validated with a dataset with a series of peptide and protein calibration standards. Compared to the well known software MoverZ (Genomic Solutions) , it can detect small peaks with high signal to noise ratios. With the chemical noise removal, the peak SNR is greatly enhanced. The proposed peak detection approach (PDA) with chemical noise removal using short-time FFT (stFFT) for prOTOF MS data is abbreviated as PDA_stFFT. Finally, we applied the method to a dataset from patients with symptomatic carotid artery atherosclerotic cardiovascular disease to detect peaks with SNR≥2. This paper focuses on the development and validation of PDA_stFFT method for peak detection. The biomarker identification and classification on the patients with carotid artery disease is the subject of a separate communication.
Serum was obtained for proteomic analysis from a cohort of patients undergoing endarterectomy (EA) for occluded carotid arteries. The study was approved by institutional review board of the National Naval Medical Center, Bethesda, MD. Serum aliquots were stored at −70° C until used for this analysis. Patients from these cohorts were divided into four groups: symptomatic EA group (EAS, N=12), asymptomatic EA group (EAA, N=11), risk-factor matched control group (RFC, N=18), and control group (C, N=18). The EAS group was composed of patients who underwent EA surgery with evidence of prior clinical atherosclerotic plaque rupture in the form of a stroke or transient ischemic attack. The EAA group was composed of patients undergoing EA surgery with no clinical evidence of previous plaque rupture. The RFC group included patients without any evidence of atherosclerosis, but with risk factors, such as high cholesterol, smoking history, diabetes and/or hypertension. The C group was patients without any evidence or risk factors for atherosclerosis.
The samples were processed by ProXpression Biomarker Enrichment kit (PerkinElmer) according to the manufacturer's protocol . Albumin and albumin-bound proteins were captured by Cibachron blue dye. After binding and washing, the enriched proteins were eluted, concentrated and applied in triplicate directly on MALDI chip targets (PerkinElmer). Mass spectra were acquired with a prOTOF™ 2000 MALDI-O-TOF MS. Mass range for all spectra was set to 700-10000 Da. The raw data for all spectra were exported in two-column text format for further data processing.
To validate the peak detection method, we prepared a mixture of peptide and protein standards with protein calibration standard I and peptide calibration standard II (Bruker Daltonics). The protein or peptide names with their m/z values are listed in Table 1. Equal volumes of 10pmol/μl of both standards in dilution buffer (1% TFA in 50% aqueous acetonitrile) were mixed together to prepare stock solution of 5pmol/μl peptide and protein standards. The stock solution is then diluted 10x with dilution buffer to 500fmol/μl as starting solution which was further diluted in 3-fold with dilution buffer to prepare a series of 8 diluted standard solutions of 500, 170, 56, 19, 6, 2.1, 0.7 and 0.23fmol/μl. Mass spectra were acquired with PerkinElmer's prOTOF 2000 within m/z range of 1000 to 20000 Da. The instrument was calibrated first according to manufacturer's protocol. 1μl of 8 prepared standard solutions was manually spotted on the target in triplicate. Then 1μl of CHCA matrix is applied to each spot. Total 24 mass spectra were acquired as Ci_01, Ci_02 and Ci_03, i = 1, 2, …, 8.
The prOTOF 2000 used for clinical samples was located in the PerkinElmer facility in Boston, MA. The prOTOF 2000 used for peptide and protein standards was located at University of Arkansas for Medical Science, Little Rock, AK.
Assume we have m spectra in study, and each takes on the same equally-spaced TOF tk with k = 1,2,…T, where T is maximum number of m/z points. The intensity of the spectrum j is modeled as:
where Sj(tk) is the true intensity of the spectrum j at TOF tk and xj(tk) is the observed intensity at TOF tk. In this model, there are two different kinds of noise: cj(tk), the “chemical noise”, and nj(tk), the random noise which mainly results from the limitation of the detector. The “chemical noise” has an interval of approximately 1 Da. In typical prOTOF MS, multiply charged peptides are rarely observed, so we can assume that all peptides or proteins carry a single charge; thus the spacing between the isotopic peaks is 1 Da. We assume that the amplitude of the chemical noise is smaller than that of the true peaks in a local region.
Random noise has very high frequency and can be readily removed with a wavelet transform . We use the undecimated wavelet transform (UDWT) to remove the noise nj(tk). After computing the wavelet coefficients for the original signal, a hard threshold is applied. The threshold is set to be the product of a threshold parameter and a robust estimate of the noise: the median absolute deviation divided by 0.67. The software used in this study is the Rice Wavelet Toolbox (RWT) [http://www-dsp.rice.edu/software/rwt.shtml]. After removing the random noise, the model becomes:
where j(tk) is the intensity of the smoothed spectrum j at TOF tk. Now we are left with the smoothed signal with chemical noise.
To identify the true peaks from the smoothed spectrum, the “chemical noise” Cj(tk) is removed with a short-time discrete Fourier transform (stDFT). Then the processed MS dataset is normalized to correct for the different total amount of proteins desorbed from the sample plate represented by Nj. Finally we identify the locations of the desired peaks in each spectrum according to a defined SNR. One of the goals of this study is to remove the chemical noise so that the peaks with comparatively small amplitude can also be detected with high SNR.
As noted, the chemical noise has the property that the spacing between two neighboring peaks is about 1Da and it has the shape similar to the sinusoidal signal. So we consider using discrete Fourier transform, which is the most popular transform to obtain the information from a signal that is not obviously in the time domain. It decomposes a signal to complex exponential functions of different frequencies and tells whether a certain frequency component exists or not in the signal. But it only applies for the stationary signal whose frequency content is unchanged in time. Since transferring the TOF data into m/z follows a formula like m/z = a (t − t0)2 + b, where a and b are related to the length of the flight tube and the applied voltage, t − t0 is the flight time, the number of data points in a 1Da region is different and decreasing from lower m/z to higher m/z. The original DFT cannot be applied directly to the smoothed signal. To analyze the non-stationary signal of which some portion of its signal is stationary, DFT is extended to stDFT. In stDFT, the signal is divided into small enough segments, where each segment of the signal can be assumed to be stationary. For this purpose, a window function is chosen in such a way that the width of the window is equal to the segment of the signal where its stationarity is valid. Since the decreasing of the number of data points in 1Da region with increasing m/z is very slow, and it is found that the number of data points in 1Da region is nearly the same in a certain region within several Daltons, we resort to stDFT. It can be described as:
where w(k,u) is the window function which is defined as:
with u being the starting point of the window. The window of the stDFT for the smoothed signal j(tk) is defined as the region in which the number of data points in 1Da is approximately the same. Figure 1(b) shows the number of data points in approximate 1Da region. So we define the stDFT window as the regions with the same number of data points in 1Da shown in the Figure 1(b). To compute DFT, fast Fourier transform is applied which is an efficient algorithm and reduces the computational complexity from O(T2) to O(T log T), where T is the number of data points in the signal.
After stDFT, we restore the sinusoidal signal that has the highest amplitude in the stDFT frequency domain in each window to estimate the chemical noise. When m/z is less than a certain value (about 4000Da), the smoothed signal j(tk) has the property that the spacing between peaks is 1Da. So the estimated chemical noise still has the period of 1Da. But when m/z becomes larger, undecimated wavelet transform smoothed the spectrum. The restored signal becomes an estimate of the long distance smoothed signal. Figure 1(c, d) shows the smoothed signal and the estimate of the chemical noise.
To remove the chemical noise from the smoothed signal, we compared the estimated chemical noise with the smoothed signal. Here a hard threshold is applied. If the net signal, defined as the original smoothed signal minus the estimate of the chemical noise, is less than a certain threshold, it is set to be zero. After this step, we are left with an estimate of NjSj(tk).
After de-noising the signal, we get an estimate of the true intensity in each individual spectrum. Peak identification in each estimated signal is described below.
In our model, the parameter Nj represents the amount of proteins desorbed from the sample plate. This number may vary as the total amount of proteins desorbed from a sample plate may differ between experiments. In order to correct for these differences, we define the total ion current as the mean intensity of the processed spectrum, the number Nj. The intensity of the processed spectrum is normalized by dividing it by the total ion current, providing an estimate of the true signal Sj(tk).
SNR is defined as the ratio of the intensity of the true signal sj(tk) to the background noise. Background noise is considered to be represented by the random noise nj(tk). The noise level is defined as the mean intensity of the random noise in a certain region.
In the prOTOF data, each protein corresponds to a cluster of peaks that includes all the isotopic peaks. To combine these peaks together, an envelope signal is extracted by fitting a curve to the local maximum points in each 1Da of the signal Sj(tk). This envelope combines all the peaks corresponding to the same protein into a concave curve. Then the local maximum value of all such concave curves is determined among the cluster of isotope peaks. The m/z of all these local maximums determines the locations of the proteins. This approach differs from some other peak detection methods [6, 7] that detect the monoisotopic peak as the location of the protein. By selecting proper SNR threshold, we can detect the desired peaks. Figure 1(e) shows the envelope of the normalized signal and the peaks we detected by setting the SNR to be 2. The curve over the denoised signal is the envelope and the circles label the positions of the peaks.
In the peak detection, we locate the peak in the position where it achieves the highest intensity in its peak cluster. Since the position of the peak that has the highest intensity among its isotope peak cluster for all the spectra may vary about 1Da, the detected position of the peaks that correspond to the same protein may shift 1Da left or right too. Thus we need to choose a position that can reflect the proper m/z of the peaks along all the spectra. Before that, we first align the peaks that should correspond to the same m/z since the same peaks can be horizontally shifted within a small region in different spectra. We propose to use the complete linkage hierarchical clustering to do it. This method has been applied in , but for our dataset, several points require further explanation.
To do the clustering for all the peaks, the peaks in all the spectra are first ordered into a M × 1 vector, where M is the total number of peaks in all the spectra. The distance between each two peaks is computed. In complete linkage clustering, the distance between one cluster and another cluster is the greatest distance from any member of one cluster to any member of the other cluster. Here, Euclidian distance is applied. Starting by assigning each item to a cluster, we can find the closest pair of clusters and merge them into a single cluster. Then the distance between the new cluster and each of the old clusters is determined. The process will be repeated till all the clusters are merged. Finally, all the items are clustered into a single cluster of size M. In practice, we do not need all the items to be clustered into a single cluster, so we need to define a proper height to cut the tree to get the desired clustering precision. In , a predefined threshold 0.005 is applied to cut the tree. This value is too large to get a good estimation of the peak position in our dataset. If a large threshold is applied, some noisy peaks can be included into the true peaks cluster, which results in the shifting of the positions of the true peaks. In the prOTOF data, the m/z shift is very small. Even when m/z is about 4000Da, the shifting is still less than 0.2Da. Thus in our testing, we chose the cutoff threshold to be 0.002.
We estimate the position of the peaks using the mean value of all the peaks in the same cluster. If a peak is a noise peak, the cluster that it belongs to should be very small. Thus, a true peak is identified if the size of the cluster is greater than a certain value. The remaining peaks are considered as noise and not considered further. This results in a series of common peaks along all the spectra.
To find a proper value that can reflect the position where the protein achieves highest intensity, we do the following. For a single peak, if there are no peaks with m/z distance about 1Da around it, it is considered as the peak. If there are peaks with spacing about 1Da, we record the number of the peaks and choose the peak that has the largest number as the final peak.
MoverZ is a program for viewing and manipulating biopolymer mass spectra, capable of performing a wide variety of manual and automated tasks such as peak labeling, calibration, etc. It is an open software and can be downloaded from: ‘http://bioinformatics.genomicsolutions.com/MoverZDL.html’. MoverZ contains its own peak labeling algorithm that uses signal to noise and resolution as parameters to determine mono-isotopic peaks. First it attempts to assign average masses to all of the peak distributions that have a root-mean-squared SNR greater than a value set by the user. Then it attempts to find a resolved monoisotopic peak in the appropriate position relative to the average mass. The final peaks are determined by a fuzzy logic decision. Further details about MoverZ can be found at ‘http://bioinformatics.genomicsolutions.com/MoverZ.html’.
We first validated our method using a dataset acquired with a series of peptide and protein calibration standards. Next we showed the results of peak detection and alignment for all the four groups of spectra from the patients with carotid artery disease.
To validate PDA_stFFT, we tested all the datasets acquired with a series of peptide and protein standards and recorded the SNR of all the known peaks in the peptide and protein standards. The number of peaks detected with PDA_stFFT when SNR is greater than 2 is listed in Table 2. The SNR for the first replicate (Ci_01, i = 1, 2, …, 8) is shown in Table 3. The other two replicates have the similar results. As described in the Materials and Methods section, the peptide and protein standards were diluted in 3-fold to prepare a series of 8 diluted samples. The peptide and protein concentration for C1 (C1_01 or C1_02 or C1_03) was selected so that we expected to detect all known peaks with prOTOF 2000. As the concentration of peptide and protein was further diluted, we expected the SNR to reduce with dilution. Thus PDA_stFFT will detect fewer peaks as the dilution increases or no peak as the concentration of peptide or protein fails below the detection limit of the instrument. The data in Table 2 reflect this pattern.
Figure 2(a) shows the three raw spectra around the peak with m/z 757Da in sample C1_0i, i = 1, 2, 3 and Figure 2(b) shows the SNR of the denoised signal with smoothing and chemical noise removal. Figure 2(c) is the same as Figure 2(b) without the chemical noise removal. From Figure 2(b), we can see that the SNR is enhanced significantly after the chemical noise removal.
To see the efficiency of our method, PDA_stFFT, we compared the total number of the peaks detected with PDA_stFFT and MoverZ.
Table 2 shows the number of peaks with SNR≥ 2 detected with PDA_stFFT and MoverZ. Among all the cases, except in the sample C1_01, our method detects more peaks than MoverZ. This may be due to the removal of chemical noise which enhances the SNR of the peak. The SNR for a peak obtained with our method is much larger than that with MoverZ in most cases. We list the SNR of all the peaks in the sample C1_01, C1_02 and C1_03 in Table 1 as an example. Similar results were achieved with other datasets. In Table 1, “None” means the SNR is smaller than 2 in that position. When m/z is small, the SNR calculated with our method can be several times as that with MoverZ. When m/z becomes very large, MoverZ may detect the peaks with higher SNR than PDA_stFFT. This is because the detection sensitivity of prOTOF 2000 decreases with the increase of m/z, and the peak intensity of a true signal becomes weaker when m/z is very large. Since the chemical noise in high m/z region is similar to that in low m/z region, after the chemical noise removal, the signal may become even weaker, which may make the SNR smaller.
One advantage of PDA_stFFT over MoverZ is that it can discern some small peaks from the chemical noise. Figure 2(a) shows a peak at m/z 757 which belongs to the peptide “Bradykinin fragment 1-7 (m/z 757Da)”. PDA_stFFT detected this peak with a high SNR while MoverZ failed to detect it as demonstrated in Table 1. Bradykinin fragment 1-7 in C1_01, C1_02 and C1_03 was detected by PDA_stFFT as a peak with SNR of 101.9, 61.9 and 33.1 respectively (Figure 2(a)). However, MoverZ only detected 757Da peak in C1_01 with SNR of 2.3 and failed to detect 757Da peak in C1_02 and C1_03.
In the wavelet transform, the threshold parameter was set to 6, allowing smoothing of the signal. During the chemical noise removal, the threshold for determining the chemical noise is set to 80 percentiles of the value obtained by subtracting the estimated chemical noise from the smoothed signal in each adaptive region. If the number of peaks in one cluster is more than 15% of the number of spectra in one group in the peak alignment step, we take it as the true peak, and get the position of the peak by computing the mean m/z over all the peaks in its peak cluster. Otherwise, we consider them noise peaks.
Before the peak identification step, it is necessary to remove the outlier or noise data using the unsupervised information. To find the noise data, the mean of all the spectra is determined. If the mean value of the spectrum ranges between mean minus the deviation and mean plus the deviation, we consider the spectrum acceptable for further analysis. For example, the EAS group has 12 patients and 11 patients have three replicate spectra and one patient has two replicate spectra. Within these 35 spectra, one spectrum was found to be an outlier and was removed from the dataset. A total 894 of common peaks were detected from these 34 spectra after the alignment from 12265 detected peaks. After we combined the peaks that may correspond to the same protein together by choosing the peak with the maximum number in its neighborhood, 519 peaks were defined for further analysis.
While the peaks aligned well (Figure 3(a)), there was a shift in a very small region. We aligned the peaks with the complete linkage clustering and determined the position of each common peak as the average position. To show the peak alignment and detection processes, we focused on the region between 1700Da and 1900Da of the spectra. Figure 3(b) shows all the peaks detected and the aligned peaks between 1700Da and 1900Da. In total, 677 peaks were detected and 41 peaks defined as common peaks. After combining the peaks that correspond to the same protein together, 24 peaks were defined (Figure 3(c) with the solid line). Figure 3(d) shows all the smoothed spectra in this region and the detected peaks. The vertical lines correspond to the peaks detected by PDA_stFFT after alignment and the final determined peaks are pointed out in the Figure 3(d) with label of “Final peak”.
As observed in Figure 3(d), the PDA_stFFT can detect very small peaks, such as m/z 1762, 1833 and 1835 with SNR≥2. From the two enlarged regions within Figure 3(d), two to three detected peaks (with a vertical line at center of the peak) are 1Da apart and these peaks may correspond to the same protein. Figure 3(e, f) shows the peaks detected in these two small regions. The m/z 1762 peak was determined as the final peak from the peak cluster between m/z 1760 and 1765 and two peaks of m/z 1833 and 1835 were determined as final two peaks between m/z 1831 and 1839. These three peaks are plotted as solid lines in Figure 3(e) and (f). The peaks with dotted lines in Figure 3(e) and (f) are considered to be isotopic peaks.
With the same setting, we finally are able to detect 465 peaks in the EAA group, 589 peaks in the control group, and 334 peaks in the RFC group. These peaks will be used for classification of a disease model to find any potential biomarker or panel of biomarkers that can distinguish difference among the groups.
Peak detection plays an essential role in biomarker discovery from MS data. The peak detection method however depends significantly on the property of the data. An automatic peak detection method is proposed in this paper for prOTOF data. Random noise is removed by the undecimated wavelet transform. We have designed an adaptive short-time discrete Fourier transform to address chemical noise removal, which increases of the signal-to-noise ratio of the true peaks. The possible isotopic peaks corresponding to one protein are combined together by extracting an envelope over them. Depending on SNR, the desired peaks in each individual spectrum can be identified. In contrast to other methods for identifying peptides, this method is independent of the knowledge of the mass of the peptides. In addition, since we use the short-time discrete Fourier transform to remove the chemical noise, it can be applied to a wide range of mass. The final peaks are determined by choosing the peak which has the largest number among its neighbors with a distance of about 1Da in the aligned common peaks obtained by complete linkage hierarchical clustering for all the peaks detected in all the spectra. We believe this approach can be easily implemented and is very effective.
We validated our PDA_stFFT method with a dataset acquired with an analysis of the peptide and protein calibration standards. The results are consistent with our experiment settings. The SNR for known peaks is significantly increased after chemical noise removal. Compared to the popular software MoverZ, our method can detect more peaks in diluted samples and can detect peaks with low intensity with high SNR. Finally, we successfully applied our method to detect the peaks from the spectra of a large dataset derived from clinical serum samples.
The authors wish to thank Dr. Lisa Sapp in PerkinElmer (Boston MA) for acquiring all prOTOF MS for clinical samples. This work is partially funded by IBIS award (Zhou) and TMHRI scholarship award (Zhou). This research was supported in part by the Intramural Program of the NIH, Clinical Center as well. The authors would like to thank Dominik Beck for proofreading the manuscript.
The authors have declared no conflict of interest.