|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Peaks are the key information in mass spectrometry (MS) which has been increasingly used to discover diseases-related proteomic patterns. Peak detection is an essential step for MS-based proteomic data analysis. Recently, several peak detection algorithms have been proposed. However, in these algorithms, there are three major deficiencies: (i) because the noise is often removed, the true signal could also be removed; (ii) baseline removal step may get rid of true peaks and create new false peaks; (iii) in peak quantification step, a threshold of signal-to-noise ratio (SNR) is usually used to remove false peaks; however, noise estimations in SNR calculation are often inaccurate in either time or wavelet domain. In this article, we propose new algorithms to solve these problems. First, we use bivariate shrinkage estimator in stationary wavelet domain to avoid removing true peaks in denoising step. Second, without baseline removal, zero-crossing lines in multi-scale of derivative Gaussian wavelets are investigated with mixture of Gaussian to estimate discriminative parameters of peaks. Third, in quantification step, the frequency, SD, height and rank of peaks are used to detect both high and small energy peaks with robustness to noise.
Results: We propose a novel Gaussian Derivative Wavelet (GDWavelet) method to more accurately detect true peaks with a lower false discovery rate than existing methods. The proposed GDWavelet method has been performed on the real Surface-Enhanced Laser Desorption/Ionization Time-Of-Flight (SELDI-TOF) spectrum with known polypeptide positions and on two synthetic data with Gaussian and real noise. All experimental results demonstrate that our method outperforms other commonly used methods. The standard receiver operating characteristic (ROC) curves are used to evaluate the experimental results.
Mass spectrometry (MS) is a crucial analytical tool in proteomics research to provide tremendous information for disease proteomics study and drug targets identification at the protein/peptide level. Due to measurement error, chemical and other background noise, MS usually contains high-frequency noise and consequently a multitude of misleading peaks. Peak detection is one of the most important steps in MS data analysis because its performance directly effects the final proteomics study results.
Because the noise in MS comes from different resources and cannot be estimated, false positive peak detection results are unavoidable. This makes peak detection as a challenging problem. In recent years, several peak detection methods have been proposed (Coombes et al., 2005; Du et al., 2006; Morris et al., 2005; Nguyen et al., 2009). Most previous algorithms have four common preprocessing steps: denoising, baseline correction, alignment of spectrograms and normalization. After preprocessing, local maxima is usually used to detect peak positions and design rules to quantify peaks. In this article, we will explore the limitations of existing peak detection methods and propose several new algorithms to solve them.
Most peak detection methods employed denoising step by removing noise in each scale of wavelet, such as commonly used Cromwell (Coombes et al., 2005; Morris et al., 2005) and continuous wavelet transform (CWT) (Du et al., 2006) methods. However, true peaks in MS could have large frequency response and be removed by denoising step. As a result, these true peaks cannot be detected. We propose using bivariate shrinkage model, which considers relationship of two neighbor scales, to remove noise in stationary wavelet domain. Because utilizing relationship between two neighbor coefficients or two scales of wavelets can keep high-frequency true signal (Selesnick et al., 2001). Stationary wavelet transform (SWT) utilizes spatial information of signals and suppress artifacts by redundant representations.
Baseline removal step was widely used in peak detection methods, but it often got rid of true peaks and created new false peaks. To avoid removing baseline, the CWT-based pattern-matching algorithm was introduced in study by Du et al. (2006). Using Mexican Hat wavelet in multi-scale, this method gave good results in peak detection with high sensitivity and low false discovery rate (FDR). However, the more important property of multi-scale in wavelet domain was not used in this method (Mallat, 2009). Instead of considering peaks as the sum of delta functions, more generally, we consider MS peaks as a mixture of Gaussian in which each peak corresponds to one Gaussian. We propose to use Gaussian derivative wavelet, instead of Mexican Hat wavelet which is only the second derivative of Gaussian wavelet. Zero-crossing lines which are robust to noise are also introduced to replace Ridge-lines in Du et al. (2006). We study the zero-crossing lines in multi-scale wavelet and provide new theoretical analysis.
In most peak detection methods, signal-to-noise ratio (SNR) was used to remove the small energy peaks with SNR values less than a threshold. But MS noise cannot be correctly estimated in either time domain or wavelet domain. Thus, in this article, instead of SNR, frequency response, height and SD of Gaussian peaks calculated by zero-crossing in Gaussian derivative wavelet domain are used to remove false peaks. In order to improve sensitivity, the Envelope analysis (Nguyen et al., 2009) is also used to save some important peaks with small energy.
In this article, we propose a new Gaussian derivative wavelet-based peak detection method (GDWavelet) for Surface-Enhanced Laser Desorption/Ionization Time-Of-Flight (SELDI-TOF) spectrum. Both simulated and real spectrum with known polypeptide positions and compositions are used to evaluate our method. With simulated data, we compare different peak detection algorithms by both Gaussian and real noise. All experimental results show that our new approach can detect more peaks (in both high and low amplitude) with a lower FDR than state-of-the-art methods.
In this section, our new GDWavelet method will be introduced. In GDWavelet, we utilize bivariate smoothing model, Gaussian derivative wavelet and envelope analysis. First, bivariate shrinkage estimator in SWT domain will be used to reduce noise and to keep whole true signal. Second, we will introduce how to detect peaks using Gaussian derivative wavelet through peak properties such as frequency response, SD and height. Finally, envelope analysis is performed to save true small energy peaks which will be missed if only peak properties are used.
Noise smoothing in MS is an important step which should remove noise and keep true peaks. In Myers et al. (2004), they tried to remove noise as much as possible, hence some true peaks were also removed. We propose to utilize bivariate shrinkage estimator in SWT domain to reduce noise and keep whole true signal. More precisely, we decrease the noise level without removing most of them. SWT is chosen due to its fast speed and redundant representations. The later step will further handle the remaining noise.
To estimate wavelet coefficients, the most well-known rules are universal thresholding and soft thresholding (Donoho et al., 1995) which was applied to Cromwell method (Coombes et al., 2005; Morris et al., 2005). These algorithms assume that wavelet coefficients are independent. Unfortunately, frequency response of peak is rather wide. Hard or soft thresholding only considers coefficients in a sub-band with narrow frequency. Recent research shows that algorithms utilizing the dependency between coefficients can give better results than those using the independency assumption (Sendur et al., 2002). Sendur et al. exploited this dependency between coefficients and proposed a non-Gaussian bivariate pdf for the child coefficient w1 and its parent w2 as follows
The marginal variance σ2 is dependent on the coefficients index k. By this bivariate pdf and the Bayesian estimation theory, the MAP estimator of w1 (Sendur et al., 2002) is derived as
where y1 is child noisy coefficient, y2 is parent noisy coefficient. This estimator is a bivariate shrinkage function. It has been used to smooth many kinds of signals such as image (Sendur et al., 2002), DNA copy number (Huang et al., 2008; Nguyen et al., 2010), etc. In this article, bivariate shrinkage estimator is used to smooth MS signals. An example of denoising result is shown in Figure 3a. This example will be discussed in Section 2.4.
In previous works (Coombes et al., 2005; Morris et al., 2005), MS peaks were considered as the sum of delta functions. That means only heights of peaks have been used for peak detection throughout SNR. Du et al. (2006) utilized width of peaks to improve peak detection results a lot. We consider MS peaks as a mixture of Gaussian in which each peak corresponds to one Gaussian:
With this assumption, four parameters providing intrinsic differences between true peaks and noise are peak position, SD, height and frequency response of peak. To find these parameters of a peak, we use zero-crossing lines in multi-scale of Gaussian derivative wavelet instead of ridge-lines in multi-scale of Mexican hat wavelet that was used by Du et al. (2006).
Scaling theory for zero-crossings has been studied and applied to many applications. Yuille et al. (1986) assumed that signal is the sum of delta functions. Another similar assumption of signal, bandlimited signal, has been studied in Vo et al. (1996). However, studying zero-crossing of signals with Gaussian mixture assumption still is a new and challenging problem. We will build new theory of zero-crossing lines in multi-scale in following sections. Through our theory, parameters (position, SD, height and frequency response) of a Gaussian peak can be accurately estimated.
We use the first derivative of fi(t) to locate local maxima corresponding peak position: fi′(t0) = 0 with t0 = μi. We continue using the second derivative and third derivative of fi(t) to estimate height and SD of Gaussian peak: fi′′(t0) = 0 with t0 = μi±σi, fi′′(t0) = 0 with t0 = μi and .
Since smoothing performed in denoising step only reduces noise and keeps small true peaks, multi-scales of Gaussian derivative wavelet are used to make local maxima and minima more robust to noise instead of only one Gaussian filter in Nguyen et al. (2009). The wavelet transform can be written as convolution product in (4):
In this article, the Gaussian wavelet is used. So, can be followed as (6):
If convoluting fi(t) and , we get result in (7)
where and .
The zero-crossing of W1f(u, s) and W2f(u, s) belong to connected curves that are never interrupted when the scale decreases.
If W2f(u, s) =0 we get , then 0 < u0(s+1)−u0(s) ≤ 1 with any scale s.
If W3f(u, s) = 0, we get . If we select s = 100 and σi = 0.1, then u0(100+1)−u0(100)=1.2247. In conclusion, 0≤u0(s+1)−u0(s)≤1 with the first and second derivative and zero-crossing lines belong to connected curves. Another conclusion is that zero-crossing lines is discontinuous lines if the third derivative Gaussian wavelet is used. Thus, only the first and second derivative Gaussian wavelets should be used in peak detection.
If fi is a discrete signal, (4) can be rewritten as follows:
We get f(k) by sampling fi(t) with Ts:
If W2f(u, s) = 0, we get . if W3f(u, s) = 0, we get .
Note: zero-crossing line is more robust to noise than ridge line. This conclusion is illustrated by an example in Figure 1. Figure 1c and e show that zero-crossing lines are easier to detect than ridge lines linking local maxima or local minima points.
From Section 2.2.1, parameters of a Gaussian peak could be estimated as follows:
Estimation of peak position: there are three ways to estimate peak positions throughout zero-crossing of three kind Gaussian derivative wavelets.
Estimation of peak's SD: Another parameter of Gaussian peak is SD σi. There are two ways to estimate σi as follows
After calculating σi−left(s) and σi−right(s) at all scales, σi should be estimated by
where Nl and Nr are length of left and right zero-crossing lines. However, zero-crossing lines at left and right sides of the third Gaussian derivative wavelet are disconnected lines, so it is not easy to estimate σi through (19, 20, 21).
Estimation of peak height: finally, we develop a way to estimate real height of Gaussian peak. With Gaussian peak , we have
We can use (22) to estimate height of Gaussian peak after knowing μi and σi.
An Example: to demonstrate the above theory, we assume we have a Gaussian peak as follows:
where Ax = 10, μx = 5 and σx = 0.5. This peak is added Gaussian noise and baseline as follows:
where b is constant, a representation of baseline, μ = 0 and σ = [0.25; 0.5; 0.75; 1]. With each σ value, 200 signals f(s) have been created. One sample f(t) is shown in Figure 1a. We will estimate μx, σx and Ax using above zero-crossing theory. Error rate which is defined in (25) will be used to compare accuracy of different estimation methods:
Figure 1b, c and e show zero-crossing lines in 128 scales using Gaus1, Gaus2 and Gaus3. These zero-crossing lines will be used to estimate μx, σx and Ax. Table 1 lists error rates of four methods to estimate peak position μx. With Gaus1, Gaus2, and Gaus3 methods, μx values are calculated by (13, 14, 15) correspondingly. The term ‘with denoise’ means bivariate shrinkage estimator is used to denoise Gaussian noise in signal f(t). The Mexh, Mexican hat wavelet corresponding to Gaus2, is used as core part to detect peak in CWT method (Du et al., 2006) and peak tree method (Zhang et al., submitted for publication). Based on result's in Table 1, the error rate when using Mexh wavelet (Du et al., 2006) is the largest. We note that we use package ‘MassSpecWavelet’ (Du et al., 2009) which uses denoising with DWT and finds peak position using ridge lines (Du et al., 2006) with Mexh wavelet. ‘Gaus1 with denoise’ has the smallest error rate. However, error rates in Gaus1 without denoising and in Gaus2 are still acceptable and much better than in Mexh wavelet.
We can estimate σx by (18) or (21). However, with Gaus3, zero-crossing lines are not continuous lines (see Remark in Section 2.2.1). Thus, estimation of zero-crossing in 128 scales of Gaus3 is a problem. This problem causes a larger error in calculating the σx. From result of Table 2, we can conclude that Gaus2 with denoising should be used to estimate σx because its error rate is the smallest.
By using (22) with zero-crossing lines of both Gaus2 and Gaus3, the height of Gaussian peak is estimated. In this case, baseline b which is used in (24) is a constant. From Table 3, Gaus2 with denoising gives the smallest error rate and should be used to calculate Ax.
From above example, the best way to estimate peak position μx is from the first Gaussian derivative wavelet, Gaus1. The second Gaussian derivative wavelet, Gaus2, should be used to estimate SD σx and height Ax of a Gaussian peak. Figure 1d and f shows Ridge lines which correspond to zero-crossing lines in Figure 1c and e. It is clearly that detecting Ridge lines is more difficult than detecting zero-crossing lines. Ridge lines in Du et al. (2006) are similar to Ridge lines in Figure 1f, corresponding to zero-crossing line in Gaus3 which should not be used because of its high error in calculating parameters of peaks.
Envelope analysis was introduced by Nguyen et al. (2009). Any finite energy signal y(t) can be analyzed into three envelope signals including MAX, MIN and MED envelops at the first level. Each of these envelops can be considered as a signal and will be decomposed into three envelopes. In this article, we use MAX and MED envelops to detect peaks because MIN envelops contain no peak. We decompose the original signal into one MAX envelop at level 1, one MAX and one MED envelops at level 2 and four envelops which comprise two MAX envelops and two MED envelops at level n > 2. Empirically, 5–7 are recommended as the number of levels to get significant peaks.
The framework of our proposed GDWavelet method is shown in Figure 2. First, raw MS data is smoothed by bivariate shrinkage estimator (2) in SWT domain to keep true signal and reduce noise. Note that, the lowest frequency detail scale and approximate scale which may include true signal should not be applied with any estimator, so that true signal is not removed. As a result, noise in signal is reduced and smoothed signal still has a little noise. Second, without applying baseline removal that often discards true peaks and creates new peaks, smoothed signal is used to estimate frequency response, position, height and SD of Gaussian peak by zero-crossing lines in multi-scale Gaussian derivative wavelet domain. Frequency response of Gaussian peak is proportional to the length of zero-crossing line if the first derivative Gaussian (Gaus1) is used. Peak position, μi, is estimated by (13). SD, σi, of Gaussian peak is calculated by (18). Result of (22) with Gaus2 gives heights of peaks. Using the first and the second derivative Gaussian wavelet, we can estimate all parameters of a Gaussian peak. After removing peaks with frequency response and SD less than a threshold, we get all peak candidates. Third, in peak quantification step, we use two rules to remove false peaks: (i) all peak candidates are quantified by peak rank (PR; Nguyen et al., 2009) in Envelop analysis. Peaks with PR = 1, even small peaks, are important peaks. (ii) Peak height is used to remove peaks with height smaller than threshold. We use peak height to substitute SNR that was used by Morris et al. (2005) and Du et al. (2006), because noise cannot be exactly estimated in either time domain or wavelet domain. Finally, the union results of two quantifying rules are the final detected peaks.
We randomly select 19-th sample of CAMDA, 2006 to illustrate how GDWavelet method detects MS peaks. In Figure 3a, blue signal represents raw signal and red one is signal smoothed by (2). A zoom in subfigure draws the peak which is used to show its zero-crossing lines in Figure 3b. Using one zero-crossing line in multi-scale of the Gaus1 and two zero-crossing lines in multi-scale of the Gaus2, position, height, SD, and frequency response of this peak are estimated. In Figure 3c, we quantify peaks by two rules: peak height and PR (in Envelope analysis). The circles are results from peak height-based quantification. The stars are from PR-based quantification. Union results include all peaks with heights larger than a threshold or PR one. Figure 3d shows receiver operating characteristic (ROC) curves of four related methods. GDWavlet gives the best performance.
Cruz-Marcelo et al. (2008) and Emanuele and Gurbaxani (2009) presented the extensive studies to compare the performance of state-of-the-art methods for SELDI data preprocessing, including CWT (Du et al., 2006), Cromwell (Coombes et al., 2005; Morris et al., 2005), PROcess (Li et al., 2006), Ciphergen and SpecAlign (Wong et al., 2005). They concluded that CWT (Du et al., 2006) has the best performance. Another method which also works well is Cromwell (Coombes et al., 2005; Morris et al., 2005). In this section, our GDWavelet method will be compared with the Cromwell (Coombes et al., 2005; Morris et al., 2005), the CWT (Du et al., 2006) and our previous method, GaborEnvelop (Nguyen et al., 2009). Cromwell method is implemented by MATLAB which can be downloaded from (UT-MD Anderson Cancer Centre, 2002). The CWT method (Du et al., 2006) was implemented in R (called ‘MassSpecWavelet’) and Version 1.12 can be downloaded from Du et al. (2009). GaborEnvelop (Nguyen et al., 2009) is implemented in MATLAB.
We evaluate the performance of above methods by the ROC curve. Both simulated and real data are used. The first simulated data was proposed by Morris et al. (2005) and Coombes et al. (2005) and is available for download at Simulated Proteomics Spectra (2005). In this data, hundreds of mean spectrum samples with hundreds of proteomics datasets are generated.
Based on the simulation engine developed by Morris et al. (2005) and code (R and MATLAB) to generate simulated data proposed by Cruz-Marcelo et al. (2008) and Zhang et al. (submitted for publication), we also create two new simulated datasets to investigate noise affection on different algorithms. The 100 spectrums with 20 − 30 true peaks are created first, and Gaussian and real noise are added separately to get two datasets. When Gaussian noise is added, each sample includes 20% of protein peaks which are below three time of SNR. Real noise is extracted from real data in which there is no true peaks. There is only noise from 26000 (index) to end in first sample of CAMDA, 2006. Real noise probability density function is built. Using this function, noise with different SD will be created. Based on this configuration, we create about 20–30 true peaks and more small energy peaks in simulated data.
The CAMDA dataset (2006) of all-in-1 Protein Standard II (Ciphergen Cat. # C100–007) is the real dataset. Because we know polypeptide composition and position in this dataset, we can estimate the sensitivity and the FDR. There are seven polypeptides which create seven true peaks at 7034, 12 230, 16 951, 29 023, 46 671, 66 433 and 147 300 of the m/z values.
The sensitivity and FDR of four methods are calculated for 60 real MS signals and three simulated MS datasets with 100 signals each. The SNR thresholding values are increased gradually to calculate the ROC curves of Cromwell and CWT methods. The SNR thresholding values are chosen from 0 to 20 for Cromwell method and from 0 to 120 for CWT method. In our GDWavelet method, the peak height ratio, which is defined as the ratio of current peak height over average height of peaks, is changed from 0 to 10 to build the ROC curve. We plot the average ROC curves in Figures 4 and and5.5. We should notice that we take average of all ROC points with the same SNR threshold (for Cromwell and CWT) and the same peak height rate (for our GDWavelet method).
Three simulated datasets and one real SELDI-TOF dataset are used to create ROC curves in Figures 4 and and5.5. In all four datasets, the performance of Cromwell is not stable and gets worse than CWT and GDWavelet. Between GaborEnvelop which used Envelope analysis and CWT which used ridge lines and SNR in peak quantification, GaborEnvelop is better than CWT in real data in Figure 4b and CWT is better than GaborEnvelop in simulated data. In all cases, our GDWavelet method has much better performance than GaborEnvelop and CWT methods. At the same FDR, the sensitivity of our method is consistently higher than GaborEnvelops and CWTs. It is clear that utilizing both of Envelope analysis and Gaussian derivative wavelet in peak quantification made significant contributions to detect both high energy and small energy peaks. Bivariate shrinkage estimator in wavelet domain guarantees that denoising step in our method saves more true signal than in Morris et al. (2005). Zero-crossing lines-based peak parameters estimations in our article is more accurate and robust to noise than ridge lines in Du et al. (2006). Envelope analysis is more efficiently used in GDWavelet than in GaborEnvelop. Therefore, the GDWavelet has better peak detection results than Cromwell, GaborEnvelop and CWT. Thus, it is an efficient and accurate method to detect peaks in both real and simulated MS data. In Figures 4 and and5,5, CWT's ROC curves is limited in small FDR because two thresholds of the length of ridge lines and the scale corresponding to the maximum amplitude on the ridge line are used as default in MassSpecWavelet (Du et al., 2009). Finally the runtime of GDWavelet algorithm is comparable to CWT method, because both methods need more computational time to decompose a signal to many scale using continuous wavelet transform.
In this article, we proposed new zero-crossing line theory in multi-scale of Gaussian derivative wavelet to estimate parameters of peaks in MS which has been assumed as a mixture of Gaussian. A novel GDWavelet method was proposed to efficiently and accurately detect MS peaks by integrating bivariate shrinkage model, Gaussian derivative and Envelope analysis. The bivariate shrinkage estimator in SWT domain was used to reduce noise and still keep true peaks. All parameters of a Gaussian peak, estimated by multi-scale in Gaussian derivative wavelet and Envelope analysis, have been used to remove false peaks. The peak height and PR were introduced as a nice substitution of the previous SNR method to identify true peaks. Both simulated data and real MS data are used to evaluate our GDWavelet method. Simulated data were created with both Gaussian noise and real noise. Our GDWavelet method gave out a much better performance in the ROC curves than three other state-of-the-art peak detection methods. GDWavelet algorithm will be extended and test with other kinds of MS (such as MALDI-TOF) as future work. Based on GDWavelet method, many MS data-related applications will be improved, such as protein identification, biomarker discovery, cancer classification, etc.
Funding: NSF-CCF 0830780; NSF-CCF 0939187; NSF-CCF 0917274; NSF-DMS 0915228; NSF-CNS 0923494; UTA-REP.
Conflict of Interest: none declared.