Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2788927

Formats

Article sections

Authors

Related links

Bioinformatics. 2009 December 15; 25(24): 3213–3220.

Published online 2009 October 9. doi: 10.1093/bioinformatics/btp582

PMCID: PMC2788927

Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building (MMC 303), Minneapolis, MN 55455-0378, USA

*To whom correspondence should be addressed.

Associate Editor: Alex Bateman

Received 2009 June 22; Revised 2009 October 4; Accepted 2009 October 6.

Copyright © The Author 2009. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

**Motivation:** The need to align spectra to correct for mass-to-charge experimental variation is a problem that arises in mass spectrometry (MS). Most of the MS-based proteomic data analysis methods involve a two-step approach, identify peaks first and then do the alignment and statistical inference on these identified peaks only. However, the peak identification step relies on prior information on the proteins of interest or a peak detection model, which are subject to error. Also numerous additional features such as peak shape and peak width are lost in simple peak detection, and these are informative for correcting mass variation in the alignment step.

**Results:** Here, we present a novel Bayesian approach to align the complete spectra. The approach is based on a parametric model which assumes that the spectrum and alignment function are Gaussian processes, but the alignment function is monotone. We show how to use the expectation–maximization algorithm to find the posterior mode of the set of alignment functions and the mean spectrum for a patient population. After alignment, we conduct tests while controlling for error attributable to multiple comparisons on the level of the peaks identified from the absolute mean spectra difference of two patient populations.

**Contact:** ude.nmu.tatsoib@rnavac

One approach to the characterization of the proteome of a tissue sample from an organism is to use the mass spectrum of the tissue sample. In mass spectrometry (MS), a biological sample is fragmented into ions, and the paired mass-to-charge ratio (*m*/*z*) versus the intensity of the resulting set of ions are measured. Several technical approaches in obtaining a representation of the spectrum are in common use, such as matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) MS and surface-assisted laser desorption-ionization time-of-flight (SALDI-TOF) MS, but all these approaches involve the alignment of the *m*/*z* axis to correct for experimental variations if one wants to compare the spectra from different tissues. What these different approaches have in common is that they produce a set of intensity measurements for *m*/*z* values, or since we will present examples using the MALDI-TOF technology which usually produces singly charged ions, we can just think of this as a set of intensity measurements as they depend on mass. One common goal is to compare mean spectra across different patient populations to identify biomarkers.

All alignment approaches fall into one of two very broad categories (Vandenbogaert *et al*., 2008): they are based either on peak data, where the data have been processed to identify important signals (peaks) and distinguish these signals from noise, or on profile data, where the MS spectra are taken as recorded. We will now briefly describe each of these general strategies and discuss several implementations of each strategy.

In the peak-based approach, one tries to distinguish between relevant signals from peptides and irrelevant noise in the data in an initial peak detection step, and then relies only on these peaks for the subsequent analysis. In the peak detection step, one applies a number of data preprocessing steps. Data preprocessing is considered a necessary step before characterization of the proteome of a set of samples for all applications of MS due to the noise in the measured intensity and also the slight variation in the measured locations. This data preprocessing typically consists of the following steps: baseline correction, smoothing, intensity normalization, peak identification and peak alignment. Following this data preprocessing, analysis of different patient populations can be carried out by using the detected and aligned peaks. Peak detection typically excludes a large amount of noise and thereby reduces the data considerably. However, the performance of the overall alignment process strongly depends on the performance of the peak detection step. Peak-based alignment methods typically only take *m*/*z* variation of aligned peaks into account; only a few of them also use intensity. Several peak-based alignment methods have been proposed, as we now discuss.

Tibshirani *et al*. (2004) aligned peaks extracted from the smoothed raw spectra via a dendrogram constructed using agglomerative hierarchical clustering with complete linkage. The idea is that tight clusters should represent the same biological peak that has been horizontally shifted in different spectra, thus one can extract the centroid of each cluster to represent the consensus position for that peak across all spectra. A peak in the individual spectra is deemed to be one of the common peaks (cluster), if its center is close enough to the estimated central position of the common peak. The assumption that motivates such a procedure is true only if global or non-linear local shifts in the *m*/*z* scale are uniformly small enough so that no peak deviates far from the mean spectrum of the population to keep the closeness to the true peak position. Finally, selection of the number of clusters (i.e. the dendrogram cutoff value) is in general difficult to determine.

Yu *et al*. (2006) proposed to address the multiple peak alignment problem via a sequential approach based on Gaussian scale-space theory. They assume that multiple sets of detected peaks are the observed samples of a set of common peaks. They convert the problem of estimating locations of the unknown number of common peaks from multiple sets of detected peaks into the much simpler problem of searching for local maxima in the scale-space representation. The optimization of the scale parameter σ is achieved through minimizing the energy function *E*(σ) = *D*(σ)+λ*R*(σ). The data-fitting term *D*(σ) is the sum-of-square differences between the sample peaks and the scale-space-based representation. The regularization term *R*(σ) = 1 − exp{−(σ−μ_{s})^{2}/(2σ_{s}^{2})} prevents overfitting (too small σ) and oversmoothing (too large σ). The parameter λ allows the user to adjust the relative importance between these two terms. From the Bayesian point of view, the data-fitting term is the log-likelihood, the regularization term is the log prior. They use the *m*/*z* distances between neighboring peaks in all samples to estimate σ_{s} and μ_{s} in the regulation term. In their simulations, the energy function achieved the same minimum, around , for all the λ values they tried: 0.5, 1, 1.5 and 2.0. They claimed that the value of the relative importance coefficient λ does not have much influence on the minimum energy and simply set λ = 1. The above claim is true if and only if , which corresponds to . The μ_{s} happened to be a good estimate for σ in their simulations, but this will not generally be the case. Thus, how to set the value of λ is still an unsolved problem. In the next step, they obtain the number and locations of the common peaks at the estimated-scale level by searching for a local maximum, and align peaks with respect to the common peaks using a closest point matching method (i.e. for every common peak, its counterpart in a sample has the smallest distance among all peaks in the same sample), which is valid only if the majority of peaks locate close to the true locations with only a few outliers. The fundamental assumption of the above Gaussian scale-space approach is that the locations of the observed peaks follow a symmetric unimodal distributions (e.g. normal distributions) with their means equal to the corresponding locations of the common peaks and variances reflecting the width of the peak, but that strongly depends on whether or not the preprocessing steps could deal well with frequently observed global and local non-linear shifts of peaks.

Alternatively, one can attempt to compare the complete MS spectra directly. Methods that use this approach usually attempt to find an alignment such that the overall difference between intensities of all MS spectra and the reference spectrum (e.g. the mean spectrum or any spectrum among the studied ones) after alignment is minimized.

Some try to find a polynomial warping function that applies to all *m*/*z* values using least squares, as proposed by Eilers (2004), which we call the parametric time warping (PTW) method. Typically, quadratic functions are used in practice, which are not flexible enough to capture non-quadratic *m*/*z* distortion, thereby introducing bias.

Listgarten *et al*. (2005) proposed what they call the continuous profile model (CPM). This model employs a hidden Markov model-based approach, in which each observed mass spectrum is a non-uniformly subsampled version of a latent spectrum, to which local rescaling and additive noise are applied. They estimated the latent spectrum using the expectation-maximization (EM) algorithm, then align each observed spectrum to the latent spectrum using the Viterbi algorithm, a dynamic programming (DP) algorithm for finding the most likely sequence of hidden states. One potential advantage of this approach is the consideration of the intensity information in the alignment, and another is to combine all the preprocessing, peak detection and peak alignment steps into one computation. On the other hand, one must manually set many parameters, and this approach only works for extracting the latent spectrum for one sample by aligning multiple technical replicates. The implicit assumption behind this approach is that there is one-to-one correspondence among peaks across sequences to be aligned (Yu *et al*., 2006). This is usually not true for samples from different persons. Thus, this method cannot be directly applied to clinical datasets which usually have no technical replicates for each sample. In addition, it allows very large cumulative shifts, which can potentially lead to alignment of unrelated peaks thereby creating artificial features. This method is successful in its original context, speech recognition, to align speech energy signal spectra with different speaking speeds from one person.

Here, we develop a profile-based alignment approach, using an explicit probability model in conjunction with a Bayesian inferential approach. The approach to alignment described here builds on a previously developed method for curve registration by Reilly *et al*. (2004). Not only does this approach avoid the peak selection step of peak-based methods, it strikes a balance between the profile-based methods described above in that, while being far more flexible than quadratic time warping method, it avoids the pitfalls of methods such as the CMP that require manual specification of many parameters.

The quantitative intensity measurements output by most MALDI-TOF mass spectrometers already have the baseline subtracted from the intensity by the machine's internal algorithm (Yasui *et al*., 2006). Thus, further baseline subtraction will not be considered here.

Due to variation in sample preparation and deposition on the target, matrix crystallization and ion detection, overall abundance measurements often vary by at least 4-fold for different samples (Wu, 2004). The proposed alignment method depends on the spectra's abundance; therefore, abundance variations need to be minimized. Many normalization methods have been proposed to solve this problem. Here, we simply apply one scaling factor to each MS run so that the total abundances under the curve is the same for all the samples (Wu, 2004). After this we take the logarithms of the abundance to reduce the variance's dependence on the mean level.

MALDI-TOF MS is a technique in which a co-precipitate of an UV-light absorbing matrix and a biomolecule is irradiated by a nanosecond laser pulse. Biomolecule *i* is turned into gas phase ions with initial velocity *v*_{0i} (the initial velocity approximately follows a normal distribution), and then are accelerated in an electric field with voltage *U* and enter a field-free flight tube. During the flight in this tube, different ionized biomolecules are separated according to their *m*/*z* ratio and reach the detector at different times. The *m*_{i}/*z*_{i} of an ion can be approximated by a quadratic function of its drift time *t*_{i}. Suppose *L* is the drift length, while β_{0}, β_{1} and β_{2} are calibration constants. Then,

The values of *m*/*z* in a TOF-MS are determined by estimating β_{0}, β_{1} and β_{2} using an internal or external calibration to known *m*/*z* values. Typically, the calibration involves measuring a standard sample that contains several species with known *m*/*z* values and fitting quadratic curves to the observed versus known *m*/*z* values of the several species to estimate the coefficients β_{0}, β_{1} and β_{2}. The estimates are then used to determine the corresponding *m*/*z* value of each TOF point. The errors in this approximation can become fairly large (>2%) when the calibration equation is extrapolated beyond the range of masses of the calibrants (Coombes *et al*., 2005).

Figure 1 shows a segment of MS spectra of bronchio-alveolar lavage samples of seven patients receiving lung transplants who experienced transplant rejections (solid lines) and five patients who did not reject for at least 5 years (dash lines). The *x*-axis of Figure 1a is the TOF scale, while the *x*-axis of Figure 1b is the calibrated *m*/*z* scale. Comparing Figure 1a and b, the proteins are not indexed properly by their calibrated *m*/*z* values across multiple spectra. Meanwhile, the calibration potentially introduces bias via mis-specification of the quadratic model and introduces variability through the use of the parameter estimates , and . It may be advantageous to forego calibration and align the spectra using the TOF scale directly. Also the TOF measurements are equal-spaced measurements, which is slightly faster for alignment with respect to computation, as we will note in Section 2.5.2.

Let *x*_{i}(*t*) represent the height (intensity on the log-scale) of the spectrum for subject *i* at time *t* (in the TOF measurement scale). If θ(*t*) is the average spectrum for this patient population at *t*, ξ_{i}(*t*) is the deforming function for this subject at *t* in TOF measurement, and ε_{i}(*t*) is random error, then we assume

for all *t*. Here, we assume that these deforming functions are bijective, continuous and strictly monotone increasing, otherwise ξ_{i}(*t*) is able to completely erase observed peaks in the spectra. These assumptions are natural since we do not think the observed spectra are cut up and are reassembled versions of the underlying signal; rather the need for alignment is that there are slight distortions in the TOF scale. Note that we only observed data on some finite set (*t*_{1}, *t*_{2},…, *t*_{T}) and so ξ_{i}(*t*) need only be defined for *t* , but given that ξ_{i} is assumed monotone, it has an inverse and so *E*[*x*_{i}(ξ_{i}^{−1}(*t*))]=θ(*t*). Thus, we need to define *x*_{i}(*t*) for all *t* in the range of ξ_{i}^{−1}(*t*). Since ξ_{i}(*t*) is also assumed continuous, we need to define *x*_{i}(*t*) for all *t*. Let *E*_{j} for *j* = 1,…, *T* − 1 be the partition of [*t*_{1}=min_{t}, *t*_{T}=max_{t}] defined by the locations of the knots of ξ_{i}(*t*) (use the same partition for all *i*), *t*_{1}, *t*_{2},…, *t*_{T}. So here we define *x*_{i}(*t*)=∑_{j=1}^{T−1} *x*_{i}(*t*_{j})*I*_{{tEj}}. In addition, θ(*t*) will be assumed constant within each *E*_{j} segment.

We parameterize ξ_{i}(*t*) as piecewise linear with knots positioned at the locations (or a subset of the locations) where we have observed data. We also restrict the range of ξ_{i}(*t*) to be in a finite set with cardinality strictly greater than the number of knots. Hence, there are only finitely many possible ξ_{i}(*t*), but since we can increase the cardinality of the range of ξ_{i}(*t*), we can obtain an approximation to any desired level of accuracy.

We assume that the *L*_{2} norms of *x*_{i}(*t*)−θ(ξ_{i}(*t*)) over *E*_{j}, ∫_{Ej} [*x*_{i}(*t*)−θ (ξ_{i}(*t*))]^{2} d*t*, are independently distributed as σ^{2}|*E*_{j}|χ_{1}^{2} (where |*I*| denotes the length of the interval *I*). We also assume that the *L*_{2} norms of ξ_{i}(*t*)−*t* over *E*_{j}, ∫_{Ej} [ξ_{i}(*t*)−*t*]^{2}d*t*, are independently distributed as τ^{2}|*E*_{j}|χ_{1}^{2}. Then the −2 log posterior is given up to a constant by

We then find the posterior modes of ξ_{1}(*t*), ξ_{2}(*t*),…, ξ_{n}(*t*) by minimizing

(1)

subject to

The first constraint will guarantee that ξ_{i} (*t*) is strictly monotone increasing. The motivation for the second constraint is that we want to maintain the shape of the peaks during the alignment process. Because our method aligns the spectra point by point, the peaks can be arbitrarily stretched or compressed along TOF axis as long as the objective function is minimized. By this constraint, the distance (along TOF axis) could be conserved between consecutive points with both intensities no less than *r*_{i}. Thus, the shape of those peaks are conserved. For example, if we believe that most of the biological meaningful peaks' intensities are >80 percent quantile of the local spectrum, then we should set *r*_{i} as the 80 percent quantile of the local spectrum here. The larger the *r*_{i} is, the less the shape of the spectrum will be conserved. Figure 1c shows the spectra after alignment without the second constraint, note that the bell shape of the peaks is lost.

For MALDI-TOF MS, setting *r*_{i} as a local quantile instead of a global quantile is necessary as shown in Figure 2 where the dashed line represents the measured spectrum of a lung transplant patient, the solid curve displays the 80 percent quantile of the nearest 400 points in a symmetric window, and the dotted horizontal line is the global 80 percent quantile. The amplitude of the mass spectrum intensity varies greatly along the TOF scale, only part of the spectrum (between around 10 000 to 20 000, in time order) have intensities higher than the global 80 percent quantile. If we set *r*_{i} as the global 80 percent quantile, the proposed algorithm can only guarantee to keep the shapes of these peaks during the alignment procedure. While if use the local 80 percentile quantile as *r*_{i}, we could keep the shapes of peaks across the whole spectrum.

The mass spectrum of a lung transplant patient. dashed line: spectrum in log scale. Solid curve: 80 percent quantile of the nearest 400 points in a symmetric window; dotted line: global 80 percent quantile.

If we treat θ as missing data and ξ_{i}(*t*) as the parameter, we can use the EM algorithm to search for posterior modes. At the (*k* + 1)-th iteration, in the E-step, we calculate the expectation of the log-likelihood with respect to deforming functions from the *k*-th iteration {ξ_{1}^{(k)},…, ξ_{n}^{(k)}},

in the M-step, maximize

with respect to ξ_{1},…, ξ_{n}, where

To do so we need only to compute the expected value of θ(*t*) and θ(*t*)^{2}, but given all the ξ_{i}(*t*)^{(k)} we can estimate these quantities using . Once these expectations are computed we maximize (minimize) the expected log posterior (expected −2 log posterior) with respect to ξ_{i}(*t*). Given the form of the expected log posterior, we can maximize it with respect to each ξ_{i}(*t*) independently of the rest. These maximization steps are carried out using the DP algorithm developed in Reilly *et al*. (2004), but in this case there is an expectation in the function to optimize that was not present there. The part of the expected log-likelihood relevant for the M-step is

Whereas in Reilly *et al*. (2004) there was no expectation. This makes only a slight difference since

and given ξ_{i}^{(k)} (*t*) we can estimate E[θ(ξ_{i}(*t*))] and E[θ(ξ_{i}(*t*))^{2}] using the corresponding moments of the *x*_{i}(*t*) averaging over *i* using the current values of the deforming function ξ_{i}^{(k)}(*t*) to first deform all curves.

If we treat each of the *n* observed spectrum as the spectrum to which all other spectra are aligned, then we can obtain *n* initial values for the EM algorithm, all of which are likely to be quite good. We could randomly pick any one as the initial value and this is in theory fine because the initial value should not impact the final solution one obtains. Typically, the EM algorithm obtains convergence within 20 iterations here.

One aspect of the spectra investigated here that makes the straightforward implementation of the previously outlined computational strategy difficult is the number of observations for each spectrum. On the other hand, we expect ξ(*t*) to be a slowly varying function, so we conduct our alignment procedure on a much rougher scale than we actually have data. Hence, we first obtain a kernel smoothed estimate of the spectrum at every *K* points (e.g. *K*=8), then apply the algorithm to estimate ξ(*t*) on this rough scale. Then we calibrate the data by defining ξ(*t*) on the fine scale via linear interpolation.

We use DP to obtain an approximation to the solution to the objective function (1). Our DP algorithm implementation only allows determination of an approximate solution because the values of the range of ξ(*t*) must be a discrete set, but since this discrete set can approach the set of real numbers the solution can reach any desired level of accuracy.

Here, we follow the approach of Reilly *et al*. (2004). The basic idea of the application of the DP algorithm to our problem is to relate the minimized value of the approximation to objective function (1), when there are *k* observations, _{k}, to the value of this discrete approximation when there are *k*−1 observations, _{k−1}.

The TOF observations are equally spaced, and if we label them from 1 to *T* and also assume that *x*_{i}(*t*) and θ(*t*) are constant within each *E*_{i} interval, then the objective function (1) could be simplified as

(2)

As we mentioned in Section 2.2, using the TOF scale and thus using objective function (2) instead of (1), we avoid calculating the length and related division for each *E*_{j}, and also the difference of *t*_{j+1}−*t*_{j}, which slightly speeds up the computation. Let _{m} be the set of piecewise linear, strictly increasing functions with nodes at 1, 2,…, *T* such that the values at the nodes are of the form with *v* taking value from set {0, 1,…, *m*(*T*−1)} and *m*>1. Suppose deformation function ξ_{i} maps the *k*-th point of the *i*-th spectrum to the value . To simplify the notation, we let *l*=(*l*_{1}, *l*_{2},…, *l*_{n}) be the vector of the deformed values of the *k*-th point. Such collection of ξ_{i} is denoted as , and the approximation to objective function (2) with *k* nodes in each spectrum is

The cardinality of the domain of ξ_{i} is *T* and the cardinality of the range of it is *mT* − *m* + 1. Because the only bijective map from one set to another with the same cardinality is the identity, we need to set *m*>1 to guarantee *mT* − *m* + 1 greater than *T*. As *m* increases, the gap between possible values of ξ_{i}(*j*) decreases, and this gap can be made arbitrarily small by choosing a sufficiently large value of *m*. For ξ_{i} to be bijective, we require ξ_{i}(1)=1 and ξ_{i}(*T*)=*T*.

Let be the optimal deformation for ξ_{i}_{m}. If , which means the optimal deformation function maps the *k*-th point in the *i*-th spectrum to the value , then the requirement that the deformation does not tear the predictions implies that for *j*=2,…, *k*−1. The DP principle demands that for any optimal deformation , if , then for *j*>*k* − 1 must be the optimal deformation for the initial condition which specifies . Hence,

for *k*=2,…, *T*−1 and . The ξ_{i} in this expression is just a function of *k*, *l*_{i}, *s*_{i} since the expression entails and .

In reported quality control experiments (Yasui *et al*., 2006), observed *m*/*z* values fluctuate from experiment to experiment by approximately ±0.1% to ±0.2% of the true *m*/*z* value. According to *m*/*z*≈β_{2}*t*^{2}, the relative mass error and the relative TOF error satisfy. The largest expected shift in TOF scale is about . Thus, we expect to see that the deformation function ξ_{i}(*t*) looks like Figure 3a instead of Figure 3b. That means the optimal solution path should be near the diagonal area of the plot. Thus, the DP algorithm could restrict its searching region within that area, instead of the whole rectangular array. A global constraint, the Sakoe–Chiba band (Sakoe and Chiba, 1978), could be used here. In our context, if we let , then the Saloe–Chiba band implies

Suppose . Without using this constraint, the DP algorithm will search *l*_{i} from *k* to *T*(*m* − 1) − (*T* − *k*), *k*=2,…, *T*−1 and *i*=1, 2,…, *n*. The DP algorithm's complexity is thus *O*(*mnT*^{2}). After using this constraint, the DP algorithm will search *l*_{i} from *m*[max(2, *k*−*b*)−1] to *m*[min(*b*+*k*, *T*−1)−1], thus the DP algorithm's complexity in each EM iteration is reduced to *O*(*bmnT*). In the MS context, *T* is length of the measurement sequence, around 50 000 measurement points each spectrum. Thus, using the Sakoe–Chiba band here speeds up the computation substantially.

Sakoe–Chiba Band. (**a**) A possible deformation path, (**b**) an non-allowed deformation path, (**c**) is too large, (**d**) is suitable and (**e**) is too small. In (c–e), diagonal lines define the Sakoe–Chiba band.

The parameter *b*, the largest shift allowed, could be estimated based on the mass accuracy () reported in the MALDI-TOF instrument manual or quality control experiments. For example, Bruker Biflex III MALDI-TOF Mass Spectrometer measures up to 50 000 spectrum points, and the dwell time (DW) per data point can be set as 1, 2, 4 and 10 ns. Suppose we measures 50 000 of spectrum points with DW 1 ns. If the mass accuracy, , is 0.15%, then the maximal allowed shift, *b*, is ns. In the algorithm, is the only parameter that needs to be set manually. Here, is the ratio of the intensity variance (on the log scale) to the deformation variance. If is relatively large, then only very small deformations (shifts) are allowed. In this case, the optimal solution will look like Figure 3c, and the deformation will not be large enough to find a suitable alignment. If is relatively small, then very large shifts are allowed. The optimal solution will look like Figure 3e, and the alignment will entail shifts that are inconsistent with the magnitude of shifts reported in MALDI-TOF quality control experiments. Figure 3d shows the ideal situation, where enough deforming is allowed within the predetermined band *b*. This motivates us to set as the smallest value so that the solution lies within the band defined by the parameter *b*. In practice, we start from a relatively small value. If one of the solution curves hits the band boundary, the algorithm will stop the alignment procedure, increase value by 10%, and then redo the alignment. The algorithm repeats the previous step until the solution lies within the band defined by the parameter *b*.

After aligning a set of MS spectra, researchers first attempt to compare the aligned mass spectra to identify features (peaks) corresponding to underlying differentially expressed proteins or protein fragments between different patient populations. Suppose the mean spectrum of the first population is θ_{1}(*t*), the mean spectrum of the second population is θ_{2}(*t*) and their absolute difference is θ_{diff}(*t*)=|θ_{1}(*t*)−θ_{2}(*t*)|. Clearly our scientific interest, the underlying differentially expressed proteins, corresponds to peaks in θ_{diff}(*t*).

Two natural values to quantify the size of a peak are its height and its area. The area of a peak is a more accurate measure of the corresponding protein's abundance than the height. If there were no variability in the initial velocities, then all ions with the same mass and charge would strike the detector at the same instant, the TOF spectrum would show a spike at the TOF corresponding to each protein, with a height equal to the protein's concentration. In actual MALDI-TOF spectra with the presence of a spread of initial velocities, we observe bell-shaped peaks rather than one-dimensional spikes (Coombes *et al*., 2005; House *et al*., 2006). Thus, our focus is on the area under the peak.

We identify peaks by the use of the continuous wavelet transform (CWT) method proposed by Du *et al*. (2006), which allows wavelet transforms at every scale with continuous translation. CWT uses the Mexican hat wavelet as the mother wavelet. The Mexican hat wavelet with scale *w* best matches the peaks with a width of 2*w*. Thus, for a peak in a MS spectrum, the corresponding CWT coefficient reaches a maximum around the peak center when the scale best matches the peak width. The peak width can be estimated based on the CWT scale corresponding to the maximum point of the CWT coefficient.

Suppose there are *n*_{1} patients from population 1 and *n*_{2} from population 2. The corresponding spectrum of each patient after alignment is *X*_{ij}^{aligned}(*t*), *i*=1, 2, *j*=1,…, *n*_{i}. The procedure of identifying peak features corresponding to differentially expressed proteins or protein fragments between two patient populations is as follows:

- After alignment (align all the spectra from different patient populations together), calculate the mean spectrum of each population, . And then calculate the absolute mean spectrum difference, .
- Identify peaks in the absolute mean spectrum difference by CWT method. Record the center location
*c*_{k}and the scale*w*_{k}of each peak from the CWT output, the corresponding peak area is approximately within [*c*_{k}−*w*_{k},*c*_{k}+*w*_{k}],*k*=1,…,*m*, where*m*is the number of identified peaks. - For each identified peak, calculate its abundance in each spectrum
*X*_{ij}^{aligned}(*t*),*a*_{ij}^{k}=∑_{l=ck−wk}^{ck+wk}*X*_{ij}^{aligned}(*l*),*i*=1, 2,*j*=1,…,*n*_{i},*k*=1,…,*m*. - For each identified peak, conduct a two sample
*t*-test (or some non-parametric test) to compare*a*_{1j}^{k},*j*=1,…,*n*_{1}and*a*_{2j}^{k},*j*=1,…,*n*_{2}, and record the corresponding*P*−value,*p*_{k},*k*=1,…,*m*. - Apply the Benjamini–Hochberg procedure (Benjamini and Hochberg, 1995) to
*p*_{1},…,*p*_{m}. The corresponding ordered*P*-values are*p*_{(1)}≤*p*_{(2)}≤···≤*p*_{(m)}. For controlling the FDR at level α, identify*k*, such that,Then report all peaks corresponding to*p*_{(j)},*j*=1,…,*k*as differentially expressed, which can subsequently be mapped to differentially expressed proteins.

We aligned the bronchio-alveolar lavage sample mass spectra of 12 patients receiving lung transplants. A segment of the spectra after alignment is shown in Figure 1d. We denote spectrum *j* in group *i* as *S*_{j}^{i}. Group 1 includes seven patients receiving lung transplants who experienced transplant rejections [i.e. bronchiolitis obliterans syndrome (BOS)] and Group 2 includes five controls (i.e. lung transplant recipients who did not reject for at least 5 years, most rejections occur by this point). We calculate the square of Pearson's correlation between each sample with the corresponding group mean spectrum before and after the alignment, denoted *r*_{raw}^{2} and *r*_{aligned}^{2}, respectively. The ratios of them, *r*_{aligned}^{2}/*r*_{raw}^{2}, range from 1.043 to 1.330 as shown in Table 1. The correlation between each patient spectrum and the corresponding mean spectrum after alignment has increased, thereby indicating that a better alignment has been achieved. We also used the PTW method to align the spectra. The PTW Matlab routine, by P. Eilers (Medical Centre, University Leiden, The Netherlands), is available in warpGUI Matlab program. The ratios of the square of Pearson's correlation between each sample with the corresponding group mean spectrum before and after the alignment using PTW, *r*_{aligned−PTW}^{2}/*r*_{raw}^{2}, range from 1.003 to 1.008 as shown in Table 1. The superior performance of the proposed method demonstrates, by its greater flexibility, to capture non-linear distortion of the TOF (or *m*/*z*) axis.

We also have data from another study that used a slightly different protocol for obtaining the spectra using a MALDI-TOF instrument. The protein profile was obtained in the Bruker Biflex III mass spectrometer operating in linear mode with external calibration to the +1 and +2 charge states of Cytochrome C. The mass accuracy in linear mode with external calibration is 1500 p.p.m. (). For each patient, 50 000 of spectrum points were measured. We have 32 chronic lung transplant recipients who experienced rejections and 47 who did not have a rejection. After alignment, we use the procedure described in Section 2.6 to identify peak features corresponding to differentially expressed proteins. Ninety-three peaks are identified, and 80 peaks are significant while controlling the FDR at level 0.05. Figure 4 shows a segment of the spectra. There are seven significant peaks identified within this segment (Fig. 4c). The three peaks with the highest intensities are Human neutrophil defensin peptides (HNP) peaks (*m*/*z* = 3371, 3441 and 3485), and have been shown to be highly expressed in lung transplant recipients with BOS, while infrequently detected in the control population that did not develop BOS (Zhang *et al*., 2005).

A segment spectra of 32 chronic lung transplant recipients who experienced rejection versus 47 recipients without rejection. (**a**) Spectra after alignment. Recipients with rejection: dashed curves. Recipients without rejection: dotted curves. (**b**) Mean spectrum **...**

We performed the 1000 permutation tests, each time randomly assigning 32 spectra to the disease group and 47 spectra to the control group. Out of 1000 permutations, 958 do not identify any significant peaks, while we expect to get false discoveries ~5% of the time because we are controlling the FDR at 5%.

We also did sensitivity analysis to investigate how the value of parameter *b* influenced the peak identification result. Since and the number of measured spectrum points is fixed, it only varies with mass accuracy, . In our experiment, the mass accuracy varied from 0% (no alignment) to 0.6%, and the results are shown in Table 2. When the value of used to determine the parameter *b* departs slightly (0.1% and 0.2%) from the mass accuracy value given by equipment manual (here 0.15%), the results show hardly any difference. When the value of is far away (like 0.5% and 0.6%) from 0.15%, some peaks are lost whereas some artificial peaks are created.

Most MS-based proteomic data analysis use a two-step approach: identify peaks first and then do the alignment and statistical inference on these identified peaks only. However, numerous additional features such as peak shape or width are lost in the peak detection step, which are informative for correcting TOF axis distortions in the alignment step. Also the performance of the overall alignment process and statistical analysis strongly depends on the performance of the peak detection step. It has been shown that the use of inadequate or ineffective methods in the peak detection step may make it difficult to extract meaningful biological information in the subsequent analysis (Morris *et al*., 2005). In fact, important differences in low-intensity peaks or on shoulders of peaks can be missed by the peak detection algorithms (Morris *et al*., 2007). An alternative way is to compare the data-rich MALDI-TOF MS spectra directly. A major technical difficulty in the direct comparison is the variation in the TOF axis, only after this alignment can individual features in different experiments be compared directly. Here, we present a novel Bayesian alignment approach to align the complete mass spectra. The proposed method is capable of accurately capturing various distortions of *m*/*z* (or TOF) axes and does not require manual specification of important parameters. It circumvents the limitations of current alignment methods and supplies researchers a tool to exploit the full potential of MS data.

After the alignment, researchers can directly compare the data-rich MALDI-TOF MS spectra. Pointwise comparison has been proposed (Datta and DePadilla, 2006; Morris *et al*., 2007). However, due to the large number (a typical acquisition involves tens of thousands of points) of simultaneous tests, controlling error for multiple comparisons must be addressed in such a situation. Morris *et al*., (2007) computed the pointwise posterior probabilities of at least a δ-fold intensity change at each spectral location. The threshold on the posterior probabilities for flagging a location as significant is based on setting the expected Bayesian FDR at a prespecified level α. Datta and DePadilla (2006) used marginal *P*-values obtained from *t*-tests for testing the intensity differences at each *m*/*z* ratio in the cancer versus non-cancer samples. They studied the effect of selecting a cutoff FDR on the performance of the clustering and classification algorithms using the significant features.

There is a problem with the above pointwise comparison and controlling error for multiple comparisons procedure [a similar problem arises in neuroimaging (Chumbley and Friston, 2009; Heller *et al*., 2006)]. First, the key issue here is that we are not making inferences about points but about the regions corresponding to an underlying differentially expressed protein. This means that controlling the FDR on the level of points is unrelated. The crucial thing to control is the FDR of the features we are making inferences about, which is the peaks identified from the mean spectra difference. Second, though the expected number of falsely discovered points can be controlled, the expected number of falsely discovered peaks cannot be controlled at all. Third, dealing with tens of thousands of simultaneous statistical tests requires adjusting the *P*-values for multiple comparisons, imposing high statistical thresholds that may uncover only the points with the very highest intensity difference but mask others that do differentiate across patients populations. For example, in our lung transplant data, if we control the pointwise FDR at level 0.05, none of the adjusted *p*-values is significant at all. Here, we propose to do comparison and multiple-comparison adjustment on the level of identified peaks in the absolute mean spectra difference. We think this is the best way to identify peak features corresponding to differentially expressed proteins. Also detecting peaks from mean spectra is superior than detecting peaks from individual spectrum, which has been thoroughly discussed in Morris *et al*. (2005). First, the noise in the mean spectrum decreases by , where *n* is the sample size, thus detecting peaks from the mean spectrum is intrinsically more sensitive. Second, small consistent peaks are more easily shown in the mean spectrum. Morris *et al*. (2005) had demonstrated that using the mean spectrum will increase the sensitivity of peak detection both in real data examples and simulation studies.

The proposed alignment method is very effective for low mass accuracy data with respect to reaching a better alignment. For the data with 5–50 p.p.m. accuracy levels of recent instruments, a better alignment of the spectra is still possible using our approach. This method is still relevant for old datasets and may have value for use by those using older machines or pushing the limits of MS with novel extensions. Additionally, sometimes we need to compare spectra from different laboratories (with machines that have different accuracy) or obtained at different times, which will frequently be severely misaligned. These misalignments will be handled with our method.

We would like to thank Chris Wendt and Gary Nelsestuen for the data.

*Funding*: National Institutes of Health (grant P01-AI074340); University of Minnesota graduate dissertation fellowship.

*Conflict of Interest*: none declared.

- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995;57:289–300.
- Chumbley JR, Friston KJ. False discovery rate revisited: FDR and topological inference using Gaussian random fields. Neuroimage. 2009;44:62–70. [PubMed]
- Coombes KR, et al. Understanding the characteristics of mass spectrometry data through the use of simulation. Cancer Inform. 2005;1:41–52. [PMC free article] [PubMed]
- Datta S, DePadilla L. Feature selection and machine learning with mass spectrometry data for distinguishing cancer and noncancer samples. Stat. Methodol. 2006;3:79–92.
- Du P, et al. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics. 2006;22:2059–2065. [PubMed]
- Eilers PH. Parametric time warping. Anal. Chem. 2004;76:404–411. [PubMed]
- Heller R, et al. Cluster based analysis of fMRI data. Neuroimage. 2006;33:599–608. [PubMed]
- House L, et al. Discussion Paper 2006-24. Duke University Department of Statistical Science; 2006. Nonparametric models for peak identification and quantification in mass spectroscopy, with application to MALDI-TOF.
- Listgarten J, et al. Multiple alignment of continuous time series. In: Saul LK, et al., editors. Advances in Neural Information Processing Systems. Vol. 17. Cambridge, MA: MIT Press; 2005.
- Morris JS, et al. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics. 2005;21:1764–1775. [PubMed]
- Morris J, et al. Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics. 2007;64:479–489. [PMC free article] [PubMed]
- Reilly C, et al. Using image and curve registration for measuring the goodness of fit of spatial and temporal predictions. Biometrics. 2004;60:954–964. [PubMed]
- Sakoe H, Chiba S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. ASSP. 1978;26:43–49.
- Tibshirani R, et al. Sample classification from protein mass spectrometry, ‘by peak probability contrasts’ Bioinformatics. 2004;20:3034–3044. [PubMed]
- Vandenbogaert V. Alignment of LC-MS images, with applications to biomarker discovery and protein identification. Proteomics. 2008;8:650–674. [PubMed]
- Wu BL. Statistical methods in analyzing mass spectrometry dataset. Dissertation. 2004
- Yasui Y, et al. Profiling high-dimensional protein expression using MALDI-TOF mass spectrometry for biomarker discovery. In: Crowley J, Ankerst DP, editors. Handbook of Statistics in Clinical Oncology. 2. New York: Chapman-Hall/CRC; 2006.
- Yu W, et al. Multiple peak alignment in sequential data analysis: a scale-space based approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 2006;3:208–219. [PubMed]
- Zhang Y, et al. Analysis of chronic lung transplant rejection by MALDI-TOF profiles of bronchoalveolar lavage fluid. Proteomics. 2005;6:1001–1012. [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |