Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2772212

Formats

Article sections

Authors

Related links

Anal Chim Acta. Author manuscript; available in PMC 2010 August 26.

Published in final edited form as:

Published online 2009 July 5. doi: 10.1016/j.aca.2009.06.064

PMCID: PMC2772212

NIHMSID: NIHMS133219

Donald A. Barkauskas,^{}^{a} Scott R. Kronewitter,^{b} Carlito B. Lebrilla,^{b} and David M. Rocke^{c}

The publisher's final edited version of this article is available at Anal Chim Acta

See other articles in PMC that cite the published article.

Matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry is a technique for high mass-resolution analysis of substances that is rapidly gaining popularity as an analytic tool. Extracting signal from the background noise, however, poses significant challenges. In this article, we model the noise part of a spectrum as an autoregressive, moving average (ARMA) time series with innovations given by a generalized gamma distribution with varying scale parameter but constant shape parameter and exponent. This enables us to classify peaks found in actual spectra as either noise or signal using a reasonable criterion that outperforms a standard threshold criterion.

Matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry (MALDI FT-ICR MS) is a technique for high mass-resolution analysis of substances that is rapidly gaining popularity as an analytic tool in proteomics. Typically in MALDI FT-ICR MS, a sample (the *analyte*) is mixed with a chemical that absorbs light at the wavelength of the laser (the *matrix* ) in a solution of organic solvent and water. The resulting solution is then spotted on a MALDI plate and the solvent is allowed to evaporate, leaving behind the matrix and the analyte. A laser is fired at the MALDI plate and is absorbed by the matrix. The matrix becomes ionized and transfers charge to the analyte, creating the ions of interest (with fewer fragments than would be created by direct ablation of the analyte with a laser). The ions are guided with a quadrupole ion guide into the ICR cell where the ions cyclotron in a magnetic field. While in the cell, the ions are excited and ion cyclotron frequencies are measured. The angular velocity, and therefore the frequency, of a charged particle is determined solely by its mass-to-charge ratio. Using Fourier analysis, the frequencies can be resolved into a sum of pure sinusoidal curves with given frequencies and amplitudes. The frequencies correspond to the mass-to-charge ratios and the amplitudes correspond to the concentrations of the compounds in the analyte. FT-ICR MS is known for high mass resolution, with separation thresholds on the order of 10^{−3} Daltons (Da) or better [1, 2].

The spectra analyzed in this article were recorded on an external source MALDI FT-ICR instrument (HiResMALDI, IonSpec Corporation, Irvine, CA) equipped with a 7.0 T superconducting magnet and a pulsed Nd:YAG laser 355 nm. In addition to hundreds of spectra generated as described above for a cancer study [3] using human blood serum as the analyte, we generated 56 spectra using neither analyte nor matrix. We will refer to the latter category of spectra as “noise spectra” and use them in Sections 2 and 3 to develop our model, then apply the model to a spectrum with known contents in Section 4.

We find that an autoregressive, moving average (ARMA) time series with innovations given by a generalized gamma distribution can closely model the properties of the noise spectra, and that this representation is useful for accurately identifying real substances in spectra produced using analyte. The modeling assumptions developed in this article are implemented in the *R* package
`FTICRMS`, available either from the Comprehensive *R* Archive Network (http://www.r-project.org/) or from the first author.

A typical noise spectrum is shown in Figure 1 with frequency in kilohertz (kHz) plotted on the horizontal axis. (In the mass spectrometry literature, it is more usual to see *m/z*—the mass-to-charge ratio—on the horizontal axis, but the actual process of measurement uses equally-spaced frequencies, and the *m/z* values are computed using one of several non-linear transformations on the frequencies [4]. Thus, the spectrum pictured in Figure 1 is how it appears after the Fast Fourier Transform is applied to the measured data.) The thick spike at a frequency of roughly 40 kHz is actually two peaks at frequencies 41.21 kHz and 42.21 kHz which extend upward to intensities of approximately 222.7 and 95.4, respectively, and are apparently instrumental noise—they appear in all 56 noise spectra at roughly the same spots and have no isotope peaks. In the analysis that follows, we set the values of the spectra at frequencies corresponding to these two peaks to be missing.

We start by considering two striking properties of the noise spectra. The first property is the special forms of the sample autocorrelation function (ACF) and sample partial autocorrelation function (PACF) of the noise spectra; Figure 2 displays the graphs of the sample ACF and sample PACF of the noise spectrum from Figure 1. Starting with lag 7, the sample ACF is nearly constant at roughly 0.0613. The sample PACF, on the other hand, oscillates between positive and negative values before decaying to a small positive value. As we show in Section 2.3, the sample ACF enables us to get information not only about the baseline but also about the coefficients to use in the ARMA representation of the spectrum. The sample PACF will be useful for evaluating the final ARMA model for accuracy.

The second property comes from looking at the sample “homogenized” cumulants ${\widehat{\kappa}}_{1}^{\prime},{\widehat{\kappa}}_{2}^{\prime}$, … of the spectrum. (The sample homogenized cumulants of a set of data are related to the mean, variance, skewness, kurtosis, etc., of the data and will be defined precisely in Section 2.4, Equation (5).) Figure 3 displays scatterplots of the running sample homogenized cumulants (with bandwidth 4001 points—other bandwidths give similar plots) of the noise spectrum from Figure 1. It is clear that the first three sample homogenized cumulants have strong relationships. As we show in Section 2.4, this enables us to get information about the proper parameters to use in the generalized gamma distribution for the innovations in the ARMA representation of the spectrum.

The sample ACF * _{k}* at lag

$${\widehat{r}}_{k}=\frac{{\sum}_{t=k+1}^{n}({y}_{t}-\overline{y})({y}_{t-k}-\overline{y})}{{\sum}_{t=1}^{n}{({y}_{t}-\overline{y})}^{2}},$$

(1)

where is the sample mean. This is usually defined for stationary time series, in which (among other criteria) the means {*μ _{t}*} of the underlying random variables {

Thus, suppose that
${Y}_{t}\sim ({\mu}_{t},{\sigma}_{t}^{2})$ with known means
${\{{\mu}_{t}\}}_{t=1}^{n}$ and suppose that the correlation between *Y _{t}* and

$$\begin{array}{l}{\stackrel{\sim}{\rho}}_{k}=\frac{\mathbb{E}\{({Y}_{t}-{\mu}_{t})({Y}_{t-k}-{\mu}_{t-k})\}}{\sqrt{\mathbb{E}\{{({Y}_{t}-{\mu}_{t})}^{2}\}}\xb7\sqrt{\mathbb{E}\{{({Y}_{t-k}-{\mu}_{t-k})}^{2}\}}}\\ {\stackrel{\sim}{\rho}}_{k}\sum _{t=1}^{n}{({y}_{t}-{\mu}_{t})}^{2}\approx \sum _{t=k+1}^{n}({y}_{t}-{\mu}_{t})({y}_{t-k}-{\mu}_{t-k}),\end{array}$$

(2)

where is the expected value operator. We subtract the right-hand side of Equation (2) from the left and add the result to the numerator in Equation (1). This gives us

$${\widehat{r}}_{k}\approx \frac{{\sum}_{t=k+1}^{n}({y}_{t}-\overline{y})({y}_{t-k}-\overline{y})+{\stackrel{\sim}{\rho}}_{k}{\sum}_{t=1}^{n}{({y}_{t}-{\mu}_{t})}^{2}-{\sum}_{t=k+1}^{n}({y}_{t}-{\mu}_{t})({y}_{t-k}-{\mu}_{t-k})}{{\sum}_{t=1}^{n}{({y}_{t}-\overline{y})}^{2}}.$$

(3)

We can use Equation (3) to approximate * _{k}* when the underlying correlations

$$\begin{array}{l}{\widehat{r}}_{k}\approx \frac{{\sum}_{t=k+1}^{n}({y}_{t}-\overline{y})({y}_{t-k}-\overline{y})}{{\sum}_{t=1}^{n}{({y}_{t}-\overline{y})}^{2}}\\ \approx \frac{{\sum}_{t=k+1}^{n}({\mu}_{t}+{\epsilon}_{t}-\overline{\mu})({\mu}_{t-k}+{\epsilon}_{t-k}-\overline{\mu})}{{\sum}_{t=1}^{n}{({y}_{t}-\overline{y})}^{2}}\\ \approx \frac{{\sum}_{t=k+1}^{n}({\mu}_{t}-\overline{\mu})({\mu}_{t-k}-\overline{\mu})+{\sum}_{t=k+1}^{n}{\epsilon}_{t}{\epsilon}_{t-k}}{{\sum}_{t=1}^{n}{({y}_{t}-\overline{y})}^{2}}\\ {\widehat{r}}_{k}\approx \frac{{\rho}_{\mu}(k)\xb7\text{Var}(\{{\mu}_{t}\})}{\text{Var}(\{{y}_{t}\})},\end{array}$$

(4)

where *ρ _{μ}*(

Furthermore, we note that the sample ACF pictured in Figure 2 reaches the estimated value of 0.0622 for *k* ≥ 7 but is larger than that for *k* ≤ 6. That suggests that the underlying correlations * _{k}* are nonzero for

The *cumulants κ _{n}*(

$$log(\mathbb{E}{e}^{\mathit{itX}})=\sum _{n=1}^{\infty}{\kappa}_{n}(X)\xb7\frac{{(it)}^{n}}{n!}.$$

The most important property that cumulants have is that for independent random variables *X* and *Y*, we have *κ _{n}*(

For ease of presentation, we also introduce “homogenized cumulants” ${\kappa}_{n}^{\prime}(X)$ as

$${\kappa}_{n}^{\prime}(X)=\text{sign}\{{\kappa}_{n}(X)\}\xb7\mid {\kappa}_{n}(X){\mid}^{1/n},$$

(5)

which are translation-equivariant for *n* = 1, translation-invariant for *n* ≥ 2, and homogeneous of degree 1 for all *n*. (It is these quantities, not the actual cumulants, that are plotted in Figure 3.)

For the remainder of this article, we will use capital letters to denote random variables and corresponding lower-case letters to denote realizations of those random variables. Thus, for example, the spectrum pictured in Figure 1 is {*y _{t}*} for the time series {

As Figure 3 shows,
${\widehat{\kappa}}_{2}^{\prime}({Y}_{t})/{\widehat{\kappa}}_{1}^{\prime}({Y}_{t}),{\widehat{\kappa}}_{3}^{\prime}({Y}_{t})/{\widehat{\kappa}}_{1}^{\prime}({Y}_{t})$, and
${\widehat{\kappa}}_{3}^{\prime}({Y}_{t})/{\widehat{\kappa}}_{2}^{\prime}({Y}_{t})$ are all nearly constant across the entire spectrum. Let {*Ỹ _{t}*} be the de-trended series obtained from {

Thus, there is good evidence that the de-trended series {*Ỹ _{t}*} is, in fact, stationary. If {

$$\begin{array}{l}{\kappa}_{n}({\stackrel{\sim}{Y}}_{t})={\kappa}_{n}\left(\sum _{k=0}^{\infty}{\psi}_{k}{X}_{t-k}\right)\\ =\sum _{k=0}^{\infty}{\psi}_{k}^{n}{\kappa}_{n}({X}_{t-k})\\ ={m}_{n}{\kappa}_{n}({X}_{t})\\ {\kappa}_{n}^{\prime}({\stackrel{\sim}{Y}}_{t})=\text{sign}({m}_{n})\mid {m}_{n}{\mid}^{1/n}{\kappa}_{n}^{\prime}({X}_{t}),\end{array}$$

where
${m}_{n}\equiv {\sum}_{k=0}^{\infty}{\psi}_{k}^{n}$. In particular, the innovations {*X _{t}*} will also have proportional cumulants.

Thus, we should look for a distribution for the innovations for which the second and third cumulants remain proportional to the mean as the mean varies. One such distribution is given by the *generalized gamma distribution* with exponent *α*, shape parameter *β*, and scale parameter ζ* _{t}*:
${X}_{t}^{1/\alpha}\sim \mathrm{\Gamma}(\beta ,{\zeta}_{t})$. This distribution has probability density function given by

$${f}_{\alpha ,\beta ,{\zeta}_{t}}(x)=\frac{\alpha {\zeta}_{t}^{-\beta}{x}^{\alpha \beta -1}exp(-{x}^{\alpha}/{\zeta}_{t})}{\mathrm{\Gamma}(\beta )},\phantom{\rule{0.38889em}{0ex}}x\ge 0,$$

(6)

where $\mathrm{\Gamma}(\beta )={\int}_{0}^{\infty}{u}^{\beta -1}{e}^{-u}du$ is the standard gamma function. Easy calculations show that

$$\begin{array}{l}{\kappa}_{1}({X}_{t})=\frac{{\zeta}_{t}^{1/\alpha}}{\mathrm{\Gamma}(\beta )}\xb7\mathrm{\Gamma}(\beta +1/\alpha )\\ {\kappa}_{2}({X}_{t})=\frac{{\zeta}_{t}^{2/\alpha}}{{\mathrm{\Gamma}}^{2}(\beta )}\xb7\{\mathrm{\Gamma}(\beta )\mathrm{\Gamma}(\beta +2/\alpha )-{\mathrm{\Gamma}}^{2}(\beta +1/\alpha )\}\\ {\kappa}_{3}({X}_{t})=\frac{{\zeta}_{t}^{3/\alpha}}{{\mathrm{\Gamma}}^{3}(\beta )}\xb7\{{\mathrm{\Gamma}}^{2}(\beta )\mathrm{\Gamma}(\beta +3/\alpha )-3\mathrm{\Gamma}(\beta )\mathrm{\Gamma}(\beta +1/\alpha )\mathrm{\Gamma}(\beta +2/\alpha )+2{\mathrm{\Gamma}}^{3}(\beta +1/\alpha )\}\end{array}$$

In particular, note that
${\kappa}_{2}^{\prime}({X}_{t})/{\kappa}_{1}^{\prime}({X}_{t}),{\kappa}_{3}^{\prime}({X}_{t})/{\kappa}_{1}^{\prime}({X}_{t})$, and
${\kappa}_{3}^{\prime}({X}_{t})/{\kappa}_{2}^{\prime}({X}_{t})$ are functions of *α* and *β* only and do not depend on *ζ _{t}*. By the homogeneity and additivity properties of cumulants, the homogenized cumulants for

In order to do this, we first need to find the causal representation of the ARMA process for the noise spectrum. We start by estimating the order of the process and the coefficients by looking at {
${y}_{t}^{\prime}$}. Using the ARMA-fitting function
`arima` in *R* (which does maximum-likelihood estimation), we tried all possible ARMA(*p*,*q*) models for *p* + *q* ≤ 7 and chose the one that maximized the modified Akaike’s information criterion (AIC* _{C}*) of [6]:

$${\text{AIC}}_{C}=-2lnL+\frac{2(p+q+1)n}{n-p-q-2},$$

where *L* is the log-likelihood and *n* is the number of data points. The best model using this version of AIC was an ARMA(1,5) process:

$${\stackrel{\sim}{Y}}_{t}={\phi}_{1}{\stackrel{\sim}{Y}}_{t-1}+{X}_{t}+{\theta}_{1}{X}_{t-1}+{\theta}_{2}{X}_{t-2}+{\theta}_{3}{X}_{t-3}+{\theta}_{4}{X}_{t-4}+{\theta}_{5}{X}_{t-5}.$$

(7)

For such a model, the causal representation

$${\stackrel{\sim}{Y}}_{t}=\sum _{k=0}^{\infty}{\psi}_{k}{X}_{t-k}$$

has coefficients given recursively by

$$\begin{array}{l}{\psi}_{0}=1\\ {\psi}_{n+1}={\phi}_{1}{\psi}_{n}+{\theta}_{n+1}.\end{array}$$

Note that since *θ _{k}* = 0 for

$$\begin{array}{l}{m}_{n}=\sum _{k=0}^{\infty}{\psi}_{k}^{n}\\ =\sum _{k=0}^{4}{\psi}_{k}^{n}+\sum _{k=5}^{\infty}{({\phi}_{1}^{k-5}{\psi}_{5})}^{n}\\ =\sum _{k=0}^{4}{\psi}_{k}^{n}+{\psi}_{5}^{n}\sum _{j=0}^{\infty}{({\phi}_{1}^{n})}^{j}\\ =\sum _{k=0}^{4}{\psi}_{k}^{n}+\frac{{\psi}_{5}^{n}}{1-{\phi}_{1}^{n}};\end{array}$$

and we can use the recursion to write *ψ*_{0}, …, *ψ*_{5} in terms of _{1}, *θ*_{1}, …, *θ*_{5}.

It should be noted that the
`arima` command in *R* assumes normal innovations, and therefore there might be some bias in our estimation of the ARMA coefficients. However, as Li and McLeod [7] observed, for large sample sizes the normal assumption introduces an extremely small amount of bias. We confirmed this by generating 50 spectra using the ARMA(1,5) model with generalized gamma innovations and mean derived from the running means with bandwidth 4001 points of the noise spectrum in Figure 1. We then compared the original ARMA coefficients to those obtained by dividing each simulated spectrum by its running means with bandwidth 4001 points and applying the
`arima` command in *R*. The average absolute bias was less than 1% for each of the six coefficients, and each of the six 95% confidence intervals as well as the joint 95% confidence interval contained the original parameters. Thus, any bias in the estimation of the ARMA coefficients introduced by not assuming generalized gamma innovations is probably minimal.

We can then use the ARMA coefficients _{1}, *θ*_{1}, …, *θ*_{5} along with the running cumulants of the spectrum to estimate the parameters *α* and *β*. Let *r _{jk}* be the least-squares estimate of
$\{{m}_{j}^{-1/j}{\widehat{\kappa}}_{j}^{\prime}({Y}_{t})\}/\{{m}_{k}^{-1/k}{\widehat{\kappa}}_{k}^{\prime}({Y}_{t})\}$. In Figure 3, it appears that

$${r}_{21}=\frac{\sqrt{\mathrm{\Gamma}(\beta )\mathrm{\Gamma}(\beta +2/\alpha )-{\mathrm{\Gamma}}^{2}(\beta +1/\alpha )}}{\mathrm{\Gamma}(\beta +1/\alpha )}$$

(8)

$${r}_{32}=\frac{\sqrt[3]{{\mathrm{\Gamma}}^{2}(\beta )\mathrm{\Gamma}(\beta +3/\alpha )-3\mathrm{\Gamma}(\beta )\mathrm{\Gamma}(\beta +1/\alpha )\mathrm{\Gamma}(\beta +2/\alpha )+2{\mathrm{\Gamma}}^{3}(\beta +1/\alpha )}}{\sqrt{\mathrm{\Gamma}(\beta )\mathrm{\Gamma}(\beta +2/\alpha )-{\mathrm{\Gamma}}^{2}(\beta +1/\alpha )}}$$

(9)

(Note that *r*_{21} is an estimate of the coefficient of variation of the innovations, and *r*_{32} is an estimate of the cube root of the skewness of the innovations.) The scale parameter *ζ _{t}* is then given by

$${\zeta}_{t}={\left\{\frac{{\mu}_{t}\xb7\mathrm{\Gamma}(\beta )}{{m}_{1}\xb7\mathrm{\Gamma}(\beta +1/\alpha )}\right\}}^{\alpha}.$$

We applied the methods from Section 2 to each of the 56 noise spectra. The values of the ARMA coefficients _{1}, *θ*_{1}, …, *θ*_{5} estimated from {*Ỹ _{t}*}; the homogenized cumulant ratios

In addition, we see that the sample ACF and sample PACF of the simulated spectrum match those of the actual noise spectrum quite well. The sample ACF of a spectrum simulated from an MA(6) model also closely matches the sample ACF of the noise spectrum, but the sample PACFs are noticeably different. Figure 5 shows the sample PACF of the simulated spectrum from Figure 4 along with the sample PACF of a spectrum simulated from an MA(6) model. It is clear that the sample PACF of the MA(6) model does not decay nearly as quickly as the sample PACF of the noise spectrum, but the sample PACF of the spectrum simulated from the ARMA(1,5) model matches very well.

Another result of the ARMA representation of the noise is the explanation of an interesting phenomenon that had been previously observed in MALDI FT-ICR spectra. Barkauskas, *et al*. [3] used a criterion for peak location and quantification that involved taking a shifted logarithm of baseline-corrected data, then finding five consecutive points which, when fitted with a quadratic function, had a negative coefficient for the quadratic term and a correlation satisfying *r*^{2} ≥ 0.98 (see Figure 6). They observed that typical MALDI FT-ICR spectra have roughly 104,000 such non-overlapping peaks, which clearly indicates that they must be mostly noise and not actual compounds. With spectra simulated as in this article, we get an average of approximately 98,000 such peaks in each spectrum, so it turns out that the proliferation of peaks is probably largely due to the combination of the ARMA(1,5) model and the choice of distribution for the innovations. (Spectra simulated from the same ARMA(1,5) model using normal innovations had only 70,000 such peaks on average, illustrating the dependence on the distribution used for the innovations. Spectra simulated from an MA(6) model with generalized gamma innovations had only 90,000 such peaks on average, which serves as further confirmation that the ARMA(1,5) model is superior.)

As an application of the techniques developed in the previous sections, we use them to detect peaks in a MALDI FT-ICR spectrum (shown in Figure 7) generated with beer as the analyte. (Beer was chosen because of its known composition with highly-structured mass patterns of glycans, the compounds of interest in the analysis in Barkauskas, *et al*. [3]). For a peak-detection criterion, we choose all “large” peaks, which we define as follows: we first take all local maxima in the spectrum which are *k* times the value of a baseline estimated using an improved version of a method of Xi and Rocke [8] (new version of algorithm submitted for publication), where *k* is a constant to be determined. We then use a logarithmic transformation on the data and for each maximum look for a set of five consecutive points containing that maximum which, when fitted with a quadratic function, has a negative coefficient for the quadratic term and a correlation satisfying *r*^{2} ≥ 0.98 as in Barkauskas, *et al*. [3]. (The taking of logarithms is justified because the data has constant coefficient of variation, so it follows from the *δ*-method (see, e.g., Bickel and Doksum [9, Theorem 5.3.3]) that taking the logarithm will approximately variance-stabilize the data, which will allow for the direct application of standard linear statistical models for analysis.)

For this article, we choose *k* to be such that the spectrum being larger than *k* times the estimated baseline is roughly equivalent to being *n* standard deviations above the mean for *n* {4, 4.25, 4.5, 4.75, 5} in an independent, identically-distributed normal situation (i.e., we want the Φ(*n*) quantile of the assumed distribution.) To estimate *k*, we ran 100 simulations of 10^{7} observations of ARMA(1,5) data generated with coefficients from Table 1 and innovations given by a generalized gamma distribution with the exponent and shape parameter from Table 1 and scale parameter equal to 1, then scaled the observations so each sample mean was 1. For each simulation we calculated the observed Φ(*n*) quantile of the data (i.e., *k*) for each of the five choices for *n*. The estimated values of *k* and their standard deviations are displayed in Table 2. The choice of *k* then boils down to a sensitivity/specificity debate. If the primary goal is discovery, then a lower threshold for “large” peaks might be useful; if it is necessary to limit the number of false discoveries, then a higher threshold for “large” peaks would be better.

For comparison, we also looked for “large” peaks using a simple threshold model with the threshold chosen using Tukey’s biweight with *K* = 9 to calculate robust measures of center *c* and scale *s* for the spectrum, then proceeding as above by starting with any local maximum which was at least *c* + 9*s* and looking for parabolic peaks.

We classified the peaks found by any of these methods as being either glycans, fragments of glycans, or unknown peaks. We then further subdivided the unknown peaks into those that had at least one isotope peak that was also detected by at least one method and those for which the main peak was the only peak ever detected. The presence of at least one detected isotope peak virtually guarantees that the peak is an actual compound, while a peak with no isotope peaks detected could be either (i) a real compound whose abundance is so low that its isotope peaks are lost in the noise, or (ii) some type of noise—for example, an electronic spike like those from the noise spectrum in Figure 1. The results of each of these procedures applied to the beer spectrum are summarized in Table 3, along with the same procedures applied to the noise spectrum from Figure 1.

By examining the masses of the peaks detected by each method, it is clear that the threshold method is preferentially selecting peaks at higher masses (lower frequencies). This is due to the fact that the mean levels of MALDI FT-ICR spectra are basically increasing functions of mass, so naturally peaks at the higher masses will have a greater chance of being above the chosen threshold value. What might be surprising at first glance, however, is that the *σ*-equivalent methods are preferentially selecting peaks at lower masses (higher frequencies). If the model is correct, the detected peaks should consist of signal, which should be detected no matter which method is used; and noise, which should be approximately uniformly distributed throughout the spectrum. This apparent contradiction can be resolved by observing that because of the form of the transformation used to calculate mass from frequency, the vast majority of the data points are at low masses. When the masses are translated back to frequencies, we see that the peaks detected in the noise spectrum by the *σ*-equivalent methods are at frequencies that are roughly uniformly distributed throughout the spectrum (Figure 8), as expected. In addition, the *σ*-equivalent methods are clearly doing a better job of not detecting peaks in the noise spectrum. Thus, the *σ*-equivalent methods appear to be the correct ones to use for peak detection.

One obvious future direction is to implement a maximum-likelihood estimation algorithm following the methods of Li and McLeod [7] to find simultaneous estimates of the ARMA coefficients in Equation (7) and the parameters *α* and *β* from Equation (6) as well as standard errors for the estimates of *α* and *β*. Another is to explore models for the innovations other than the generalized gamma distribution.

Another possible direction is based on the observation that all of the spectra analyzed in this article were generated on the same machine; it would be interesting to see how much of this is machine-dependent by analyzing spectra from other MALDI FT-ICR MS machines. The ideal situation would be if one could calculate *α*, *β*, and (possibly even) {*μ _{t}*} once for each machine and then use these values for analysis going forward. In any case, the framework provided in this article should allow other researchers to determine appropriate parameters for their own MALDI FT-ICR mass spectrometry setups.

This work was supported by the National Human Genome Research Institute (R01-HG003352 to DAB, DMR); National Institute of Environmental Health Sciences Superfund (P42-ES04699 to DAB, DMR); National Institutes of Health (R01-GM49077 to CBL); National Institutes of Health Training Program in Biomolecular Technology (2-T32-GM08799 to DAB); and the Ovarian Cancer Research Fund.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1. Herbert CG, Johnstone RAW. Mass Spectrometry Basics. CRC Press; Boca Raton, FL: 2003.

2. Park Y, Lebrilla CB. Application of Fourier transform ion cyclotron resonance mass spectrometry to oligosaccharides. Mass Spec Rev. 2005;24(2):232–264. [PubMed]

3. Barkauskas DA, An HJ, Kronewitter SR, de Leoz ML, Chew HK, de Vere White RW, Leiserowitz GS, Miyamoto S, Lebrilla CB, Rocke DM. Detecting glycan cancer biomarkers in serum samples using MALDI FT-ICR mass spectrometry data. Bioinformatics. 2009;25(2):251–257. [PMC free article] [PubMed]

4. Zhang LK, Rempel D, Pramanik BN, Gross ML. Accurate mass measurements by Fourier transform mass spectrometry. Mass Spec Rev. 2005;24(2):286–309.

5. Shumway RH, Stoffer DS. Time Series Analysis and Its Applications with R Examples. 2. Springer; New York, NY: 2006.

6. Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika. 1989;76(2):297–307.

7. Li WK, McLeod AI. ARMA modelling with non-Gaussian innovations. J Time Series Analysis. 1988;9(2):155–168.

8. Xi Y, Rocke DM. Baseline correction for NMR spectroscopic metabolomics data analysis. BMC Bioinformatics. 2008 Jul;9 [PMC free article] [PubMed]

9. Bickel PJ, Doksum KA. Mathematical Statistics. 2. Vol. 1. Prentice Hall; Upper Saddle River, NJ: 2001.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |