We posit that the detector response is a member of the Natural Exponential Family with Quadratic Variance Function (NEF-QVF), which is a proper subset of the exponential family of distributions [28
]. Figures and show that assuming the detector response takes the form of a specific distribution is impractical, but that the detector response V
) has a QVF. The NEF-QVF family of distributions occur often in practice and have the following useful properties, characterized by Morris [28
1. If a random variable X
NEF-QVF, it is completely specified by its variance function V
2. If X
constants then aX
is also NEF-QVF
: If X1
NEF-QVF, then X1
4. Affine combinations of normal, Poisson, gamma, binomial, negative binomial, and generalized hyperbolic secant distributed random variables generate all possible distributions in the NEF-QVF family.
There are some physical reasons as to why the NEF-QVF assumption could be reasonable as well. Some plausible justifications for the first two terms in Eq. (1) are:
1. Constant Term
: This is possibly due to thermal noise (additive Gaussian noise) which is common to all electronic measurement devices [21
2. Linear Term
: The ability to detect an ion in a multiple stage electron multiplier, a common type of detector in MALDI-like instruments, is described by compound Poisson statistics [29
The existence of a plausible physical explanation for the quadratic variance term remains an open question. However its effect is measured in both BUFFER1 and BUFFER2 and cannot be neglected. While the QVF model explains the data well in the mass focused region between 3 and 30 kDa, it is likely to break down at lower masses around 2-2.5 kDA where the baseline reaches a maximum. In this region the detector often saturates, introducing a non-linearity into the data that we have not accounted for.
The success of our univariate model for SELDI may indicate that we have selected the most important feature to consider in the preprocessing of the data: namely, the fluctuations in the response of the ion detector subject to different inputs. The analysis of expression values of preprocessed data, on the other hand, requires multivariate methods as there are significant statistical dependencies between the peak heights corresponding to proteins that may be interacting. While these correlations are important in the analysis performed after the data is preprocessed, our results indicated it may be safe to ignore them during the preprocessing. While we have shown LibSELDI to be accurate for estimating peak m/z values, we have not assessed the usefulness approach for estimating peak intensities in this work. The utility of LibSELDI for accurately estimating peak intensities remains an open question and subject of future work.
It is entirely possible that the quadratic variance model could be applicable to other similar technologies such as MALDI and newer SELDI mass spectrometers. This, however, has not been confirmed.
Having buffer only spectra allows one to estimate the parameters of the detector response curve. Knowledge of the detector response curve enables us to apply the modified Antoniadis-Sapatinas denoising scheme described in the methods. Using this approach in our LibSELDI package yields excellent peak detection performance. We have proved this concept on HYBRID1 and HYBRID2 by estimating the QVF parameters of (1) using the buffer-only spectra that were randomly selected from BUFFER1 and BUFFER2 respectively. This implies that spots on SELDI chips should be reserved for buffer-only spectra. Thus, the trade-off for using our approach is increased cost in terms of the number of chips one must use. The modified Antoniadis-Sapatinas denoising is computationally intensive as well, taking approximately seven minutes per spectrum on a high-end workstation.
We argue that some of the cost is recovered by the potential for adaptive and accurate preprocessing, but not all. It may be possible to use QC and/or calibration samples to estimate the QVF as well rather than buffer-only spots. However, this would add in some additional variation due to the nature of the medium (serum, plasma, etc).
While LibSELDI outperforms MassSpecWavelet on the HYBRID1 and HYBRID2 test sets, the applicability of this comparison and of these results to purely real data remains an open question. There is some basic biological variability modeled in our test sets (see description in supplement of [6
]). However, data from complex biological samples such as serum or plasma likely contains more biological variation and artifacts than we have modeled in HYBRID1 and HYBRID2. The investigation of how biological variation affects the model in QC samples is a work in progress.
In addition to achieving a better mean OC curve at lower FDR values, LibSELDI consistently predicts fewer peaks than MassSpecWavelet, leading to protein predictions closer to the true number of proteins in the data, as shown in Figure . This is further evidence that the adaptive modified Antoniadis-Sapatinas denoising approach using the NEF-QVF model for the detector response is smoothing the spectra by close to the right amount.
Figure 6 Efficiency of peak/protein predictions. We show boxplots summarizing the number of peaks predicted for each program in the mean spectrum of each dataset from HYBRID1 and HYBRID2 before thresholding. LibSELDI consistently predicts around 250 peaks, while (more ...)