PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Anal Chim Acta. Author manuscript; available in PMC 2010 September 28.
Published in final edited form as:
PMCID: PMC2748225
NIHMSID: NIHMS138715

Swarm Intelligence Based Wavelet Coefficient Feature Selection for Mass Spectral Classification: an Application to Proteomics Data

Abstract

This paper introduces the ant colony algorithm, a novel swarm intelligence based optimization method, to select appropriate wavelet coefficients from mass spectral data as a new feature selection method for ovarian cancer diagnostics. By determining the proper parameters for the ant colony algorithm (ACA) based searching algorithm, we perform the feature searching process for 100 times with the number of selected features fixed at 5. The results of this study show: (1) the classification accuracy based on the 5 selected wavelet coefficients can reach up to 100% for all the training, validating and independent testing sets; (2) the eight most popular selected wavelet coefficients of the 100 runs can provide 100% accuracy for the training set, 100% accuracy for the validating set, and 98.8% accuracy for the independent testing set, which suggests the robustness and accuracy of the proposed feature selection method; and (3) the mass spectral data corresponding to the eight popular wavelet coefficients can be located by reverse wavelet transformation and these located mass spectral data still maintain high classification accuracies (100% for the training set, 97.6% for the validating set, and 98.8% for the testing set ) and also provide sufficient physical and medical meaning for future ovarian cancer mechanism studies. Furthermore, the corresponding mass spectral data (potential biomarkers) are in good agreement with other studies which have used the same sample set. Together these results suggest this feature extraction strategy will benefit the development of intelligent and real-time spectroscopy instrumentation based diagnosis and monitoring systems.

Keywords: wavelet, swarm intelligence, ant colony algorithm, mass spectrometry, feature selection, support vector machine

1. Introduction

Mass spectrometry can be a powerful tool for chemical analysis, and has been widely used for a variety of fields including environmental quality control, the food industry, and medical sciences [14]. Specifically, using mass spectrometry for early disease diagnosis has become an attractive research field within proteomics studies [35] and biomarker detection [6, 7]. Such diagnostic processes usually consist of two major steps: sample analysis using mass spectrometry and a mathematical classification model that is constructed based on the spectral data to attempt grouping control versus disease data. To reveal more information and rules hidden behind these complex disease processes, we need to generate high-throughput mass spectral data to represent the bio-system. However, this makes the classification model construction increasingly dependent on a robust and correct feature extraction method to select pertinent information from high dimensional mass spectral data.

Wavelet analysis is widely applied for spectral data analysis, and has the ability to remove noise, reduce data dimensions and concentrate information from within signals [810]. Although wavelet analysis significantly reduces data dimension, not all the wavelet coefficients contain pertinent information, so it is necessary and attractive to detect representative features from wavelet coefficients. Setting up a threshold value to discard small wavelet coefficients not only reduces the dimension of wavelet coefficients, but this also concentrates the most information on the remaining wavelet coefficients [11, 12]. However, in some cases the wavelet coefficients with smaller values do not necessarily mean they are unimportant. The Student’s t-test can be applied to examine the seperability of selected wavelet features [8], but this method can not be applied to non-classification problems such as quantitative estimation. For example, suppose we need to estimate blood sugar concentration from serum mass spectral data. We can not use the Student’s t-test to find pertinent feature variables for this estimation problem, because it is not a group differentiation problem. Genetic algorithms (GA) and simulated annealing (SA) have been applied to select features directly from mass spectral signals [13]. One advantage of applying an optimization method for feature selection is that this is an objective-driven searching process. Defining different objective functions, we can employ an optimization strategy to detect specific features for different objectives such as spectral data classification and quantitative estimation. For these reasons, it is encouraging to select wavelet features using these optimization methods.

Different from GA and SA, the ant colony algorithm (ACA) is a novel swarm intelligence based optimization method [14]. Mimicking collective behavior of a social system, ACA has been applied to different fields, but to the best of our knowledge there have been no reports of using ACA to select wavelet features for mass spectral data analysis especially in proteomics studies. Therefore, this paper aims to introduce ACA to the wavelet coefficient feature selection by employing a set of mass spectral data of blood sera from ovarian cancer patients and control healthy females. During the searching process, we use a support vector machine (SVM), a typical and efficient classification method, to examine the classification properties of selected features at each step. For a classification problem, we expect this wavelet analysis and ACA integrated feature selection method will significantly reduce data dimension and also yield a high classification accuracy. Meanwhile, we examine if the selected wavelet coefficients can help locate the corresponding segments in the original mass spectral data which can potentially provide physical and medical meaning for disease mechanism studies. Some suggestions to ensure the success of an ACA-based feature selection process is also given and discussed. By integrating wavelet transformation and ACA, this feature selection process helps to develop intelligent spectroscopy instrumentation based diagnosis and monitoring systems.

2. Material Description

Low-molecular-weight serum protein profiles have the potential to reflect the pathological state of organs and aid in the early detection of cancer, thus mass spectrometry is quickly becoming an efficient tool for disease diagnosis [3, 4]. The dataset for this study was composed of the mass spectral data of the sera from 162 ovarian cancer patients and 91 control healthy females. Each serum sample was thawed, added to a hydrophobic interaction protein chip, and analyzed using an upgraded protein biology system-2 surface-enhanced laser desorption and ionization time-of-flight ((PBSII SELDI-TOF) mass spectrometer. Each mass spectrum consists of 15, 154 raw m/z values, representing proteomic pattern of the serum. The detailed procedure of sample generation and analysis can be found at the website: http://home.ccr.cancer.gov/ncifdaproteomics/.

3. Method Description

3.1 Wavelet Transformation

The wavelet transform is a powerful tool for signal analysis, and works by examining the signal in different scales [15]. Basically, wavelet analysis can be viewed as a filtering process in which a group of low-pass and high-pass filters are used to split the signal into high frequency (detail, D) and low frequency (approximation, A) bands [16]. Usually low frequency domain signals will be further decomposed using a similar decomposition process. The detailed principle can be referred to in the literature [15, 17, 18]. An n level decomposition of original signal (S) can be represented by:

S=An+Dn+Dn1++D1
(1)

Daubechies-4 wavelet, “Db4” and Haar wavelet, “Haar” are two types of widely used wavelets [1921]. In this study, “Db4” was employed as the mother wavelet, because of its successful applications to spectral data analysis [19, 22, 23]. It is generally recommended that the 3rd or 4th level is a feasible decomposition level for spectral signal analysis and de-noising [16, 19, 23, 24] and further decomposition may be not more helpful [25], so a 4-level decomposition process was applied to the mass spectral data in this study. Meanwhile, considering that the coefficients in the high frequency domains contain more noise related information [19, 20] and that there have been some reports of successful classifications based purely on the low frequency domains [16, 20, 23], we selected pertinent features only from the low frequency coefficients at the 4th level by employing ACA. Searching in the low frequency domain can not only help ensure the physical meaning of the selected features and their corresponding parts in the original spectra but also reduces the searching time that otherwise could be much longer for a search on the entire wavelet coefficient set. The Student’s t-test might be able to exclude some non-differentiable variables from the searching process, but it may not work for a highly nonlinear classification problem or a non-classification problem. To provide a strategy for future wide applications, we directly applied ACA to the wavelet coefficients without any pre-screening.

3.2 Ant Colony Algorithm

ACA is an optimization strategy, mimicking group social behavior [14]. The basic concept of ACA is to use a large number of simple artificial agents (ants) to find the best numerical solution to a problem. Real ants collaborate during their food searching process, by depositing pheromones (a chemical that triggers a collective response of other members in the same colony) on the paths they selected towards a food source. A path which is selected by more ants has more pheromone deposited and consequently this path is more likely to be selected in the next searching step. Eventually, all the ants will choose the same path which usually is the shortest path from their nest to food. In an artificial ant colony based optimization process, a “path” is a candidate solution and the “score” of a selected “path” is determined by a user defined objective function. Generally, a “path” selected by all the artificial ants theoretically would be the optimal solution of this optimization problem.

Figure 1 shows the flowchart of an ACA oriented optimization process. In this study, ACA was used to find a set of discernable wavelet coefficients (i.e., features) for system characterization and classification. Supposing there are m variables for selection and k ants participating into the searching process, a brief description of ACA can be shown as below:

  1. Pheromone amount initialization: give each variable the same initial pheromone amount. τ i (0) = ϕ(i=1,2,…, m).
  2. Solution selection: Each ant selects features based on a probability function. A simple but effective estimate of the probability of each variable to be selected can be:
    Pi(t)=τi(t)i=1mτi(t)
    (2)
    where, τ i (t) is the amount of pheromone at time t for the ith variable. This selection process can be briefly described as: (1) Accumulate the probabilities of all the features in such a way that accu(i) = accu(i−1) + P(i), where accu(0) = 0. Clearly, accu(m) = 1 (m is the total feature number); (2) Generate a random number between 0 and 1, if the this random number is between accu(i−1) and accu(i), the ith feature is selected. This process is also widely used in Genetic Algorithm for member (chromosome) selection [26]. Clearly, the variable with more pheromone is more likely to be selected. One can adjust the selection probability of each variable by employing weighting functions.
  3. Solution score evaluation: Set up a criterion or objective function for solution evaluation. In this study, classification accuracy based on a SVM classifier is used as a criterion to evaluate the quality of the selected variables of each ant. The details of SVM will be given later.
  4. Pheromone amount updating: Pheromone amount of each variable is updated based on the equation below
    τi(t+1)=r·τi(t)+Δτi(t)
    (3)
    Where i is variable index, r is a constant between 0 and 1, called pheromone remaining rate which is complementary to evaporation rate e ( r = 1 − e ), and Δτi(t) represent pheromone increase, which is proportional to the classification accuracy of the selected features of each ant. For each ant, only the pheromone of the selected variables will increase. The higher the accuracy is, the more the pheromone increases. During the whole searching process, the pheromone of all the variables keeps evaporating based on a predefined evaporation rate e.
Figure 1
Flowchart of an ACA oriented optimization process

Repeat the above “selecting”, “scoring”, and “updating” steps for k ants, which is called one iteration. Ideally, after a number of iterations, all the ants will be convergent to the same set of features, which theoretically is the optimal solution of the optimization problem. More details of ACA can be referred to in the literature [14, 27].

3.3 Support vector machine (SVM)

ACA provides an effective search engine to find a set of representative or discernable features for mass spectral data characterization and classification. As discussed above, solution scoring and pheromone updating depend on the classification accuracy of the selected features. In this study, SVM is embedded into the ACA searching process to evaluate the quality of the selected features.

Basically, SVM is a statistical learning machine with a similar structure to radial basis function neural networks and has become an effective pattern recognition method [28, 29]. An SVM learning process aims to determine a number of training samples that are indispensable for classification and establish a classifier using these selected samples. Figure 2 shows a schematic illustration of support vector (black symbols) based classification. A brief description of SVM principle can be illustrated with a two-class linear separation problem. Suppose the linearly separable data are{xi, yi}, xi is a sample vector, and yi = {+1,−1}. A general two-class linear discrimination function is f(xi) = wTxi + b, where, w is the weight vector and b is a bias value. The class membership of sample xi depends on the sign of f(xi). The key concept of SVM is to construct an optimal classification boundary or hyperplane that can maximize the margin between two classes through the following optimization problem [30].

min:Φ(w)=12w2=12wTw
(4)

Figure 2
A schematic construction of SVM classifier

Introducing Lagrange multipliers to this quadratic optimization problem yields,

min:L(w,b,α)=12wTwiαi{yi[wxi+b]1}
(5)

where, αi are Lagrange multipliers. The optimal solution is wopt=iαioptyixi.bopt can be easily calculated by any support vector using wopt. The samples with non-zero Lagrange multipliers ( αi ≠ 0 ) are the support vectors for this classifier.

The linear separation process can be easily extended to nonlinear systems, by employing kernel functions to transform the original input space to a multi-dimensional linear transitional space. The Gaussian kernel function is widely used for SVM construction. The details of constructing a nonlinear SVM classifier can be referred to in the literature [16, 30]. Gaussian width, σ2, is an important parameter for Gaussian kernel functions and in determining responses of support vectors to input vectors, but an in-depth investigation of Gaussian width is beyond the scope of this paper. In this study, σ2 for all the employed SVM models during the searching process was set to be 1, a suggested value for general applications [31], to provide a fair comparison and evaluation.

In this study, the input of the SVM model is the ACA selected wavelet coefficients of each sample and the output is the index number (+1: control, −1: cancer) to represent the category of each sample. After the SVM being trained with the above setting, the category of a new sample depends on the sign of its SVM output. All the data analysis and modeling work in this study was performed on Matlab (version: 7.6.0).

4. Results

4.1 Data Pretreatment and Split

Baseline removal is an important pre-treatment for many types of signal analyses. A conventional baseline removal process for spectral data consists of three major steps [32]: (1) determination of baseline key points, (2) building a baseline model, and (3) signal correction by subtracting the baseline from original signal. In this study, a baseline correction function, called “msbackadj”, in Matlab (version: 7.6.0) was applied to each mass spectrum. After baseline correction, a local normalization (i.e., individually scale each spectrum into the range of [0, 1], using the sum of all signal intensities of the spectrum) process was applied to each spectrum.

Considering the large number of samples in this study, we randomly split the sample set into three sections. For both ovarian cancer samples and control samples, 1/3 of each group was used for training, 1/3 for validating, and the remaining 1/3 for independent testing. Figure 3 shows the methodology of the entire feature selection process. Briefly, after pre-processing and performing wavelet decomposition on all the samples, we use the training and validating data to detect the “optimal” features through an ACA based searching process and meanwhile establish a classification model based on the detected features. Then we apply this classification model to the corresponding feature values in the independent testing samples to examine the real differentiability of the selected features.

Figure 3
The methodology of the entire feature selection

4.2 Wavelet Decomposition of Mass Spectral Data

As discussed in “Wavelet Transformation” section, we use the “Db4” as a mother wavelet and apply a 4-level decomposition process to each mass spectrum. The feature selection then focuses on the low frequency domain coefficients at the 4th decomposition level (A4). The reasons for these choices have been explained in the “Method Description” section. For each spectrum, there are 953 wavelet coefficients in the low frequency domain at the 4th level. Therefore, the following ACA based feature selection process is a 953-dimensional optimization problem.

4.3 Determining pheromone remaining rate

The ACA parameters for this study are set to be: 30 for the number of ants, 20 for the initial pheromone density for each of 953 variables, 100 for the maximal iteration number in one single run. If not specified, these parameters are used for all the ACA experiments of this paper. As discussed above, pheromone remaining rate, r, controls the pheromone updating during the searching process, so the purpose of our first experiment was to choose a proper pheromone remaining rate. For the sake of convenience, we fixed the number of selected features at “5” in this task. A specific section to determine a proper and parsimonious feature number will be given afterwards. To evaluate the selected features, a SVM classifier created from the training data was used to examine the classification effect on the validating data. A weighting accuracy based on the training data and validating data was design as below to provide a reliable score for the selected features, considering a high validating accuracy may not necessarily indicate a good training accuracy.

Score(features)=0.3*trainingaccuracy+0.7*validatingaccuracy

For each remaining rate, we ran the ACA searching process for three times, in each of which 30 ants were employed to find 5 features that can make classification accuracy as high as possible. Figure 4 shows the relationship between the average score (i.e., average weighting classification accuracy) and pheromone remaining rates. It can be seen that the remaining rate of 0.8 shows the highest average classification accuracy (marked with “solid circle”). The lower remaining rates could result in an overly quick loss of the variable pheromone, which eventually lead to the loss of system diversity and the local optima of the searching process. The larger standard deviations of the classification accuracies for the lower remaining rates also indicate a poorer solution convergence. Contrarily, the larger remaining rates would prevent the pheromone differences between variable, which would result in the difficulty in feature selecting. In this study, we use 0.8 as the pheromone remaining rate.

Figure 4
Relationship of average classification accuracy (solid circle) versus Remaining rate

4.4 Determining an appropriate feature number for classification

For a classification problem based on feature selection, we expect to find a small number of features to characterize the whole system and to provide a satisfactory classification result. One method is to consider the number of selected features as another variable to be optimized along with feature selection. This is a multiple-objective optimization problem. Alternatively, another method is to start with a very small feature number and perform enough searching for each number. We can then increase the feature number by “1”, until there is no significant increase between the classification accuracies of two consecutive feature numbers. In this study, we chose the latter method.

Similarly, for each feature number we ran an ACA searching process for three times, in each of which 30 ants were employed to find the corresponding number of features to make classification accuracy as high as possible. Figure 5 shows the relationship between feature number and classification accuracy. The average classification accuracy increases with the feature number increasing from 3 to 5 and then remained relatively stable. The lower classification accuracies for three and four features could indicate that they are not enough to capture the pertinent information to establish a reliable classification model. Also, the larger standard deviations of the classification accuracies for these two cases mean a poor solution convergence. The relatively stable state for the average classification accuracies of 5–8 features could suggest including more features may not further increase the classification accuracy. Therefore, a proper parsimonious feature number was set to be 5 in this study.

Figure 5
Relationship of average classification accuracy (solid circle) versus feature number

4.5 Wavelet coefficient feature selection

Having determined a proper pheromone remaining rate and appropriate feature number, in this section we perform an extensive feature selection by running the ACA based search program for 100 times. For each time, 30 ants are employed to perform an iteration searching process. Like genetic algorithms, ACA is stochastic optimization method, so different initial values may generate different solutions. In some cases, the 30 ants could not find a same solution even at the end of the iteration searching process, and so we selected the solution with the highest classification accuracy to represent the effect on this case. All total, the 100 runs of the ACA based search program generate 100 sets of wavelet coefficient features, each of which had 5 features. Considering a low classification accuracy may indicate its corresponding features are not pertinent enough for system characterization, we focus our further study on the 54 selected feature sets whose classification accuracies were higher than 95%.

One of the 54 feature sets yielded 100% classification accuracy on all of the training, validating, and independent testing sets, suggesting an excellent seperability of the selected features and the feasibility of the proposed feature selection strategy. These results are listed in Table 1 along with the results of the next cases. To provide a more reliable and convincible result, we systematically examine the 54 feature sets. Figure 6 shows the selection frequencies of all the wavelet coefficients that were selected in the 54 feature sets. Many wavelet coefficients are selected as the features for classification model construction, indicating they are almost equivalently good features for data classification. Nevertheless, eight wavelet coefficients marked with “*” show higher selection frequencies and were selected for further analysis. The reason for collecting the popular features from all the 54 representative feature sets is to reveal sufficient information for disease mechanism studies and to construct a reliable and robust classification model for future prediction, because a single detected feature set may be not robust enough for future application, especially for a case with limited samples. Figure 7 shows the distributions of the wavelet coefficients of the two groups at these 8 features, suggesting the seperability of these 8 features. To quantitatively examine the classification accuracy of these 8 features, an SVM classifier was created using the data corresponding to these 8 features in the training data (1/3 of the whole sample set) and then applied to the corresponding feature data in the validating and testing sample sets. As shown in Table 1, the classification accuracies of this case for the training, validating, and independent testing sets were 100%, 100%, and 98.8%, respectively. Comparable to the previous results on the same proteomics data set [3, 4, 33], this result provided reliable and convincible evidence to support the feasibility and effect of the proposed feature selection method on the proteomics data based ovarian cancer diagnosis.

Figure 6
Selection frequencies of all wavelet coefficient features
Figure 7
Distribution of the eight most popular wavelet coefficients of the control and ovarian caner samples
Table 1
Classification accuracies [in percentage] based on the four different feature sets

4.6 Feature location in original mass spectra

Having demonstrated the classification effect of the selected wavelet coefficient features, our next step was to examine if these selected wavelet coefficients can locate the corresponding ranges in the original mass spectral data to help investigate the disease mechanism. A simple method was used in this study. The eight popular wavelet coefficients in the low frequency domain were replaced with zero values and then the modified low frequency coefficients along with the high frequency coefficients were used to reconstructed mass spectral data. Then, a paired Student’s t-test (α = 0.01) was applied to determine on which m/z sites the original spectral data and reconstructed spectral data have significant differences.

However, in some cases accumulated computation error during signal transformation and reverse transformation processes alone could be enough to generate minor differences between original and reconstructed data. Moreover these differences even can be quite significant (p<0.01). Here is an example. We generated 20 simulated spectral samples as shown in Figure 8, by using Gaussian functions. Each spectrum was composed of 2,000 artificial m/z values. We applied a 4-level wavelet transformation to each spectrum and then reconstructed each spectrum using reverse wavelet transformation. Applying a paired Student’s t-test to each m/z site indicated that at 167 m/z sites the original and reconstructed spectral data showed a significant difference (p<0.01). Therefore, in this study, we made a further screening onto the t-test (α = 0.01) detected spectral data points to ensure the differences between the original and reconstructed data on these points were not due to the possible accumulated computational errors or other confounding factors.

Figure 8
Simulated mass spectral data for accumulated computation error investigation

Actually, an intuitive method to locate the mass spectral ranges corresponding to the selected wavelet coefficients is to examine the relative error between the original and reconstructed data for each m/z site. Here, the m/z sites where more than half of the mass spectra in the training set showed larger than 15% differences between the original and reconstructed data were retained for analysis. Thus, totally 159 m/z sites were selected to examine their own classification ability. Most of the 159 spectral data points are clustered in the 4 peaks in Figure 9.

Figure 9
Mass spectral data corresponding to the eight selected wavelet coefficients

Based on the control and cancer samples in the training set, a Student’s t-test indicates the best p-values for the located spectral points in the 4 peaks are <0.001 (m/z: 245.2447, for peak 1), <0.001 (m/z: 261.8864, for peak 2), <0.001 (m/z: 559.7097, for peak 3) and <0.001 (m/z: 3927.2474, for peak 4), suggesting a good seperability of these four peaks. The mean values of these four biomarkers in two groups are shown in Table 3 along with their p-values. These detected peaks are in a good agreement with other studies using the same sample set. The first two peaks located in our study were also detected as the two most separable biomarkers in a logical analysis based detection study [3]. Another study based on multivariate projection method indicated an interesting range for possible biomarker location was 3,700–5,000 m/z, which exactly cover the fourth peak detected in our study [33]. Not reported in other studies, the third peak in Figure 9 may indicate a new contributive biomarker could be located in the range of m/z: 557–562. We also examined the interrelationship among these selected biomarkers. The R2 values for any two biomarkers in the control samples are 0.14 (biomarker 1 vs. biomarker 2), 0.00 (biomarker 1 vs. biomarker 3), 0.02 (biomarker 1 vs. biomarker 4), 0.22 (biomarker 2 vs. biomarker 3), 0.19 (biomarker 2 vs. biomarker 4), and 0.01 (biomarker 3 vs. biomarker 4). The R2 values for any two biomarkers in the cancer patient samples are 0.15 (biomarker 1 vs. biomarker 2), 0.11 (biomarker 1 vs. biomarker 3), 0.03 (biomarker 1 vs. biomarker 4), 0.06 (biomarker 2 vs. biomarker 3), 0.01 (biomarker 2 vs. biomarker 4), and 0.03 (biomarker 3 vs. biomarker 4). The low correlation values indicate these biomarkers are independent on each other in both categories.

Table 3
Distribution of the peak amplitudes (normalized) for the four selected biomarkers in the control and cancer samples

All of these results support the physical meaning and potential medical applications of the detected biomarkers that were located using the selected wavelet coefficient features. The discovery of novel biomarkers has the potential to not only revolutionize diagnosis measures, but they may also impact treatment and prevention of disease in clinical arena. These biomarkers also have the potential to accelerate and refine the drug discovery process as new targets of drug therapies [34]. By monitoring variables that are known to have clinical relevance, biomarkers are able to measure the delivery of drugs to their intended targets, understand pathophysiology, and predict therapy clinical effects [35].

We also quantitatively examined the classification results of these four detected biomarkers in the original mass spectral data, by establishing a SVM classifier using the biomarker data in the training set. As we can see, the biomarker detection and classification model construction were totally independent on the testing data. The accuracies of this SVM classifiers on the training, validating, and testing sets are 100%, 95.3%, and 97.6%, respectively. This result intrigued us to examine if employing all 159 located mass spectral points can further improve the classification accuracy, because a number of single biomarkers detected from the limited sample set using the t-test may not be robust enough for wide future prediction and also some statistical analysis may be unable to reveal the seperability for a nonlinear system. Similarly, a SVM classifier was created through the 159 mass spectral data in the training set, and then applied to the corresponding spectral data in the validating and testing sets. The classification accuracies of this model are 100% for the training set, 97.6% for the validating set, and 98.8% for the testing set, respectively. These results are better then the classification results purely based on the four selected mass spectral points (biomarkers), suggesting the robustness of the classification model based on the 159 located spectral points corresponding to the ACA selected wavelet coefficients and also further suggesting the effects of a wavelet analysis – ACA integrated feature searching strategy.

4.7 Classification based on peak selection strategy

In this section, we applied the ACA based feature selection strategy to the original spectral peaks to have a brief comparison of the classification effects between the selected wavelet features and the selected original peaks. A public software for spectral alignment, SpecAlign (Version 2.4) was employed for the peak and signal alignment. A total of 325 peaks were detected. Then the searching process based on ACA and SVM was applied to detect the distinguishable features from the 325 peaks. The parameters used for the wavelet feature selection, including the pheromone remaining ratio and a proper feature number, were also applied to this case. The selecting – validating – testing process was also based on the same sample set split method. Similarly, we also ran the searching process using the training and validation samples for 100 times. A brief comparison between the classification effects based on the selected original peaks and the selected wavelet coefficient features are shown in Table 2.

Table 2
Classification results based on the peak selection process and the wavelet coefficient feature selection

It can be seen from Table 2 that (1) among the 100 runs, the number of the runs that show 100% accuracy for all of the of the training, validating, and independent testing sets are “2” for the peak selection strategy and “1” for the wavelet coefficient selection strategy; and (2) focusing on the runs with their validation accuracies being higher than 95%, the average accuracy of these runs on the independent testing set are 93.1% for the wavelet feature selection strategy and 95.4% for peak selection strategy. The results of two searching strategies are almost equivalently good. However, the classification based on the peak selection process depends on some key data-pretreatment procedures such as peak match and alignment. Clearly, these pre-treatments may delay a real-time analysis and diagnosis process. In some cases, they may even need an off-line verification to have a correct peak (chemical) match with high certainty. Therefore, without the need of these extra pre-treatments and showing an almost equivalently good classification results to the classification based on the original peak selection, the proposed classification strategy based on the wavelet feature selection provides us a promising direction for real time high dimensional mass spectral data classification.

5. Discussion

This study demonstrated the feasibility and classification effect of integrating wavelet analysis and ACA on feature selection. Wavelet analysis helps the ACA base search program work on a much lower dimensional space that contains the major and pertinent information of the original signal. ACA provides an objective driven feature search process, in which we can detect specific feature for different objectives such as classification and quantitative estimation. And also, the objective function can be either linear or nonlinear.

In this study, we primarily aim to introduce a novel feature selection method for mass spectral data classification. While doing so, we elected to focus the feature selection process on the low frequency domain at the 4th decomposition level. As detailed in the section “3.1 Wavelet transformation”, this choice has been widely supported by previous wavelet analysis based classification studies [16, 19, 20, 22, 25]. This approach also helps us to locate the mass spectral ranges corresponding to the selected wavelet coefficients, by keeping the high frequency domain coefficients that contains more noise related information unchanged. It is possible that there could be subtle differences between the classification results based on the wavelet coefficients at different decomposition levels, but as the classification results based on the current wavelet decomposition level are already almost 100% for the whole sample set, it is beyond the major scope of this work to make an extensive study on wavelet decomposition levels.

Here we will use Fisher-ratio to further explain the feasibility of working on the low frequency domain at the 4th decomposition level. The Fisher ratio is a measure for the linear discriminating power of a variable. Generally, it can be defined as: F = (m1 − m2)2/v1+ v2, where m1 and m2 are the variable mean values of two groups and v1 and v2 are the variable variances of two groups. The larger the Fisher ratio is, the stronger the variable discriminating power. We calculated the Fisher ratio for each variable in the original mass spectra and each wavelet coefficient in the low frequency domain at the 4th level. The average Fisher ratio of all the mass spectral variables is 0.33 while that of all the wavelet coefficients at 4th level is 0.36, which is slightly higher. Although the Fisher ratio is a measure for linear systems, it is important to note that a higher Fisher ratio does not necessarily mean a better classification in a complex nonlinear classification problem. These two similar Fisher ratios can explain how there is sufficient discriminating power related information left in the 4th decomposition level.

In this study, we selected eight popular wavelet coefficients from Figure 6 to discuss the physical meaning and potential medical applications, expecting these popular selected coefficients can generate reliable results in the following investigations based on the selected coefficients. It is possible that including the wavelet coefficients with lower selection frequencies may locate more mass spectral ranges that could be used for additional mechanism studies.

Pheromone remaining rate is an important factor to control an ACA searching process. In this study, the remaining rate was fixed at a predefined value during the searching process, but it is promising to adaptively modify this rate and other parameters during the searching process to prevent possible solution premature or solution divergence, especially for cases where there are many almost equivalently good variables. Developing adaptive ACA searching process will be of considerable interest in our future work.

6. Conclusion

As a novel swarm intelligence based optimization method, ACA was integrated with wavelet analysis to detect differentiable features for proteomics data classification. Wavelet analysis provided a lower dimensional searching space for ACA program, helping expediting the searching of pertinent features. After determining the parameters for the ACA based searching program, we ran the feature selection program 100 times with the feature number fixed at 5. The results of the 100 runs show: (1) the classification accuracy of the 5 selected wavelet coefficients can reach up to 100% for all the training, validating and independent testing sets; (2) the eight most popular selected wavelet coefficients provide 100% accuracy for the training set, 100% accuracy for the validating set, and 98.8% for the independent testing set, which prove the robustness and accuracy of this feature selection method; (3) located by using reverse wavelet transformation, the mass spectral data corresponding to the eight popular wavelet coefficients not only still keep high classification accuracies (100% for the training set, 97.6% for the validating set, and 98.8% for the testing set) but also provide sufficient physical and medical meaning for future ovarian cancer mechanism studies. Furthermore, the corresponding mass spectral data (e.g., potential biomarkers) are in good agreement with other studies that have used the same sample set. Some suggestions to ensure the ACA searching efficiency such as pheromone remaining (evaporation) rate determination are also provided and discussed. Integrated the virtue of wavelet analysis and the flexibility of ACA, this feature extraction method would benefit establishing automatic and intelligent spectroscopy instrumentation based diagnosis and monitoring systems.

Acknowledgements

Partial support for this publication was made possible by Grant Number UL1 RR024146 from the National Center for Research for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NCRR or NIH.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

1. Cozzolino D, Smyth HE, Cynkar W, Janik L, Dambergs RG, Gishen M. Analytica Chimica Acta. 2008;621:2. [PubMed]
2. Zhao WX, Hopke PK, Qin XY, Prather KA. Analytica Chimica Acta. 2005;549:179.
3. Alexe G, Alexe S, Liotta LA, Petricoin E, Reiss M, Hammer PL. Proteomics. 2004;4:766. [PubMed]
4. Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA. Lancet. 2002;359:572. [PubMed]
5. Hendriks M, Smit S, Akkermans WL, Reijmers TH, Eilers PHC, Hoefsloot HCJ, C.M. R, Koster CG, Aerts JM, Smilde AK. Proteomics. 2007;7:3672. [PubMed]
6. Ghosh S, Dennis WH, Petty NM, Melchert RB, Luo B, Grant DF, Dey DK. Journal of Chemometrics. 2006;20:87.
7. Kwon DW, Tadesse MG, Sha N, Pfeiffer RM, Vannucci M. Cancer Informatics. 2007;3:19. [PMC free article] [PubMed]
8. Kim SB, Wang Z, Oraintara S, Temiyasathit C, Wongsawat Y. Chemometrics and Intelligent Laboratory Systems. 2008;90:161.
9. Nicolai BM, Theron KI, Lammertyn J. Chemometrics and Intelligent Laboratory Systems. 2007;85:243.
10. Schleif FM, Villmann T, Kostrzewa M, Hammer B, Gammermand A. Artificial Intelligence in Medicine. 2009;45:215. [PubMed]
11. Kwon D, Vannucci M, Song JJ, Jeong J, Pfeiffer RM. Proteomics. 2008;8:3019. [PMC free article] [PubMed]
12. Pajares G, de la Cruz JM. Pattern Recognition. 2004;37:1855.
13. Llobet E, Gualdron O, Vinaixa M, El-Barbri N, Brezmes J, Vilanova X, Bouchikhi B, Gomez R, Carrasco JA, Correig X. Chemometrics and Intelligent Laboratory Systems. 2007;85:253.
14. Dorigo M, Maniezzo V, Colorni A. Ieee Transactions on Systems Man and Cybernetics Part B-Cybernetics. 1996;26:29. [PubMed]
15. Mallat SG. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. 1989;11:674.
16. Zhao W, Sankaran S, Ibá˜nez AM, Dandekar AM, Davis CE. Analytica Chimica Acta. 2009;647:46. [PubMed]
17. Addison PS. The illustrated wavelet transform handbook – Introductory theory and applications in science, engineering, medicine, and finance. London: Institute of Physics Publishing; 2002.
18. Daubechies I. Communcations on Pure Applied Mathematics. 1988;41:909.
19. Felinger A, Kare M. Chemometrics and Intelligent Laboratory Systems. 2004;72:225.
20. Ferreira CBR, Borges DL. Pattern Recognition Letters. 2003;24:973.
21. Trygg J, Wold S. Chemometrics and Intelligent Laboratory Systems. 1998;42:209.
22. Kim CI, Yu IK, Song YH. Energy Conversion and Management. 2002;43:1839.
23. Zhao W, Davis CE. Chemometrics and Intelligent Laboratory Systems. 2009;96:252.
24. Pastia L, Walczak B, Massart DL, Reschiglian P. Chemometrics and Intelligent Laboratory Systems. 1999;48:21.
25. Morales E, Shih FY. Pattern Recognition. 2000;33:1611.
26. Goldberg DE. Kluwer Academic Publishers. Boston, MA: 1989.
27. Shelokar PS, Jayaraman VK, Kulkarni BD. Analytica Chimica Acta. 2004;509:187.
28. Cortes C, Vapnik V. Machine Learning. 1995;20:273.
29. Gualdron O, Brezmes J, Llobet E, Amari A, Vilanova X, Bouchikhi B, Correig X. Sensors and Actuators B-Chemical. 2007;122:259.
30. Kecman V. Learning and Soft Computing, Support Vector Machines, Neural Networks and Fuzzy Logic Models. Cambridge, MA, USA: The MIT Press; 2001.
31. Walczak B, Massart DL. Analytica Chimica Acta. 1996;331:177.
32. Zhao W, Bhushan A, Santamaria AD, Simon MG, Davis CE. Algorithms. 2008;1:130. [PMC free article] [PubMed]
33. Campbell J. The Application of Multivariate Projection Methods to the Analysis of Large-Scale Proteomic Data Sets, Young Bioinformaticians Forum 2004. Oxford, UK: 2004.
34. Proteomics-Based Biomarker & Drug Discovery, Genetic Engineering & Biotechnology News. 2006
35. Frank R, Hargreaves R. Nature Reviews Drug Discovery. 2003;2:566. [PubMed]