Using these two tools of a Bayesian belief network and mutual information, we have taken the output of our advanced signal processing methods [3
] and attempted to find diagnostic features, as well as build a classifier that could be used to separate future samples. More detail on the method can be found in Kuschner [10
The first step in our implementation is to determine all variables that show dependency with the class. We originally attempted to determine a threshold for mutual information significance by repeatedly and randomly permuting the class assignment of our data, computing the mutual information MI(class;feature) for each permutation, and finding the largest "random" mutual information which results. However, due to the mutual information maximization described below, strictly using this baseline mutual information value as a threshold to declare significance resulted in an unrealistic number of features with some connection to the class. The final methodology was empirical, increasing the threshold while monitoring the number of first level features, as well as the fraction of cases misclassified under cross-validation. Feature set sizes uniformly decreased, while error rates had a local minimum, pointing to a stable threshold value. Resource constraints prevented further investigation into a rule-based method of determining a significance threshold, and we hope to improve this aspect of our work at a later date. In the leukemia data described in the next section, a threshold of 3.2 times the "randomized" baseline provided stable feature sets with a reasonable number of features and minimal error rates.
The abundance values for each feature were discretized into 3 bins - high, medium, and low - by maximizing the mutual information of that feature with the class. Thus for each feature, bin boundaries are swept from the maximum to the minimum empirical values, the data is discretized, and MI(class; discretized feature) is found. The bin boundaries that maximize this value are noted and the discrete values of the variables are used for all further calculations.
The three bin method was selected in order to isolate central values which provide little or no discrimination between groups. If a protein has higher abundance in disease samples, for instance, the high bin will have a large difference in the probability of occurrence given disease vs. normal, as will the low bin. The central bin will have nearly the same probability of occurrence regardless of disease state. Diagnostic features will have few cases in the central bin, and the maximized MI will reflect this. Figure illustrates this technique.
Three bin discretization. A histogram of a variable when separated by group shows how central values with little discrimination are isolated, thereby maximizing mutual information between the variable's values and the class.
With a significance threshold set, all features with MI(class;feature) greater than the threshold are initially considered to have connection to the class variable. Graphically, directed arcs are created in the BN from the class node to each node representing a feature passing this test.
Once a set of features having significant mutual information with the class is established (the "first level features"), they are individually tested against all other features. The MI threshold and discrete values found previously are used (after adjusting for variations in the scale of maximum MI) to determine if connections between features exist. If MI(first-level feature; other feature) exceeds the significance threshold, an undirected arc is placed on the Bayesian network to represent this dependency.
In the case where this test is between a first-level feature and a non-first-level feature, the arc can be directed immediately, given our simplification to the BN based on the causality assumption. However, if this feature-feature link occurs between two first-level features, an additional test is required.
To direct such arcs, we used conditional mutual information. This measurement determines the mutual information remaining
between two variables when a third variable is known. In practice, the data is partitioned by the third variable's value, and mutual information is measured by
If the connection to the class C is of the form C→V1→V2, then the feature V2 will become independent of the class when the data is partitioned on the values of V1, as discussed previously. In that case, the mutual information MI(C;V2|V1) will drop to zero, indicating the independence and the initial link C→V2 is removed. If this connection accurately reflects the causal situation, however, it will not be true that MI(C;V1|V2) drops to zero--and this link is kept.
This provides a means to organize first-level features with dependencies between them. We compute the conditional MI values and look for MI(C;V2|V1)<<MI(C;V2). If such a drop occurs, we conclude a serial connection C→V1→V2 exists. While we did not find that the conditional mutual information vanished, indicating pure independence, significant drops (>75% of the original value) were often observed. Our results were relatively insensitive to the exact threshold used, but fewer expected connections (like multiply charged states) were organized correctly with this threshold above 90%.
First level feature pairs where such drops were not significant were maintained at the first level, but the arc between them was directed based on the greater of the conditional mutual information results.
These two simple tests resulted in a Markov blanket of features around the class variable and information about correlations between these and other features. The resulting DAG is recorded during each cross-validation trial and is used to classify the test cases. After k trials, all cases in the data are classified and an error rate for that trial is recorded. To find that error rate, the probabilistic classification given by the BN (e.g. "probability of disease given this data") is changed to a deterministic classification ("this sample comes from the disease group") using some value, which is 0.5 in all of our results.
By randomizing the list of samples in each of the k-fold cross-validation groups n times, n*k Bayesian networks are recorded, along with n cross-validated error rates. The stability of the Bayesian network is examined by noting the frequency with which various links appear in this set of results, and the stability of the direction of the arcs between features. An "average" network of most frequent connections can be built, thus enabling the classification of new samples with the most stable connections and parameters found.
Two data sets are examined, one artificial and one real. The real data set used is from an Eastern Virginia Medical School (EVMS) study aimed at discriminating between subsets of patients infected with Human T-cell Leukemia Virus type 1 (HTLV-1). Blood sera samples were collected under a protocol developed by the National Institute of Health and EVMS. The samples derived from three major clinical sites and were collected using centralized protocols and kept frozen until processed. The diagnosis and classification of Adult T-cell leukemia (ATL) was made using the World Health Organization classification and Shimoyama criteria. In addition to ATL and healthy individuals the cohort included HTLV-1-infected asymptomatic individuals from the same clinical sites. The acquisition of MS spectra was performed according to protocols described in [11
EVMS investigators employed an in-house program to assign samples in a randomized matrix pattern to prevent bias between replicates, or clinical status, and chip spot position. All samples were processed in triplicate and the arrayed chips were read in a 48-h period. The matrix codes were assigned by an individual separate from the team that processed the samples so that each phase of the study was blinded with respect to the operator. The code was broken during the classification stage.
Before analysis, the data were divided into two sets. The training set consisted of 145 different patients, of which 78 were classified during the clinical portion as "normal," and 67 with various stages of ATL. After removing spectra from the triplicate processing that had poor signal-to-noise due to instrument problems, we were left with 417 cases for the study. This constituted approximately two-thirds of the data taken; the remaining one-third (with all corresponding replicates) was withheld for validation until a final classifier was created. The data are available at ProteomeCommons.org, under the title "Leukemia'04 TOF spectra parsed into Rdata files."
Signal processing of the spectra was performed using tools created by our group and its collaborators and reported elsewhere [4
]. The procedure followed several steps: (1) Removal of an exponentially decaying pedestal (with a time constant of 10000 time steps) presumably created by the low mass matrix products; (2) Peak location and amplitude fitting for each spectra by using a Gaussian line shape with a full-width at half maximum of 11 time steps. This fit weighted each data point's squared error by the inverse of the expected signal to simulate the expected Poisson statistics. Whenever a peak amplitude exceeded 2 times the root-mean-square (RMS) noise in the spectra, that location and amplitude were recorded. (3) All the recorded peaks were aligned across the spectra by shifting the start time of each spectrum to minimize the variation in peak positions found in the spectra. This typically required shifts of as large as ± 6 time steps. (4) A master peak list was generated by including peak locations found in at least 5% of the spectra. (5) Fitted amplitudes were recorded for all spectra at the master peak locations, even those which did not exceed the earlier SNR threshold of 2. (6) The remaining background was removed by smoothing and interpolating each spectrum, after excluding those regions that were within 3 FWMH of a peak or an expected peak. (7) The peak amplitudes were corrected for a systematic decay that was observed in the QC (pooled serum) spectra by increasing each spectrum's amplitude by 0.04% in the order they were taken. This resulted in a net increase of approximately a factor of 2 by the end of the list of spectra. We believe that this was necessary to correct for a laser power decrease over the course of the experimental run. The final result was an array of abundance values for each case at a number of mass-to-charge positions (the peaks). This processing of leukemia data set led to detection of 96 aligned peaks in all 417 spectra in m/z range from 2 to 13 kDa.
The artificial data was created to test the ability of the method to reproduce known relationships. No signal processing is done, but the typical challenges associated with the analysis of real mass spectra are introduced, such as strong correlations between peaks, high variability, convolution of the values of nearby peaks, and the presence of many peaks that are non-diagnostic. Data is generated from Gaussian distributions with means and variances drawn from the leukemia data (which itself has no known underlying distribution). Specific features are chosen to be primary markers, and the mean values of the distributions the two groups are drawn from are one to two standard deviations apart, as was found in the leukemia data. The distribution between groups shown in Figure is from this data set, and simulates the most diagnostic feature found in the leukemia data set. Several other features are chosen to be modifications of the primary features, and random, but bounded, amounts of the primary features are placed in the secondary features. This simulates protein modifications, multiple ionizations, and other systematic events MALDI events [5
]. The remaining features are drawn from a single distribution, regardless of class. Other common TOF-MS problems are introduced. For example, one feature has a portion of its values moved to a neighboring feature to replicate signal convolution of nearby peaks. Detail of the creation of the generated data and the MATLAB code used to generate it is at http://kwkusc.people.wm.edu/dissertation/CreateGenData.html