PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2009 September 1; 25(17): 2256–2262.
Published online 2009 June 26. doi:  10.1093/bioinformatics/btp390
PMCID: PMC2734322

A genetic programming approach for Burkholderia Pseudomallei diagnostic pattern discovery

Abstract

Motivation: Finding diagnostic patterns for fighting diseases like Burkholderia pseudomallei using biomarkers involves two key issues. First, exhausting all subsets of testable biomarkers (antigens in this context) to find a best one is computationally infeasible. Therefore, a proper optimization approach like evolutionary computation should be investigated. Second, a properly selected function of the antigens as the diagnostic pattern which is commonly unknown is a key to the diagnostic accuracy and the diagnostic effectiveness in clinical use.

Results: A conversion function is proposed to convert serum tests of antigens on patients to binary values based on which Boolean functions as the diagnostic patterns are developed. A genetic programming approach is designed for optimizing the diagnostic patterns in terms of their accuracy and effectiveness. During optimization, it is aimed to maximize the coverage (the rate of positive response to antigens) in the infected patients and minimize the coverage in the non-infected patients while maintaining the fewest number of testable antigens used in the Boolean functions as possible. The final coverage in the infected patients is 96.55% using 17 of 215 (7.4%) antigens with zero coverage in the non-infected patients. Among these 17 antigens, BPSL2697 is the most frequently selected one for the diagnosis of Burkholderia Pseudomallei. The approach has been evaluated using both the cross-validation and the Jack–knife simulation methods with the prediction accuracy as 93% and 92%, respectively. A novel approach is also proposed in this study to evaluate a model with binary data using ROC analysis.

Contact: ku.ca.xe@gnay.r.z

1 INTRODUCTION

Altered patterns of blood or tissue borne biomarkers have been used as diagnostic indicators of some diseases for a few decades. These assays often measure the levels of antibodies or serum enzymes, and the reliability and accuracy of these tests to diagnose disease is dependent on the altered level of a specific biomarker in all cases of the associated disease. For many other diseases, the potential to use blood borne biomarkers for diagnosis certainly exists, but the configuration of these diagnostic tests is complicated because different individuals show different responses to disease. This heterogeneity of these responses means that disease is indicated by an altered pattern of biomarker expression rather than by changes in the levels of individual biomarkers.

A number of recent studies have identified changes in serum biomarker levels as a consequence of diseases such as cancer (Carlsson et al., 2008; Han et al., 2008; Hanash et al., 2008; Ingvarsson et al., 2008), cardiovascular disease (Balestieri et al., 2008; Gerszten and Wang, 2008), infectious diseases (Barbour et al., 2008; Benhnia et al., 2008; Eyles et al., 2007; Hodgetts et al., 2007) and sepsis (Lukaszewski et al., 2008). However, the diversity of the responses to disease and the number of biomarkers involved the development of diagnostic tests based on these changes is not simple. Methods for the rationalization and interpretation of data to provide diagnostic motifs of disease remains are one of the major challenges.

We have recently used a proteome array to map the antibody responses in individuals who have been diagnosed as suffering from melioidosis, and infectious disease which is endemic in Southeast Asia and is caused by the bacterium Burkholderia pseudomallei. Antibody responses to single antigens were not diagnostic for the disease. However, we report here a methodology using the genetic programming approach that allows the selection of combinations of antigens for the diagnosis of melioidosis with accuracies of >92%. This methodology could be applied to the identification of optimum panels of biomarkers for the diagnosis of other diseases where the biomarkers profiles show significant individual to individual variation.

In clinical practice, one important issue is to use a computer program to detect infection with a desired diagnostic accuracy as soon as possible and as easy as possible. Making decisions using binary data (yes/no or on/off) is the easiest way for clinical staff. The next important concern in clinical practice is that logic function such as ‘A is positive and B is positive’ would be easy to use compared with a complicated computer model. Meanwhile, it is much welcomed to use as fewer biomarkers as possible in practice. Because of this, we were motivated to design a novel approach to implement an easy-to-use decision-making system for infection detection. This approach is based on the use of genetic programming.

Genetic Programming is an extension to the conventional Genetic Algorithm (Goldberg, 1989). Both genetic algorithm and genetic programming have been treated as a type of models which can search for unknown functions through mimicking natural evolution in computers. Their basic component is to use a binary vector (BV) as code to represent a potential solution to a function or a problem. Unlike most other machine learning algorithms, genetic algorithm and genetic programming normally work with multiple solutions in parallel. This means that in one simulation, multiple potential solutions to an unknown function will be investigated, evaluated and selected in parallel. Unlike genetic algorithm, genetic programming does not use fixed-length BVs in simulation. This feature has made genetic programming a powerful tool for exploring coding applications, i.e. how to find best code with variable coding size for an unknown problem. Therefore most structural optimization applications can have their problems solved through the use of genetic programming where variable structure definitions can compete for being selected. The literature has provided much information on how to use genetic programming for optimization applications (Koza, 1992; Koza and Rice, 1992; Eskin and Siegel, 1999; Loveard and Ciesielski, 2001). Genetic programming has also been used in bioinformatics. For example, it was used for discriminating transmembranes domains and non-transmembranes (Koza and Andre, 1996a) and for identifying extracellular proteins (Koza and Andre, 1996b).

Except for genetic programming algorithm, random forest has been widely used for searching minimum set of biomarkers. For instance, it has been used in selecting biomarkers in breast cancer diagnosis (Rexhepaj et al., 2008), in Streptococcus pneumonia disgnosis (Williamson et al., 2008), and in pancreatic cancer diagnosis (Ge and Wong, 2008). However, random forest is not designed for implementing an evolvable coding machine as genetic programming method used in this study, hence no diagnostic pattern with variable alternatives can be found.

2 APPROACH

Suppose a set of testable antigens have been applied to two groups of patients, both Burkholderia Pseudomallei infected (namely positive) patients and non-infected (namely negative) patients. The set of testable antigens are denoted by a set of independent variables(BS1, BS2,…, BSM), where BS stands for Burkholderia Pseudomallei and M is the number of testable antigens. We denote by Θ the set of antigens.

Suppose there are N+ and N infected and non-infected patients, respectively. Each patient will have an antibody response value, either zero, small or large to each antigen. The response of the n-th infected/non-infected patient to the m-th antigen is denoted by a numerical value rnm[set membership]R, where R is the set of all real numbers. There are therefore N+ × M and N × M antibody responses from the infected patients and the non-infected patients, respectively. A data set of patients' antibody responses is collected. The data set is composed of two parts, i.e. responses from the infected patients Ω+ = {rn}n=1N+ and responses from the non-infected patients Ω = {rn}n=1N, where rn=(rn1, rn2,…, rnM). In practical applications, it is desirable to diagnose the disease in a simple way, i.e. detecting if antibody to pathogen is present or absent in a patient's serum. This means that we need to convert real numbers to binary numbers using a proper conversion function. Because both infected and non-infected patients can show positive response to a testable antigen, a proper threshold must also be determined in additive to a proper conversion function. We first apply logarithm on the raw data and then calculate the mean and the standard deviation to determine if one response is switched off to zero or turned on to one. The switching is made by a proposed conversion function defined as below

equation image
(1)

where the mean μ and the standard deviation σ > 0 are calculated from data, β > 0 is a positive value to determine the number of the standard deviations for the conversion, and δ(X) is a function to convert X to one if X > =0, otherwise zero. In other words, if the difference between the logarithm of the raw data [log(rnm)] and the mean (μ) is positive and larger than β standard deviations, the signal is switched on, otherwise off.

After conversion, the response to an antibody of a patient is either true (expressed by a value one) or false (expressed by a value zero). An antibody test with maximized number of true status in the infected patients and minimized number of true status in the non-infected patients is always preferred. We define the coverage as the percentage of true status over the number of patients. Moreover, we measure coverage separately in the infected and non-infected patients. Detail of the design of the conversion is discussed in ‘Methods’ section.

After this data pre-process, we then have two data sets, Φ+ = {xn}n=1N+ and Φ = {xn}n=1N for two groups of patients, where xn=(xn1, xn2,…, xnM) and xnm[set membership]{0, 1}. If xnm = 1, we say that the n-th patient (infected or non-infected) has a response to the m-th antigen in a medical test, otherwise no antibody response. In most situations, it is difficult to use a single antigen for the diagnosis. It is desirable to search for a number of antigens which can work cooperatively in a function for better diagnosis, i.e. for maximizing the coverage in the infected patients and minimizing the coverage in the non-infected patients. These selected antigens are then combined together in an equation referred to as a diagnostic pattern. Using such a pattern, it is then possible to deliver diagnosis in medical applications quickly and easily. Because a Boolean function can deliver either a true status or a false status, we prefer to use the Boolean functions in this study. A Boolean function is a function of a subset of testable antigens trained on converted binary data B(s(Θ), π; Φ+, Φ), where s(Θ) is a subset of Θ and π = {+, × } is a Boolean operator set with+representing an OR operator and × representing an AND operator.

In most real applications, the number of potential candidate subsets of antigens for use is huge. For instance, among 214 antigens in this study, the number of possible candidate subsets is 2214. Note that this is only the number of candidate subsets. If we consider how to combine them together using Boolean functions, the number can be even much larger! It is computationally infeasible to exhaust them to find the best.

Evolutionary computation then provides a powerful tool for this kind of optimization task. Among various evolutionary computation algorithms, the genetic programming approach is able to conduct a search and optimization task for a variable number of variables (antigens in this context) with a different function format (equations). In using the genetic programming approach, each potential solution to a problem (a Boolean function in this context) is expressed as a reverse Polish notation (Hamblin, 1957) in a BV. For instance, below is such a BV for the above-mentioned Boolean function of three antigens BS2BS5 × BS3+. Each BV is a chain of two basic components, i.e. variables and operators. Each element in a BV is either an antigen or a Boolean operator. The number of elements in a BV is the number of antigens plus the number of Boolean operators. In the above BV, we have five elements in order. They are BS2, BS5, ×, BS3 and +.

At the beginning, we may have a number of such randomly generated BVs with variable lengths. Because each Boolean function expressed using such a BV will have a specific combination of antigens and Boolean operators, different Boolean functions will have different output values when it is applied to the available data. A proper evaluation or scoring function is needed. For the non-infected patients, a binary indicator referred to as the false alarm is defined as An external file that holds a picture, illustration, etc.
Object name is btp390i1.jpg, where An external file that holds a picture, illustration, etc.
Object name is btp390i2.jpg is an OR function, Bi(xn) is the output of the i-th Boolean function for the n-th BV xn, Fi is the false alarm signal for the i-th Boolean function. It can be seen that whenever one non-infected patient has a positive output from a Boolean function, the false alarm signal is positive as well. The positive coverage (a percentage of correctly diagnosed infected patients) is defined as

equation image

The scoring function is defined as An external file that holds a picture, illustration, etc.
Object name is btp390i3.jpg. Here An external file that holds a picture, illustration, etc.
Object name is btp390i4.jpg is a negative function of X. When X=0, An external file that holds a picture, illustration, etc.
Object name is btp390i5.jpg and when X = 1, An external file that holds a picture, illustration, etc.
Object name is btp390i6.jpg. Obviously, the score (Si) must be maximized. However, it is no doubt that using all the available antigens, we may obtain a maximum score. The problem is to find a minimum set of antigens which can maximize the score. A fitness function which is commonly used in evolutionary computation algorithms is defined as a trade-off between the score and a measure relating to the number of used antigens (the complexity of a Boolean function). The fitness function is defined as

equation image

where Li is the length of the BV of the i-th Boolean function and ML is the maximum length of all the BVs. Importantly, α [set membership] (0, 1) is the trade-off parameter. If α > 0.5, the score of a BV has a larger weight than the size during selection. If α < 0.5, the large size of a BV will be more penalized. The reason that we use the length of a BV rather than the number of antigens used in a Boolean function is just for reducing computational cost during simulation.

In simulation, three genetic operators are applied. They are selection, crossover and mutation. Selection is based on the fitness function values. All the BVs in a pool in the current generation are sorted according to the fitness function values from the largest to the lowest. Only top 50% BVs are selected as elites. These elites are copied from the current generation to the next generation directly. The rest are discarded. Because the number of BVs in a pool is always kept as a constant, we need to generate another 50% BVs using another two genetic operations, i.e. mutation and crossover. The mutation is an easy one where we randomly select a number of genes in a BV for mutation. Because there are two types of genes, i.e. antigens and Boolean operators, the mutation operation will be restricted into these two types. It means that if an element selected for mutation is an antigen, the mutation will randomly select another antigen for the replacement. For instance, BS2BS5 × BS3+ may be mutated to BS2BS7 × BS3+ if the target for mutation is BS5. On the other hand, if the target for mutation is a Boolean operator, the mutation will randomly select another operator for the replacement. For instance, BS2BS5×BS3+ can be mutated to BS2BS5+BS3+.

The crossover operation is a little complicated. Two types of crossover operations are used. They are single-parent crossover and double-parent crossover. The former will generate a new BV by applying the crossover to one BV. The later will generate a new BV by applying the crossover to two BVs. The same principle is to find two complete sub-BVs from one or two BVs. Complete means that a selected sub-BV should have its self-closed Boolean function. For instance, In BS2BS5 × BS3+, BS2BS5 × is a complete sub-BV while BS2BS5 isn't. After two complete sub-BVs have been selected, their positions will be exchanged. For instance, BS2BS5 × BS3+ and BS7BS1BS4++ can be used to generate two new BVs. If the sub-BV selected from the first BV is BS2BS5× and the sub-BV selected from the second BV is BS1BS4+, two new BVs are BS1 BS4+BS3+ and BS7BS2BS5×+.

The simulation will continue till one of two following criteria is satisfied. First, a simulation will stop if there is little change on the fitness function values for the top 50% BVs. Second, a simulation will be terminated if the maximum evolutionary generations are reached. The procedure of the genetic programming approach for this application is shown as below:

  • Data pre-process:
    • ○ Apply logarithm to the raw data and calculate the mean and the standard deviation of the logarithm data;
    • ○ Convert the raw data to binary signal data.
  • Pattern discovery using different combinations of parameters (α and β):
    • ○ Randomly seed P (the number of BVs in a pool) Boolean functions of the testable antigens;
    • ○ Start evolutionary computation for selecting the best Boolean functions;
    • ○ Check if any of the pre-defined stopping criteria is satisfied, if so stop, otherwise continue.
  • Pattern extraction:
    • ○ Find the best combination of α and β;
    • ○ Extract the diagnostic patterns corresponding to the largest coverage and smallest number of used antigens as the final rules.

3 METHODS

3.1 Data

The data used in this study were obtained using a proteome array chip to measure antibody responses to a panel of 214 immunoreactive antigens. The construction of this array is being reported elsewhere (a paper describing serum data generation has been submitted to PNAS) and data can be seen in the web site: http://ecsb.ex.ac.uk/~zryang/pseudomallei/index.html. Sera from melioidosis positive or negative patients in Singapore were generally taken on admission to hospital or from walk-in clinics. Positive samples (n = 87) were taken from patients on admission to hospital and who had a diagnosis of melioidosis confirmed by blood culture. The negative sera (n = 59) were taken from patients who were either admitted to hospital or walk-in clinics but were negative for melioidosis.

We remove all the zero entries leaving only non-zero ones for data pre-process because logarithm cannot be applied to a zero value directly. We pull all the non-zero signals from both infected and non-infected patients together. The raw data show a typical single side long tail distribution, i.e. only a few response values demonstrating large values (the estimated density function is in http://ecsb.ex.ac.uk/~zryang/pseudomallei/explanation.html). We apply logarithm to the raw data generating a less skewed distribution (data are not shown). The mean value is 7.92 and the standard deviation is 1.70 for the logarithm data. Figure 1 shows the converted data with β=1. It can be seen that infected patients show a larger percentage (12.3%) of ones while non-infected patients show a smaller percentage (6.2%) of ones.

Fig. 1.
The converted binary data with β = 1. The white cells are value one and black cells are value zero. It can be seen that the infected group shows more ones compared with the non-infected group.

3.2 Experimental design

The pool size (P) is 1000. The maximum evolution generation (simulation cycles) is 1000. Using these large numbers is to ensure the optimization to be converged safely. In the result section, we can see that these large numbers do ensure the learning process converged. The α value is varied from 0.5 to 0.9 with the stepping value as 0.1. The value of β is varied from 0.6 to 1.6 with the stepping value 0.1. With these different values for α and β, the aim is to find the maximum positive with minimum diagnostic antigens. Note that the negative coverage (the percentage of collected identified non-infectious patients) is maintained as zero in model construction. It is found that no Boolean function is able to demonstrate the positive coverage over 90% when β > 1.6. It is obvious that when β value increases, the number of ones in non-infected patients will decrease, but the number of ones in infected patients will also decrease because of high threshold for switching on/off. The same reason applies to the situations with small β values. When β is too small, we will have more ones not only in infected patients, but also in non-infected patients.

During simulation, any newly generated BV will be checked if it is identical to any existing ones. If so, it will be removed and a new one will be generated.

After the simulation, diagnostic patterns are extracted. During extraction, we only focus on the top BVs (the best Boolean function) in the pool. The next few BVs may provide nearly equally good solutions. However, doing so may provide many diagnostic patterns making interpretation time consuming. In analysis, it has been found that most of these next few BVs have many overlapped antigens in the Boolean functions meaning that they may correlate to each other. So far, it is too complicated to avoid correlated Boolean functions in a pool except for completely identical BVs. We leave this as a future study.

We decide to use two approaches to evaluate the system including the selection of the best parameters, the k-fold cross-validation and the Jack–knife simulations (Efron, 1981). We set k as 10. In k-fold cross-validation, data are randomly divided into k-folds. There will be k models, each of which uses 9-folds for training and 1-fold for testing. The final system evaluation is based on the testing performance on the k-folds of testing data points. With this method, each data point will be tested only once but will be used for training for nine times. With the Jack–knife simulation, we will have N models if we have N data points. In each model, N − 1 data points are used for training and only one data point is used for testing. The final evaluation of the system will be based on N tests. It can be seen that each data point is also tested once, but will be used for N − 1 times in training.

To evaluate a system, we use two measurements, the fix-point evaluation and float-point evaluation. The fix-point evaluation uses the confusion matrix which measure how accurate a system is when a certain parameter for decision making is fixed. The float-point evaluation is normally testing how robust a system is. This means that the decision boundary is varied in a certain interval to observe how system performance is sensitive to the change of the boundary. The receiver operating characteristic (ROC) curve (Metz, 1978) is commonly used for the float-point evaluation. In this qualitative method, a two-dimensional space is generated by setting the horizontal axis as the false positive rate and the vertical axis as the true positive rate. The false positive rate is one minus the percentage of the accurately predicted negative data points and the true positive rate is the percentage of the accurately predicted positive data points. By changing the decision boundary, a set of points will be generated in this two-dimensional space. Each point is composed of two measures, the false positive rate and the true positive rate. Connecting all the points will generate an ROC curve. By comparing the ROC curves from different models, it is possible to qualitatively tell which model is robust.

Because the continuous data are converted to binary one, it is impossible to vary a decision boundary in our models. Actually, there is no decision boundary in a Boolean function. The output of the Boolean function is only delivering two values, false or true. We therefore have developed a novel technique to evaluate a Boolean function's robustness.

The conversion to a binary data for each antigen test taken from each patient depends on the number of the standard deviations. If we reduce or increase the number of the standard deviations, the false positive rate and the positive rate will be changed. This means that the change of the number of the standard deviations which is fixed in training will test how system performance is sensitive to the change of the number of the standard deviation. We therefore vary the number of the standard deviations from 0.1 to 1.6 to generate a series of models with different false positive and true positive rates. By plotting these points in a two-dimensional space, we then generate ROC curves for the analysis.

4 RESULTS

Table 1 shows the coverage (accuracy) of all 25 combinations of α and β. It can be seen that none of the Boolean function is able to have the positive coverage >90% when α < 0.9. This means that the scores are more important than the number of antigens selected. In other words, the correlation between antigens can be less important than the positive coverage measure. The correlation between antigens means that two or more antigens have similar contributions to a Boolean function.

Table 1.
The coverage for different combinations of α and β

The largest coverage is 96.55% for two combinations, one being the combination of α = 0.9 and β = 1.2 and other being the combination of α = 0.9 and β = 1.3.

When we look at the number of antigens used for diagnostic patterns, we will have a further wise selection of a good diagnostic pattern. Table 2 shows the number of diagnostic antigens of all 55 combinations of α and β. It can be seen that the combination of α = 0.9 and β = 1.3 uses a smaller number of antigens for the diagnosis compared with the combination of α = 0.9 and β = 1.2. The combination of α = 0.9 and β = 1.2 uses 21 antigens while the combination of α = 0.9 and β = 1.3 uses 17 antigens, being 19% less. In this sense, we would like to say that the Boolean function with the combination of α = 0.9 and β = 1.3 is the best diagnostic pattern.

Table 2.
The number of diagnostic antigens for different combinations of α and β

Although the Boolean function with the combination of α = 0.9 and β = 1.3 shows the largest positive coverage, being 96.55%, it uses 17 antigens. The more the antigens used for diagnosis, the more the cost may occur. Table 3 shows all the Boolean functions having the positive coverage over 90% with different combinations of α and β values. It can be seen that when α = 0.9 and β = 1.6, the coverage is 91.95% using only three antigens. This information provides practitioners a wide alternative choice in clinical diagnosis processes. The Boolean function with α = 0.9 and β = 1.6 can certainly be used as a primary diagnostic pattern. If a patient has been diagnosed having the disease using this diagnostic rule, the complicated rule may not be used. If a patient has not passed the test of this diagnostic rule, a further test may apply. Below are the BVs of all the Boolean functions having coverage over 90%. It can be seen that their lengths vary.

  • β = 0.6: BPSL1317 BPSS0933 * BPSL2520 + BPSS1512 BPSS0933 * BPSS1512 BPSL1902 * BPSS1544 BPSL1902 * + + BPSS0065 BPSS0663 + + BPSS1743 + + BPSS0893 + BPSL2615 BPSL1902 * BPSL2096 BPSS1525 + BPSL1661 BPSS1532 BPSS2331 * BPSL2096 BPSL3228 + + + + + BPSS1610 + +
  • β = 0.7: BPSL0999 BPSS2145 * BPSL2615 BPSS2194 + BPSL1661 * BPSS1778 BPSS0663 + BPSL1317 BPSS0933 * BPSS1743 + + BPSS1525 + + BPSL2520 + BPSL2096 + +
  • β = 0.8: BPSL2615 BPSL1902 * BPSS0477 BPSS1512 * BPSL2520 + BPSL1173 BPSS2043 * + BPSL2096 + BPSL3336 + +
  • β = 0.9: BPSS1525 BPSL2697 BPSS1534 * BPSL2697 BPSL2919 * + BPSS1993 + BPSL2096 + BPSL2697 BPSL0999 * BPSS2053 + + BPSS1652 + BPSL2520 + BPSS0663 + + BPSL0782 + BPSL2697 BPSS1512 * +
  • β = 1.0: BPSS0530 BPSS0663 + BPSL2017 + BPSL2520 BPSS1516 BPSL2697 * BPSL2615 BPSL2697 * BPSS2053 + + BPSS1512 BPSS0477 * BPSS0091 + + BPSL0999 BPSL2697 * + BPSL2096 + + +
  • β = 1.1: BPSL1317 BPSS1742 * BPSL0665 + BPSL2017 BPSL3319 BPSS1531 * BPSS1534 BPSS1512 BPSS0477 * BPSL2520 + BPSS2013 BPSS1516 BPSL2522 + BPSS0530 BPSL2615 BPSL2697 * + + + + + + + +
  • β = 1.2: BPSL1317 BPSS1434 * BPSS2053 BPSS0530 + + BPSS1534 BPSS2141 BPSS2053 + BPSL1661 * BPSS0477 BPSL1661 + BPSS2193 BPSL2615 + * + BPSS0477 BPSS2186 + BPSS1512 BPSL0999 + * + BPSL2522 + BPSL0665 + BPSL2520 + BPSL2017 BPSS1516 + + BPSL2096 + + +
  • β = 1.3: BPSL1901 BPSL2697 * BPSL2017 + BPSL1745 + BPSS2288 BPSL1317 + BPSS1512 + BPSS0477 * + BPSL3341 BPSL2697 * + BPSL2522 + BPSS0530 + BPSL2520 + BPSS1531 BPSL2697 * BPSS2053 + + BPSL2298 BPSL2615 BPSL2697 * BPSS1534 + + +
  • β = 1.4: BPSL2520 BPSS2053 + BPSL1173 + BPSS1534 + BPSS1599 + BPSL2298 BPSS0094 + + BPSL2697 BPSS0477 * BPSL0782 + +
  • β = 1.5: BPSL2697 BPSS1532 * BPSS1599 + BPSL1173 BPSL2697 BPSS0477 * BPSL1445 BPSS0477 * BPSS2053 + + + + BPSL2520 BPSS0094 + BPSL2697 BPSL3130 * + BPSL2298 BPSL3341 BPSL2765 * + BPSL0782 + + +
  • β = 1.6: BPSS1512 BPSL1445 + BPSL2697 +

Figure 2 shows the tree structure of the Boolean function of α = 0.9 and β = 1.3. The function is interpreted this way: a patient will have the disease if

  • BPSL1901 and BPSL2697 show positive signals or
  • BPSL3341 and BPSL2697 show positive signals or
  • BPSS1531 and BPSL2697 show positive signals or
  • BPSL2615 and BPSL2697 show positive signals or
  • either BPSS2288 or BPSL1317 or BPSS1512 shows a positive signal and BPSS0477 shows a positive signal or
  • either BPSL2017 or BPSL1745 or BPSL2522 or BPSS0530 or BPSL2520 or BPSS2053 or BPSL2298 or BPSS1534 shows a positive signal.

Figure 3 shows one evolutionary process where it can be seen that the simulation becomes stable very quickly after 200 evolutionary generations.

Fig. 2.
The tree structure of the Boolean function with α = 0.9 and β = 1.3.
Fig. 3.
The evolutionary history for the model with α = 0.9 and β = 1.0.
Table 3.
The Boolean functions with more than 90% coverage

Finally, we discuss the evaluation of the system. We have used the procedures described as above, i.e. the cross-validation and the Jack–knife simulation. Both evaluation procedures show that the number of standard deviation 1.6 gives the best evaluation. Table 4 shows the confusion matrices as well as the accuracies for both evaluation procedures. Both have the same sensitivity 90%, but the Jack–knife had one more misclassified non-infected patient. Figure 4 shows the ROC curves for the cross-validation models with the number of standard deviation as 1.6. It can be seen that they do show a good robustness.

Fig. 4.
ROC curves for the cross-validation models. Each line represents an ROC curve for one cross-validation model. CVi means i-th cross-validation model.
Table 4.
The confusion matrices

In using the Jack–knife evaluation procedure, we have checked the top Boolean function generated in different runs of the Jack–knife simulation to investigate their diversity. However, it has been found that such diversity is extremely small. For 146 runs of Jack–knife simulation, we have found that the rule BPSS1512 or BPSL1445 or BPSL2697 was repeated for 142 times (97%). The other three rules are overlapping with this major rule. They are BPSS1512 or BPSL1445 or BPSL2697 or BPSL2053 BPSS1512 or BPSL2697 BPSS1512 or BPSL1317 or BPSL2697 BPSS1512 and BPSL2697 appeared 100% in 146 top rules.

For a comparison, the support vector machine (Vapnik, 1995) is also used for building models. Specially, the free package SVMlight (Joachims, 1998) was used for the raw data with a logarithm conversion. Both cross-validation and Jack–knife simulations are used. The radial-basis kernel function of the package is used. In this kernel function, a smoothing parameter needs to be tuned. The other two parameters in the package are the C parameter for trading-off between training and testing errors and the J parameter for dealing with imbalanced data. These two parameters have also some impact on model performance. We have used a grid search. The C value starts from 100 to 100 000 with a stepping value 100. The J value starts from 1 to 100 with a stepping value 10. The smoothing parameter starts from 1.0e-6 to 1.0e-0 with a stepping value 1.0e + 1. It has been found that model performance has little difference when the C value is within the interval of 100 and 10 000 and the J value is within the interval 1 and 100. Beyond these intervals, model performance drops dramatically. It is found that when the smoothing parameter is > 1.0e−1 and <1.0e-4, no models can be constructed (the prediction accuracy is <55%). Table 5 shows the overall performance for using 10-fold cross-validation (CV) and Jack-knife (JK) simulations. It can be seen that the performance of SVM is slightly worse than the genetic programming approach. Bear in mind that the SVM models have used all the 214 antigens with a complicated decision-making system.

Table 5.
The confusion matrices using SVM

We must emphasise that the value of α is optimized in model construction. Because our data set has been very small, we are unaffordable to reserve a subset of it for blind test. The value of β can be altered if high specificity or high sensitivity is expected in real use. Meanwhile, a web tool has also been implemented for public use if the same serum test culture is used. Its URL address is http://ecsb.ex.ac.uk/~zryang/pseudomallei/index.html.

5 CONCLUSION

We have presented a methodology in this article for searching for the diagnostic patterns through an evolutionary computation process. For this specific application in fighting against the Burkholderia pseudomallei, we have proposed a novel scoring function, which aims to maximize the coverage of the antibody responses of the infected patients while maintaining zero or minimal coverage of the antibody responses of the non-infected patients (no false alarm signal). We have used different trade-off strategies for searching for the diagnostic patterns with maximized coverage and minimum number of antigens used for the diagnosis. We have also tested different threshold values for switching the raw data to binary signals. The simulation shows that the maximum coverage is 96.55% using 17 antigens (7.9% of all the available antigens). The simulation has also found out nine Boolean functions having the coverage over 90% with variable number of antigens. The most important issue of this study is that the search of the diagnostic patterns can be automatic employing an evolutionary computation approach. The system is evaluated by two procedures, namely the cross-validation and the Jack–knife simulations. Both evaluation procedures demonstrated similar performance and robustness of the system. In conclusion, if we convert the serum tests of BPSS1512, BPSL1445 and BPSL2697 using the conversion function [Equation (1)] with μ = 7.92, σ = 1.7, and β = 1.6, we have the expected prediction accuracy at least 92% for this specific disease. The method proposed in this study can be easily extended to other applications where disease diagnostic patterns of biomarkers need to be found.

ACKNOWLEDGEMENTS

The authors thank Eng-Eong Ooi and Jimmy Loh (DMERI, Singapore) for providing serum samples. The protein microarrays were fabricated by Jozelyn Pablo and Matt Kayala; probing and analysis was done by Chad Burk.

Funding: NIH/NIAID grant numbers AI61363 and U5065359 to P.F.

Conflict of Interest: none declared.

REFERENCES

  • Balestrieri ML, et al. Proteomics and cardiovascular disease: an update. Curr. Med. Chem. 2008;15:555–572. [PubMed]
  • Barbour AG, et al. A genome-wide proteome array reveals a limited set of immunogens in natural infections of humans and white-footed mice with Borrelia Burgdorferi. Infect Immun. 2008;76:3374–3389. [PMC free article] [PubMed]
  • Benhnia MR, et al. Redundancy and plasticity of neutralizing antibody responses are cornerstone attributes of the human immune response to the smallpox vaccine. J. Virol. 2008;82:3751–3768. [PMC free article] [PubMed]
  • Carlsson A, et al. Borrebaeck CA.Serum proteome profiling of metastatic breast cancer using recombinant antibody microarrays. Eur. J. Cancer. 2008;44:472–480. [PubMed]
  • Efron B. Nonparametric estimates of standard error: the jackknife, the bootstrap and other methods. Biometrika. 1981;68:589–599.
  • Eskin E, Siegel EV. Genetic programming applied to Othello: introducing students to machine learning research. In: Joyce D, editor. 30th Technical Symposium of the ACM Special Interest Group in Computer Science Education. Vol. 31.1. New Orleans, LA, USA: ACM Press; 1999. pp. 242–246.
  • Eyles JE, et al. Immunodominant Francisella tularensis antigens identified using proteome microarray. Proteomics. 2007;7:2172. [PubMed]
  • Gerszten RE, Wang TJ. The search for new cardiovascular biomarkers. Nature. 2008;451:949–952. [PubMed]
  • Ge G, Wong GW. Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles. BMC Bioinformatics. 2008;9:275. [PMC free article] [PubMed]
  • Goldberg DE. Genetic Algorithms in Search, Optimisation and Machine Learning. Reading, MA: Addison-Wesley; 1989.
  • Hamblin CL. Computer languages. Australian J. Sci. 1957;20:135–139.
  • Han KQ, et al. Identification of lung cancer patients by serum protein profiling using surface-enhanced laser desorption/ionization time-of-flight mass spectrometry. Am. J. Clin. Oncol. 2008;31:133–139. [PubMed]
  • Hanash SM, et al. Mining the plasma proteome for cancer biomarkers. Nature. 2008;452:571–579. [PubMed]
  • Hodgetts A, et al. Biomarker discovery in infectious diseases using SELDI. Future Microbiol. 2007;2:35–49. [PubMed]
  • Ingvarsson J, et al. Detection of pancreatic cancer using antibody microarray-based serum protein profiling. Proteomics. 2008;8:2211–2219. [PubMed]
  • Joachims T. Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C, editors. Proceedings of the European Conference on Machine Learning. Vol. 1398. London, UK: Springer; 1998. pp. 137–142.
  • Koza JR. Genetic Programming on the Programming of Computers by Means of Natural Selection. Cambridge, MA: MIT Press; 1992.
  • Koza JR, Rice JP. Proceedings of Tenth National Conference on Artificial Intelligence. Menlo Park, CA, USA: AAAI Press/MIT Press; 1992. Automatic programming of robots using genetic programming; pp. 194–207.
  • Koza JR, Andre D. Classifying protein segments as transmembrane domains using architecture-altering operations in genetic programming. In: Angeline PJ, Kinnear KE Jr., editors. Advances in Genetic Programming II. Cambridge, MA: MIT Press; 1996a. pp. 155–176.
  • Koza JR, Andre D. Automatic discovery of protein motifs using genetic programming. In: Yao X, editor. Evolutionary Computation: Theory and Applications. Singapore: World Scientific; 1996b. pp. 171–197.
  • Loveard T, Ciesielski V. Representing classification problems in genetic programming. Proc. Congress Evolut. Comput. 2001;2:1070–1077.
  • Lukaszewski RA, et al. The pre-symptomatic prediction of sepsis in intensive care unit patients: a pilot study. Clin Vaccine Immunol. 2008;15:1089–1094. [PMC free article] [PubMed]
  • Metz CE. Basic principles of ROC analysis. Seminars in Nuclear Med. 1978;8:283–298. [PubMed]
  • Rexhepaj E, et al. Novel image analysis approach for quantifying expression of nuclear proteins assessed by immunohistochemistry: application to measurement of estrogen and progesterone receptor levels in breast cancer. Breast Cancer Res. 2008;10:R89. [PMC free article] [PubMed]
  • Vapnik V. The Nature of Statistical Learning Theory. New York: Springer-Verlag; 1995.
  • Williamson YM, et al. Differentiation of Streptococcus pneumoniae conjunctivitis outbreak isolates by matrix-assisted laser desorption ionization-time of flight mass spectrometry. Appl. Environ. Microbiol. 2008;74:5891–5897. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press