|Home | About | Journals | Submit | Contact Us | Français|
Biological sequence usually contains yet to find knowledge, and mining biological sequences usually involves a huge dataset and long computation time. Common tasks for biological sequence mining are pattern discovery, classification and clustering. The newly developed model, Plausible Neural Network (PNN), provides an intuitive and unified architecture for such a large dataset analysis. This paper introduces the basic concepts of the PNN, and explains how it is applied to biological sequence mining. The specific task of biological sequence mining, exon/intron prediction, is implemented by using PNN. The experimental results show the capability of solving biological sequence mining tasks using PNN.
Research in computational biology has grown rapidly in the past ten years due to advances in molecular biology techniques yielding high-throughput data. One of the most important research interests in computational biology is sequence mining. It is much harder to extract knowledge hidden in the sequences than to generate biological sequences. With biological mutation and evolution, sequence datasets usually are enormous and complex. Analysis models become a critical factor for biological sequence mining. Several tasks related to sequence mining such as pattern discovery, classification, prediction, and clustering, can be implemented by statistical, neural network, or data mining models . Those models can be used to capture the knowledge or patterns in order to predict, classify, or analyze sequence data.
A newly developed neural network model, plausible neural network (PNN), which combines probabilistic and possibilistic inferences for binary random variables was introduced in  and subsequently patented by Chen in 2003 . PNN is similar to Hopfield neural networks in that both are bidirectional and symmetrical. But in inference interpretation, PNN is similar to Bayesian neural networks. Specifically the synaptic weights in PNN and Bayesian neural networks have probabilistic meanings and thus are transparent. However, PNN with the weights based on mutual information content as described in section II provides two important features. First, mutual information content weights put the PNN architecture of winner-take-all activation of competing neurons on a solid statistical inference foundation, in which no prior distribution is required as in Bayesian neural networks. Second, mutual information content weights can be used to compute Shannon mutual information between attributes with little added cost.
Several advantages of PNN lead to applying it to biological sequence mining. The PNN model is very intuitive in that the user can easily train it with different initial states to discover patterns hidden in large-scale, complex data. In addition, the model is very general that it can be applied to clustering, prediction, classification, and pattern discovery with the same architecture. Moreover, PNN represents missing data as a null vector so that they do not participate in the PNN inference process. Most of other clustering and classification methods handle missing data by adding artificial data or removing the incomplete data. Since the learning process and activation method in PNN are simple, the computation for training or activation can be finished in a short period of time.
This paper applied PNN to one of the biological sequence mining tasks, the prediction of exon/intron boundaries. The accuracy rate comparing to other computational intelligent models shows the quality result of PNN.
A basic PNN architecture for biological sequence mining consists of two layers (input layer and hidden layer) of cooperative and competitive neurons. The connection between input neurons and hidden neurons is complete, bidirectional, and symmetric.
The competitive neurons are grouped into a winner-take-all (WTA) ensemble and, in turn, WTA ensembles cooperate in decision making. Fig.1 illustrates the architecture for DNA sequence data. The hidden neuron WTA ensemble represents un-labeled (thus unknown or hidden) patterns of similar sequences. Each input WTA ensemble encodes the values of an input attribute, which is either a sequence base or the sequence class. For example, in Fig. 1 the input WTA ensemble (IE, EI, N) encodes the values of attribute Class (i.e. DNA sequence class label) and each WTA ensemble (A, C, G, T) encodes a DNA base in the sequence.
PNN encodes categorical or continuous attributes into WTA ensembles. For sequence mining, only the categorical attribute encoding is required. For a categorical attribute with k categories, its value is encoded as a WTA ensemble of k binary-valued (i.e. 0 or 1) neurons (X1, X2,… , Xk). For example, splice-junction gene sequences datasets may have a class attribute with three possible values IE, EI, and N representing “intron to exon boundary”, “exon to intron boundary”, and “None”, respectively. For such examples, the class attribute can be encoded in a three-neuron WTA ensemble (IE, EI, N). The input vector (1, 0, 0) for the class attribute WTA ensemble indicates the sequence is of the IE class, (0, 1, 0) indicates EI, and (0, 0, 1) indicates N.
Each base in a DNA sequence is encoded by a four-neuron WTA ensemble (A, C, G, T). This DNA encoding can allow for uncertain base value given in the IUPAC code as well. For example, if the base value is R (means G or C), the input vector for the base is (0, 0.5, 0.5, 0). Missing attribute values are represented by a null vector, i.e. all Xi’s in the attribute ensemble are zero.
Given the coding scheme, an input WTA ensemble (X1, X2,… ,Xk) representing an attribute value satisfies the following conditions:
Furthermore, missing attribute values are represented by a null vector, i.e. all Xi’s in the attribute ensemble are zero. With this coding, since the input from every neuron in the attribute ensemble is zero, activation potential (i.e. input times the weight) contributed by the attribute is zero. As a consequence, the attribute with missing value will not participate in the WTA activation of the competing hidden neurons.
Assume each neuron in the PNN is a binary random variable. The connection weight between an input neuron Xi and a hidden neuron Yj is given as follows
In (2), P denotes the probability of the enclosed event. The weight (2) contains the firing history or mutual information content of two connected neurons. It is clear that Xi and Yj are positively associated (excitatory), statistically independent, or negatively associated (inhibitory) if ωij is positive, 0, or negative, respectively. Note that binary random variable is assumed to define (2), but the estimation of (2) as described in section E and PNN applications also apply to neurons satisfying (1).
The PNN is bidirectional since the firing in PNN can be carried out in both directions between input neurons and hidden neurons. For convenience, forward firing is referred to the activation of hidden neurons given the states of the input neurons and reverse firing is referred to the activation of input neurons given the states of hidden neurons. Suppose an ensemble of competitive WTA hidden neurons Y1, Y2,…, Yn receive input signals from the input neurons X1, X2,…, Xm. The firing states of Y1, Y2,…, Yn are computed in the following steps:
Step 3 is a soft WTA competition since there could be multiple winners with different activation level. Alternatively, one winner could be chosen with the largest activation level. The latter is called hard WTA. Step 4 is a normalization step to make sum of Yj equal to one.
In the reverse firing, the states of the hidden neurons Y1, Y2,…, Yn are given and the computation of the activation of input neurons is carried out for each input WTA ensemble separately using steps 1-4 above with some modifications. Specifically, for each input WTA ensemble (X1, X2,…, Xk), the activation of neurons Xj’s is computed using steps 1-4 with these modifications:
The PNN process of input data rows is via the WTA competition among the hidden neurons. Assuming the hard WTA competition is used, one hidden neuron will become the winner of an input data row. The outcome of the WTA competition is judged by the activation level of each hidden neuron from the forward firing of a data row as described in subsection D. The hidden neuron with the highest activation level will become the winner (i.e. the data row in question belongs to the cluster represented by the hidden neuron.) This competition process can be justified by the principle of inverse inference, which is stated as “Given the evidence, the more probable a hypothesis can produce such evidence the more likely it to be true” . To paraphrase the principle in PNN terms, given an input data row (i.e. given input neurons’ values), the more probable hidden neuron can response to such an input the more likely the data row belongs to the cluster represented by the hidden neuron. Suppose X1, X2,…, and Xm denote the states (0 or 1) of the input neurons in a PNN, the “evidence” (or the possibility) of a hidden neuron, Yj, responding to the input can be measured by the following conditional probability:
Therefore, to justify the principle of inverse inference as applied to PNN, it needs to show that the winning hidden neuron Yj judged by the forward firing using the weight (2) has indeed the highest quantity (5). Using (2) to compute the activation potential in (3), yields
If the input attributes are statistically independent, (6) can be reduced to:
Comparing (8) and (5), Zj is equal to the conditional probability (5) divided by a factor common to all hidden neurons. As a result, using (8) to determine the winning hidden neuron yields the same result as using (5). This concluded the justification of the principle of inverse inference used in PNN. Furthermore since the sum of (5) over all Yj is one, the normalization actually computes (5) as the activation level for the hidden neuron Yj.
The statistical independence assumption on input attributes is a limitation of PNN as that in the naïve Bayesian neural networks. However, in data exploration without any prior knowledge of the data, the limitation would not be a big problem in PNN applications. One important note about PNN is no prior distribution is needed due to the uniform prior present in the denominator of (8). This is a huge advantage over Bayesian neural networks, which require a difficult task of estimating prior distribution.
To apply PNN, a training (learning) procedure is needed to estimate the weight (2). The PNN training algorithm estimates the weights from a given dataset. Given the past co-firing history of two neurons X and Y, (xk, yk), k = 1,2,…,n, the weight (2) between X and Y in a PNN can be estimated by
Suppose that a training set of DNA sequences is given. First, fire the hidden neurons randomly for each sequence to produce an initial fire matrix. Based on the initial fire matrix and training sequences, calculate the new weights between input neurons and hidden neurons using (3). In turn, use the new weights in the forward firing of training sequences to get a new fire matrix. Repeat this process until the PNN is stable, i.e. until the fire matrix is not changed. Fig.2 shows the PNN training algorithm.
Two datasets are used to test the performance of PNN. The first dataset is downloaded from . This dataset contains 3190 sequences which are taken from the GenBank 64.1. Each data sample contains a class attribute (I/E, E/I, or N), instance name, and 60-base sequence with 30 bases on either side of the boundary. 767 sequences are classified as E/I. 768 sequences are classified as I/E. 1655 sequences are classified as N.
In order to determine PNNs could be used to determine splice site prediction with large dataset, a set of sequences used to benchmark gene prediction programs was downloaded from . The second dataset contains 9204 sequences extracted from 570 different genes. From this set of sequences, 2629 Exon/Intron (E/I) boundaries and 2634 Intron/Exon (I/E) were extracted, with 30 bases of sequence on either side of the boundary reported. In addition, 3941 60-base sequences not occurring in an E/I or I/E (reported as an N) were extracted. This dataset was then separated randomly into two parts. One was used to train a PNN with 8 hidden neurons to distinguish between these three classes. The other one was used to test the performance of the PNN for I/E and E/I recognition.
Eight different classification algorithms are tested using the first sequence dataset in . The algorithms are listed as follows:
In order to compare PNN to the above algorithms, same dataset is used to test the PNN and the accuracy rate is evaluated. The tested PNN contains 8 hidden neurons and the soft WTA competition is used for the network. Table 1 shows the accuracy rates of PNN and the other algorithms obtained from . The result shows the PNN achieved higher accuracy rate than any algorithms listed above.
The second dataset is employed to test the PNN performance of handling the large dataset. From the total of 9204 sequences, 6204 randomly selected sequences were used to train the PNN, and the other 3000 sequences were used to test the PNN. The PNN contains 8 hidden neurons and 243 input neurons (3 for the class attribute and 4 for each base). Alpha cut was set to 0.8 in order to have more precise prediction. The PNN was trained for 200 iterations and the training was completed in minutes but yet to meet the stable condition. The 3000 testing sequences then input to the trained PNN to test the performance. The result showed that 2830 out 3000 sequences were correctly classified. In other words, 94.3% accuracy rate was achieved. The performance parameters: sensitivity (Sn), specificity (Sp), and correlation co-efficiency (CC) were calculated to assess the performance of the PNN. The definitions of Sn, Sp and CC are as follows:
where TP stands for True Positive, FN: for False Positive, TN for True Negative, and FN for False Negative
Table 2 shows the performance parameters of this experiment. Notice Sn, Sp, and CC have rather high percentages for both the I/E and E/I predictions using PNN even the PNN had not yet reached an optimal state. However, the training was extended for couple more hundreds of iterations, but the performance was not improved significantly. To improve training performance, the next experiment used a larger training dataset.
In this experiment, the training dataset was increased to 7204 randomly selected sequences. The rest of the 2000 sequences were used to test the performance of the PNN. The same PNN from the previous experiment was configured for this experiment. The training of the PNN was finished surprisingly in 58 iterations and 1883 out of 2000 testing sequences were correctly classified. Therefore, the PNN achieved about 94% accuracy rate in a very short period of training time. Table 3 also shows the performance parameters of this experiment.
The measurement of accuracy for this experiment shows the similar result from the previous experiment. Additionally, for misclassification, the PNN provided some useful information (e.g. 0.45: N 0.55:E/I) which could be used for further analysis.
Since the training of PNN is based on mutual information instead of error correction, to improve the performance, a simple tuning algorithm can be applied. It simply repeats the training process and records the configuration of the PNN, which has the best accuracy rate for the classification.
Table 4 shows the measurement of the PNN with the tuning algorithm. Comparing to 117 misclassified testing sequences from experiment 2, only 89 misclassified testing sequences were found out of 2000 sequences. Furthermore, the I/E prediction had significantly improved in Sn and CC as shown in Table 4.
PNN not only shows good performance for the prediction of exon/intron boundaries but also can discover the patterns of exon/intron boundaries. For a trained PNN, each hidden neuron represents a found pattern. The pattern can be easily extracted by reverse-firing the particular hidden neuron. The extracted knowledge then provides better understanding of the dataset. Thus, the number of hidden neurons is crucial to the PNN performance. In this paper, different numbers of hidden neurons are tested in the experiments. The PNN with 8 hidden neurons shows the best performance regarding to the accuracy rate and the training time. However, to decide the minimum number of hidden neurons to achieve desired performance criteria is still an open issue.
PNN has shown the abilities to solve problems such as classification, clustering, pattern recognition, function approximation, etc  . In this paper, PNN has shown promise for gene splice-junction recognition. The fast convergence of PNN makes it feasible for large biological sequence analysis. For future work, the study of new junction pattern discovery and improvement of PNN for sequence mining in general are interesting to be implemented. In addition, the possibility of applying PNN to promoter prediction and protein profile pattern recognition will also be considered.
Kuochen Li, CECS, University of Louisville, Louisville, KY 40292, USA.
Dar-jen Chang, CECS, University of Louisville, Louisville, KY 40292, USA.
Eric Rouchka, CECS, University of Louisville, Louisville, KY 40292, USA.
Yuan Yan Chen, PNN Technologies Inc, PO Box 7051, Woodbridge, VA 22195, USA.