PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bmcsbBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Structural Biology
 
BMC Struct Biol. 2010; 10(Suppl 1): S2.
Published online 2010 May 17. doi:  10.1186/1472-6807-10-S1-S2
PMCID: PMC2873825

Prediction of protein long-range contacts using an ensemble of genetic algorithm classifiers with sequence profile centers

Abstract

Background

Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions.

Results

In this paper, we propose a novel ensemble of genetic algorithm classifiers (GaCs) to address the long-range contact prediction problem. Our method is based on the key idea called sequence profile centers (SPCs). Each SPC is the average sequence profiles of residue pairs belonging to the same contact class or non-contact class. GaCs train on multiple but different pairs of long-range contact data (positive data) and long-range non-contact data (negative data). The negative data sets, having roughly the same sizes as the positive ones, are constructed by random sampling over the original imbalanced negative data. As a result, about 21.5% long-range contacts are correctly predicted. We also found that the ensemble of GaCs indeed makes an accuracy improvement by around 5.6% over the single GaC.

Conclusions

Classifiers with the use of sequence profile centers may advance the long-range contact prediction. In line with this approach, key structural features in proteins would be determined with high efficiency and accuracy.

Background

Proteins have complicated three-dimensional (3D) structures. It is often cost-expensive and speed-slow for proteins to be resolved by experimental techniques, such as x-ray crystallography and nuclear magnetic resonance (NMR). This is why more than ten million proteins are sequenced, while only 62,000 protein structures are stored in PDB. As protein structures are the basis for understanding protein functions and rational molecules design, associative studies between protein sequences and 3D structures by computational techniques have received increasing research interests.

It is known that non-local interactions of residue pairs are crucial for proteins to attain their native state [1,2]. Fariselli and Casadio reported that if residue contacts for a protein are known, the major features of its 3D structure could be determined by combining the correctly predicted motifs of secondary structures [3]. To a more extend, even a corrupted map with nonphysical contacts of a protein could lead to the recovery of its 3D structure by projecting the contact map onto its closest physical structural counterpart [4]. Previous results also indicate that 50% correctly predicted contacts ought to suffice that reconstruction [5] at least for proteins with less than 150 amino acids and with 8Å distance cutoff.

There have emerged various methods addressing the inter-residue contact prediction problem, such as methods with the use of evolutionary information [6], a Self-Organizing Map (SOM) integrated by genetic programming (GP) [7], neural networks (NN) [8,9], general input-output hidden Markov models (GIOHMMs) [5], support vector machines (SVMs) [10,11] and so on. Punta and Rost reported that about 30% of the predicted contacts were correct (in accuracy) with the residue separation at least six residues, where about 10% of the observed contacts are predicted (in coverage) [12]. Vullo's two-stage predictor achieved 19.8% prediction accuracy for the minimum contact separation of 24 residues, when evaluating on the top L/5 predicted contacts [13]. Wu and Zhang conducted a comprehensive assessment on sequence-based and template-based methods for contact map prediction and achieved an accuracy around 20% for long-range contacts [14]. Currently, the most accurate contact predictor, named NNcon, achieved 18% accuracy based on the CASP8 dataset [15]. In spite of great progress in the prediction of inter-residue contacts, the development of computational approaches is still at its embryonic stage. Therefore, fully exploring inter-residue contacts in proteins and designing novel approaches is highly demanded.

In this paper, we propose an ensemble of genetic algorithm classifiers (GaCs) to study the problem of long-range contact prediction. The input to this ensemble classifier are sequence profile centers (SPCs). Each SPC represents the average sequence profiles of residue pairs that belong to the same long-range contact class or long-range non-contact class [16]. One sequence profile is an encoding vector for a residue pair whose spatial distance between the two members falls into one distance interval ≤ 8Å or > 8Å, and whose residue separation is ≥ 24 in the sequence. Our GaCs train on multiple but different pairs of long-range contact data (positive data) and long-range non-contact data (negative data). The negative data sets, having roughly the same sizes as the positive ones, are constructed by random sampling over the original imbalanced negative data. As a result, about 21.5% long-range contacts are correctly predicted. We also found that the ensemble of GaCs indeed makes an accuracy improvement by around 5.6% over the single GaC.

Results and discussion

Our dataset involves 480 protein chains with 83307 residues, and it consists of 97639 residue pairs in long-range contact. Some proteins have more residue pairs in long-range contact and some ones contain fewer such residue pairs with respect to protein sequence length. However, protein sequence length has an approximately linear relationship with the number of long-range contacts [17]. Figure Figure11 shows such relationship (the red line in Figure Figure11).

Figure 1
Relationship of the number of long-range contacts versus the corresponding protein sequence length. The red line denotes the fitting curve of sequence length against the number of long-range contacts for protein chains.

Gromiha et al found that residue Ile is the most possible residue occurring in long-range contacts followed by Cys, Val, Tyr, Trp, Phe and Leu [2], indicating that hydrophobic residues mainly influence the long-range contacts. This also holds roughly true for our dataset. Figure Figure22 illustrates the amino acid composition for long-range contacts based on our dataset. Residues Val, Leu, Ala, Gly and Ile are more likely to be in long-range contact than other residue types. Conversely, residue Trp has the least long-range contact possibility followed by Cys, His, Met and Gln. In Figure Figure2,2, amino acids have almost the same long-range contact preference for multi-chain proteins as those with single-chain, which suggests that the statistics of amino acid composition is very consistent.

Figure 2
Amino acid compositions for multi-chain proteins and those with single-chain. The sky-blue bar stands for protein with single chain and the grey one denotes multi-chain protein. In addition, all protein chains are colored in brown.

We also investigate the propensities of amino acid types in long-range contacts to the types in long-range non-contacts. Each amino acid type in long-range contacts or in long-range non-contacts is respectively counted. The composition of amino acid type in long-range contacts or in long-range non-contacts is the ratio of the amount of individual amino acid type to that of total amino acids in long-range contacts or in long-range non-contacts. The propensity difference of each amino acid type is the ratio of the percentage of the type in contacts to that in non-contacts. These propensities for all amino acid types are shown in a logarithm (log2) scale in Figure Figure3.3. We can note that the larger the values of amino acid types are, the more possible they are in long-range contacts, while those with smaller values are more possibly in long-range non-contacts. Amino acids with larger propensity values, such as 'C', 'V', 'I', and 'L', representing hydrophobicity, are always in long-range contacts. However hydrophilic amino acids 'E', 'D', and 'K' often appear in long-range non-contacts. More importantly, cysteine and valine are the most frequently occurring residue in long-range contacts, but the glutamic acid appears in long-range non-contacts mostly.

Figure 3
Propensities of amino acid types in long-range contacts to the types in long-range non-contacts for multi-chain proteins and those with single-chain. The sky-blue bar stands for protein with single chain and the grey one denotes multi-chain protein. In ...

Transformation of sample vectors

GA was applied to reduce the dimensionality of input vector, the reduced (transformed) input vectors were then used as the input vectors of our classifier. In the transformation, some input variables were removed or merged, but it was done without decreasing the information of input. One computational benefit of this transformation is that the computational cost was dropped dramatically. For the two-class problem in our experiments, two optimal non-linear transformations for the two corresponding sub-classifiers were achieved. Therefore, there are two-discards for representing the ratio of the number of removed variables to the total one. For instance, one discard ratio 34.348% for a sub-classifier is illustrated in Figure Figure4.4. In this case, there are 143 original variables removed, and 65 variables merged together into only 40 variables. A transformed vector can be obtained by normalizing itself after removing or merging the original variables. Other transformed vectors can be got in the similar way illustrated in Figure Figure4.4.

Figure 4
A transformation case of input vector by one classifier. (a) A case of original input vector with 460 dimensions, (b) Transformed input vector by the transformation for contact class 1 classifier. All the two vectors are equalized by normalized themselves. ...

Performance of the GaCs ensemble

One GaC predictor can predict whether or not one sample vector belongs to long-range contact class. In this work there are 20 individual GaCs based on pairs of positive training sample set and different negative training sample subsets without overlap (refer to Methods section for details, where N = 20 in this work). A set of parameters is used to implement each individual GaC. As discussed in the method section, each protein in the dataset is represented as a vector of features. Feature vectors in the training dataset are first encoded into a string of chromosome. Each individual (or chromosome string) of each genetic classifier is a vector of the same size as the feature vector (460 dimensions in this work). 150 individuals compose the initial population of our genetic algorithms. Individuals of top 5% fitness values are selected to the next generation directly, while the others will go through the crossover and mutation procedures based on the preset crossover (0.95%) and mutation (0.01%) probabilities. After a number of iterations, each genetic algorithm terminates when the best fitness score did not change over 120 iterations and the best individuals are then obtained. The best individual is considered to be the best feature transformer for the prediction of long-range contacts. Other parameters associated with the implementation of genetic algorithm classifier are listed here: the crossover type is single point and the roulette wheel technique is used as the selector type.

To show the performance of the combination of several GaCs, four levels of ensembles of GaCs were constructed by the number of GaCs. Ensemble 5 denotes that there are five GaCs to be combined, while ensembles 10, 15, and 20 stand for 10, 15, and 20 GaCs to be combined to evaluate the improvement of classifiers ensemble, respectively. Table Table11 shows the comparison of the four levels of GaCs ensembles. Performance of the best individual GaC is also shown in Table Table1.1. It can be found in Table Table11 that ensemble 20 outperforms other ensembles and achieves an increase of accuracy by 2%, while the best individual GaC performs worse than the four ensembles.

Table 1
Comparison of four levels of GaCs ensembles

Since the contact prediction accuracy varies significantly with individual proteins and their structure classes [7], we calculate accuracy for each test protein based on GaCs ensemble 20. For each protein chain, we select four levels of the number of predicted contacts in the order of predicted distance between SP vector and SP center of long-range contact. The reason in doing so is that the total number of true contacts has approximately a linear relationship with the protein length [17]. The relationship was also shown in Figure Figure11 for our dataset. In detail, the four levels are '2L', 'L', 'L/2', and 'L/5', respectively, where L denotes protein sequence length. Results show that in many cases (e.g. 1hh7A, 1bxaA, 1gpr_, 1cewI, 1cznA, 1gn0A, 1igd_, 1tif_, 1s8nA), the prediction accuracies are larger than 30%. However, the prediction accuracies for some protein chains such as 1cv8_ and 1c7kA are pretty low. We found that the contact prediction accuracy heavily depends on the calculation of SP centers, on the number of long-range contacts and on the quality of multiple sequence alignment as well as the proportion of beta-sheets. Furthermore, in order to understand the distribution of our GaC long-range contact prediction with respect to CATH [18] domain classes, we compute the average accuracies on the five CATH structure classes (Table (Table2).2). According to Table Table2,2, the contact prediction accuracies on proteins belonging to β-sheets (α — β, all β) is higher than that of all α-helical proteins, which is consistent with other previous observations [7,10]. In Table Table2,2, the average accuracy is about 21.5% when evaluating the top L/5 predicted contacts and the residue pair with 24 apart. Taking into account the inherent physical restraints of protein structures, this prediction performance may be helpful for reconstructing an ab initio low-resolution structure since previous experiments showed that only L/5 or even less residues contacts are required to reconstruct a low resolution structure for a small protein [19-23]. However, the hard challenge is how to reconstruct a protein structure from even a corrupted predicted contact map [4], where contact restraints are much less reliable than the experimental contacts determined by NMR techniques.

Table 2
Performance comparison on CATH classes

Performance with respect to the number of long-range contacts in proteins

Protein chains often have different numbers of long-range contacts. However, it can be seen from Table Table22 that there are no evidences to conclude that the performance on proteins containing more long-range contacts can be better than that containing less long-range contacts. On the other hand, protein chains with too few long-range contacts may worsen the classification performance. Interestingly, our model on protein chains with shorter sequence length performs better probably because proteins with shorter sequence contain larger ratio of the number of residue pairs in long-range contact to the total long-range residue pairs. For instance, the classification performance increases slightly from long sequence proteins to short proteins when considering the top L/5 classified contacts. The detailed accuracies are from 28.23% (the corresponding protein sequences are from 200 to 300), 30.81% (the corresponding protein sequences are from 100 to 200) to 35.44% (the corresponding protein sequences are less than 100) for proteins in the beta class in CATH.

Moreover, it can be seen that the larger long-range contact ratios the tested proteins have, the better predictions of inter-residue long-range contact our model probably makes. In particular, our model evaluated on the proteins in the beta and alpha-beta classes, which consist of larger long-range ratios than those in other classes, outperforms that on other proteins.

Performance comparison based on CASP7 evaluation

The CASP7 evaluation procedure is focused on inter-residue contact predictions with linear sequence separation ≥ 12 and ≥ 24, respectively [24,25], while in this work we only focus on long-range contact prediction with linear sequence separation ≥ 24 and with assessing the top L/5 predicted contacts, where L is protein sequence length. These evaluation metrics are also similar to those used in the previous Critical Assessment of Fully Automated Structure Prediction Methods (CAFASP) [26,27] and in the EVA contact evaluation server [28]. We use the similar procedure and the same test proteins to evaluate the accuracy and coverage for our GaCs ensemble.

Contact map predictors participating in CASP7 include BETApro [29], Distill [30], GPCPRED [7], PROFcon [12], Possum [31], SAM_T06_server [32], SVMcon [10] and so on. Table Table33 reports the performance of the seven automated contact map predictors in the CASP7 experiment. The performance of our GaCs ensemble is appended at the right end of Table Table3.3. It can be seen that its accuracy is 20.7%, overall just slightly behind Possum. Its coverage at a sequence separation threshold of 24 is 3.1%, which is less than SAM-T06 and BETApro.

Table 3
Performance comparison based on CASP7 evaluation

As previously discussed [10], one thing should be noted that in the CASP7 experiment, methods being made predictions for part of domains, such as PROFcon, can not be directly compared with other methods. Here we include its results for completeness in Table Table3.3. Additionally, since the evaluation dataset and scheme we used may be slightly different from the official CASP7 evaluation, our purpose is to try to evaluate the current state of the art of long-range contact predictors instead of ranking them. Previous works also indicate that prediction accuracy of 50% for distant contacts with 8Å distance cutoff ought to suffice to reconstruct 3D protein structure, at least for proteins with less than 150 amino acids [5,24]. Other results showed that the accuracy level of about 30% is required for deriving moderately accurate (low resolution) 3D protein structures from scratch [19-23]. Despite the lower accuracy and coverage made by protein contact predictors, it is an important step towards reaching the accurate level [5,12,24]. From previous CASP prediction results, it can found that in a word, these predictors tend to perform more and more better [10].

Actually, in this work, the contact prediction accuracy is related to the SP centers, the number of long-range contacts and the quality of multiple sequence alignment as well as the proportion of beta-sheets. However, it is extremely difficult to build a specific non-linear expression based on the relationship.

Prediction results with respect to sequence profile centers

Prediction results on SP centers suggest that many LRCs are located around their SP centers. To illustrate the prediction results of the GaCs ensemble on SP centers, chain 'A' of protein PDB:1sjw with 142 residues was taken as an example and shown in Figure Figure5.5. Protein chain 1sjwA belongs to alpha-beta class in CATH and consists of 5 helices and 5 strands, and it contains 138 long-range contacts inter-residues and 6883 long-range non-contacts. In Figure Figure5,5, only residue pairs in long-range contact were displayed and those with sequence separation less than 24 residues were discarded. From Figure Figure5,5, most long-range contacts are predicted and a high classification accuracy is obtained. Mapping the input vectors (SP vectors) of these long-range contact residues onto one high-dimensional space, most contacts are neighboring around their corresponding high-dimensional SP center. In a similar manner, SPs for long-range non-contacts are also clustering together and distantly separated from their SP center but not shown in this work. In this case, the top L/5 (28) predicted contacts, whose residue members are separated no less than 24 residues in sequence, are selected and the corresponding prediction accuracy is 46.43%. In the cases of selecting the top 2L, L, L/2, L/10 and L/20 predicted contacts, our model achieves accuracies of 21.13%, 31%, 42.25%, 57.14%, and 63.39%, respectively. Many long-range residues located in beta sheets are successfully predicted to be in contact. It is consistent with the discovery that residues in long-range contact in proteins belonging to beta class and alpha-beta class in CATH are more easily identified than those in other classes in CATH, as discussed in most of previous work. It is also shown that the predicted contacts are clustered around the true contacts (see from Figure 5(a) to Figure 5(f)). It is of interest that many false positive contacts are also near to true contacts. Therefore, even these contacts may be helpful for reconstructing protein structure.

Figure 5
Contact map and predicted long-range contacts for protein chain PDB:1sjwA. (a) The comparison of contact map and predicted contacts when the top 2L predictions are selected; (b), (c), (d), (e) and (f) illustrate the similar comparisons when the top L ...

Conclusions

As pointed out by Baldi [33], a machine learning algorithm adopting a simple representation of a sequence space can be much more powerful and useful than using the original data containing all details. We found that most long-range contacts or long-range non-contacts are near to their SP centers after the feature transformation by our technique. In this work, we developed a GaC ensemble to reduce the dimensionality of input features based on SP centers. The purpose of the GaC method is to transform input vectors and select a number of features, while the combination of several GaCs is used in order to achieve higher contacts prediction. As expected, the ensemble of GaCs outperforms individual GaC and achieves an increase of accuracy by around 5.6%.

We have also made the following observations: (1) Most long-range contacts or long-range non-contacts are clustered around their SP centers by using our GaCs ensemble, when selecting the top L/5 classified contacts. Furthermore, it was found that about 21.5% long-range contacts can be correctly predicted under the same condition; (2) Clustering proteins with long-range contacts into a few number of clusters may lead to a higher classification performance. Likewise, long-range non-contacts also behave the similar manner. Therefore, for contact class or one non-contact class, integrating a set of SP centers with the information of predicted secondary structures or hydrophobicity might improve classification prediction; (3) Alternative classifiers can be proposed to predict long-range contacts based on predicted SP centers by radial basis function neural network or support vector regression for each protein chain.

Methods

Datasets and cross-validation

We obtained the protein chain set from PDB-REPRDB [34], which selects protein chains from PDB based on PDB Rel. 2007_11_14, and updated on 15 April 2009. We selected protein chains that are resolved by X-ray crystallography with resolution ≤2.0Å. The sequence identity between each two chains is less than 25%. As a result, we achieved 480 protein chains which have corresponding Consurf-Hssp files [35]. The dataset can be found at our website: http://mail.ustc.edu.cn/~bigeagle/BMCStructBio2010/index.htm. To validate our approach, a two-fold cross-validation strategy was employed to conduct our experiments. In this case, predictor was trained on one subset and tested on another one and vice versa.

Feature spaces

We firstly encode input vectors for each pair of residues, (i,j), then respectively stretch the two residues from N- to C-termini along a protein chain. Meanwhile, two corresponding sliding windows with an odd size of window length are used to encode input vectors. They are respectively centered at residue i and j, where window length is set to 9 in our work. Due to the improvement of contacts prediction by the application of segment connecting the residues of i and j [10,12,17,36], we took a third central window with five consecutive residues centered at the residue site int ((i+j)/2).

We used the property of residue sequence profiles (SP) obtained from HSSP database [37] at very beginning, where each residue was represented by 20 elements whose values were evaluated from multiple sequence alignment and their potential structural homologs. As discussed above, the three windows contain (9+5+9)=23 residues, where each residue corresponds to a sequence profile vector with 20 elements. In total, the input vector for one residue pair contains 20×23=460 elements, that is, one input vector consists of 460 features or variables.

Definition of long-range contacts

Usually, the contact map of a polypeptide chain with sequence length N is represented by an N×N matrix, CM. It is defined in terms of spatial distances between C-alpha atoms of residues and a predefined cutoff distance d. Usually, d is set as 8Å. So contact map for two-class long-range contact can be defined as:

equation image

(1)

where d(i, j) denotes the spatial distance between residues i and j.

In this case, two residues separated at least 24 residues in sequence are named as long-range residue pair. A pair of long-range residues in contact (class 1) is regarded as a positive sample while a pair of long-range residues in non-contact (class 0) is a negative sample.

Description of sequence profile center

The sequence profile center (SPC) is the average of all the sequence profile vectors belonging to one contact class of a protein. The definition of sequence profile center Ci in one protein chain j for contact class i is given as follows:

equation image

(2)

where Si (l) denotes the l – th sequence profile whose corresponding residue pair is to belong to contact class i, and mi is the number of residue pairs in contact class i.

For testing our method, SP centers for test protein chains, due to unknown 3D structures, need to be extracted from the training protein chains. All test chains use the same SP centers. So, the definition of SP center Ci of contact class i for test chains is given as follows:

equation image

(3)

where m denotes the number of training protein chains.

Given a test residue pair, we calculate the distance between the SP of the residue pair and every SPC. Generally, label i of SPC Ci is assigned to a SP if the SPC Ci is the nearest to the SP than the other SPC. Some other representations for profile center or centroid can be found in literature [38].

Genetic algorithm classifiers

In this paper, GaC predictor aims to transform original input vectors in such a way that the classification rate is significantly enhanced while retaining the efficiency and simplicity of the original vectors. It proceeds to search for an optimal transformation for the variables of input vectors based on genetic optimization. After obtaining the optimal transformation, classifier based on distance dissimilarity is used to classify test samples.

Chromosome encoding

Genetic algorithm [39] behaves in an analogous manner to Darwinian evolution by maintaining a population of solutions based on a fitness function, and strives to obtain the individuals with the maximum or minimum fitness value within the population. A string represents each candidate in the population, which is associated with a fitness value that reflects the capability to survive into the next generation during the evolution process.

To identify long-range contacts, let V be a feature space set V = (v1,v2,... ,vm), where vi is a feature variable and m is the dimension of feature vectors. Each residue pair within a protein is represented as a feature vector of V. We want to train a GA-based classifier that can correctly classify the feature vectors into classes C1 (long-range contact) and C0 (long-range non-contact). Our goal is to search for an optimal feature transformer T that maximizes the classification rate based on the corresponding selected features. To obtain the optimal T, GA is applied to search through the feature space V with a fitness function. To do that, firstly, a vector vi of the feature space V is represented as a chromosome string Si. A chromosome is composed of three kinds of expressors represented by characters a, b, and c, and the size of a chromosome is the same as a feature vector.

The schema for chromosome encoding is as follows: (1) Character a in a chromosome indicates that the values in the corresponding position in all feature vectors in V will be removed; (2) Two consecutive b's or c's indicate that the values in the corresponding positions will merged together. For instance, Figure Figure66 illustrates the transformation process for a feature vector (x1, x2, x3, x4, x5, x6, x7, x8). In this case, the corresponding chromosome is 'cbbabcca'. After being applied the transformers, the elements of the sample feature vector being removed or merged are concatenated and normalized to form a new vector with four elements. The new normalized vector will be used for long-range contacts classification.

Figure 6
Transformation for input vector. One histogram bin in each original normalized vector denotes one feature or variable and the height corresponds to the magnitude of the feature. The transformed vector should be also normalized but not shown here for the ...

Definition of fitness function

For each transformation Tm associated with the string Sm, we can construct the transformed input functionAn external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i4.gif for input training vector xi . For class Ck(k = 0,1), we can define the following centroid function based on Tm as:

equation image

(4)

where |Ck| is the cardinality of class Ck(k = 0, 1). C1 represents for contact class and C0 is for non-contact class.

Given these centroid functions, a new class structure An external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i6.gif can be imposed on the input vector x as follows:

equation image

(5)

where f is the function of the unknown model x and d(*,*) is a measure of dissimilarity between two functions.

Recall that the k-th classifier is trained to identify whether or not an input vector (unknown model) comes from class k. To implement the task, one fitness function for each k-th classifier is used to measure the discrepancies between the original class structure, C1, C0 and the imposed class structure, An external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i8.gif based on the chromosome string Sm. The fitness function is defined as:

equation image

(6)

where An external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i10.gif and An external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i11.gif denote the complements of Ck and An external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i12.gif, respectively. The first term counts the number of correct positive classifications, while the second term counts the number of correct negative classifications. Particularly, the maximal value of An external file that holds a picture, illustration, etc.
Object name is 1472-6807-10-S1-S2-i13.gif will be obtained when the two contact class structures exactly coincide, and its value will decrease as their discrepancy increases.

Outputs ensemble

Statistically, the number of negative samples is much smaller than that of positive sample, which leads to a rather imbalance between positive and negative samples. To avoid the influence of imbalance problem, negative training samples are divided into several subsets without overlap by random sample selection, which have roughly the same sizes as that of positive samples. In theory, combining the outputs of a number of independent classifiers can improve classification rate since the errors made by a classifier may be corrected by the others [40-42].

A majority voting rule was adopted to combine the outputs of genetic algorithm classifiers. By using the majority voting, one input vector was predicted as positive class 1 if at least one output was labeled as positive class 1, otherwise the corresponding residue pair was in long-range non-contact class 0.

The flow diagram of our prediction method

Figure Figure77 illustrates the flowchart of our prediction method. Each GA trains on a pair of positive training dataset and a negative training subset based on corresponding SPCs (calculated by Equation 2). As a result, N feature transformations are achieved. Those feature transformations are applied to reduce the dimensionality of testing SPs and SPCs to be tested (calculated by Equation 3). Then, calculating the distances between transformed SPs and SPCs is to determine whether or not the corresponding residue pair is in long-range contact, and more near distance means more possible in contact inter-residue. Finally, final decision would be made by combining outputs from N feature transformations.

Figure 7
The flow diagram of our prediction method. SPs in the figure denote input vectors, while SPC for a training subset stand for the average corresponding SPs calculated by Equation 2 and SPCs for test protein chains are calculated by Equation 3.

Performance indexes

To evaluate the performance of our classifiers, we applied the criteria of accuracy (Acc) and coverage (Cov), which were adopted at CASP/CAFASP [12,43] and defined as follows:

equation image

(7)

where TP denotes the number of true positives, FP denotes the number of false positives, and FN is the number of false negatives.

List of abbreviations used

Acc: Accuracy; CATH: Class, Architecture, Topology, and Homologous superfamily; CAFASP: Critical Assessment of Fully Automated Structure Prediction; CASP: Critical Assessment of Techniques for Protein Structure Prediction; Cov: Coverage; FN: False Negative; FP: False Positive; GA: Genetic Algorithm; GaC: Genetic Algorithm Classifier; LRC: Long-Range Contact; PDB: Protein Data Bank; SP: Sequence Profile; SPC: Sequence Profile Center; TP: True Positive;

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

Peng Chen carried out the implementation and wrote the manuscript. Jinyan Li read, revised, and approved the final manuscript.

Acknowledgements

This work was supported in part by the Singapore MOE ARC Tier-2 funding grant T208B2203 and the National Science Foundation of China (No. 60803107).

This article has been published as part of BMC Structural Biology Volume 10 Supplement 1, 2010: Selected articles from the Computational Structural Bioinformatics Workshop 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1472-6807/10?issue=S1.

References

  • Niggemann M, Steipe B. Exploring local and non-local interactions for protein stability by structural motif engineering. J Mol Biol. 2000;296:181–195. [PubMed]
  • Gromiha MM, Selvaraj S. Inter-residue interactions in proteins folding and stabiligy. Progress in Biophysics and Molecular Biology. 2004;86:235–277. [PubMed]
  • Fariselli P, Casadio R. A neural network based predictor of residue contacts in proteins. Protein Engineering. 1999;12:15–21. [PubMed]
  • Vendruscolo M, Kussell E, Domany E. Recovery of protein structure from contact maps. Fold Des. 1997;2:295–306. [PubMed]
  • Pollastri G, Baldi P. Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics. 2002;18:S62–S70. [PubMed]
  • Vicatos S, Boojala VBR, Kaznessis Y. Prediction of distant residue contacts with the use of evolutionary information. Proteins. 2005;58:935–949. [PubMed]
  • MacCallum RM. Striped sheets and protein contact prediction. Bioinformatics. 2004;20(Suppl 1):i224–i231. [PubMed]
  • Fariselli P, Olmea O, Valencia A, Casadio R. Prediction of contact maps with neural networks and correlated mutations. Protein Engineering. 2001;14:835–843. [PubMed]
  • Chen P, Huang DS, Zhao XM, Li X. Predicting Contact Map Using Radial Basis Function Neural Network with Conformational Energy Function. Int J Bioinform Res Appl. 2008;4:123–136. [PubMed]
  • Cheng J, Baldi P. Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics. 2007;8:113. [PMC free article] [PubMed]
  • Chen P, Han K, Li X, Huang DS. Predicting Key Long-Range Interaction Sites by B-Factors. Protein Pept Lett. 2008;15(5):478–483. [PubMed]
  • Punta M, Rost B. PROFcon: novel prediction of long-range contacts. Bioinformatics. 2005;21:2960–2968. [PubMed]
  • Vullo A, Walsh L, Pollastri G. A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics. 2006;7:180. [PMC free article] [PubMed]
  • Wu S, Zhang Y. A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics. 2008;24:924–931. [PMC free article] [PubMed]
  • Tegge AN, Wang Z, Eickholt J, Cheng J. NNcon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Research. 2009;37:W515–W518. [PMC free article] [PubMed]
  • Chen P, Wang B, Wong HS, Huang DS. International Joint Conference on Neural Networks. Orlando, FL, USA; 2007. Prediction of Long-range Contacts from Sequence Profile. pp. 938–943.
  • Lund O, Frimand K, Gorodkin J, Bohr H, Bohr J, Hansen J, Brunak S. Protein distance constraints predicted by neural networks and probability density functions. Protein Eng. 1997;10:1241–1248. [PubMed]
  • Orengo CA, Machie AD, Jones S, Jones DT, Swindells MB, Thornton SM. CATH - a hierarchic classification of protein domain structures. Structure. 1997;5:1093–1108. [PubMed]
  • Aszodi A, Gradwell MJ, Taylor WR. Global Fold Determination from a Small Number of Distance Restraints. Journal of Molecular Biology. 1995;251(2):308–326. [PubMed]
  • Zhang Y, Skolnick J. Automated structure prediction of weakly homologous proteins on a genomic scal. PNAS. 2004;101:7594–7599. [PubMed]
  • Skolnick J, Kolinski A, Ortiz A. MONSTER: A method for folding globular Proteins with a small number of distance restraints. J Mol Biol. 1997;265:217–241. [PubMed]
  • Ortiz AR, Kolinski A, Rotkiewicz P, Ilkowski B, Skolnick J. Ab initio folding of proteins using restraints derived from evolutionary information. Proteins. 1999;Suppl 3:177–185. [PubMed]
  • Zhang Y, Kolinski A, Skolnick J. TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophysical Journal. 2003;85:1145–1164. [PubMed]
  • Grana O, Baker D, MacCallum R, Meiler J, Punta M, Rost B, Tress M, Valencia A. CASP6 assessment of contact prediction. Proteins. 2005;61:214–224. [PubMed]
  • Moult J, Fidelis K, Tramontano A, Rost B, Hubbard T. Critical assessment of methods of protein structure prediction (CASP) - round VI. Proteins. 2005;61:3–7. [PubMed]
  • Lesk A, Conte LL, Hubbard T. Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts. Proteins. 2001;Suppl 5:98–118. [PubMed]
  • Fischer D, Elofsson A, Rychlewski L, Pazos F, Valencia A, Godzik A, Rost B, Ortiz A, Dunbrack R. CAFASP-2: the second critical assessment of fully automated structure prediction methods. Proteins. 2001;45:171–183. [PubMed]
  • Grana O, Eyrich V, Pazos F, Rost B, Valencia A. EVAcon: a protein contact prediction evaluaton. Nucleic Acid Res. 2005;33:W347–W351. [PMC free article] [PubMed]
  • Cheng J, Baldi P. Three-Stage Prediction of Protein Beta-Sheets by Neural Networks, Alignments, and Graph Algorithms. Bioinformatics. 2005;21:i75–i84. [PubMed]
  • Bau D, Martin AJ, Mooney C, Vullo A, Walsh I, Pollastri G. Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins. BMC Bioinformatics. 2006;7:402. [PMC free article] [PubMed]
  • Hamilton N, Burrage K, Ragan M, Huber T. Protein contact prediction using patterns of correlation. Proteins. 2004;56:679–684. [PubMed]
  • Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998;14:846–856. [PubMed]
  • Baldi P, Brunak S. Bioinformatics: The machine learning approach. London, England: The MIT Press; 2001.
  • Noguchi T, Akiyama Y. PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003. Nucleic Acids Res. 2003;31:492–493. [PMC free article] [PubMed]
  • Glaser F, Rosenberg Y, Kessel A, Pupko T, Ben-Tal N. The ConSurf-HSSP Database: The Mapping of Evolutionary Conservation Among Homologs Onto PDB Structures. Proteins. 2005;58:610–617. [PubMed]
  • Gorodkin J, Lund O, Andersen CA, Brunak S. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. Heidelberg, Germany; 1999. Using sequence motifs for enhanced neural network prediction of protein distance constraints. pp. 95–105. [PubMed]
  • Dodge C, Schneider R, Sander C. The HSSP database of protein structure-sequence alignments and family profiles. Nucleic Acids Res. 1998;26:313–315. [PMC free article] [PubMed]
  • Leary RHJ, Rosen B, Jambeckz P. An Optimal Structure-Discriminative Amino Acid Index for Protein Fold Recognition. Biophysical Journal. 2004;86:411–419. [PubMed]
  • Goldberg DE. Genetic Algorithms in Search Optimization and Machine Learning. Reading, MA: Addison-Wesley; 1989.
  • Kittler J, Alkoot FM. Sum versus vote fusion in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell. 2003;25:110–115.
  • Kuncheva LI. Combing pattern classifiers: methods and algorithms. USA: Wiley; 2004.
  • Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell. 1990;12:993–1001.
  • Fischer D, Rychlewski L, Dunbrack RL, Ortiz AR, Elofsson A. CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins. 2003;53(Suppl 6):503–516. [PubMed]

Articles from BMC Structural Biology are provided here courtesy of BioMed Central