|Home | About | Journals | Submit | Contact Us | Français|
Many biologically active proteins are intrinsically disordered. A reasonable understanding of the disorder status of these proteins may be beneficial for better understanding of their structures and functions. The disorder contents of disordered proteins vary dramatically, with two extremes being fully ordered and fully disordered proteins. Often, it is necessary to perform a binary classification and classify a whole protein as ordered or disordered. Here, an improved error estimation technique was applied to develop the cumulative distribution function (CDF) algorithms for several established disorder predictors. A consensus binary predictor, based on the artificial neural networks, NN-CDF, was developed by using output of the individual CDFs. The consensus method outperforms the individual predictors by 4~5% in the averaged accuracy.
A number of proteins lacking rigid 3D structures under physiological conditions in vitro yet fulfilling key biological functions is rapidly increasing [1–10]. These proteins are known as intrinsically disordered proteins (IDPs) among other names. They are highly abundant in nature [11–13], typically involved in signaling, recognition and regulation [7,8,14–18], and are strongly associated with human diseases . IDPs typically possess highly dynamic structures in solution with high mobility at different timescales, and therefore such proteins almost never form crystals. Hence, the existence of these proteins represents a substantial challenge to the structural genomics initiative .
IDPs and IDRs differ from structured globular proteins and domains with regard to many attributes, including amino acid composition, sequence complexity, hydrophobicity, charge, flexibility, and type and rate of amino acid substitutions over evolutionary time [4,21–23]. Based on these differences between IDPs and ordered proteins, numerous disorder predictors have been developed (reviewed in [24–26]). Nearly all of the predictive tools developed so far provide disorder prediction on the per-residue basis; i.e., they give the likely disorder status of each amino acid residue. Often, in the analysis of a given dataset, it is useful to carry out a binary classification of whole proteins, indicating whether a protein is likely to fold or likely to remain unstructured. Such a classification is not a simple task, as the extent to which a sequence is ordered or disordered and the nature of disorder vary widely among proteins. In fact, the structural variability of IDPs is extremely high and native coils, native pre-molten globules, and native molten globules were described in literature [4,9,10,14,16,18,27]. The protein can be completely unstructured or contain some elements of tertiary and/or secondary structure. In multi-domain proteins, domains might be connected by highly flexible linkers, and one or several domains might be completely disordered. Some proteins might have long disordered loops or tails. Because of this great variability, there is no strict boundary between ordered and intrinsically disordered proteins.
Two distinct binary classification methods have been reported previously [3,11,13]. One of these approaches uses charge-hydropathy plots (CH-plots), where ordered and disordered proteins are plotted in CH-space, and a linear boundary separates them . The other method is based on predictor of natural disordered regions (PONDR®) VLXT [21,28], which predicts the order-disorder score for every residue in a protein. Cumulative distribution function (CDF) distinguishes ordered and disordered proteins based on the distribution of prediction scores [11,13]. CDF curve gives the fraction of the outputs that are less than or equal to a given value. According to the CDF analysis, fully disordered proteins have very low percentage of residues with low predicted disorder scores, as the majority of their residues possess high predicted disorder scores. On the contrary, the majority of residues in ordered proteins are predicted to have low disorder scores. Hence, theoretically, all the fully disordered proteins should stay at the lower right corner of the CDF plot, whereas all the fully ordered proteins should be located at the upper left corner of this plot [11,13].
Due to the significant improvement in the prediction accuracy observed for several per-residue predictors, it was of interest to determine whether the CDF analysis based on these predictors would give improved binary classifications. An additional question was whether new methods can be used to optimize the CDF boundary line to achieve higher prediction accuracy. In this paper, the CDF method was developed for two other members of the PONDR® family of disorder predictors, VSL2 [29,30] and VL3 , for a simplified predictor based on the TOP-IDP scale , as well as for IUPred [33,34] and FoldIndex . We also proposed a new method for optimizing the order-disorder boundary line in the CDF plots. Finally, a consensus method was elaborated by using a neural network based on CDF values from the outputs of the PONDR® VXLT, PONDR® VSL2, PONDR® VL3, TOP-IDP, IUPred, and FoldIndex, and this method appears to be more accurate than any of the methods based on individual predictors.
Four groups of datasets were used in this study. The first group included the ‘original datasets’ from Ref. : (i) an ordered dataset of 105 wholly ordered proteins and (ii) a disordered dataset of 54 fully disordered proteins. These two datasets were used to take advantage of their high quality, and to provide an unambiguous comparison of the new methods developed in this paper with the previously developed method . The second group was new fully ordered and fully disordered datasets. The new set of fully ordered proteins had 554 chains that were derived from the PDB database as of July 20, 2008 to include sequences of non-homologous single chain non-membrane proteins, which had no ligands, no disulfide bonds, and no missing residues, and which were characterized by unit cells with primitive space groups. The new dataset of fully disordered protein had 84 chains that were extracted from DisProt (release 4.5 of July 17, 2008)  to include non-homologous proteins without structured regions. Each of these new datasets was randomly and equally split into training and testing sets. The third group was the datasets of sequences for Escherichia coli K12, Archaeoglobus fulgidus, and Methanobacterium thermoautotrophicum generated from the UniProt database after removing all the fragments. The last group was a dataset that included 64 partially disordered proteins with less than 25% of sequence identity which were also extracted from PDB and had missing electron density for at least 30 residues, as in Ref. .
PONDR® VLXT [21,28] is composed of three neural networks, two for the termini of the sequence and one for internal region. The final output is an average over above three outs. The inputs of the neural networks are residue composition-related quantities. PONDR® VL3  employs majority-voting over a bunch of neural networks which also take composition, complexity, and entropy as the inputs. PONDR® VSL2 [29,30] is built up on support vector machine with sequence composition, evolution information, and predicted secondary structure as the inputs. TOP-IDP  is a new amino acid scale developed to discriminate ordered and disordered residues with the highest accuracy. IUPred [33,34] applies the sequence-based pair-wise potential energy evaluated from the globular proteins to distinguish disordered residues/proteins from the ordered ones. FoldIndex  takes the relative relation of net charges and normalized hydrophobicity scale which is originated from CH plot to partition ordered and disordered residues.
CDF analysis summarizes the per-residue predictions by plotting predicted disorder scores against their cumulative frequency, which allows ordered and disordered proteins to be distinguished based on the distribution of prediction scores [11,13]. At any given point on the CDF curve, the ordinate gives the proportion of residues with a disorder score less than or equal to the abscissa. To develop corresponding CDF algorithms, the outputs of all the above-mentioned predictors were unified to produce the per-residue disorder scores ranging from 0 (ordered) to 1 (disordered). In this way, CDF curves for various disorder predictors always began at the point (0, 0) and ended at the point (1, 1) because disorder predictions were defined only in the range [0, 1] with values less than 0.5 indicating a propensity for order and values greater than or equal to 0.5 indicating a propensity for disorder. As a result, fully ordered proteins yield convex curves because a high proportion of the prediction outputs are below 0.5, while fully disordered proteins typically yield concave curves because a high proportion of the prediction outputs are above 0.5. In practice, the range of prediction score (from 0 to 1) was divided into 20 bins [11,13]. It is expected therefore that there should be an approximately diagonal boundary line that could be used to separate the ordered and disordered proteins with an acceptable accuracy.
The original datasets were divided into training sets and testing sets. The boundary line for each CDF was optimized in the training set, and tested in the testing set. Bootstrap sampling of 1000 times was also applied to validate the confidence region of the accuracy.
A quantity termed CDF distance was also applied to assess whether the protein is ordered or disordered. The CDF distance is defined as:
where dCDF is the averaged CDF distance of the protein from the CDF boundary line. Ks and Ke are the starting and ending bins of the CDF boundary line. CDFi is the CDF value of i-th bin, while CDF0i is the value of CDF boundary at that bin.
By combining the CDFs based on PONDR® VLXT, PONDR® VSL2, PONDR® VL3, TopIDP, IUPred, and FoldIndex, a neural network-based consensus method of predicting the order/disorder status was developed. The neural network was fully connected with twenty inputs (three from the PONDR® VLXT-based CDF, four from the PONDR® VSL2-based CDF, three from the PONDR® VL3-based CDF, three from TopIDP-based CDF, four from IUPred-based CDF, and three from FoldIndex-based CDF), one hidden layer with ten hidden units, and one output. A sigmoidal curve was used as the activation function at each node. Inputs from the CDF of each predictor were selected from the bins having the highest separating accuracies. The above mentioned fully disordered and fully ordered datasets were randomly separated into eight groups with each group having one eighth of both the original training and testing sets. At each time, seven groups were used for training, while one group was taken for testing. The training sets were further randomly split into two parts. One, with 90% of the original dataset, was used for the training. Another 10% was used for protection against over-fitting. Weight parameters in the neural networks were chosen by maximizing the accuracy in these 10% of samples. The accuracy was evaluated by using testing datasets. This process was repeated for eight times to implement the eight-fold cross-validation. The final accuracy was the average over eight times on the testing sets.
Originally, a statistical method, where the accuracy of separation is calculated by the summation over both ordered and disordered proteins, was applied to locate the CDF boundary line . Here we describe an alternative approach. First, the average CDF values of ordered and disordered proteins were calculated separately for 20 bins along the X-axis. Next, for each bin, the vertical distance between the averaged ordered and disordered CDF values was divided into 30 parts irrespectively of the distances between the two values. Then, the position of the boundary point was varied and the prediction accuracies of both the ordered and disordered proteins were determined for each choice of boundary point. The accuracies of ordered and disordered proteins for all the boundary choices for all the bins gave an accuracy distribution matrix. Based on this matrix, the location and length of the boundary line was found.
To identify a boundary line made up of one continuous segment for which the low accuracy ends are removed and the high accuracy central region is kept, the following criteria were used:
Table 1 shows that the new PONDR® VLXT-based boundary achieved averaged accuracies of 88% and 89% for ordered and disordered datasets, respectively. The new boundary outperforms the previous boundary  by 2% for disordered proteins but was 2% less accurate for ordered proteins. However, the difference in accuracy between ordered and disordered datasets was only 1% for the new method, compared to 3% for the previous method. This decreased discrepancy means an improved balance between ordered and disordered protein predictions, which is useful for reducing the overall false positive rate. Although this statement is less prominent after the errors are taken into account, the new results are still comparable to the previous ones. The PONDR® VLS2-based boundary reached the similar accuracy as the PONDR® VLXT-based boundary, whereas VL3-based boundary surpasses PONDR® VLXT-based boundary by 2% on the ordered dataset. IUPred-based boundary had the highest accuracy of 91% in disordered dataset which is about 6% higher than that in ordered dataset. The TOP-IDP-based CDF boundary was the least accurate one. FoldIndex-based boundary showed slightly better results than that for TOP-IDP (see Table 1). However, in partially disordered dataset, all the accuracies decreased significantly. For this dataset, PONDR® VSL2-based CDF had the best accuracy of 84% followed by PONDR® VL3 CDF of 81%. FoldIndex was ranked the third at 80%. All other CDFs accuracies were around 70% or below (Table 1).
The reasons of why some boundaries achieved the higher accuracy are explored in Figure 1, which represents all the averaged CDF curves from each dataset and corresponding boundaries. Figure 1A shows that for the disordered proteins, the shapes of PONDR® VSL2-CDF and PONDR® VL3-CDF curves are almost identical. The averaged PONDR® VLXT-CDF curve for the disordered proteins starts with noticeably higher values. This implies that the percentage of residues predicted to be ordered by PONDR® VLXT is relatively high, suggesting that this predictor has a tendency to over-predict order. IUPred-CDF is lower than PONDR® VLXT-CDF at small prediction scores but higher than PONDR® VLXT-CDF at scores larger than 0.4. That is to say IUPred predicted many fully disordered residues to have scores of 0.4 or so. For the ordered dataset, PONDR® VSL2 CDF is always at the lowest location. When the prediction score is higher than 0.25, IUPred CDF ranks the highest followed by the PONDR® VL3 CDF. This is expected results because IUPred was created using data obtained from globular proteins. However, when the prediction score is less than 0.25, PONDR® VLXT CDF is ranked the highest, whereas IUPred CDF and PONDR® VL3 CDF are similar to each other. Figure 1B represents the averaged CDF curves and the boundaries for TOP-IDP and FoldIndex for fully ordered and fully disordered datasets. It is clear that CDF curves for these two predictors possess very unusual sigmoidal shapes. Therefore, these two predictors intended to assign intermediate score to all the residues and had the poor separation over ordered and disordered proteins. This indicates that both TOP-IDP and FoldIndex are not very suitable for the binary classification individually.
Figure 1C represents the distribution of the distances between the ordered and disordered CDF curves for six predictors. It is seen that the PONDR® VLXT data are skewed toward the low disorder scores, the PONDR® VSL2 data are somehow skewed toward the high disorder scores, the TOP-IDP and FoldIndex data are distributed in a very narrow interval, IUPred also shifts to the low score region, whereas the PONDR® VL3 data are the most evenly distributed through the entire interval of disorder scores. This clearly shows that the PONDR® VL3 could produce one of the best separations. In agreement with this conclusion, the average CDF differences between the ordered and disordered datasets were 0.33, 0.47, 0.54, 0.06, 0.49, and 0.24 in the boundary bins for PONDR® VLXT, PONDR® VSL2, PONDR® VL3, TOP-IDP, IUPred, and FoldIndex, respectively. By taking into consideration all these observations, it is obvious that PONDR® VL3 has the most accurate boundary for the separation of the ordered and disordered dataset.
The data shown in Figure 1 were used to generate CDF boundary points, which were then fit by the following linear equations:
were CDFVLXT, CDFVSL2, and CDFVL3, CDFTOP-IDP, CDFIUPred, and CDFFoldIndex correspond to the CDF boundary values based on the PONDR® VLXT, PONDR® VSL2, PONDR® VL3, TOP-IDP, IUPred, and FoldIndex predictors, respectively, whereas DO corresponds to the disorder score. Compared to the PONDR® VLXT-based CDF boundary, PONDR® VL3-based boundary is parallel to PONDR® VLXT boundary but is also shifted to the lower disorder scores, all other boundaries are steeper and are shifted to the lower disorder scores. The values of disordered score at the low-end of each boundary line are 0.6, 0.4, 0.4, 0.5, 0.3, and 0.25 for the PONDR® VLXT-, PONDR® VSL2-, PONDR® VL3-, TOP-IDP-, IUPred-, and FoldIndex-CDFs, respectively.
Figure 2A represents the PONDR® VLXT-, PONDR® VSL2-, PONDR® VL3-, TOP-IDP-, IUPred-, and FoldIndex-based CDF curves for partially disordered proteins. It is important to emphasize that all the partially disordered proteins in this study were collected from PDB. As a result, all of them have significant amount of ordered residues, suggesting that the current set of partially disordered proteins is highly biased toward order. Based on these observations, one can expect that the majority of partially disordered proteins in the current dataset will be predicted by CDF analyses as ordered. In agreement with this hypothesis, all CDF curves in Figure 2A are rather similar to CDF curves calculated for the fully ordered proteins (cf. Figure 1).
Next, to understand whether there is a difference in the prediction tendencies for partially disordered proteins with long disordered regions and for proteins with several short disordered regions, an original partially disordered dataset (PDD) was divided in two groups, one with proteins having disordered regions longer than 50aa (PDD-L), and another one with proteins having shorter disordered regions (PDD-S). Results of the analysis of these subsets by various CDFs are represented in Figure 2B and Table 3, which clearly show that proteins in the PDD-S set are predicted to be more ordered than proteins in the PDD-L set. This conclusion follows from the fact that partially disordered proteins with long disordered regions are generally located closer to the boundary than proteins with several short disordered regions (see Table 3).
At the final stage, the outputs from the PONDR® VLXT, PONDR® VSL2, PONDR® VL3, TOPIDP, IUPred, and FoldIndex CDFs were used to build a neural network-based consensus method, NN-CDF, for the binary disorder classifications. The data were divided into 8 subsets to implement 8-fold cross validation. Table 2 illustrates that compared to the individual PONDR® VLS2, PONDR® VL3, and IUPred CDF predictions, this new consensus predictor showed ~4% increment in the averaged prediction accuracy over both fully ordered and fully disordered datasets. The accuracy on ordered dataset is 2% higher than PONDR® VL3 CDF predictor which is the second best in all the methods. For disordered dataset, this method has the same similar accuracy with IUPred CDF which is around 90%. The larger error observed in the consensus NN may be a result of insufficient samples in the testing subsets. And for partially disordered proteins, the accuracy of consensus NN is around 10% higher the second best PONDR® VSL2 CDF.
Table 4 represents the percentages of fully disordered proteins in three genomes, Escherichia coli K12, Archaeoglobus fulgidus, and Methanobacterium thermoautotrophicum, as evaluated by CDF predictors based on PONDR® VLXT, PONDR® VSL2, PONDR® VL3, TOP-IDP, IUPred, FoldIndex, and NN. PONDR® VLXT-based CDF predicts 2 to3 times more disordered sequences in all three species than PONDR® VSL2-, PONDR® VL3-, TOPIDP-, IUPred-, and FoldIndex-based CDF methods. Even in the case when whole CDF curve is completely below the boundary line (data in brackets of Table 4), the PONDR® VLXT CDF still has much more disordered sequences, especially for Archaeoglobus fulgidus and Methanobacterium thermoautotrophicum. The results for PONDR® VSL2, PONDR® VL3, TOP-IDP, and FoldIndex are more or less similar to each other, although TOP-IDP has slightly lower percentage of disordered proteins for Escherichia coli and higher values for Archaeoglobus fulgidus, IUPred has higher percentage of disordered proteins on Escherichia coli and extremely low disordered ration on Archaeoglobus fulgidus. By applying the consensus method, the percentage of disordered protein is further decreased to 4~9%.
We developed a new error-estimation method for the identification of boundary line in CDF graphs containing CDF curves for both ordered and disordered proteins. This method does not need the pre-assumption on the normal distribution of CDF values around the average in the corresponding datasets. By using this new method, we generated CDF-based prediction tools for PONDR® VLXT, PONDR® VSL2, PONDR® VL3, TOP-IDP, IUPred, and FoldIndex predictors. All of them achieved reasonable prediction accuracy. We also developed the neural network-based consensus method that used the output of all mentioned above CDF outputs. This consensus method was 4~5% more accurate than any of the individual predictors. We further implemented a series of experiments by removing one or two less-accurate CDF predictors from the input of the consensus method. To our surprise, even the less-accurate predictors were useful for the improvement of the final prediction accuracy. The influence of various components for the performance of the final tool will be further analyzed in future. It is also worthwhile to notice that although the consensus method achieved high accuracy on partially disordered dataset, the identification and classification of partially disordered proteins are not a trivial task. By definition, the partially disordered proteins should have an “evenly increased” curve or “flat central region” on the CDF plots. The peculiarities of the CDF predictions for partially disordered proteins need to be more carefully studied.
The numbers of predicted wholly disordered proteins in Escherichia coli K12, Archaeoglobus fulgidus, and Methanobacterium thermoautotrophicum by PONDR® VLXT-based CDF were higher than previously reported . Furthermore, the PONDR® VLXT-CDF predictor identified significantly larger number of disordered sequences in all the three species, compared to other CDF predictors. This is because the new PONDR® VLXT boundary line was located higher than the PONDR® VLXT-based CDF boundary line calculated in the previous study . This shift was determined by the need of balancing the false positives in both wholly ordered and fully disordered sets. Since the same method was used in other CDF predictions, it could be expected that other boundary lines are also shifted to higher positions. The final consensus prediction reveale that the percentages of disordered proteins in Escherichia coli K12, Archaeoglobus fulgidus, and Methanobacterium thermoautotrophicum are 4.2%, 7.5%, and 8.4%, respectively. These results are very similar to previous reported ratios of 4.6%, 6.3%, and 8.0% . The discrepancy among individual predictors indicates that there is still an urgent need for the new prediction protocols and the precise estimation of the disordered content on whole genome.
This work was supported in part by the grants R01 LM007688-01A1 (to A.K.D and V.N.U.) and GM071714-01A2 (to A.K.D and V.N.U.) from the National Institutes of Health and the Program of the Russian Academy of Sciences for the “Molecular and cellular biology” (to V. N. U.). We gratefully acknowledge the support of the IUPUI Signature Centers Initiative.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.