PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
 
BMC Bioinformatics. 2010; 11: 198.
Published online 2010 April 21. doi:  10.1186/1471-2105-11-198
PMCID: PMC2877690

Predictors of natively unfolded proteins: unanimous consensus score to detect a twilight zone between order and disorder in generic datasets

Abstract

Background

Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding.

Results

In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins.

Conclusions

Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated.

Background

For long time it has been thought that the existence of a stable three dimensional structure is necessary for a protein molecule to be functional. Nevertheless, the evidence that many proteins are unfolded in their functional state has induced, in the last decade, a re-assignment of the structure-function paradigm [1]. Natively unfolded proteins, also known as "intrinsically unstructured" or "intrinsically disordered" [1-3] lack a well defined tertiary structure, being functional in states made of an ensemble of flexible conformations. It is known that these proteins are involved in important cellular processes like signalling, targeting and DNA binding [1-8]. It has been suggested that they may play critical roles in the development of cancer [9,10] and in some amyloidotic diseases [10-12]; moreover, the absence of a rigid structure allows them to bind different targets with high specificity and low affinity, suggesting that they may be hubs in protein interaction networks [13-16]. The current release 5.0 of the Disprot database (http://www.disprot.org/; see also [17,18]) contains 517 proteins with different "flavours" of disorder [19].

A great effort has been made to predict natively unfolded proteins through an ab-initio analysis of their amino acidic sequences. Many methods have been proposed (for reviews see [20,21]), both to predict disordered amino acids of a protein, i.e. amino acids whose atoms are hard to localize by X-ray crystallography or NMR spectroscopy [22,23], and to define synthetic binary scalar indexes that express the probability that a protein has a global tendency to fold or to remain unfolded. It has been observed that predictors of natively unfolded proteins are, basically, functionals of the amino acidic composition [24]. Szilagyi et al. reported that hydrophobicity and charge distributions of folded and natively unfolded proteins overlap, and, on a hydrophobicity/charge plane, there is an area occupied by proteins from both groups [24]. This suggests that, in the vector space of amino acidic compositions, it is not possible to find a hyper-plane separating the points corresponding to sequences of folded proteins and those corresponding to sequences of unfolded proteins, but there exists an overlap volume, that is reasonable to identify with a "twilight zone" between order and disorder [24,25].

In this work we are interested in finding an operational method to identify proteins in this twilight zone.

In the following paragraphs of this section we outline the perspective of our work and anticipate the main results we shall present and discuss in the next sections.

In the first part of this study we revisit several methods to predict natively unfolded proteins through binary scalar indexes of folding. Uversky et al. propose to analyse the mean hydrophobicity and mean net charge of the protein sequences [26]; following their method, Prilusky et al. developed a web-based server named FoldIndex [27]. Galzitskaya et al. propose to use the mean packing of a protein sequence as an indicator of its folding status [28-30]. Recently, mean pairwise energy was used to effectively discriminate folded proteins from natively unfolded ones in a peculiar set of 39 protein complexes [31,32]. A more refined approach is Poodle-W by Shimizu et al. [33]; they analyse the amino acid composition of the protein sequences with a spectral graph transducer [34].

We compared the performances of these predictors on several datasets. We also defined a global binary folding index, named here gVSL2, using VSL2 [35,36], a predictor of disordered amino acids that excellently performed in the CASP7 experiment [37] and which is an evolution of the predictors that evidenced different flavours of disorder [19].

We observed that in many cases a number of sequences were differently classified by different indexes (figure (figure1).1). To identify these proteins we introduced, by combining several indexes, a strictly unanimous consensus score SSU that leaves unclassified a protein if at least two indexes disagree in classifying it. We verified that proteins unclassified by SSU span, on a hydrophobicity/charge plane, an overlap area occupied both by folded and natively unfolded proteins (figures (figures22 and and3).3). We checked also that these proteins have amino acidic frequencies intermediate between those of proteins that SSU predicts either as folded or natively unfolded (figures (figures4,4, ,5),5), so we have concluded that they belong to a twilight zone in the space of amino acidic composition. Proteins in this twilight zone have a flexibility intermediate between that of folded and that of natively unfolded proteins and have a dependence of chain length similar to that reported by Szilagyi et al. [24].

Figure 1
Overlaps between the predictions of single indexes. Comparison of the predictions by mean pairwise energy, gVSL2 and Poodle-W, on set C. The three indexes agree on 1785 proteins predicting them in the same class; 224, 210 and 231 proteins are singly classified ...
Figure 2
Hydrophobicity/charge plot of folded, unfolded and unclassified proteins in set B. Hydrophobicity/charge plot of proteins in set B, experimentally identified as folded (red) and unfolded (blue). Upper green triangles refer to folded proteins unclassified ...
Figure 3
Hydrophobicity/charge plot of folded, unfolded and unclassified proteins in set C. Hydrophobicity/charge plot of proteins in set C, experimentally identified as folded (red) and unfolded (blue). Upper green triangles refer to folded proteins unclassified ...
Figure 4
Amino acidic percent composition of folded, unfolded and unclassified proteins. Amino acidic composition histograms of proteins in set C that are predicted as folded (red, left bars) and unfolded (blue, right bars) by SSU; the central bars in green refer ...
Figure 5
Distribution of the log-odds ratio S in folded, unfolded and unclassified proteins (twilight zone). Distribution of log-odds ratios in predicted folded (red bars), unfolded (blue bars) and unclassified proteins (green bars), as evaluated by SSU on set ...

We used SSU to scan several genomes from Archaea, Bacteria and Eukarya looking for natively unfolded proteins; the obtained percentages are similar to those already reported in the literature [38,39]; through SSU we found robust scaling laws both in the classified and unclassified sets of proteins, of possible significance for studies in molecular evolution [40,41].

Results

Performances of single indexes of folding

To compare the performance of the synthetic predictors we computed them on set A, that Prilusky et al. selected to test FoldIndex [27]. The results are reported in table table1.1. The index HQ is our implementation of the algorithm described by Uversky and co-workers [26]. Mean packing and mean pairwise energy were computed following the original protocols [29,31] (see Methods). We classified a protein as natively unfolded if it had a mean packing below 20.55 and a mean pairwise energy above -0.37 arbitrary energy unit (a.e.u.); these threshold values were optimized to reach a sensitivity of at least 0.80 and a level of false predictions as low as possible. gVSL2 is the arithmetic mean of the disorder scores obtained by the predictor VSL2 [35,36] (see Methods for details).

Table 1
Performance of single folding indexes on set A

From the AUC values shown in table table11 it is possible to rank the single indexes. We observe that mean pairwise energy, gVSL2 and Poodle-W have about the same performance and are the best three.

To test the stability of these results over different datasets, we repeated the experiment using two other groups of proteins: set B, used by Shimizu et al. to test Poodle-W [33] and set C, our own selection of proteins. Set B contains 526 folded proteins with a level of disorder below 5% and 81 natively unfolded proteins with a level of disorder above 70%; it has been built as a very discriminative set, aiming at including either fully ordered or fully disordered proteins. Set C was compiled to test the ability of previously proposed methods to separate, on one side, fully ordered folded proteins but also proteins containing a percentage of disordered amino acids above 5%, from, on the other side, fully disordered proteins containing more than 70% of disordered residues. It contains 2369 folded proteins, 1573 fully ordered with a level of disorder below 5% and the remaining 796 proteins with higher percentage of disordered amino acids. Set C contains also 81 natively unfolded sequences with a level of disorder above 70% (see Methods). The performances of the single folding indexes on set B and C are reported in tables tables22 and and3,3, respectively. The AUC values confirm that Poodle-W, gVSL2 and mean pairwise energy <Ec>are the best performing on both sets. Note also that Poodle-W and <Ec>have a low level of false positives fp, whereas gVSL2 tends to have a higher sensitivity Sn.

Table 2
Performance of single folding indexes on set B
Table 3
Performance of single folding indexes on set C

It is evident the lower performance of all the indexes on set C, with respect to set B. As said above, the protocol followed by Shimizu et al. [33] is more restrictive; their set contains folded proteins whose average composition is different from that of unfolded ones (as shown in the hydrophobicity/charge plot of figure figure3)3) whereas set C, being more generic, contains a large number of untypical folded sequences. Poodle-W confirms itself as the relatively best single index, both on the optimized dataset B and on the more generic dataset C.

We checked that set C includes 506 proteins that are complexed with peptides or other macromolecules. These proteins may be partially disordered, particularly if they are not linked with their targets. We controlled however that the changes in the performance of the indexes are negligible if we exclude these proteins. The performance of the folding indexes computed on set C purged from these 506 complexed proteins are included in the [Additional file 1: Table ST1].

A consensus score to detect a twilight zone of amino acidic composition

In figure figure11 we compare, in a Venn diagram, the results of the predictions obtained by mean pairwise energy <Ec>, gVSL2 and Poodle-W for the 2450 proteins in set C. We see that the three indexes predict in the same folding class 1785 proteins, and that the indexes partly overlap, i.e. <Ec>and gVSL2 predict in the same folding class 231 proteins, and there are other 434 proteins on which at least two indexes of folding agree. On the other hand, there are proteins that are uniquely predicted by each single folding index: 224 by <Ec>, 210 by gVSL2 and 231 by Poodle-W. So, globally, there are 665 proteins on which at least two indexes disagree.

It is reasonable to think that proteins differently classified by two single folding indexes have amino acidic frequencies different from those typical of folded and natively unfolded proteins, then they are natural candidates to belong to a twilight zone in the vector space of amino acidic compositions. To identify these proteins we used a consensus score. The use of combination scores is not new [39,42] and a meta-server is also available [43]. We introduced a combination rule that we called strictly unanimous score, SSU. We decided to combine mean pairwise energy, gVSL2 and Poodle-W because, among the global indexes here considered, they are the best performing ones (see tables tables1,1, ,22 and and3).3). SSU classifies a protein as folded if all the indexes agree in predicting it as folded; conversely, it classifies a protein as natively unfolded if all the indexes agree in predicting it as natively unfolded; proteins are left unclassified when two indexes disagree. SSU classifies the majority of the sequences analysed (see nc in table table4):4): of the 2369 folded proteins, 646 are left unclassified, whereas of the 81 natively unfolded proteins, 19 are left unclassified; overall, the unclassified proteins are therefore 665, 27% of all proteins in set C. In set B the percentage of unclassified proteins is 11%.

Table 4
Performance of each of the indexes combined into SSU on the sets B and C purged by the unclassified proteins

Note that, due to the clause of unanimity, the performance of the single indexes on sets B and C purged from the unclassified proteins coincides with that of SSU and it is improved (compare tables tables22 and and33 with table table4).4). This improvement after filtering can be obtained only if one combines in SSU a set of indexes of comparable performances, as discussed in the second subsection of the Discussion. We observe, in table table4,4, a relatively high sensitivity, and a low level of false positives, suggesting that removing the unclassified proteins is a filter of false predictions. We checked that 68% of the false predictions of <Ec>, 74% of those of gVSL2 and 54% of those of Poodle-W are unclassified by SSU. Note also the small fraction of unclassified unfolded proteins; that is essentially due to the fact that unfolded proteins are less numerous in the datasets.

Since unclassified proteins by SSU are often false predictions of the single indexes, they should have an untypical amino acidic composition. We checked that on the hydrophobicity/charge plane, which is a projection of the space of amino acidic compositions (figures (figures22 and and3).3). Figure Figure22 refers to set B, figure figure33 to set C. Points in blue and in red correspond to experimentally determined unfolded and folded proteins, respectively. Points in green and in black correspond to the proteins unclassified by SSU. To give all the information we have distinguished with green points folded unclassified proteins, and with black points unfolded unclassified proteins. It is evident that in set C folded and unfolded proteins are largely mixed up. This is due to the inclusion in the dataset of a number of protein complexes (about 22%), as well as of binding, signalling, nucleic acids associated and membrane proteins (about 40%). Many of them could have a composition typical of unfolded proteins, but nevertheless are folded and crystallizable after binding with a ligand. We checked that the overlap decreases if we exclude these proteins from the dataset and the hydrophobicity and charge distributions tend to those of set B [Additional file 2: Figure S1 (A)]. Note, however, that in both figures figures22 and and33 (see insets) the centroids of the areas spanned by the three groups of proteins are well separated and aligned; those referring to unclassified proteins (green) are intermediate. Interestingly, in the more curated set B, the proteins unclassified by SSU fall in a more restricted and well defined overlap area. A similar result is obtained from set C purged from complexes and ligand binding proteins [Additional file 2: Figure S1 (A)]. In the supplemental figure S1(B) we plot, on the hydrophobicity/charge plane, folded proteins in set C that are protein complexes or ligand binding proteins. Therefore, supplemental figure S1(B) reports those folded proteins that have been excluded in supplemental figure S1(A); the unfolded proteins in set C are reported both in the supplemental figures S1 (A) and (B). Let us note that unclassified proteins are not all or mostly membrane proteins, signalling and DNA binding proteins. It is also clear that the centroids of the unclassified proteins are always intermediate both in the whole set C (figure (figure3),3), in the set purged by untypical proteins (supplemental figure S1 (A)) and in its complement (supplemental figure S1 (B)). These observations confirm that proteins we assign to the twilight zone belong to an overlap volume in the amino acidic composition space.

Amino acidic composition of proteins in the twilight zone

It is interesting to check directly whether proteins which are unclassified by SSU have amino acidic frequencies intermediate between those predicted as folded and those predicted as natively unfolded. We computed the frequencies of the amino acids for the three groups of proteins in set C. The results are reported in figure figure4.4. We see that W, C, F, I, Y, V, L, H and N are more present in folded than in unfolded proteins, whereas A, R, Q, S, P, E, K are more frequent in unfolded than in folded proteins; this result is consistent with that reported in the literature [19,44]. Following Romero et al. [44], we indicate the first group of amino acids as order-promoting and the second group as disorder-promoting. We observe also that the unclassified proteins by SSU have a peculiar composition with respect to folded and natively unfolded ones; more precisely, the frequencies of W, F, I, Y, V, L, Q, S, P, E and K are comprised between those of proteins predicted either ad folded or natively unfolded. M, T, A, N and D have frequencies similar in the three groups of sequences. By looking at figure figure44 one would conclude that the amino acidic composition of the unclassified proteins is substantially intermediate between those of proteins predicted as folded and those of natively unfolded. However, this could only signals that the set of unclassified proteins is a balanced mixture of folded and unfolded proteins. To exclude this hypothesis we checked that only a small fraction of unclassified proteins belong to the set of natively unfolded sequences (see previous section and table table44).

We further characterized the amino acidic composition of the proteins in set C by computing the log-odds ratio S of the likelihoods, for a sequence, of being composed mainly by order-promoting or disorder-promoting amino acids (see Methods for details). Positive (negative) values of S indicate that the sequence is composed mainly by order-promoting (disorder-promoting) amino acids [45]. The distributions of log-odds ratios are reported in figure figure55 for sequences that are either classified by SSU or left unclassified. Note that, visually, the S scores are not normally distributed in each group of sequences; their statistics are different as checked by Wilcoxon's rank-sum test (p-values of the order of 10-16). As expected, proteins classified as folded have positive log-odds ratios (mean S equal to 6.98 ± 0.16, the median is 5.20). Proteins classified as unfolded have negative log-odds ratios (mean S is -6.87 ± 1.04 and the median is -3.28). Unclassified proteins have values of the log-odds ratio distributed around zero (mean 0.93 ± 0.16, median 0.52), between those of folded and natively unfolded proteins. This means that they are composed approximately by the same number of order-promoting and disorder-promoting amino acids, i.e. their amino acidic composition is intermediate between that of folded and natively unfolded proteins. We found a similar result for set B. Note that, in figure figure4,4, the frequencies of C, H, R and G in the unclassified proteins, are not comprised between those in the other two classes but are even higher than those. This means that the amino acidic composition of the unclassified proteins is not strictly in between those of folded and natively unfolded proteins. The important point to make here is: the composition in the twilight zone is different, possibly corresponding to different folding propensities.

Discussion

General remarks

Most global predictors of disorder rely on the amino acidic composition of protein sequences [24]. The distribution of the amino acids in folded proteins partly overlaps with that in natively unfolded proteins, moreover, as also noted in [24], the existence of two-state homodimers, that are disordered as monomers but fold on homo-dimerization, indicates that composition alone does not determine the folding status of a sequence. In this paper we have presented the unanimous consensus score SSU; this score effectively selects proteins that, in the vector space of amino acidic composition, belong to an overlap volume occupied both by folded and natively unfolded proteins. This overlap volume has been theoretically investigated by Szilagyi et al. [24] as a twilight zone between order and disorder. Proteins in the twilight zone, as identified by SSU, have a distribution of the log-odds ratios S concentrated around zero, statistically different and intermediate between the S distributions of folded and unfolded proteins. This suggests that proteins in the twilight zone are those with approximately the same number of order-promoting and disorder-promoting amino acids.

It is important to point out that SSU assigns proteins to the class of folded, unfolded or unclassified considering only their amino acidic composition and not the percentage of disordered amino acids they contain. So, proteins unclassified by SSU generally do not have a higher fraction of disordered amino acids than proteins classified as folded. In figure figure66 we report the distributions of disordered amino acids in proteins assigned by SSU to the class of folded and unclassified proteins; it is evident that both classes have the same distribution. We checked that, in set C, of the 796 proteins with a percentage of disordered amino acids higher than 5%, 541 are classified by SSU as folded, 45 as unfolded and 210 are unclassified. We also checked that of the 506 complexed proteins alluded to in the results section and in the supplemental table ST1 [Additional file 1: Table ST1], 361 are classified as folded, 30 as unfolded and 115 are unclassified. So, proteins in the twilight zone cannot directly be identified with proteins with long flexible segments or loops. In figure figure77 we report the distributions of mean flexibility in predicted folded, unfolded and unclassified proteins (see Methods for definitions). Interestingly, unclassified proteins exhibit a mean flexibility (-0.465 ± 0.001) which is intermediate between that of proteins predicted folded (-0.486 ± 0.001) and natively unfolded (-0.431 ± 0.002). This suggests that the mean flexibility of proteins in the twilight zone is not directly related to the number of disordered amino acids. It will be interesting, on a case-by-case study, to investigate how the distribution of disordered amino acids correlates with the localization of flexible tracts in the proteins of the twilight zone.

Figure 6
Fraction of disordered amino acids in folded, unfolded and unclassified proteins (twilight zone). Fraction of disordered amino acids in predicted folded (red bars) and unclassified (green bars) proteins, as evaluated by SSU from set C. A residue is disordered ...
Figure 7
Distribution of mean flexibility in folded, unfolded and unclassified proteins (twilight zone). Distribution of mean flexibility in predicted folded (red bars), unfolded (blue bars) and unclassified protein (green bars), as evaluated by SSU on set C.

An interesting point is raised by Szilagyi et al. [24], who underscore the relevance of the chain length for the tendency of a protein to be folded or unfolded. This suggested us to look at the distribution of lengths in the proteins of the three classes: folded, unfolded and unclassified (twilight zone, see figure figure8).8). It is interesting to note that in all cases there is a scaling power law, but the scaling exponents are quite different in the three classes; remarkably the twilight zone has the more negative scaling exponent (-3.3 ± 0.2), indicating that proteins in this class tend to be shorter than folded and unfolded ones. We subdivided proteins of set B into different bins of length, corresponding to sequences of 50-99, 100-199, 200-299 and up to 300 amino acids, respectively. Then, we estimated the extent of the projection of the twilight zone in the hydrophobicity/charge plane for each bin considered. The results are reported in graphs of supplemental figure S2 [Additional file 3: Figure S2], where dashed lines refer to a twilight zone determined by a logistic regression similar to that in ref. [24], whereas solid lines delimit the twilight zone as detailed in the Methods section; our estimates of the projection of the twilight zone are given by areas delimited by parallel solid lines drawn to include, for each bin of length, not less than 75% of the dots referring to proteins left unclassified by SSU. We observe a narrowing of the extent with the increase of chain lengths, consistently with the observations in Szilagyi et al. (see figure figure22 in ref [24]) based on a logistic regression. Moreover, we have observed that the narrowing is quite independent from the choice of the threshold of 75% we use to delimit the zone. It is important to observe that the projection of the twilight zone, as selected by SSU (the area between solid lines in supplemental figure S2), does not coincide with the twilight zone determined by the logistic regression in the hydrophobicity/charge plane, as done in [24] (the area between dashed lines in the same supplemental figure S2). It is out of scope here to study how many and which of the proteins left unclassified by SSU are also recognized by the logistic regression in a twilight zone of the hydrophobicity/charge plane; let us only remark again that we have shown here that unclassified proteins are characterized by an S score close to zero, related to a mean flexibility which is intermediate between that of folded and unfolded proteins (figures (figures55 and and77).

Figure 8
Distribution of lengths in folded, unfolded and unclassified proteins (twilight zone). Log-log plots of the distribution of lengths in the three classes of proteins extracted by SSU from set C. The scaling exponents, evaluated from a regression of the ...

We believe that SSU can be useful as a starting point for further studies of the proteins lying in the region between order and disorder. We already used SSU in a previous work [41], scanning several genomes from different kingdoms; we obtained percentages of natively unfolded proteins of about 0.8% for Archaea, 3.7% for Bacteria and 23.4% for Eukarya, consistent with those previously reported [38,39]. The percentage of unclassified proteins is of 5.1% for Archaea, 7.4% for Bacteria and 15.8% for Eukarya. We also found scaling laws: the scaling exponent of the percentage of disordered proteins as a function of the number of proteins in the genome is 1.81 ± 0.10, whereas the percentage of unclassified proteins scales with an exponent 1.29 ± 0.05 [41]. In that analysis we combined mean packing, mean pairwise energy and gVSL2, but we did not use Poodle-W, since it was then unavailable to us. We are planning to extend that research using the operational combination of indexes proposed in the present study.

In the next section, at last, we propose a generic protocol on how to combine indexes into a strictly unanimous score.

A protocol for the combination of indexes able to select a reliable subset of proteins in the twilight zone

The extent of the twilight zone made of the unclassified proteins depends on the particular choice of the combined indexes. We checked that if one combines indexes with relatively low performance with more performing ones then only a small fraction of unclassified proteins corresponds to the false predictions of most single indexes. In this case many mistakenly classified proteins by the poor performing indexes are just unrecognized true positives and true negatives of the best performing ones, and then the twilight zone is poorly characterized. To avoid that we propose the following simple protocol: the folding indexes are initially ordered by decreasing values of the AUC; then one or more indexes are subsequently used to form consensus scores, monitoring the parallel decrease of fp. SSU (1) is just the best performing single index; SSU (2) is the combination of the first with the second best index; SSU (3) the combination of the first three indexes, and so on. An enhancement of the performance of the indexes on the more and more polarized subsets is expected at each step, together with a parallel increase in the percentage of unclassified proteins. In fact, these trends are shown in table table55 for the indexes considered in this work; by looking at this table it is clear that, from step 3 on, fp tends to saturate, whereas the fraction of unclassified proteins increases. As a rule of thumb we propose to add indexes until the fp starts to saturate and we adopted this criterion in selecting Poodle-W, gVSL2 and mean pairwise energy <Ec>in the present study.

Table 5
Performance of strictly unanimous scores, as a function of the number of combined single indexes, tested on set C

Conclusions

It has been pointed out that the separation between order and disorder in proteins is not sharp [24,25], but there exists a twilight zone between them. In Szilagyi et al. [24] the twilight zone is identified as an overlap volume in the space of amino acidic compositions, occupied both by folded and natively unfolded proteins. In this work we used the consensus score SSU to select, operationally, proteins belonging to the twilight zone, out of a generic dataset. The method can be easily adopted for large database screening, since it is easy to implement and computationally efficient. Our results show that proteins unclassified by SSU: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of predicted folded proteins and that of predicted unfolded proteins. These last remarks indicate these unclassified proteins belong to a class that possibly has physical properties intermediate between those of folded and those of natively unfolded proteins.

Methods

Datasets

Set A was selected by Prilusky et al. to test FoldIndex [27]. It is composed by 151 folded proteins and 39 natively unfolded proteins. Folded proteins are extracted from OCA Protein Data Bank http://bioportal.weizmann.ac.il/oca, selecting those with less than 10% of sequence identity and lengths comprised between 50 and 200 residues. Proteins with disulfide bridges, hetero-groups or other non-protein elements are excluded; moreover, only X-ray structures with a percentage of disordered amino acids below 5% are considered. Unfolded proteins are selected by searching the literature for experimentally certified cases. Set B was selected by Shimizu et al. to train and test Poodle-W [33]. It is composed by 526 folded proteins selected from the PDB, and 81 natively unfolded proteins selected from the DisProt database [17,18]. Folded proteins are extracted from the PDB, among those with: i) resolution better than 2 Å; ii) R-factor below 20% and refined through Refmac5, SHELXL97 or CNS; iii) a percentage of disorder below 5% and no disordered amino acids in the central area (i.e. between the 10th residue from the N-terminal and the 10th residue from the C-terminal). Natively unfolded proteins are extracted from DisProt version 3.3 among those with a percentage of disordered amino acids above 70%. Both folded proteins and natively unfolded ones have a sequence identity below 30%, determined using BLASTClust.

Set C was selected for the present work. Folded proteins are extracted from PDBSelect25 [46,47], version October 2007, which contains 3694 proteins with sequence identity lower than 25%. Structures with a resolution above 2 Å and an R-factor above 20% are excluded. A restricted list of 2369 is obtained, 1573 of which with a percentage of disordered amino acids below 5%. Operationally, a residue is considered as disordered if it is present in the SEQRES but not in the ATOM field of the PDB file [37,38]. A list of 81 natively unfolded proteins, with at least 70% of disordered amino acids and sequence identity below 25%, are extracted from the DisProt database [17,18], version 3.6. Note that the 81 unfolded proteins in this set, coincide only in part with the 81 unfolded proteins selected, with a more refined procedure, in [33]. The list of PDB and DisProt entries of the proteins in set C are available in Additional files 4 and 5.

HQ index

We followed the method by Uversky et al. [26], revisited by Prilusky et al. [27]. From mean hydrophobicity <H>and mean net charge <Q>of the protein sequences, the following index is evaluated:

equation image

We considered a protein as natively unfolded if HQ is negative or zero, otherwise we considered it as folded.

Mean packing

The mean packing of a protein sequence is the arithmetic mean of the packing values of each amino acid. The packing index has been introduced by Galzitskaya et al. [29], it is evaluated by counting, for each type of residue, the residues located within a distance of 8 Å, and averaging over a large dataset of structures. The packing index is here assigned to the central residue as the average over a sliding window of length 11. We considered a protein as natively unfolded if its mean packing is below 20.55, otherwise we considered it as folded. We did not observe a significant improvement of the performance changing the length of the sliding window.

Mean pairwise energy

We followed the method by Dosztanyi et al. [31], as implemented in IUPred [48]. The pairwise energy of an amino acid expresses its "contact interaction" with the amino acids located, downward and upward, from 2 to 100 position apart, along a given sequence. The pairwise energy of amino acid i at position p is given by:

equation image

where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-198-i3.gif is the frequency of amino acid j in a window of length up to 100 around position p, taking into account the limitations on both sides, due to the length of the protein. The generic element Pij of the "energy predictor matrix" P expresses the expected contact interaction energy between amino acid i and j. Pairwise energy values are averaged over a window of 21 amino acids and the average is assigned to the central residue at position p in the sequence. Finally, the arithmetic mean of the pairwise energy values of all the amino acids gives the global mean pairwise energy of the protein.

We considered a protein as natively unfolded if its mean pairwise energy is above -0.37 a.e.u., otherwise we considered it as folded.

Index gVSL2 derived from VSL2

VSL2 [35,36] is a disorder predictor that assigns to each amino acid of a protein its propensity to be disordered, estimated using a combination of support vector machines. The score from VSL2 is normalized between 0 and 1 and an amino acid is considered disordered if its value is above 0.5.

We evaluated the global index gVSL2 by assigning to the central residue the average of VSL2 scores (variant VSL2B) over sliding windows of 11 residues, and then taking the average value over all the residues. We classified a protein as natively unfolded if its gVSL2 value is above 0.5.

Poodle-W

Poodle-W [33] is a predictor of natively unfolded proteins based on a Spectral Graph Transducer semi-supervised learning machine [34]. It analyses the amino acid compositions of both proteins with known structural properties and with unknown structural properties, and it constructs a k-nearest neighbour graph with sequences as vertices and similarity among sequences as edges; then it cuts the graph into two sub-graphs so as to minimize the misclassification of the sequences with known structure and the sum of the edges weights of the graph; the resulting sub-graphs identify the predicted folded and natively unfolded proteins.

Statistics for the amino acid composition of protein sequences

We followed the approach of Romero et al. [44]. In a dataset containing Nsequ sequences, let nj be the number of the amino acids of the j-th sequence and nij the number of occurrence of the i-th amino acid in the j-th sequence. The frequency of the i-th amino acid in the j-th sequence is given by:

equation image

with variance

equation image

The frequency of the i-th amino acid in the dataset analysed is computed as:

equation image

where Naa is the number of the amino acids in the dataset. The corresponding variance is given by:

equation image

Log-odds ratio of the likelihoods that a sequence has amino acidic composition typical of folded and unfolded proteins

To compute this score, one usually refers to a probabilistic model. One assumes to have reliable estimates of the probability of occurrence of each amino acid a in folded and unfolded proteins {πa(F)} and {πa(U)}, respectively. A folded protein can be thought of as if its sequence is sampled from {πa(F)} through a sequence of independent extractions. The likelihood that a sequence has amino acidic composition typical of a folded protein is:

equation image

where An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-198-i9.gif is the probability of amino acid a (estimated from the occurrences of a in a convenient sample of folded proteins) and na is the occurrence of amino acid a in the sequence. The probabilistic model implicit in the above definition is a 0-order Markov chain. Similarly we can define LU by using πa(U), with a simple extension of the notation. Then, LF/LU is the likelihood ratio associated to a given sequence, through its amino acidic composition {na}, where a runs over all amino acid labels.

The log-odds ratio of a given sequence is then:

equation image

Order-promoting amino (disorder-promoting) acids contribute with positive (negative) terms to S, since their ratios πa(F)a(U) are bigger (lower) than one. Therefore, S is positive (negative) if the protein is composed mainly by order-promoting (disorder-promoting) amino acids. When a protein is composed by approximately the same number of order-promoting and disorder-promoting amino acids, its S score has a value close to zero. Likelihood ratio tests are treated in different textbooks [45,49]

Mean flexibility

The mean flexibility of a protein is the arithmetic mean of the flexibility values of its amino acids. We used the flexibility scale by Smith et al. [50]; they derive their flexibility scale through an analysis of B-factors of 290 proteins selected from PDB Select 25 version August 1998. The flexibility index is here assigned to the central residue as the average over a sliding window of length 11. We did not observe a significant difference in the behaviour of figure figure77 changing the length of the sliding window.

Identification of the twilight zone in the hydrophobicity/charge plane

To characterize the extension of the twilight zone found by SSU we used the following procedure. First, we identified two discriminative lines to separate predicted folded and predicted unfolded proteins from those unclassified by SSU. To this end we considered two hydrophobicity/charge planes: one for predicted folded and unclassified proteins and the other one for predicted unfolded and unclassified proteins. For all pairs of points, in each plane we traced a line and evaluated its performance in separating the two groups of proteins; among all the discriminative lines which separate unclassified proteins with sensitivity above 0.80, we chose, in each plane, the one with lowest fraction of false positives. Of the discriminative lines found in the two planes, we chose, in turn, the one with lower fraction of false positives. Second, to delimit the twilight zone, we plotted a second line by translating the first one so to include, between the two lines, not less than 75% of unclassified proteins. In supplemental figure S2 [Additional file 3: Figure S2] we report, for different bins of protein length, the twilight zones determined by the procedure above described. The projections of the twilight zone, in each bin, are represented by the area of the plane delimited by solid lines.

Indicators of performance

We evaluated the performance of the predictors by using very common indicators [37,51]:

• Sensitivity: An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-198-i11.gif

• Specificity: An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-198-i12.gif

• False predictions: An external file that holds a picture, illustration, etc.
Object name is 1471-2105-11-198-i13.gif

• Area Under Curve (AUC) of a Receiver Operating Characteristic (ROC) curve

where TP stands for the true positives, TN the true negatives, FP the false positives and FN for the false negatives.

Sn and Sp express the fraction of correctly predicted proteins in a dataset; more precisely, Sn is the fraction of correctly identified natively unfolded proteins, whereas Sp is the percentage of correctly identified folded proteins. fp expresses the fraction of folded proteins that are classified by the index as natively unfolded. Receiver operating characteristic (ROC) curves are widely used to evaluate the performance of binary folding indexes and are obtained by plotting the fraction of true positive predictions versus the fraction of false negative predictions for each threshold value of the probability to be disordered. The performance of the predictor is evaluated through the area under the curve (AUC), generally computed by the trapezoid rule; the AUC is independent from discriminative thresholds. The AUC values were computed using ROCR [52].

Authors' contributions

Both authors conceived the study and critically discussed each stage of the work. AD collected the sequences in the dataset, conducted the numerical work and produced the first draft of the paper. AG did subsequently substantial editorial work. Both authors critically read and discussed the final manuscript.

Supplementary Material

Additional file 1:

Supplemental table ST1. This file contains supplemental Table ST1, with caption.

Additional file 2:

Supplemental figure S1. This file contains supplemental figure S1, with caption.

Additional file 3:

Supplemental figure S2. This file contains supplemental figure S2, with caption.

Additional file 4:

PDB entries of folded proteins in set C. This file collects the PDB entries of the 2369 proteins in set C.

Additional file 5:

DISPROT entries of unfolded proteins in set C. This file collects the DISPROT entries of the 81 natively unfolded proteins in set C.

Acknowledgements

The authors gratefully thank Dr. K. Shimizu for her kindness in sending the list of proteins used to train and evaluate Poodle-W and for her collaboration in classifying for us, with Poodle-W, proteins in dataset C. The authors also thank Dr. Z. Dosztanyi for sending a draft of paper [21], and for providing the IUPred code. The critical remarks of three anonymous referees were of great help in focussing the scope of the paper.

References

  • Wright P, Dyson HJ. Intrinsically unstructured proteins: re-assessigning the protein structure-function paradigm. J Mol Biol. 1999;293:321–331. doi: 10.1006/jmbi.1999.3110. [PubMed] [Cross Ref]
  • Dyson HJ, Wright P. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6:197–208. doi: 10.1038/nrm1589. [PubMed] [Cross Ref]
  • Dunker A, Lawson J, Brown C, Romero P, Oh J, Oldfield C, Campen A, Ratliffl C, Hipps K, Ausio J, Nissen M, Reeves R, Kang C, Kissinger C, Bailey R, Griswold M, Chin W, Garner E, Obradovic Z. Intrinsically disordered proteins. J Mol Graph Model. 2001;19:26–59. doi: 10.1016/S1093-3263(00)00138-8. [PubMed] [Cross Ref]
  • Demchenko AP. Recognition between flexible protein molecules: induced and assisted folding. J Mol Recognit. 2001;14:42–61. doi: 10.1002/1099-1352(200101/02)14:1<42::AID-JMR518>3.0.CO;2-8. [PubMed] [Cross Ref]
  • Uversky VN. Natively unfolded proteins: a point where biology waits for physics. Protein Sci. 2002;11:739–756. doi: 10.1110/ps.4210102. [PubMed] [Cross Ref]
  • Tompa P. Intrinsically unstructured proteins. TRENDS Biochem Sci. 2002;27:527–533. doi: 10.1016/S0968-0004(02)02169-2. [PubMed] [Cross Ref]
  • Fink AL. Natively unfolded proteins. Curr Opin Struct Biol. 2005;15:35–41. doi: 10.1016/j.sbi.2005.01.002. [PubMed] [Cross Ref]
  • Uversky VN, Oldfield CJ, Dunker AK. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signalling. J Mol Recognit. 2005;18:343–384. doi: 10.1002/jmr.747. [PubMed] [Cross Ref]
  • Iakoucheva LM, Brown CJ, Lawson JD, Obradovic Z, Dunker AK. Intrinsic disorder in cell-signalling and cancer-associated proteins. J Mol Biol. 2002;323:573–584. doi: 10.1016/S0022-2836(02)00969-5. [PubMed] [Cross Ref]
  • Uversky VN, Oldfield CJ, Dunker AK. Intrinsically disordered proteins in human diseases: introducing the D2 concepts. Annu Rev Biophys. 2008;37:215–246. doi: 10.1146/annurev.biophys.37.032807.125924. [PubMed] [Cross Ref]
  • Uversky VN. Protein folding revisited. A polypeptide chain at the folding-misfolding-nonfolding cross-roads: which way to go? Cell and Mol Life Sciences. 2003;60:1852–1871. doi: 10.1007/s00018-003-3096-6. [PubMed] [Cross Ref]
  • Uversky VN, Fink AL. Conformational constraints for amyloid fibrillation: the importance of being unfolded. Biochimica and Biophysica Acta. 2004;1698:131–153. [PubMed]
  • Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN. The roles of intrinsic disorder in protein interaction networks. FEBS Journal. 2005;272:5129–5148. doi: 10.1111/j.1742-4658.2005.04948.x. [PubMed] [Cross Ref]
  • Haynes C, Oldfield C, Ji F, Klitgord N, Cusick M, Radivojac P, Uversky VN, Vidal M, Iakoucheva L. Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. Plos Computational Biology. 2006;2:890–901. doi: 10.1371/journal.pcbi.0020100. [PubMed] [Cross Ref]
  • Dosztanyi Z, Chen J, Dunker AK, Simon I, Tompa P. Disorder and sequence repeats in hub proteins and their implication for network evolution. Journal of proteome research. 2006;5:2985–2995. doi: 10.1021/pr060171o. [PubMed] [Cross Ref]
  • Kim PM, Sboner A, Xia Y, Gerstein M. The role of disorder in interaction networks: a structural analysis. Molecular System Biology. 2008;4:179–186. [PMC free article] [PubMed]
  • Vucetic S, Obradovic Z, Vacic V, Radivojac P, Peng K, Iakoucheva L, Cortese M, Lawson J, Brown C, Sikes J, Newton C, Dunker AK. DisProt: a database of protein disorder. Bioinformatics. 2005;21:137–140. doi: 10.1093/bioinformatics/bth476. [PubMed] [Cross Ref]
  • Sickmeier M, Hamilton J, LeGall T, Vacic V, Cortese M, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK. DisProt: the database of disordered proteins. Nucl Acid Res. 2007;35:D786–D793. doi: 10.1093/nar/gkl893. [PMC free article] [PubMed] [Cross Ref]
  • Vucetic S, Brown CJ, Dunker AK, Obradovic Z. Flavors of protein disorder. Proteins. 2003;52:573–584. doi: 10.1002/prot.10437. [PubMed] [Cross Ref]
  • Ferron F, Longhi S, Canard B, Karlin D. A practical overview of protein disorder prediction methods. Proteins. 2006;65:1–14. doi: 10.1002/prot.21075. [PubMed] [Cross Ref]
  • Dosztanyi Z, Sandor M, Tompa P, Simon I. Prediction of protein disorder at the domain level. Curr Prot and Pept Science. 2007;8:161–170. doi: 10.2174/138920307780363406. [PubMed] [Cross Ref]
  • Daughdrill G, Pielak G, Uversky VN, Cortese M, Dunker AK. In: Protein folding handbook. Buchner J, Kiefhaber T, editor. Weinheim, Wiley-VCH; 2005. Natively unfolded proteins; pp. 275–337. full_text.
  • Rose G. Ed. Unfolded proteins. Advances in protein chemistry. 2002;62:1–398. full_text. [PubMed]
  • Szilagyi A, Gyorffy D, Zavodszky P. The twilight zone between protein order and disorder. Biophys J. 2008;95:1612–1626. doi: 10.1529/biophysj.108.131151. [PubMed] [Cross Ref]
  • Zhang Y, Stec B, Godzik A. Between order and disorder in protein structures-analysis of "dual personality" fragments in proteins. Structure. 2007;15:1141–1147. doi: 10.1016/j.str.2007.07.012. [PMC free article] [PubMed] [Cross Ref]
  • Uversky VN, Gillespie JR, Fink AL. Why are "natively unfolded" proteins unstructured under physiological conditions? Proteins. 2000;41:415–427. doi: 10.1002/1097-0134(20001115)41:3<415::AID-PROT130>3.0.CO;2-7. [PubMed] [Cross Ref]
  • Prilusky J, Felder C, Zeev-Ben-Mordehai T, Rydberg E, Man O, Beckmann J, Silman I, Sussman J. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005;21:3435–3438. doi: 10.1093/bioinformatics/bti537. [PubMed] [Cross Ref]
  • Garbuzynskiy SO, Lobanov M, Yu, Galzitskaya OV. To be folded or to be unfolded? Protein Sci. 2004;13:2871–2877. doi: 10.1110/ps.04881304. [PubMed] [Cross Ref]
  • Galzitskaya OV, Garbuzynskyi SO, Lobanov MY. Prediction of natively unfolded regions in protein chain. Molecular Biology. 2006;40:298–304. doi: 10.1134/S0026893306020166. [PubMed] [Cross Ref]
  • Galzitskaya OV, Garbuzynskyi SO, Lobanov MY. Prediction of amyloidogenic and disordered regions in protein chains. PLoS Comput Biol. 2006;2:e177. doi: 10.1371/journal.pcbi.0020177. [PubMed] [Cross Ref]
  • Dosztanyi Z, Csimok V, Tompa P, Simon I. The pairwise energy content estimated from amino acid composition discriminate between folded and intrinsically unstructured proteins. J Mol Biol. 2005;347:627–639. doi: 10.1016/j.jmb.2005.01.071. [PubMed] [Cross Ref]
  • Meszaros B, Tompa P, Simon I, Dosztanyi Z. Molecular principles of the interactions of disordered proteins. J Mol Biol. 2007;372:549–561. doi: 10.1016/j.jmb.2007.07.004. [PubMed] [Cross Ref]
  • Shimizu K, Muraoka Y, Hirose S, Tomii K, Noguchi T. Predicting mostly disordered proteins by using structure-unknown protein data. BMC Bioinformatics. 2007;8:78–92. doi: 10.1186/1471-2105-8-78. [PMC free article] [PubMed] [Cross Ref]
  • Joachims T. Transductive learning via Spectral Graph Transducer. Proceeding of International Conference on Machine Learning. 2003. pp. 143–151.
  • Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005;7:176–182. doi: 10.1002/prot.20735. [PubMed] [Cross Ref]
  • Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006;7:208–225. doi: 10.1186/1471-2105-7-208. [PMC free article] [PubMed] [Cross Ref]
  • Bordoli L, Kiefer F, Schwede T. Assessment of disorder predictions in CASP7. Proteins. 2007;69:129–136. doi: 10.1002/prot.21671. [PubMed] [Cross Ref]
  • Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DJ. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337:635–645. doi: 10.1016/j.jmb.2004.02.002. [PubMed] [Cross Ref]
  • Oldfield C, Cheng Y, Cortese M, Brown C, Uversky VN, Dunker AK. Comparing and combining predictors of mostly disordered proteins. Biochemistry. 2005;44:1989–2000. doi: 10.1021/bi047993o. [PubMed] [Cross Ref]
  • Bogatyreva NS, Finkelstein AV, Galzitskaya OV. Trend of amino acid composition of proteins of different taxa. J Bioinform Comput Biol. 2006;4:597–608. doi: 10.1142/S0219720006002016. [PubMed] [Cross Ref]
  • Deiana A, Giansanti A. Number of natively unfolded proteins scales with genome size. Biophysics and Bioengineering Letters. 2008. p. 1.http://padis2.uniroma1.it:81/ojs/index.php/CISB-BBL/article/view/2842/2918
  • Kumar S, Carugo O. Consensus prediction of protein conformational disorder from amino acidic sequence. The Open Biochemistry Journal. 2008;2:1–5. doi: 10.2174/1874091X00802010001. [PMC free article] [PubMed] [Cross Ref]
  • Lieutaud P, Canard B, Longhi S. MeDor: a metaserver for predicting protein disorder. BMC Genomics. 2008;9:S25. doi: 10.1186/1471-2164-9-S2-S25. [PMC free article] [PubMed] [Cross Ref]
  • Romero P, Obradovic Z, Xiaohong L, Gamer EC, Brown CJ, Dunker AK. Sequence complexity of disorder proteins. Proteins. 2001;42:38–48. doi: 10.1002/1097-0134(20010101)42:1<38::AID-PROT50>3.0.CO;2-3. [PubMed] [Cross Ref]
  • Higgs PG, Attwood TK. Bioinformatics and molecular evolution. Blackwell Publishing; 2006.
  • Hobohm U, Scharf M, Schneider R, Sander C. Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Sci. 1992;1:409–417. [PubMed]
  • Hobohm U, Sander C. Enlarged representative set of protein structure. Protein Sci. 1994;3:522–524. [PubMed]
  • Dosztanyi Z, Csizmok V, Tompa P, Simon I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21:3433–3434. doi: 10.1093/bioinformatics/bti541. [PubMed] [Cross Ref]
  • Hoel PG. Introduction to mathematical statistics. John Wiley and sons; 1971.
  • Smith DK, Radivojac P, Obradovic Z, Dunker AK, Zhu G. Improved amino acid flexibility parameters. Protein Sci. 2003;12:1060–1072. doi: 10.1110/ps.0236203. [PubMed] [Cross Ref]
  • Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16:412–424. doi: 10.1093/bioinformatics/16.5.412. [PubMed] [Cross Ref]
  • Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21:3940–3941. doi: 10.1093/bioinformatics/bti623. [PubMed] [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central