Definition of disorder
Protein disorder can be defined by many ways depending on the research focus and experimental method used. As a baseline, we used the definition used in the Critical Assessment of protein Structure Prediction (CASP) experiments: the disordered residues are those marked by REMARK465 tag in the experimentally determined protein structures deposited in Protein Data Bank (PDB) [20
], which indicates regions with missing coordinates in crystal structures determined by X-ray crystallography or residues with highly variable coordinates in ensembles of Nuclear Magnetic Resonance (NMR) structures. This definition was extended to include also proteins deposited in the DisProt database (disorder validated experimentally by a variety of experimental methods such as circular dichroism (CD) spectroscopy, mass spectrometry, immunochemistry, SDS-PAGE gel, small-angle X-ray scattering (SAXS), currently over 1300 regions) [11
]. The advantage of the DisProt database is that it includes proteins without known three-dimensional structure, especially proteins that are entirely disordered, whose structure typically cannot be determined by high resolution methods (X-ray crystallography and NMR). Thus, we treat all disorder types as a single class.
Primary methods used in the meta-method
The MetaDisorder series of predictors combined, via a machine-learning approach, the predictions of 13 primary disorder predictors that performed well in CASP and are freely available as standalone applications or stable web servers that can process large numbers of queries: DisEMBL [21
], DISOPRED2 [22
], DISpro [23
], Globplot [24
], iPDA [25
], IUPred [26
], Pdisorder [27
], Poodle-s [28
], Poodle-l [29
], PrDOS [30
], Spritz [31
], DisPSSMP [32
], and RONN [33
]. Additionally, the meta-predictors designed for CASP9 used also six subjectively selected methods for protein fold-recognition: HHSEARCH run over PDB70 and CDD databases [34
], FFAS [35
], mGenThreader [36
], PSI-BLAST run in two different modes (with and without masking regions with low sequence complexity) over the culled PDB database [37
], PHYRE [38
], and PCONS [39
] (a consensus method that uses as an input models generated by MODELLER [40
] based on alignments from the previously mentioned fold-recognition methods). For short description of each method see Table and Table . Additionally, two methods for secondary structure prediction: JNET [41
] and PSIPRED [42
], and one solvent accessibility predictor, JNET [41
], were used.
Description of disorder predictors analyzed in this work
Description of fold recognition methods used by MetaDisorder
Training and testing datasets
To train the meta-predictors, two independent datasets were used. The first dataset was prepared based on the combined DisProt database (version 3.6) and CASP7 targets. Sequences longer than 1000 residues were omitted, because they exceed the length limit of some of the primary methods used and could not be processed automatically without arbitrary manipulations. Overall, this procedure provided 566 proteins, which included 232,664 residues in total, of which 23.45% were disordered. The second dataset, called pdbRemark465, was based on structures in the PDB database. Representative structures were extracted using the PISCES server [43
] and filtered according to the following criteria: experimental technique: X-ray crystallography, resolution
0.2, length 50–1000 aa residues, and mutual sequence similarity
20%. The resulting dataset contained 1147 proteins (289,008 residues, of which 6.28% were disordered according to the REMARK465 tag in the PDB files, see Additional file 1
). In the final version of the meta-predictor, we combined these two datasets and used them for assessing the disorder prediction accuracy. During that procedure, standard 10-fold cross validation was used. All amino acid residues were randomly assigned into 10 bins of nearly equal size. 9 bins were used as a source of the training data and the remaining 10th bin was used as a source of the testing data. This procedure was then repeated 10 times, with each of the 10 bins used exactly once for validation. The results of 10 analyses were then averaged to produce final scores.
Since we aimed to be as objective as possible in assessing the predictive power of our methods in a fair comparison to other methods, to avoid any bias we tested all predictors described in this article within truly blind tests of CASP8 and CASP9, in which (as mentioned earlier), the prediction of disorder is defined as the ability to identify regions with missing coordinates in crystal structures determined by X-ray crystallography or residues with highly variable coordinates in ensembles of NMR structures.
For the training of GSmetaDisorder3D and GSmetaDisorderMD predictors, we used proteins from CASP8 (122 proteins, 27,614 residues, of which 11.11% were disordered; among them 19 were solved by NMR, 2.515 residues, of which 47.95% were disordered). Again, 10-fold cross validation was used. The detailed statistics about each dataset are provided in Table .
Summary of the datasets employed in this study
Measures used for training and evaluation
The results of predictions can be divided into four categories: true positives (TP) – residues correctly predicted as disordered, true negatives (TN) – residues correctly predicted as ordered, false positives (FP) – ordered residues misclassified as disordered, and false negatives (FN) – disordered residues misclassified as ordered.
The first assessment criterion we used was the receiver operating characteristic (ROC). The ROC curve is a graphical plot of the sensitivity vs. false positive rate for a classifier, as its discrimination threshold is changed. The resulting area under curve (AUC) defines the overall robustness of an algorithm, where 1 means the perfect predictor (all true positives are found by the method without any false positives) and 0.5 corresponds to a random one.
The second criterion is the weighted score, called Sw,
which rewards a correct disorder prediction higher than a correct order prediction [44
]. This is done to avoid over-prediction of an ordered state due the fact that ordered regions are more common in known proteins. The Sw
score is defined as:
where the Wdisorder equals the fraction of ordered residues and Worder equals the fraction of disordered residues. Sw is in the range −1 to 1, where 0 means random prediction. Maximization of Sw was the main criterion of the optimization procedure and it was also used to assess the relative value of individual primary disorder predictors to be incorporated into our meta-servers. The Sw score was directly used as a weight of a prediction returned by each such method.
The third commonly used measure, which was not used during our procedure of developing the consensus methods, but which was used for their evaluation, is Matthews correlation coefficient (MCC) [45
and MCC were the measures used during CASP to assess disorder predictors.
Finally, we used our own measure, called Sww, which combines AUC and Sw score in the following way: it is calculated using the Sw formula, but the discrimination threshold is changed incrementally from 0 to 1, by steps of 0.01, giving sets of TP, TN, FP, FN values that are used to calculate a series of Sw scores. Sww is the average value of these scores. This score was used only in the GSmetaDisorderMD2 method during CASP9.
The statistical significance of the evaluation scores was determined by the bootstrap confidence interval method [19
]: 80% of the targets were randomly selected 1000 times, and the mean absolute error of scores was calculated. The ROC statistics were compared by using the Wilcoxon signed rank test and by calculating standard errors of ROC statistics.
Binary consensus and continuous consensus versions of MetaDisorder predictors
In general, two categories of predictors exist. The simplest predictors are binary, they try to classify the predicted feature only into separate subcategories (here disordered and ordered residues). More advanced methods return continuous scores with values e.g. between 0 and 1 that inform how certain the prediction is, and the prediction is made according to an arbitrarily chosen threshold. The lower the threshold, the higher the number of both true and false positives. Accordingly, initially we constructed two versions of the MetaDisorder predictor, named BinCons and FloatCons. These two methods were tested within the framework of the CASP8 benchmark as groups with numbers 153 and 297, respectively [19
]. BinCons uses only binary predictions from primary methods: each disorder prediction for a residue is counted as 1 and ordered as 0.01 (0 was avoided to prevent possible cases of dividing by zero). FloatCons uses all the information available: if a given method returns a continuous prediction, its score is used during the final consensus calculation. A consensus score for each residue is calculated by summing the scores from all primary methods and multiplying them by the accuracy of the given method. The result is normalized, i.e. the score is divided by the maximal possible score. For simplicity, the criterion of a method’s accuracy used as the weight of the method was Sw
calculated for our combined datasets. It was possible, because Sw
does not depend on the predictor output type.
In the next step, a special correcting function is used. It takes into account the fact that residues located in the protein termini are on the average more disordered than residues in the middle of the protein chain. This function is based on the statistics of disorder presence in the 15 proximal residues calculated on both datasets and provides an appropriate corrective factor, by which the original predictive score is multiplied.
Finally, the decision whether a residue is ordered or disordered is made. If a residue scores above the threshold, it is predicted as disordered; otherwise it is predicted as ordered. The threshold for classifying the residue as ordered or disordered was based on Sw scores obtained during 10-fold cross validation tests.
Additionally, at the end, the repairing procedure is employed to improve the prediction. For predicted string (e.g. “DDD‒‒‒D‒‒…”, with D indicating disorder and “-” indicating order) a simple smoothing filter with a window of five residues is applied. It eliminates short (up to 3 residues) stretches of predicted disorder within long regions of predicted order (converts previous example to “DDD‒‒‒‒‒‒…”).
GSmetaDisorder3D – a template-matching method
Apart from disorder predictors, many other bioinformatics tools yield implicit or explicit information about order and disorder. In the course of a variety of other protein sequence analysis projects, we realized that there is a clear correlation between the disorder in the target protein sequence, and the presence of gaps in alignments to structurally characterized templates calculated by the protein fold-recognition methods. Although the implementation of a method utilizing this type of information may seem trivial, it was not so straightforward to deal with different types of fold recognition methods. In other words, it was not so obvious which method should be used or, if many methods were used, how to rank them. Additionally, a template-matching method should be able to take into account the fact that matches to homologous proteins have different reliability and in some cases homologous sequences cannot be found. To address all these questions, we compared the results from arbitrary chosen fold recognition methods that were relatively fast and performed well in the framework of CASP: HHSEARCH, FFAS, mGenThreader, PSI-BLAST, PHYRE, and PCONS5 (see Methods for details and references). To optimize the weights assigned to individual methods depending on the alignment quality we used a genetic algorithm implemented in Pyevolve [47
]. The fitness function of the genetic algorithm was designed as a one-dimensional vector of length 24 (8 methods mentioned above multiplied by 3 thresholds for well-, moderately- and poorly-scored templates; see Table for details of the thresholds used). In this way, the weights for all methods were obtained, for the further incorporation into a combined template-matching method. The resulting predictor was tested in CASP9 as a group number 421 (GSmetaDisorder3D).
Thresholds used in fold recognition programs for classification of potentially good, medium and poor alignments
GSmetaDisorderMD and GSmetaDisorderMD2 – combined disorder consensus and template-matching method
The next method in the MetaDisorder series, GSmetaDisorderMD, was developed by combining FloatCons (the consensus method with continuous scoring) with GSmetaDisorder3D (the method based on analysis of gaps in fold-recognition alignments). The same genetic algorithm was used as in the training of GSmetaDisorder3D, but additionally the second dimension to the vector was added to optimize the relationship between these two components. This method was tested in CASP9 as a group number 374.
GSmetaDisorderMD2 is a variant of GSmetaDisorderMD, in which the genetic algorithm used for training optimized the Sww score instead of the Sw score. This predictor was tested in CASP9 as a group number 147.
Implementation and availability
The MetaDisorder is a web interface to our series of disorder meta-predictors and can be accessed at http://iimcb.genesilico.pl/metadisorder/
] is used. Additionally, the results of analyses can be also obtained as simple text output (for details see Figure ).
Figure 1 MetaDisorder web-server interface.a) user-friendly web interface – main plot part can be easily zoomed in and out, results reported by all primary methods can be downloaded in the CASP format. b) simple text output format suitable for machine (more ...)