|Home | About | Journals | Submit | Contact Us | Français|
EVA (http://cubic.bioc.columbia.edu/eva/) is a web server for evaluation of the accuracy of automated protein structure prediction methods. The evaluation is updated automatically each week, to cope with the large number of existing prediction servers and the constant changes in the prediction methods. EVA currently assesses servers for secondary structure prediction, contact prediction, comparative protein structure modelling and threading/fold recognition. Every day, sequences of newly available protein structures in the Protein Data Bank (PDB) are sent to the servers and their predictions are collected. The predictions are then compared to the experimental structures once a week; the results are published on the EVA web pages. Over time, EVA has accumulated prediction results for a large number of proteins, ranging from hundreds to thousands, depending on the prediction method. This large sample assures that methods are compared reliably. As a result, EVA provides useful information to developers as well as users of prediction methods.
The goal of EVA is to evaluate the sustained performance of protein structure prediction servers through a battery of objective measures for prediction accuracy. While the bi-annual CASP (Critical Assessment of Techniques for Protein Structure Prediction) meetings address the question ‘how well can experts predict protein structures with the help of machines?’, the question addressed by EVA is ‘how well can automatic servers predict protein structures?’. Conceptually, this is similar to CAFASP (Critical Assessment of Fully Automated Structure Prediction), but there is a major difference: EVA provides a continuous, fully automatic and statistically more significant analysis of structure prediction servers, whereas CAFASP only covers a limited number of proteins determined in a period of about 4 months in every 2 years: fewer than 10 proteins were available for the non-homology category at CAFASP3 in 2002. This implies that it is—at best—extremely difficult to infer differences of statistical significance from the CAFASP/CASP data sets. For example, the assessor for secondary structure prediction in 2002 concluded that there was no improvement in secondary structure predictions with respect to the CAFASP/CASP in 2000 although the numerical values differed by over six percentage points.
EVA facilitates developers of structure prediction methods to improve their approaches and users of prediction servers to apply methods judiciously. The ranking of each prediction method is analysed and updated on the web every week. Ranking is a non-trivial task because of the non-uniformity in data sets and in the measures for accuracy. Another complication is that methods are compared most reliably when they are tested under identical conditions, i.e. with identical sets of proteins (1–3). Here, we sketch the EVA mechanisms that enable such large-scale assessment of prediction servers automatically and continuously.
The analysis of prediction methods involves the following steps (Fig. (Fig.1):1): (i) select a set of suitable test sequences; (ii) apply prediction methods to those sequences; (iii) assess prediction methods by measuring prediction quality using certain scoring functions; (iv) determine criteria for statistically significant differences, and rank the methods accordingly; (v) merge results of the current week with those accumulated in the past, publish results on the web, and communicate with and gather results from the EVA satellites [at Centro Nacional de Biotecnologia (CNB) in Madrid, Spain and at University of California, San Francisco (UCSF)].
Every day, EVA downloads the sequences for the newest experimentally determined protein structures from the Protein Data Bank (PDB) (4) web site. Sequences are dissected into protein chains that constitute the basic units for EVA. Very short sequences (<30 residues) and proteins containing a significant number of unresolved residues are excluded. The remaining sequences are sent by META-PredictProtein (5,6) (META-PP) to prediction servers that consented to the evaluation by EVA. Threading/fold recognition servers constitute an exception to this ‘send-all’ rule: in order to reduce the load on these servers, we submit only sequences without clearly homologous structures (i.e. novel proteins) (7,8). These novel sequences have no hits in the previous version of the PDB below a PSI-BLAST (9) E-value of 10−3 and/or an HSSP-distance <0 (8). Over the last 3 years, this filtering step reduced the number of chains to about 8%; threading servers therefore have to handle <10 submissions from EVA per week. While secondary structure prediction methods handle all proteins, currently EVA publishes results only for the subset of the novel proteins every week. For contact predictions, proteins with homologous structures are considered separately from proteins without structurally defined homologues. Obviously, most results analysed in the comparative modelling category (EVA-CM) are based on proteins that are not novel. However, EVA-CM currently does not apply any particular threshold: all models are evaluated.
Once a day, META-PP (5,6) submits sequences to prediction servers and collects the results. Once a week, these results are sent to EVA satellites for evaluation, namely to Columbia University for secondary structure prediction and fold recognition/threading, to UCSF for comparative modelling and to CNB for inter-residue distances/contacts.
Prediction quality is evaluated using a battery of scoring functions sketched below for all four categories.
Ranking is most reliable when prediction methods are tested under identical circumstances. The best way to rank two methods is to assess their performance based on the identical test sets. Two ranking methods are currently available in EVA. The first one is based on sub-sets of all proteins that are common to all methods. The limitations of this approach are that: (i) not all methods exist at the same time; and (ii) not all sequences are predicted by all methods at any given time due to server downtime and errors. In practice, these two effects reduce the size of the common sub-sets dramatically. The second ranking approach relies on pairwise method comparisons that depend on the sub-set of proteins common to the two compared methods (3). This pairwise ranking approach determines for each pair of participating servers whether or not it is possible to discriminate their accuracies, given the size of the test set and the particular accuracy measure used. The downside of this approach is that the overall ranking list obtained by averaging the pairwise results may be ‘frustrated’ due to the different testing sets for the different pairs of methods.
The central EVA site at Columbia University collects either the assessments or the html pages with assessments from the satellites every week and presents them on the web. The central EVA site is mirrored at all EVA satellites (Fig. (Fig.11).
EVA currently addresses the following protein structure prediction categories (Table (Table1):1): comparative modelling (EVA-CM); inter-residue contact prediction (EVA-con); secondary structure prediction (EVA-sec); and threading (EVA-FR). In the following, we sketch the measures for accuracy employed for each category. Note that the detailed definitions of the scores are available through the EVA web sites.
Implements a small number of criteria—arranged hierarchically from coarser to finer—that measure the accuracy of a comparative model. The assessed aspects of a model include fold type, alignment, whole structure, core structure, loops and side-chains. Final ranking is reported using the ‘pairwise’ comparison of prediction servers (3). From May 2000 to January 2003, predictions were collected from five different servers, resulting in 20957 submitted models for 9050 different PDB chains. On average, 2.3 models were predicted per chain.
Evaluates inter-residue contact/distance predictions. A number of servers predict contacts directly, using neural networks of different kinds trained on contact maps (10,11). There are also predictions of contacts based on assembled structures (12). The current evaluation criteria implemented in EVA-con include: (i) accuracy—the number of the correctly predicted contacts divided by the total number of predicted contacts (13); (ii) improvement over random—the calculated accuracy divided by the random accuracy (13); (iii) distance distribution of the predicted contacts—the weighted harmonic average difference between the predicted contact distance distribution and the all-pairs distance distribution (14); and (iv) delta evaluation—the percentage of correctly predicted contacts that are within a certain number (delta) of residues of the experimental contact, measured along the sequence (15). EVA-con may also be used to evaluate ab initio, fold recognition and comparative modelling servers by transforming models into intra-molecular contacts between the corresponding C-beta atoms (C-alpha for Gly) with a 8Å cut-off.
Evaluates protein secondary structure predictions. Secondary structures are assigned from 3D structures through DSSP (16) and STRIDE (17). EVA-sec measures accuracy by: (i) per-residue accuracy (18) (Q3)—percentage of residues correctly predicted in one of the three states (helix, strand or other); (ii) per-segment accuracy (18,19) (SOV)—average overlap between segments (methods that get most of the segment cores right are generally more useful than those that get some of the entire segments right); and (iii) accuracy of predicting structural class—percentage of proteins correctly predicted in one of the following classes: all-alpha, all-beta, alpha/beta and others (20,21). Rankings are presented using both the ‘common subset’ and ‘pairwise’ comparison approaches.
Currently evaluates models only for novel sequences (i.e. proteins for which PSI-BLAST searches do not reveal similarity to a known structure). Since there is no single measure that can comprehensively assess the quality of threading models, EVA-FR employs an array of alignment dependent and alignment independent measures (22–24). For most of the measures used, two aspects of server performance are considered: (i) the ability to produce good models for each target (rank analysis); and (ii) the ability to assign reliable scores to its models, measured through Receiver Operator Characteristics curves (ROC; note this aspect is often referred to with ‘fold recognition’). Methods are ranked through both the ‘common subset’ and ‘pairwise’ comparison approaches.
Every week, test sequences are automatically submitted to prediction servers and results are evaluated and posted on the EVA web sites. The test sets are constructed so that methods could not have been trained based on the sequences in the test sets. Moreover, the test sets are as large as possible. In addition, the reliability of the comparisons between methods is maximised by using only test sets common to the methods assessed.
Since 1994, the development of structure prediction methods has been influenced by the CASP meetings. While EVA uses well-defined numerical criteria to evaluate sustained performance, expert evaluations are still needed to learn what measures are most useful. However, human assessors are not likely to be able to handle many more test sequences than those at CASP. At the same time, there are problems with ranking methods based on test sets that are too small (1–3). EVA rankings are statistically more significant than those at CASP, because EVA assesses prediction methods continuously on as many proteins every month as CASP in 2 years (1). We believe that CASP needs to be supplemented by a large-scale, automated and continuous assessment, such as that by LiveBench (25) (assessment for threading methods only) and EVA. In fact, EVA may replace certain CASP categories in the future. For example, it was proposed at the last 2002 CASP meeting to eliminate secondary structure predictions from CASP. Instead, EVA-sec will replace CASP/CAFASP for users interested in those methods. This decision was partially influenced by the fact that the evaluation of secondary structure prediction methods has matured and this matured analysis has demonstrated beyond doubt that the set of proteins at CASP5 (2002) was not representative and too small.
The best secondary structure prediction methods have reached a sustained level of 76% accuracy for the last 2 years (2) which indicates a substantial improvement in secondary structure prediction over the last 4 years. While it is always difficult to choose an appropriate set of measures, EVA uses standard criteria that have been largely used by experts in the area. For secondary structure prediction, these criteria are well established. For all other categories, we are currently experimenting with new criteria, others will be incorporated into EVA upon request from users. The precise definitions of the criteria are available on the web. While we can make our original scripts available upon request, we currently do not have the resources to cast the whole EVA code into a form that guarantees portability or ease-of-use. Overall, EVA allows developers to focus on the development of better methods, rather than on the generally time-consuming evaluation.
In principle, the concepts implemented in EVA could and should be generalised to evaluating a larger variety of prediction methods. Often, the problem is the availability of new high-resolution data. We intend to explore extensions that cover the predictions of protein–protein interactions, membrane regions, signal peptides, cleavage sites, structural/functional motifs and sub-cellular localisation.
Thanks to Jinfeng Liu and Megan Restuccia (Columbia) for computer assistance. We are grateful to members of the Protein Design Group. The contribution of the PDG is supported in part by a grant from the Spanish Ministry of Science and Technology (PDG, CNB-CSIC). I.Y.Y.K. was supported by the grant 5-P20-LM7276 from the National Institute of Health (NIH); D.P. was supported by the NIH grant RO1-GM63029-01, A.S., M.A.M.R., M.S.M. and N.E. by the NIH grants R01 GM54762 and P50 GM62529, B.R. by the NIH grant 1-P50-GM62413-01 and the NSF grant DBI-0131168. Thanks to Phil Bourne (UCSD) and the RCBS crews for maintaining an excellent PDB and to all experimentalists who enabled this analysis by making their data publicly available. Last, but not least, thanks to all those developers who support EVA by going through the trouble of making their methods publicly available.