Comprehensive generation of three-dimensional structures with resolution or reliability of those determined by X-ray crystallography or nuclear magnetic resonance (NMR) is currently beyond the capabilities of any protein structure prediction method; these methods can, however, play an important role in generating structural annotations for whole genomes due to the much lower investment of resources required per protein domain. In this work, we have shown that it is possible to: (1) generate protein structure models on a genome-wide scale, (2) automate the assessment of the structure prediction quality, (3) convert the results into pre-existing encodings of structure in the form of SCOP superfamily classifications, and (4) augment the model-based assignment of SCOP superfamily by integrating with pre-existing function, process, and component information encoded in the GO database.
We were able to assign SCOP superfamilies to 7,094 of the 14,934 predicted domains in yeast using PSI-BLAST and fold recognition methodology. A total of 4,006 of the remaining 7,840 domains were short enough (less than 150 amino acids) for de novo structure prediction. Of these, 668 were omitted because they contained at least one predicted transmembrane helix. Low-resolution structure models were built for the remaining domains using Rosetta; of these, 404 were assigned to superfamilies with confidence using MCM, and an additional 177 were assigned with confidence after integrating with GO process, component, and function annotations.
A significant challenge in carrying out this work was the magnitude of the computation required for generating de novo structure predictions for large numbers of domains. Robust and fast methodology, efficient data storage, analysis tools, and data organization were required. Our use of distributed computing (http://wcgrid.org
), innovative database architecture [39
], and fully automatic methods were essential for this full-genome annotation. Yeast is particularly interesting because it is the focus of a vast global research effort. Future work will include an ongoing effort to scale this procedure to over 150 completely sequenced genomes as well as to employ recently developed higher resolution structure prediction methods [41
] that produce more-accurate and reliable models, but require significantly greater computational resources per protein domain.
The information content in the predicted structures may be further leveraged by integration with other data such as global quantitative measurements of mRNA, protein expression levels, DNA–protein, and protein–protein interactions. Such datasets are available for yeast and several other organisms as part of ongoing functional genomics efforts, and integration of these data types with the predicted structures should contribute to the annotation of protein functions.