Knowledge of a protein’s 3D structure can provide critical insights into aspects of its biological function: from catalytic mechanisms and protein–protein interactions to the reasons that specific gene mutations cause harmful disruptions.
Genome3D collates and presents data from resources that use the domain structures from the Structural Classification of Proteins (SCOP) and CATH classifications to provide predictions on sequences for which a structure may not yet have been solved. The SCOP and CATH databases classify protein domains derived from structurally characterized proteins deposited in the Protein Data Bank (PDB) (1
). Domains are classified into homologous superfamilies and fold groups (see ‘Materials and Methods’ for more information on SCOP and CATH).
Although both resources combine automated methods with manual curation to detect homologous domains, SCOP relies more heavily on manual curation. For CATH, homologs are automatically recognized using in-house structure comparison methods [SSAP (2
), CATHEDRAL (3
) and Hidden Markov Model (HMM)-based strategies (4
)]. In addition to differences in protocols for structure classification, SCOP and CATH use somewhat different criteria for recognizing domain boundaries in multi-domain structures. SCOP only recognizes domains that have been observed to recur in different multi-domain contexts, whereas CATH also use physical considerations such as globularity and compactness.
Although all five domain prediction methods in Genome3D exploit homology-based approaches to predict domain structures in uncharacterized protein sequences, different strategies are used (see ‘Materials and Methods’ section). Some (Gene3D, SUPERFAMILY) exploit HMM-based strategies for recognizing relatives of SCOP or CATH superfamilies. Others use more sensitive threading-based strategies, which detect much more remote homologues to SCOP (FUGUE, Phyre) or CATH (FUGUE, pDomTHREADER) superfamilies.
All five resources are widely used by the biology community to obtain structure predictions and annotations for their sequences. However, it is clear that, especially in the cases of remote homologues, none of the methods is guaranteed to provide the correct answer. Therefore, a major aim of the resource is to display predictions from all the groups so that users can identify regions that are more likely to be correct because there is extensive agreement between the resources. This information is displayed in a highly intuitive fashion. Furthermore, users can easily follow links from Genome3D to any of the individual resources if they need more information.
Thus Genome3D is analogous to InterPro in providing comparisons between domain family annotations supplied by different resources. A major difference with InterPro, however, is the fact that Genome3D provides structural annotations for very remote homologues in domain families (i.e. predictions from FUGUE, pDomTHREADER, Phyre). These annotations are not provided in InterPro. Furthermore, <50% of the structure annotations provided by Gene3D/SUPERFAMILY are displayed in InterPro, and no predicted 3D models are provided by InterPro. Instead, it focuses mainly on sequence-based family resources. Thus, Genome3D is an important complementary resource.
Structure data, and in particular the 3D models provided by Genome3D, are important in understanding the mechanisms by which proteins function. For example, 3D structure can help identify highly conserved residues clustering in active site regions. It is becoming increasingly important in interpreting the impacts of genetic variants identified by the next generation sequencing projects. These correspond to non-synonymous single nucleotide polymorphisms (nsSNPs) and alternative splice variants that can affect the structure of the protein and its ability to perform its function. For example, mutations of residues in or close to the active site have been found to be implicated in some cancers (5
). By providing 3D models and information on regions of high and low confidence in the domain predictions, Genome3D can help biologists and biomedical researchers determine whether genetic variations, e.g. nsSNPs, are likely to damage the structure and thereby affect the proper functioning of the protein.