|Home | About | Journals | Submit | Contact Us | Français|
CATH version 3.3 (class, architecture, topology, homology) contains 128688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.
CATH (class, architecture, topology, homology) is a hierarchical protein domain classification system (1), where class reflects the amino acid composition, architecture the general shape of the protein domain and topology the way in which the protein folds into this architecture. Homology captures evolutionary relationships between protein domains. Protein structures are taken from the Protein Data Bank (PDB) and decomposed into chains which, in turn, are split into domains. Domains are classified into homologous superfamilies using a combination of in-house algorithms which exploit structure, sequence and functional information. Fold groups and remote homologues (<80% sequence identity) are validated by manual curation. The class and architecture of the protein are manually specified (1).
The latest version of CATH (CATH v3.3) has expanded by 123 new folds, 199 new superfamilies and 14473 new domains over the previous release. Table 1 shows the current population of different levels in the CATH hierarchy. CATH v3.4 has 22988 more domains than CATH v3.3.
Figure 1(a) is of a ‘CATHerine wheel’ plot showing the population of non-homologous structures, i.e., the structures representing each homologous superfamily, within the different hierarchical layers in CATH v3.3. Figure 1(b) shows the increase in the number of superfamilies between CATH v3.2 and CATH v3.3. Folds with the greatest increase in superfamily numbers include the α–β plaits, four α-helix bundles and SH3 type β-barrels.
In version 3.3, 36.2% of the new domains classified into CATH superfamilies fall within the top 10 most highly populated folds which currently account for 35.7% of all non-homologous domain structures in CATH.
The curation of both CATH v3.3 and CATH v3.4 has largely focused on classifying structures solved by structural genomics (SG) initiatives. One of the principal aims of the SG initiatives is to discover all the folds that exist in the protein structure universe (3). Previous analyses by our group have shown that a large proportion of structural superfamilies in nature are likely to be already represented in CATH, i.e., CATH superfamilies already account for a large proportion of domain sequences (up to 80%) in completely sequenced genomes (4). Indeed, there has been a gradual decrease in the number of new folds identified over the last decade (5). Currently, <2% of structures solved by traditional structural biology represent novel fold groups (5,6). By contrast, various studies (6–8) have shown that a higher proportion of any novel folds are represented by SG structures. Using a normalized root mean square deviation (RMSD) of 5A to determine structural novelty, a recent study has shown that 28% of SG domains have novel structures compared with only 3% of non-SG domains (6).
In CATH v3.4, 1633 new SG structures have been classified, resulting in 99 new superfamilies and 39 new fold groups.
A significant proportion of domain sequences in completely sequenced genomes, currently unrepresented in CATH, are predicted to be transmembrane proteins (9). Structural classification of membrane proteins is more difficult than for soluble proteins due to the limited number of structural arrangements and their tendency to be structurally similar regardless of evolutionary history or function (9). Transmembrane proteins are also difficult to determine using experimental methods and some SG centres are specifically focusing on these types of proteins as targets (10). Many are α-helical proteins, comprising a single transmembrane helix, a helix hairpin or a 4 α-helix transmembrane bundle (9).
CATH v3.4 includes 2274 new transmembrane proteins, accounting for 71 new superfamilies and 22 new fold groups. Most of the newly classified superfamilies (62%) are α-helical in nature, with some 24% being single transmembrane helix superfamilies (see Table 2). A list of membrane-associated CATH superfamilies, with links to their individual superfamily pages, is now available to view at http://www.cathdb.info/sfam/membrane/ and will be added as an option on the CATH portal for CATH version 3.4.
Historically, CATH has provided information on protein structures only. Information on CATH superfamily sequence relatives is currently obtainable from CATH’s ‘sister’ site, Gene3D (11). Multi-domain architectures (MDA) and taxonomic distribution for CATH superfamilies are also provided through Gene3D as are a number of protein functional annotations. These include protein–protein interaction (PPI) data (12), GO functional assignments (13), KEGG pathways (14) and FunCAT functional descriptions (15).
Current work on the CATH website includes the development of a single web-based portal through which users can access the data provided by both CATH and Gene3D. All the usability of the original site is being maintained, including the CATHEDRAL (16) and SSAP (17,18) web servers for structural comparison. Users are able to browse though the hierarchy in the same manner as previous incarnations of the website.
The CATH superfamily pages, however, have been completely redesigned in order to provide the functional information previously only available though Gene3D and structural diversity known to exist within some superfamilies (see Figure 2). Beta pages for CATH version 3.4 for the HUP superfamily (cath code 18.104.22.1680) are available for viewing (http://beta.cathdb.info/cathnode/22.214.171.1240). Previous research carried out by our group has shown that the 100 most structurally diverse superfamilies in CATH are also the most highly populated, accounting for around 40% of the domain sequences in the genomes (see Figure 3) (19). Integrating sequence data more seamlessly with the structural data allows us to identify the most structurally and functionally diverse superfamilies and the most highly populated (see Figure 4).
The largest and smallest domains within any particular superfamily are now displayed to give a snapshot of the structural variation across the superfamily (see Figure 3). A more through understanding of structural diversity across the superfamily can be obtained by viewing plots of structural similarity scores between pairs of relatives (see Figure 4).
Within each superfamily, we also provide information on structurally coherent groups of relatives. Structurally similar groups are identified by comparing domain structures using our in-house structure comparison algorithm [SSAP (17,18)] and using multi-linkage clustering to generate groups of ‘close’ structural clusters [superposing with normalized RMSD (20) <5A] and clusters of structurally more distant relatives (superposing with RMSD <9A).
In our previous analyses, a superfamily with five or more close SSGs was considered to be structurally diverse (19). By including predicted CATH domains in our CATH resource, we can see from Figure 2 that there is a correlation between structural diversity, measured by the number of close structural clusters and the sequence diversity, measured by the number of sequence clusters (domains clustered at 30% sequence identity).
As regards functional annotations, users can explore the degree of functional diversity across the superfamily by examining the range of annotations provided for all the predicted CATH sequence relatives (integrated from Gene3D) by the Enzyme Classification (21), UniProt (22), FunCAT (15), KEGG (14) and GO resources (13). Over-represented keywords are also extracted from domain and protein annotations using the Solr search engine (see section below) to give an indication of the overall functionality and functional diversity of the superfamily being viewed.
CATH v3.4 provides multiple structural alignments [displayed using the Jalview applet (23)] for both close and distant structural clusters showing conserved residues [as calculated by scorecons (24)] and functional residue data downloaded from WSsas (catalytic residues and ligand binding residues) (25). Multiple sequence alignments will also be provided for a recently established functional family subclassification within the superfamily [GeMMA clusters (26)]. Individual three-dimensional (3D) domain structures from these alignments can be selected for display using the Jmol applet (27), complete with annotated functional and conserved residues (see Figure 5). This resource will be expanded in the future to include other functional data, for example, relating to protein interactions and also mutation data from OMIM.
In addition to providing multiple structural alignments, the structural clusters, representing close and distant structural relatives, respectively, are used to generate structure guided multiple sequence alignments of the sequence domains from Gene3D associated with the cluster. The alignments are presented on the CATH superfamily pages and are utilized by a new resource (FunTree) being developed in collaboration with the Thornton Group at the European Bioinformatics Institute. The expanded structure-based sequence alignments are used to generate a phylogenetic tree of the relatives. For superfamilies that contain known enzymes, functional data are displayed, including assigned enzyme commission (E.C.) number.
In addition, comparative analysis of the enzyme’s reactions, using comparisons between bond order changes and substrate substructure similarity (28), is carried out. The results of the reaction and small-molecule analysis are presented in conjunction with the sequence comparison analysis. Furthermore, the multi-domain architecture information from Gene3D is taken into consideration. This allows the analysis of the evolution of enzyme function within a CATH superfamily.
Since 2006, the content for the CATH website has been generated through a series of standalone CGI scripts written in the Perl programming language. Although these were sufficient for the original purpose of displaying simple webpages, many extra requirements have been added since. As more features were included, the inefficiencies inherent in serving requests from individual scripts (i.e. rather than serving webpages from a persistent web framework) severely restricted the possibilities for further development from the perspective of both hardware resources and code maintenance. As a result, the first step in facilitating any future web development was to migrate the existing code base to a more modern web framework. As a great deal of the existing group code was already in Perl, the Catalyst MVC (Model-View-Controller) Web Framework (29) was identified as a suitably mature and well-supported Perl project. Moving the code across to run in a persistent environment did require a significant amount of tidying and sanity checking because a persistent environment is far less tolerant than single-run scripts; however, the refactoring process also provided an opportunity to improve the organization, modularity and general efficiency of the code.
Under the persistent environment of the Catalyst web framework, webpages could be served up to several orders of magnitude faster than stand alone scripts. Analysis of optimization results demonstrated that this improvement was mainly due to lengthy initialization events (such as loading support libraries, creating database connections, etc.) only occurring once when the server is started, rather than at the beginning of every request. Also, since resources are shared across a number of server threads running in parallel, and the processing time of each request is much shorter, the general load on the server (process and disk I/O) is significantly reduced. This contributed to allowing a greater number of concurrent requests to be processed over a given period of time and ultimately provides a more satisfying user experience.
An additional advantage of moving across to a modern web framework, such as Catalyst, is the built-in support for extra features such as SOAP or REST-based web services. This has minimized the amount of code required to be written (and maintained) to provide informational web services such as the CATH SOAP DataServices (30).
In order to improve CATH’s searchability, we have built a dedicated search engine which indexes structures and CATH classifications according to the keywords and entity IDs associated with them and the full text of their descriptions and annotations. This uses Solr (31), a search server based on the popular Lucene toolkit, enabling complex queries across various fields which support Boolean operators, phrase searches, wildcards and many other advanced search features. For each CATH release, related data from CATH, Gene3D, PDB, UniProt and other external sources are aggregated and flattened into a single ‘document’ per CATH entity, which is added to a Solr index. This enables even highly complicated queries to be answered very quickly.
The Solr index can also be queried for the most significant terms associated with a given entity. We use this feature to annotate search results with lists of the most representative keywords for each entity. Solr is entirely web-services based. Queries are answered via a RESTful interface allowing data to be returned in a variety of formats including XML, JSON and CSV. It drives the search functionality on the CATH website, and we plan to make it publically available for external users to query programmatically.
In summary, CATH has expanded over the last 2 years to include 365 new superfamilies (176 new fold groups), 29% of which came from the SG initiatives (30% fold groups) and 28% (22%) of which were membrane families (folds). We have extended the functional information available for each CATH superfamily by integrating domain sequence relatives from Gene3D and displaying their functional annotations from various public resources (e.g. GO, EC, Kegg and FunCat). We now provide more detailed information on structural and functional diversity across each superfamily and multiple structure alignments for clusters of close and distant structural relatives. The FunTree display presents a phylogenetic perspective of enzyme superfamilies derived from a multiple sequence alignment and annotated by functional characteristics such as EC number and reaction mechanisms. Finally, access to the data in CATH has been made easier by building the webpages within the Catalyst MVC framework and the search facilities have been significantly improved by exploiting Solr and Lucene.
A. Cuff, M. Pellegrini-Calace, N. Furnham (BBSRC); A. Clegg (EMBRACE); T. Lewis (IMPACT, E.U); R. Rentzsch (ENFIN, E.U); I. Sillitoe (Wellcome Trust). Funding for open access charge: Wellcome Trust.
Conflict of interest statement. None declared.