|Home | About | Journals | Submit | Contact Us | Français|
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath/) currently contains 43229 domains classified into 1467 superfamilies and 5107 sequence families. Each structural family is expanded with sequence relatives from GenBank and completed genomes, using a variety of efficient sequence search protocols and reliable thresholds. This extended CATH protein family database contains 616470 domain sequences classified into 23876 sequence families. This results in the significant expansion of the CATH HMM model library to include models built from the CATH sequence relatives, giving a 10% increase in coverage for detecting remote homologues. An improved Dictionary of Homologous superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) containing specific sequence, structural and functional information for each superfamily in CATH considerably assists manual validation of homologues. Information on sequence relatives in CATH superfamilies, GenBank and completed genomes is presented in the CATH associated DHS and Gene3D resources. Domain partnership information can be obtained from Gene3D (http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/). A new CATH server has been implemented (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) providing automatic classification of newly determined sequences and structures using a suite of rapid sequence and structure comparison methods. The statistical significance of matches is assessed and links are provided to the putative superfamily or fold group to which the query sequence or structure is assigned.
The CATH database is a hierarchical classification of domains into sequence- and structure-based families and fold groups. Table Table11 shows the population of the latest release of CATH (Version 2.5.1, released January 2004). In the lowest level of the hierarchy, sequences are clustered according to significant sequence similarity (35% identity and above, the S-Level). At higher levels, domains are grouped according to whether they share significant sequence, structural and/or functional similarity (homologous superfamilies, H-Level) or just structural similarity (fold or topology group, the T-level). Fold groups sharing similar architectures, i.e. similarities in the arrangements of their secondary structures regardless of connectivity are then merged into the common architectures (the A-Level). At the top of the hierarchy, domains are clustered depending on their class, i.e. the percentage of α−helices or β-strands (the C-Level).
Below we describe some new CATH associated resources and protocols that increase the speed and reliability of classifying newly determined protein structures in the CATH database.
The CATH associated Dictionary of Homologous Superfamilies (DHS) (http://www.biochem.ucl.ac.uk/bsm/dhs/) was established in 1997 (1) and contains a variety of sequence, structural and functional information for each superfamily in CATH. It was updated recently for CATH version 2.5.1, which contains 1467 homologous superfamilies, 334 of which are populated with three or more remote homologues (<35% sequence identity). The DHS contains information on all the pairwise sequence similarities and structural similarities for all pairs of relatives in each superfamily. Sequence similarity is recorded by sequence identity and E-value. Structural similarity is recorded by pairwise SSAP score (2) and also, by E-values determined against a distribution of scores obtained by comparing all non-redundant structures with each other.
Multiple structure alignments are derived for structurally coherent subgroups of relatives, having a pairwise SSAP score of >85 against all relatives in the subgroup. These are generated using the CORA algorithm (3) and displayed using CORAplot (3). The current DHS contains 671 structural alignments from 416 superfamilies. Highly conserved sequence positions, which may be associated with functionally important sites, are highlighted.
Two new methods have been devised to illustrate the degree of structural divergence across the superfamily. Both exploit a multiple structure alignment to identify equivalent secondary structures across the superfamily and inserted secondary structures. Plots give information on highly conserved secondary structures that are diagnostic for the particular superfamily and on the degree of structural embellishment occurring in diverse relatives. Putative homologues to a particular CATH superfamily can be aligned against structural relatives in order to determine whether their structural characteristics fall within the range of structural diversity observed across the superfamily. Information on the population of the superfamily is also provided so that users can gauge how well the superfamily has been sampled to date.
Functional annotations are also provided for each superfamily in the DHS by recruiting relevant functional data from the Protein Data Bank (PDB) (4), GenBank (5), ENZYME (6), KEGG (7) and Gene Ontology (8) databases. The more than 10-fold expansion in the extended CATH database (from 43299 CATH structural domain sequences to 616470 by including related GenBank sequences and genome sequences) has significantly increased the amount of functional data available for a particular superfamily.
Expansion in the functional information together with more informative descriptions of structural variability in each CATH superfamily considerably assists in validating new homologues classified in CATH. Furthermore, links to the DHS are provided for structural matches identified using the CATH server.
Profile based methods for sequence comparison were developed in the early 1980s and allowed recognition of more distant homologues than pairwise based approaches (9). Benchmarking of several publicly available methods, including those using position-specific scoring matrices and hidden Markov models (HMMs) have been undertaken by several groups (10,11). These approaches used datasets of distant homologues selected from the structural classifications, such as SCOP and CATH, to determine the sensitivity of various profile based methods, e.g. HMMs (12) and PSI-BLAST (13).
We recently used a dataset of remote structural homologues from the CATH database (<35% sequence identity), which had been validated by structure comparison and manual inspection to assess the performance of several HMM based strategies (Strategies for Improved Fold and Superfamily Recognition in Genome Annotation; I. Sillitoe, personal communication). HMMs were built using the SAM-T technology developed by Karplus et al. (14). A total of 23876 HMM models were built for representative sequences from each sequence family in the extended CATH database (containing 616470 domain sequences). The extended model library gives a 10% increase in coverage for remote homologue detection compared to the standard CATH HMM model library, with a low error rate (0.1%) (I. Sillitoe, personal communication).
It can be seen from Figure Figure11 that on average, nearly 87% of homologues classified in CATH over the last two years could be recognized using sequence comparison methods, both pairwise sequence alignment and scans against the more sensitive extended CATH-HMM model library.
We have recently devised protocols for identifying sequence relatives to CATH superfamilies in completed genomes (15). To date, nearly one million sequences from 150 completed genomes have been scanned against the CATH-HMM model library (15). Between 40 and 60% of sequences or partial sequences from each genome could be assigned to a CATH superfamily. Genome sequences were also scanned against libraries of HMM models from the Pfam database (release 10) (16) in order to extend the domain annotation of each genome sequence and provide more comprehensive information on domain partnerships.
Sequence relatives to CATH superfamilies, identified in this way are displayed in the CATH related DHS and Gene3D resources. Gene3D displays the domain composition of each gene annotated by CATH and Pfam domains. CATH family data in the Gene3D resource has revealed some intriguing insights into the expansion of superfamilies involved in metabolism and regulation in bacterial genomes (17).
Figure Figure22 shows that the power-law like trends first detected in the structural classifications are mirrored when sequence relatives from the genomes are also included. Considering the structural data alone, it can be seen from Figure Figure2a2a that fewer than 10 of the most highly populated folds in the CATH database account for nearly 25% of all superfamilies in the PDB. These folds were previously described as superfolds as they are adopted by many diverse homologous superfamilies (18). When genome sequences are included it can be seen from Figure Figure2b2b that the same fold groups dominate the genomes, as they are adopted by nearly 45% of all close sequence families (relatives have 35% or more sequence identity), of known structure, in the genomes.
A new protocol has been developed for searching CATH with a newly determined protein structure. Structures submitted to the server (http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl) are first processed by the DDMake suite of programs that generate derived data from the PDB coordinate files (e.g. secondary structure data, residue accessibilities and ψ data, sequence data in the FASTA format, etc.). The query sequence is scanned against the CATH-HMM model library to identify more remote homologues. Threshold E-values used to recognize homologues are predetermined by benchmarking with validated structural homologues from CATH (I. Sillitoe, personal communication).
If the sequence returns a significant match to any relative in one or more CATH superfamilies, representatives from all close sequence families within those superfamilies are structurally compared with the query structure using the SSAP structure alignment program (2). The top 10 structural matches, sorted in the order of SSAP score are then displayed together with information on the degree of sequence and structural similarity and with links to the CATH page and the DHS page for each CATH superfamily identified. Rasmol images are also provided for the top 10 matches.
Any query structure unmatched by the CATH-HMM library is scanned against a library of representative structures from each close sequence family in CATH using the rapid structure comparison algorithm, CATHEDRAL (19). CATHEDRAL uses a robust statistical framework based on the extreme value distributions observed for random similarities to assess significance. If the query structure significantly matches one or more CATH superfamilies, SSAP comparisons are performed for all sequence representatives in those superfamilies and the top 10 matches are displayed, as before.
F.P., I.S., M.D., A.G., T.L., A.A. and C.O. all acknowledge the Medical Research Council for their funding. A.T., D.L. and R.M. are currently supported by funding from the National Institutes of Health. G.R., O.R. and T.D. acknowledge support from the Biotechnology and Biological Sciences Research Council, and C.B. acknowledges support from the Wellcome Trust for the research described in this manuscript.