As we have expanded the CATH superfamilies, we have also been able to use increasing functional information from public resources [e.g. Gene Ontology (GO) (7
), Enzyme Commission (EC)(8
)] to develop our knowledge of functional divergence within them. We have aimed to use this expanding knowledge to provide functional sub-classification of relatives, which can help biologists understand the structural mechanisms by which functions evolve.
SCOP sub-classifies superfamilies into functionally coherent families through manual analysis using information available in the literature and functional annotation databases [e.g. SwissProt (9
), GO, EC, etc]. However, recent analyses by Gough et al.
suggest that these groupings correspond more closely to taxonomic groups rather than functional groups (10
A functional family (FunFam) layer within all CATH superfamilies was first introduced in CATH-Gene3D v10 (11
). Predicted domain sequences for CATH superfamilies from Gene3D are now explicitly included in CATH. Domain sequences identified in Uniprot (12
) and Ensembl (13
) currently expand CATH from 173 536 domain structure entries to 16 297 076 known and predicted domain structure entries. CATH sequence data within each superfamily are sub-classified into FunFams to group together relatives likely to have similar structures and functions.
The original protocol to establish these functional families used a profile-based sequence clustering algorithm together with a fixed generic granularity threshold (14
). This corresponds to vertically ‘cutting’ the domain sequence similarity tree of a superfamily at a fixed level to derive a set of FunFams, an ‘unsupervised’ protocol.
A modified version of the FunFam protocol that exploits available GO annotation data to determine the right ‘cut’ of the sequence tree, instead of using a fixed threshold, was used to generate domain families for protein function prediction in Critical Assessment of Protein Function Annotation (CAFA) 2010 (BMC Bioinformatics submitted). This has been extended by a mechanism to detect and account for instances of functional ‘chaining’ in the clustering dendrogram, that is, cases of incongruence between domain sequence similarity and overall protein function similarity. As a whole, this is dubbed the ‘supervised’ protocol.
When dealing with families of (domain) sequences, it quickly becomes apparent that different use cases often suggest and require entirely different levels of family granularity. For example, in the large superfamily that represents the PDZ domain (CATH 22.214.171.124), a promiscuous peptide-binding module, two entirely different sets of families can be identified depending on the ‘point of view’. On the one hand, all domain sequences could be put into a single family, given that the domain always fulfils the same partial function within a diverse set of parent proteins and their different overall functions. On the other hand, the PDZ domains appearing in parent proteins of the same type (e.g. an orthologous group of proteins) will commonly be more similar to each other than to all other domains in the superfamily. These observations lead to two possible sets of families for the same superfamily, one ‘coarse’ and one ‘fine’.
Coarse FunFams in the above-described sense primarily lend themselves to broad evolutionary studies, for example, to track instances of domain shuffling (15
). They are also the most intuitive kind of families in the context of a domain-based resource such as CATH-Gene3D, as they clearly focus on domain function, not whole-protein function. At the same time, applications like the detailed study of conserved residues (e.g. in active and binding sites) may require the use of finer FunFams. Eventually, the choice is highly user dependent, and this realization is what governed our strategy.
As a pragmatic attempt to account for the above-described dichotomy, the current Gene3D FunFam protocol uses a hybrid approach: FunFams are first identified in a given superfamily using the latest supervised protocol, including the detection of chaining. As the latter feature is still somewhat experimental, and as finer families may sometimes be required regardless of whether domain function is conserved (see above), a second set of families (‘FineFams’) is then identified, using the original unsupervised protocol. For this, a generic threshold setting of 1e−10
) was determined in benchmarking EC4 conservation on over 400 enzyme-domain containing superfamilies in Gene3D (data not shown), underlining the focus on whole-protein function at this level. Whenever no high-quality GO annotation data are available for a superfamily, only the FineFam layer is generated.