Protein domains are distinct units of molecular evolution, usually associated with particular aspects of molecular function such as catalysis or binding. In general, they represent discrete units of three-dimensional (3D) structure. The identification of functionally characterized domains in protein sequences may give the first clues as to their molecular and cellular function.
Protein domains come in families. A dazzling array of functional diversity, and a large number of clusters grouped by obvious sequence similarity, can be reduced to anywhere between several hundred and a few thousand domain superfamilies, depending on how aggressively one groups clusters based on 3D-structural and/or functional similarities. In many cases, a single or a few search models are sufficient to uniquely identify all members of a large, diverse superfamily in a sequence database. In fact, it is possible to identify and label domains in more than two-thirds of the known protein sequences with only a few thousand domain models, as exemplified by the comprehensive collection Pfam (1
). However, even a compact collection such as Pfam cannot help but create separate models for what are truly homologous families. Overlapping regions in protein sequences will sometimes be annotated by more than one model. The Conserved Domain Database (CDD) also mirrors other collections, which are largely redundant with Pfam: SMART (2
) and COG (3
). This, of course, aggravates the annotation problem. Users of the CD-Search resource (4
) may face multiple overlapping annotations, sometimes with very similar scores but distinct functional association. This often-confusing redundancy is a necessary, but not desired property of a multiple-source collection such as CDD.
One can take certain obvious steps to reduce the redundancy, and this is what we have begun to do in CDD version v2.00. Search models are clustered based on overlapping hits in the protein database. Members of a cluster that do not significantly add to the cluster's total coverage are removed from Entrez's default CDD collection. We have also removed search models, which annotated very few or no sequences, and search models that seem to be specific for proteins and/or domains found only in narrow phylogenetic lineages.
However, redundancy can be a good thing, if it provides more specific functional annotations, and the relationships between related models are clear and well explained to the user. There are practical limits to subdividing domain superfamilies: a large number of domain models will affect the database search time, and experimentally backed functional annotation is sparse in many cases. For CDD, we have adopted a principle of creating subfamilies only for ancient conserved domains, present in diverse organisms. We create subfamilies only when the phylogenetic distribution of member sequences suggests an origin of a domain ‘orthology’ group by gene duplication occurring ~0.5 Byr in the past or earlier. This principle helps us to maintain what we hope will be a uniform and understandable level of granularity. In subfamilies, we attempt to identify function from the sequence annotation and the published literature. Alignment models are kept consistent throughout superfamily hierarchies. The core model in a subfamily alignment can be mapped onto the often less extensive alignment in the ‘parent’ model, greatly facilitating updates to include novel representative structures and sequences.
To identify ancient subfamilies for splitting out individual search models, we perform phylogenetic analysis on the multiple sequence alignments and construct sequence trees. This procedure requires fairly accurate alignments, and frequently we do revise alignment models imported from outside sources. In alignment curation, we consider information from 3D structure and structure superposition, when possible, to define structurally conserved cores, accurately delineate domain boundaries and resolve conflicts between sequence-based alignment methods and structure superposition (5
). Alignments curated at the NCBI conform to a simple block-structure, with uniformly aligned, gap-less, structurally conserved blocks separated by unaligned regions, which capture length variation.
Alignment models from both curated and imported sets are converted into position-specific scoring matrices, and the latter are assembled into search databases for use with RPS-BLAST (6