Starting with the development of the CDART resource (7
), domain models in CDD have been clustered in order to deal with redundancy. CDD has been a redundant collection ever since its conception, as it contains sets of models imported from several sources with overlapping scope, has re-curated many models for major domain families, and has also inherited redundancy that is intrinsic to the imported collections.
A domain annotation resource may represent a set of homologous sequence fragments as two or more separate models, for various reasons. The molecular functions within that set may be quite diverse, for example, or a single model may be ineffective in database search applications when the sequence fragments are too dissimilar. Two or more redundant models may match overlapping regions on a query sequence, and the annotation derived from those models may be in conflict. In many such cases, CDD will now provide annotation with the name and description of a conserved domain superfamily, where the latter is defined as a set of evolutionarily related single-domain models.
Before domain models can be clustered into superfamilies, a subset of models must be flagged as multi-domain models and exempted from clustering. Multi-domain models are defined as those whose footprints overlap with two or more sequential single-domain footprints, so that they might merge the corresponding single domains into a single cluster. With multi-domain models excluded, single domain models are subjected to single-linkage clustering, where two models are considered related if they annotate a set of protein sequences with diverse taxonomic origins in significantly overlapping intervals. The RPS-BLAST E-value threshold for clustering is set to 1E-05, and sequences must be from three or more diverse taxonomy nodes. The resulting single-domain clusters contain a mixture of models from various sources.
The cluster names and descriptions are generated automatically, by picking a representative model and copying its name and description. If the cluster contains an NCBI-curated domain or domain hierarchy, the model or the hierarchy's parent model is selected as the superfamily representative. If the cluster contains more than one NCBI-curated hierarchy, the hierarchy with the highest coverage of a nonredundant set of proteins is selected. If the cluster contains no NCBI-curated model, a model imported from the Pfam collection is selected (the model with highest coverage, if more than one). If the cluster does not contain a model imported from Pfam either, a model imported from the SMART collection is selected, and so on.
Not all superfamily cluster names are generated computationally. A small batch of superfamilies has been reviewed, and in some cases names and descriptions have been modified by CDD curators. Larger superfamily clusters have also been reviewed for putative errors in the clustering procedure, and some of the clusters have been split up accordingly. To this end, we maintain a black-list for clustering, which specifies pairs of models that are not supposed to end up in the same cluster, but have been observed to co-cluster occasionally, such as when one type of domain occurs as an insert in another type of domain, and when the RPS-BLAST alignment tool happens to over-extend N- or C-terminal partial alignments into the inserted domain for several sequences in the Entrez protein database. The type II fibronectin domains (cd00062) and zinc metallopeptidases (cd00203) would be one example.
Curation of superfamily descriptions and content will be an ongoing activity in CDD curation. Representing groups of sequences or sequence fragments related by common ancestry as superfamilies is a frequent practice in many protein classification resources, of course, and the results of such classification efforts will not always coincide. In general, superfamily clusters in CDD will coincide with ‘clans’ in the Pfam resource and with superfamilies in the SCOP classification (8
), for example, although we have not attempted to quantify the agreement.
Superfamilies are recorded as explicit CD models, which do not contain a multiple sequence alignment but rather a list of single-domain accessions. Most of the conserved domain superfamilies contain only a single model, and often represent domains found in relatively narrow taxonomic lineages. Only superfamilies that represent two or more individual models have been indexed in the Entrez/CDD database. The accessions assigned to superfamily models start with ‘cl’, so that they can be distinguished from regular single-domain models, whose accessions start with ‘cd’, ‘pfam’, ‘smart’, ‘COG’, ‘PRK’, etc., depending on the source database. With each CDD release, single domains will be clustered anew, and the results of that clustering will vary as they depend on the set of single-domain models tracked by the database as well as on the content of NCBI's Entrez/protein database. Superfamily accessions will be preserved if the composition of the associated cluster does not change by 50% or more.