The annotation of conserved domain footprints on protein sequences often serves as the first step toward characterizing protein function in silico. Protein domains may be viewed as units in the molecular evolution of proteins and can be organized into an evolutionary classification. The set of protein domains characterized so far appears to describe no more than a few thousand superfamilies, where members of each superfamily are related to each other by common descent. NCBI's conserved domain database (CDD) attempts to collate that set and to organize related domain models in a hierarchical fashion, meant to reflect major ancient gene duplication events and subsequent functional diversification.
Computational annotation of protein function is generally obtained via sequence similarity: once a close neighbor with known function has been identified, its annotation is copied to the sequence with unknown function. This strategy may work very well in functionally homogeneous families and when applied only for very close neighbors or suspected orthologs, but it is doomed to fail often when domain or protein families are sufficiently diverse and when no close neighbors with known function are available.
To this end, the CDD (
1) provides a strategy toward a more accurate assessment of such neighbor relationships, similar to approaches termed ‘phylogenomic inference’ (
2). CDD acknowledges that protein domain families may be very diverse and that they may contain sets of related subfamilies. Of these, only few may have been characterized experimentally, and within this set function may have diverged considerably. While it may be possible, and certainly efficient, to represent such a set of subfamilies with just a single family model, that model could only provide very generic annotation. In CDD curation, we attempt to detect evidence for duplication and functional divergence in domain families by means of phylogenetic analysis. We record the resulting subfamily structure as a set of explicit models, but limit the analysis to ancient duplication events—several hundred million years in the past, as judged by the taxonomic distribution of protein sequences with particular domain subfamily footprints.
CDD provides a search tool employing reverse position-specific BLAST (RPS–BLAST), where query sequences are compared to databases of position-specific score matrices (PSSMs), and
E-values are obtained in much the same way as in the widely used PSI-BLAST application (
3). When CDD is scanned with protein query sequences, a region on a query may pick up more than one overlapping footprint from a set of related models. One of those models provides the best score or lowest
E-value, but that alone may not be sufficient to indicate that the query sequence is a bona fide member of the corresponding subfamily. Since the CDD collection also contains imported models, which have not been curated at NCBI, search results may present a mixture of curated models (accessions starting with ‘cd..’) and un-curated models (accessions starting with ‘pfam’, ‘smart’ or ‘COG’). By default, overlapping domain hits are sorted by
E-value, but curated models are listed first, if their
E-values exceed a secondary significance threshold of 1e-05. Default displays are presented in a concise fashion, where domain hits that overlap with the top-ranked domain hits are hidden.
We have started to distribute CDTree, a helper application for the web browser. CDTree allows users to examine the results of simple phylogenetic analysis on the sequences from a curated domain hierarchy, and view their query sequence in the context of such a phylogenetic sequence tree.