The annotation of protein sequences with the location of domains is a common practice in the analysis of sequence data. The identification of a conserved domain footprint may be the only clue towards cellular or molecular function of a protein, as it indicates local or partial similarity to other proteins, some of which may have been characterized experimentally. Furthermore, the study of domain architectures in multi-domain protein families often reveals their evolutionary history and is a common tool in sequence classification. To this end, we released the first version of Conserved Domain Database (CDD) to the public in August 2000, >10 years ago, as a collection of 2738 multiple sequence alignment models, based on the content of the Pfam and SMART databases, and derived database search tools to support the rapid computation of sequence annotation. Since then, CDD has grown significantly both in volume and in scope. CDD now imports domain and protein family alignment models from Pfam (1
) (currently mirroring version 24), SMART (2
), COG (3
), TIGRFAM (4
) and the NCBI Protein Clusters database (5
). It also contains a set of models curated by NCBI, many of which are organized into explicit hierarchies of homologous domain families that reflect functional divergence and divergent evolutionary processes. In addition, NCBI-curated domain models use 3D structure information explicitly, to define domain boundaries, guide multiple sequence alignment and provide insights into the relationship between sequence conservation and molecular function.
CDD is updated several times a year, with occasional updates initiated by new versions of imported data sets, and with most incremental updates reflecting additions to the NCBI-curated set of models. The current version of CDD, v2.25, contains 37
632 alignment models, of which 6056 have been curated by NCBI. Various aspects of CDD have been highlighted in earlier manuscripts (6
); here we give a brief summary of major functionality pertaining to sequence annotation, some of which has been presented in greater detail in previous descriptions of CDD, and we introduce a novel tool, Batch CD-Search, that facilitates computation of annotation for large sets of protein queries.
SPECIFIC HITS, DOMAIN SUPERFAMILIES AND MULTI-DOMAIN MODELS
CDD is one of the many databases in NCBI’s Entrez query and retrieval system and can be searched, using the common Entrez interface, for keywords and terms indexed from names, titles and descriptions of the records. CDD is cross-linked with other databases such as Entrez Protein, PubMed and NCBI BioSystems, to name a few. However, most users of CDD encounter CDD records by following Conserved Domains links from Entrez/Protein sequence records, and also while executing protein BLAST and PSI-BLAST searches via NCBI’s web BLAST interface. The conserved domain model database can be scanned quickly with protein queries, and results showing domain annotation may already be available, while BLAST continues to scan the significantly larger non-redundant protein database. The application that visualizes live or pre-computed search results has been termed CD-Search (7
), and the underlying algorithm is Reverse Position-Specific BLAST (RPS-BLAST), a variation of the commonly used PSI-BLAST method (8
illustrates the layout of a page reporting conserved domain annotation. Live searches against the CDD will reproduce pre-computed search results unless the search parameters are modified from their default settings. Detailed descriptions of search result pages have been given previously (6
). A concise domain annotation, as shown by default, will provide the locations of top-scoring domain footprints plus the locations of functional sites, which can be derived from the domain footprints. The locations are shown graphically, and detailed alignments are available as an option. Both CDD and CD-Search come with up-do-date help documentation that explains formatting and interpretation of output in detail, and which has been revised thoroughly in the past year. Domain footprints are shown as either:
- Specific hits–indicating high confidence in the annotation with an NCBI-curated model, where the query model alignment score exceeds a model-specific threshold (10).
- Superfamily annotation, where each superfamily is a collection of models representing homologous protein fragments, often quite redundant.
- Annotation by multi-domain models, which have been excluded from the superfamily clustering as they tend to group non-homologous fragments into the same cluster.
Figure 1. Conserved domain annotation on a well-characterized protein sequence. Shown here is the default concise view generated by the CD-Search tool, using pre-calculated alignment information. The view is divided into two panels: a graphical summary and a table (more ...)
By default, CD-Search displays only the highest ranking domain superfamily annotation for a given region on the query (and there can be no more than one specific hit, if any). The default display also shows only the highest ranked multi-domain model for a given query region, and only if that alignment is nearly complete with respect to the model. An alternative view shows the full alignment results, listing the individual models from all source databases that could be aligned to the query with significant scores. Often, the full alignment results are quite redundant.