We have set up a protocol to import and present alignment data from external sources as well as from in-house collaborators. We attempt to identify the sequence fragments used by the alignments’ authors, so that we can link to full-length sequences in Entrez (10
). If accession codes supplied by the source databases cannot be identified, BLAST searches are run for the fragments in order to find identical or very similar sequences in NCBI’s databases, requiring at least 90% sequence identity across the aligned fragment. Particular attention is paid to close matches with structure-linked sequences, and we substitute alignment rows with such sequences when possible. For substitution, we require a similarity threshold of at least 75% sequence identity in the aligned region and no more than 5% of that region is allowed to be lost due to insertions and deletions.
Upon import, multiple alignments are deconstructed into sets of pairwise alignments. A representative common to all the pairwise alignments is chosen as the sequence with the fewest deletions relative to other sequences, so that the loss of alignment information is minimal. In fact most of the alignments imported from Pfam or SMART have a very pronounced block structure, reducing the risk of losing information in this step. Structure-linked sequences are picked as representatives whenever available.
Imported domain alignments can be retrieved by accession or searched by keyword at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The server generates summary pages from which several alignment visualization styles are available. If three-dimensional structure information is available, Cn3D 3.0, a molecular structure viewer distributed by NCBI, can be used to display integrated views of the domain’s multiple alignment and its conservation patterns, as well as the three-dimensional structure of a representative member. This display allows interactive highlighting and feature annotation. Figure shows an example of how this capability can be used to illustrate how genotypes may be linked to disease.
Figure 1 Cn3D 3.0 view showing a subset of aligned sequences from the CD sm (ATPase domain of DNA mismatch repair MUTS family). Residues corresponding to the P-loop motif around the ADP/Mg2+ binding site have been annotated in Cn3D and are highlighted (more ...)
Domain alignments in CDD are used to calculate position-specific score matrices for database searching. For the representation of position-specific score matrix (PSSM) models, a consensus sequence is calculated for each conserved domain, reporting the most frequent residues in aligned columns. Although visible in alignment displays, the consensus sequence is not used directly in PSSM calculations. However, it determines the length of the PSSMs, as only columns with >50% aligned states are include in the consensus and PSSM calculation.
The search engine making use of CDD’s collection of PSSM models is reverse-position-specific BLAST (RPS-BLAST), a variant of PSI-BLAST. It inverts the role of query and subject, comparing a single sequence against a database of PSSM models instead of searching a database of sequences with a single PSSM model. A web-based interface to RPS-BLAST, CD-Search, is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi.
The ASN.1 data specification for domain multiple alignment data in CDD is available through the NCBI toolbox distribution at ftp://ncbi.nlm.nih.gov/toolbox, together with C program code that can be used to read, write and compute with CDD data in the context of the NCBI toolkit. The content of the CDD can be downloaded from NCBI’s FTP site in machine-readable ASN.1 format, by following instructions on the CDD home page.