Here, we showed that profile–profile alignment with well-structured alignment constraints can achieve high-alignment accuracy and work well in detecting homologous relationships between conserved core regions of domain families. The core constraint exploits relationships between profile columns, prohibiting insertions or deletions within blocks, rather than pursuing improvements through refinement of the column scoring function. Our proposed method is a simple interpretation of a framework in which gap penalties vary according to local conservation, requiring only two different gap penalties. The core constraint may be incorporated into other alignment algorithms as well.
We benchmarked CORAL on core regions from NCBI-curated domains in CDD. Blocks in curated domains reflect sequence and structural conservation and approximate the structural core of the family. However, curators may define blocks to be longer or shorter than in structure alignments, and merge, split or delete the blocks suggested by structure alignments. They may also introduce additional blocks to record conserved features and sites outside the structural core, such as binding sites and motifs.
-values identify 70% of all domain pairs from the same hierarchy with E
-value < 0.05 compared with 3.0% of domain pairs from different superfamilies. Ranking scores from the same family, as in the homology recognition test, achieves even higher performance. In general, the CDD superfamily classification used to define homologs is comparable in specificity to SCOP superfamilies, the basis for remote homology in previous benchmark studies (Marchler-Bauer et al.
). Nevertheless, that curated domains in CDD are easier to classify is unsurprising, because many previous studies aligned noisier profiles constructed by PSI-BLAST and the hierarchical organization of CDD families suggests that many domains have similar conserved cores.
Constructing high-quality alignments between well-defined core regions, in contrast, benefits tremendously from the core constraint. CORAL aligns more families with high-balanced score, produces better alignments with respect to the balanced score than COMPASS or HHalign across all similarity ranges, and returns higher developer's score for almost all groups of data. Possibly even more importantly, by respecting block boundaries, it produces alignments that may be easier to revise. Automated alignments of sequences or profiles with low similarity often require manual correction to produce optimal results. Reducing error to a small number of block shifts simplifies manual analysis. Although the core constraint reduces the space of possible alignment solutions, it does not necessarily constrain the alignment to only one good solution. Our results demonstrate that weak sequence similarity between corresponding core regions increases errors in all methods. Additionally, even in the more constrained setting of global alignment, differences in profile and block lengths permit more than one possible alignment between many blocks.
The clear shortcoming of the core constraint is that at some level of divergence, core regions cannot be aligned correctly without insertions or deletions, hence methods without the core constraint are more suited to remote homolog recognition and alignment. One solution to ameliorate shift errors is to split long blocks into shorter units, randomly or by inspecting the block structure or preliminary alignments of core regions. The curated domain models already contain breaks within blocks where the sequences naturally split. In unreported experiments, we have aligned the curated domains using this alternative block definition with similar and slightly worse overall performance. Further development of this algorithm will allow for cases where additional blocks have been inserted into a sub-family model relative to its parent.
CORAL will be made available to the public as an alignment tool bundled into a future release of the NCBI Cn3D/CDTree software. This user-friendly implementation will provide fast and accurate alignment of core regions, along with access to protein family alignments from CDD. While we only tested alignments between pre-computed protein family models, core regions may be inferred from the continuous regions of any protein family alignment. However, the effective use of CORAL requires high overlap between the conserved regions of two families, for example, in the case of a common structural core, and additional processing may be needed to identify putative conserved core regions. The core constraint may also be incorporated into profile alignment algorithms with more sophisticated scoring methods to improve on both CORAL and the original method for aligning conserved cores.