Advances in DNA sequencing will bring a significant growth in the number and diversity of available genomes over the next few years. More than 100 animals and 50 plants have been sequenced to various degrees of completion and more are slated to be sequenced (http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
). With costs falling sharply and increased accessibility to sequencing technologies, it will soon become feasible for individual investigators to sequence their species of interest. To be useful to researchers, these genomes will need to be analyzed to determine genes and other functional elements. While new genome sequencing projects are progressing at a fast pace, however, the generation of expressed DNA (cDNA, EST, mRNA) and protein sequences needed to annotate them has been slow (1
). Moreover, sequencing of full-length mRNA sequences critical for annotation quality has focused on a handful of high-priority species (2–5
). An economical and increasingly popular approach is to generate mixed collections of resources from multiple closely related organisms and share them across several projects (http://www.fagaceae.org
). Mapping gene information already available in databases provides an efficient means to annotate the new genomes, one which requires fast and accurate alignment tools that can be readily used, with little or no human intervention, for a variety of comparisons.
Tools for aligning cDNA and genomic sequences typically have been designed for high sequence similarity and lose power in comparisons across species, or are too slow to handle large annotation tasks. Indeed, programs such as sim4 (6
), Spidey (7
), BLAT (8
), MgAlignIt (9
), ESTmapper (10
) and GMAP (11
) use heuristic alignment methods to align sequences of the same species efficiently and with high accuracy, but their performance drops significantly as the sequence similarity decreases. Only a few of these programs have been adapted to aligning sequences cross-species. For instance, BLAT translates both the query and the database into protein sequences before matching and GMAP uses an adjusted parameter set, but the quality of output is below what is required for automated annotation. Other tools, such as GeneSeqer (12
), EST_GENOME (13
) and EXALIN (14
), employ probabilistic or exact dynamic programming methods and are capable of aligning sequences cross-species, but lack the speed required for whole-genome annotation and are still limited in the range of evolutionary distances they address.
Computationally, aligning a cDNA with a genomic sequence containing that gene entails partitioning the cDNA into exons and the genomic sequence into exons and introns, such that exons are similar between the two sequences except for a few differences caused by sequencing errors and polymorphisms. Additionally, introns must start and end with specific splice signals (GT–AG is the most common). In comparisons between species, evolutionary mutation and gap patterns compound the differences, increasing the difficulty of alignment. Thus, a cross-species spliced alignment tool must be able to handle sequence differences arising from a variety of sources and to correctly identify the splice junctions, and it must do so efficiently and without user intervention to allow application to large automated genome annotation projects. By far the main challenge that confronts existing cross-species alignment tools is their low sensitivity, leading in turn to incomplete gene models and poor splice junction accuracy. Further challenges arise from differences in gene models caused by evolutionary block insertion and deletion events.
Among the most important factors for program sensitivity is the match pattern used to identify exact or near-exact word matches between the sequences, called the seed
. The traditional blast (15
) seed required an exact match of 11 contiguous positions (11111111111), and is called continuous
. This seed has been adopted by most alignment algorithms until its limitations have recently been revealed (16
). To improve sensitivity, spaced
seeds allow mismatches at specified positions in the seed pattern. Judiciously chosen spaced seeds that take into account the characteristics of the alignment achieve significantly higher sensitivity than continuous seeds (17
), and some have already been successfully implemented into whole-genome alignment programs such as PatternHunter (16
) and blastz (19
). Alignments of gene sequences have characteristics that differentiate them from genomic alignments, including higher order dependencies between positions (20
), transition-transversion biases (21
) and 3-periodicity due to their codon structure. We recently incorporated these features into new mathematical models and were able to design improved seeds for cross-species cDNA-to-genome alignment (22
). An additional practical consideration for developing alignment tools, especially as the number of species-to-species comparisons increases, is their applicability range. Designing seeds for even one comparison is computationally expensive. An economical alternative is to identify a small number of program parameters that perform well on a large number of comparisons and thus can be seamlessly used without regard to the species compared. We recently characterized and identified such seeds, which we termed universal
, for a large number of vertebrate comparisons (23
), and incorporate them into our program sim4cc.
Starting from the design principles above, we developed an algorithm and software tool, called sim4cc (sim4 for cross-species comparisons), for aligning cDNA and genomic sequences between species at various evolutionary distances. Sim4cc is built on the foundation of our earlier program sim4 (6
), one of the earliest spliced-alignment tools, but has incorporated significant changes to adapt it to cross-species comparisons, including universal spaced seeds designed for a wide range of species comparisons, more sophisticated splice site models and evolutionarily-aware alignment algorithms. Like its predecessor, sim4cc is designed to align a cDNA with a genomic region containing a homolog of that gene, but it can be incorporated easily into a high-throughput genome annotation engine. Moreover, with its small memory footprint and user-friendly interface, it is well suited for use by individual researchers who wish to analyze their genomic sequence of interest on their local computer. Source code for the program is available free of charge from our web site http://www.cbcb.umd.edu/software/sim4cc