To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into cross-species sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of universal seeds that perform optimally or near-optimally on a large range of comparisons.
Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNA-to-genome comparisons at different evolutionary distances (human-dog, human-mouse, human-chicken and human-zebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98–99% of the optimal seeds, and thus represent universal good seeds.
We show for the first time that optimal and near-optimal seeds for distant species-to-species comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and user-friendly cDNA-to-genome alignment applications, to aid in the annotation of new model organisms.