The next few years are expected to bring a significant increase in the number of available genomes, driven by advances in sequencing technologies [
1]. As genome sequencing projects outpace the generation of native mRNA and protein sequences, gene annotation projects for these genomes will need to rely instead on cDNA information from other species. While existing alignment programs align cDNA and the corresponding genomic sequences accurately, they are inadequate for cross-species comparisons [
2]. Beginning with blast [
3,
4], most alignment programs have used a
seed-and-extend technique to produce local alignments, starting from exact or near-exact word matches
(seeds) between the two sequences and extending them to a local alignment in several stages. Blast uses an exact match of 11 contiguous positions, represented by a vector of 1s (11
c = 11111111111). Such a seed is called
continuous. More recently,
spaced seeds have been introduced, which allow wildcard positions in the seed pattern, marked with 0s. For instance, Kent and Zahler [
5] used a seed that allowed for mismatches at the wobble codon position (WABA
12 = 11011011011011011) to increase the sensitivity of cDNA-genomic sequence alignment. For the same
weight, or number of 1 positions, spaced seeds were shown to perform better than continuous seeds in most cases [
6].
Recently, Ma
et al. [
7] introduced a framework for predicting the sensitivity of a spaced seed given a model of alignment, and showed how to determine the theoretical seed sensitivity with an efficient dynamic programming method [
8]. Given a Bernoulli model of alignments with the parameter
p (
i.e., the probability that any one alignment position is a match) estimated from the average sequence identity, they determined optimal seeds for human-mouse genomic sequence alignment. These seeds were later implemented in the programs PatternHunter and blastz [
9]. Subsequent studies used increasingly complex alignment models, such as the order 1 inhomogeneous 3-periodic Markov models in [
6,
10] or the higher order (3–5) inhomogeneous Markov models of [
11], often in the specialized context of coding sequences. Extensions of the seed models were proposed to further improve sensitivity or extend the range of applicability to non-nucleotide sequences, for instance multiple spaced seeds [
12], or vector seeds [
13]. In the latter, each position in the seed pattern is a real value representing the weight of a match or substitution at that position in the total match score, and a seed match is declared when the total score exceeds an a priori fixed threshold. However, these methods increase the memory and running time of searches, both of which are critical for practical high-throughput applications.
While early alignment and seed models used the traditional {0, 1} alphabet, to further improve the seed sensitivity transition-only wildcards were introduced to differentiate between transitions and transversions, first implemented in blastz [
9]. Later, Noé and Kucherov [
14] formalized the concept and showed that spaced seeds with transitions extend the sensitivity range of {0, 1} seeds, and Zhou and Florea [
11] additionally provided a framework for specificity calculation, and showed that they offer better sensitivity-specificity tradeoffs than {0, 1} seeds in practice.
When comparing coding sequences, such as in cDNA-to-genome alignments, the codon organization, higher-order position dependencies [
15] and specific transition-transversion biases [
16] inherent in the gene sequences are likely to be reflected in the alignment patterns. In [
6,
10], a three-state Markov model of order 1 is used to simulate the three codon positions, while Brejová
et al. [
17] introduce two models, the first a three-state Markov model of order 0, and the second a more complex Bernoulli formulation in which each codon is modeled independently. Recently, we proposed to use higher order (3–5) inhomogeneous Markov models with transitions [
11] to capture both transition-transversion biases and sequence compositional patterns.
As the number of sequenced organisms increases, the range of possible pairwise species comparisons will grow quadratically with the number of species. For practical reasons it becomes desirable to identify a small number of seeds that would perform well for a wide range of comparisons, at varying evolutionary distances. Earlier, Choi
et al. [
18-
20] determined and analyzed best seeds for genomic sequence comparisons for several sequence identity levels under a Bernoulli model of alignment, in a greatly simplified model of evolutionary distance. At low weights (10–12), seeds performed optimally or near-optimally across a wide range of sequence similarity levels. At higher weights, however, there was significant fluctuation in seed sensitivity, which led them to the hypothesis that different seeds will be needed for each type of comparison. This simple
p-level representation of evolutionary sequence divergence is likely inadequate for coding sequences, which are under a more diverse set of evolutionary pressures. We approach the question of designing
universal good seeds for cDNA-to-genome comparisons,
i.e. seeds that perform well for a large number of comparisons, starting from a complex representation of alignment as a high order 3-periodic inhomogeneous Markov model incorporating transitions [
11]. Using comparisons between human and four others species (mouse, dog, chicken and zebrafish) to sample a wide range of evolutionary distances, we analyze the distributions of seed sensitivities statistically to characterize and identify universal good seeds. In the remainder of this section, we provide a brief introduction to the problem of designing optimal spaced seeds.
Spaced Seeds
{0,1} spaced seeds
Alignment programs typically use short exact or approximate matches of an
a priori specified pattern
(seed) to detect local alignments between two sequences. A
continuous seed, such as the one used in blast [
3,
4], requires an exact match of a fixed length
k between the sequences, and is represented as a vector of 1s (11
c = 11111111111). In contrast, spaced seeds [
7] allow for approximate matches by including wildcard positions in the pattern, marked with 0s (
e.g., WABA
12 = 11011011011011011). The number of 1 positions in the pattern is called the
weight of the seed, and the length of the pattern is called the
span. Conventionally, seeds must start with a 1 position. In keeping with previous studies, we will use a fixed seed span
k = 22 [
6,
10,
11].
An alignment is represented as a string of 0s (mismatches) and 1s (matches) generated from a model

, for instance a Bernoulli or a Markov model. A seed
S =
s1 ...
sk,
s1 = 1, is said to detect the alignment
w =
w1 ...
wL ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
{0, 1}
L if there is an approximate match for the pattern in the alignment string such that all 1 positions in the seed pattern map to 1 positions in the alignment,
i.e. there exists
i = 1..
L -
k + 1 such that
wi+l-1 = 1 for all
l with
sl = 1. If such an
i exists, the seed is said to occur in the alignment
w at position
i. The theoretical sensitivity of a seed is then defined as the probability that it will detect a random alignment of length
L generated from the model

, or equivalently:
Sn(
S) =
P({
w
{0, 1}
L|
S detects
w}) [
6,
8]. Traditionally, the alignment length
L is 64, determined as the average length of a gap-free alignment in human-mouse comparisons. By definition, an
optimal seed is a seed with the highest sensitivity. For a given seed, its sensitivity can be computed exactly using dynamic programming [
8,
10,
21]. Optimal seeds can then be produced by exhaustively searching the seed space [
8], while close approximations can be obtained with fast heuristics, such as hill-climbing [
6,
10] or exploiting the seed structure to reduce the search space [
20].
{0, 1, x} spaced seeds for cDNA-to-genome alignment
Unlike genomic sequences, gene sequences exhibit higher order dependencies between positions [
15], more pronounced transition-transversion biases [
16], and distinct manifestations at the three codon positions, which are likely to translate into alignment patterns. To account for these characteristics, in previous work [
11] we proposed a framework that differentiated between transitions and transversions, by using an additional alphabet symbol
x, as well as among the three codon positions, using a third order inhomogeneous 3-periodic Markov model of cDNA-to-genome alignments. In the seed pattern,
x marks a position that allows transitions but not transversions, while in the alignment model it simply represents a transition. The weight of the new symbol is 0.5. We denote (
n1,
n0,
nx) the class of seed patterns with
n1,
n0 and
nx symbols of 1, 0 and
x, respectively,
n1 +
n0 +
nx =
k. For a given weight
W, there may be multiple (
n1,
n0,
nx) combinations with
n1 + 0.5·
nx =
W . For instance, for the weight
W = 10 and span
k = 22, the following combinations are possible: (10, 12, 0), (9, 11, 2), ..., (0, 2, 20).
In the following sections, we investigate the behavior of seeds, and in particular best seeds, among four comparisons or cDNA-to-genome alignment models (human-mouse, human-dog, human-chicken and human-zebrafish) to obtain a robust characterization of universal good seeds.