Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Bioinformatics**|**v.9; 2008**|**PMC2375135

Formats

Article sections

- Abstract
- Background
- Results
- Discussion
- Conclusion
- Availability
- Methods
- Authors' contributions
- Supplementary Material
- References

Authors

Related links

BMC Bioinformatics. 2008; 9: 36.

Published online 2008 January 23. doi: 10.1186/1471-2105-9-36

PMCID: PMC2375135

Received 2007 June 20; Accepted 2008 January 23.

Copyright © 2008 Zhou et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into cross-species sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of *universal *seeds that perform optimally or near-optimally on a large range of comparisons.

Using statistical regression methods, we investigate the sensitivity of seeds, in particular good seeds, between four cDNA-to-genome comparisons at different evolutionary distances (human-dog, human-mouse, human-chicken and human-zebrafish), and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. In addition, we find that with high confidence good seeds for more distant comparisons perform well on closer comparisons, within 98–99% of the optimal seeds, and thus represent universal good seeds.

We show for the first time that optimal and near-optimal seeds for distant species-to-species comparisons are more generally applicable to a wide range of comparisons. This finding will be instrumental in developing practical and user-friendly cDNA-to-genome alignment applications, to aid in the annotation of new model organisms.

The next few years are expected to bring a significant increase in the number of available genomes, driven by advances in sequencing technologies [1]. As genome sequencing projects outpace the generation of native mRNA and protein sequences, gene annotation projects for these genomes will need to rely instead on cDNA information from other species. While existing alignment programs align cDNA and the corresponding genomic sequences accurately, they are inadequate for cross-species comparisons [2]. Beginning with blast [3,4], most alignment programs have used a *seed-and-extend *technique to produce local alignments, starting from exact or near-exact word matches *(seeds) *between the two sequences and extending them to a local alignment in several stages. Blast uses an exact match of 11 contiguous positions, represented by a vector of 1s (11_{c }= 11111111111). Such a seed is called *continuous*. More recently, *spaced seeds *have been introduced, which allow wildcard positions in the seed pattern, marked with 0s. For instance, Kent and Zahler [5] used a seed that allowed for mismatches at the wobble codon position (WABA_{12 }= 11011011011011011) to increase the sensitivity of cDNA-genomic sequence alignment. For the same *weight*, or number of 1 positions, spaced seeds were shown to perform better than continuous seeds in most cases [6].

Recently, Ma *et al*. [7] introduced a framework for predicting the sensitivity of a spaced seed given a model of alignment, and showed how to determine the theoretical seed sensitivity with an efficient dynamic programming method [8]. Given a Bernoulli model of alignments with the parameter *p *(*i.e*., the probability that any one alignment position is a match) estimated from the average sequence identity, they determined optimal seeds for human-mouse genomic sequence alignment. These seeds were later implemented in the programs PatternHunter and blastz [9]. Subsequent studies used increasingly complex alignment models, such as the order 1 inhomogeneous 3-periodic Markov models in [6,10] or the higher order (3–5) inhomogeneous Markov models of [11], often in the specialized context of coding sequences. Extensions of the seed models were proposed to further improve sensitivity or extend the range of applicability to non-nucleotide sequences, for instance multiple spaced seeds [12], or vector seeds [13]. In the latter, each position in the seed pattern is a real value representing the weight of a match or substitution at that position in the total match score, and a seed match is declared when the total score exceeds an a priori fixed threshold. However, these methods increase the memory and running time of searches, both of which are critical for practical high-throughput applications.

While early alignment and seed models used the traditional {0, 1} alphabet, to further improve the seed sensitivity transition-only wildcards were introduced to differentiate between transitions and transversions, first implemented in blastz [9]. Later, Noé and Kucherov [14] formalized the concept and showed that spaced seeds with transitions extend the sensitivity range of {0, 1} seeds, and Zhou and Florea [11] additionally provided a framework for specificity calculation, and showed that they offer better sensitivity-specificity tradeoffs than {0, 1} seeds in practice.

When comparing coding sequences, such as in cDNA-to-genome alignments, the codon organization, higher-order position dependencies [15] and specific transition-transversion biases [16] inherent in the gene sequences are likely to be reflected in the alignment patterns. In [6,10], a three-state Markov model of order 1 is used to simulate the three codon positions, while Brejová *et al*. [17] introduce two models, the first a three-state Markov model of order 0, and the second a more complex Bernoulli formulation in which each codon is modeled independently. Recently, we proposed to use higher order (3–5) inhomogeneous Markov models with transitions [11] to capture both transition-transversion biases and sequence compositional patterns.

As the number of sequenced organisms increases, the range of possible pairwise species comparisons will grow quadratically with the number of species. For practical reasons it becomes desirable to identify a small number of seeds that would perform well for a wide range of comparisons, at varying evolutionary distances. Earlier, Choi *et al*. [18-20] determined and analyzed best seeds for genomic sequence comparisons for several sequence identity levels under a Bernoulli model of alignment, in a greatly simplified model of evolutionary distance. At low weights (10–12), seeds performed optimally or near-optimally across a wide range of sequence similarity levels. At higher weights, however, there was significant fluctuation in seed sensitivity, which led them to the hypothesis that different seeds will be needed for each type of comparison. This simple *p*-level representation of evolutionary sequence divergence is likely inadequate for coding sequences, which are under a more diverse set of evolutionary pressures. We approach the question of designing *universal *good seeds for cDNA-to-genome comparisons, *i.e*. seeds that perform well for a large number of comparisons, starting from a complex representation of alignment as a high order 3-periodic inhomogeneous Markov model incorporating transitions [11]. Using comparisons between human and four others species (mouse, dog, chicken and zebrafish) to sample a wide range of evolutionary distances, we analyze the distributions of seed sensitivities statistically to characterize and identify universal good seeds. In the remainder of this section, we provide a brief introduction to the problem of designing optimal spaced seeds.

Alignment programs typically use short exact or approximate matches of an *a priori *specified pattern *(seed) *to detect local alignments between two sequences. A *continuous *seed, such as the one used in blast [3,4], requires an exact match of a fixed length *k *between the sequences, and is represented as a vector of 1s (11_{c }= 11111111111). In contrast, spaced seeds [7] allow for approximate matches by including wildcard positions in the pattern, marked with 0s (*e.g*., WABA_{12 }= 11011011011011011). The number of 1 positions in the pattern is called the *weight *of the seed, and the length of the pattern is called the *span*. Conventionally, seeds must start with a 1 position. In keeping with previous studies, we will use a fixed seed span *k *= 22 [6,10,11].

An alignment is represented as a string of 0s (mismatches) and 1s (matches) generated from a model $\mathcal{M}$, for instance a Bernoulli or a Markov model. A seed *S *= *s*_{1 }... *s*_{k}, *s*_{1 }= 1, is said to detect the alignment *w *= *w*_{1 }... *w*_{L } {0, 1}^{L }if there is an approximate match for the pattern in the alignment string such that all 1 positions in the seed pattern map to 1 positions in the alignment, *i.e*. there exists *i *= 1..*L *- *k *+ 1 such that *w*_{i+l-1 }= 1 for all *l *with *s*_{l }= 1. If such an *i *exists, the seed is said to occur in the alignment *w *at position *i*. The theoretical sensitivity of a seed is then defined as the probability that it will detect a random alignment of length *L *generated from the model $\mathcal{M}$, or equivalently:
*Sn*(*S*) = *P*({*w *{0, 1}^{L}|*S *detects *w*}) [6,8]. Traditionally, the alignment length *L *is 64, determined as the average length of a gap-free alignment in human-mouse comparisons. By definition, an *optimal seed *is a seed with the highest sensitivity. For a given seed, its sensitivity can be computed exactly using dynamic programming [8,10,21]. Optimal seeds can then be produced by exhaustively searching the seed space [8], while close approximations can be obtained with fast heuristics, such as hill-climbing [6,10] or exploiting the seed structure to reduce the search space [20].

Unlike genomic sequences, gene sequences exhibit higher order dependencies between positions [15], more pronounced transition-transversion biases [16], and distinct manifestations at the three codon positions, which are likely to translate into alignment patterns. To account for these characteristics, in previous work [11] we proposed a framework that differentiated between transitions and transversions, by using an additional alphabet symbol *x*, as well as among the three codon positions, using a third order inhomogeneous 3-periodic Markov model of cDNA-to-genome alignments. In the seed pattern, *x *marks a position that allows transitions but not transversions, while in the alignment model it simply represents a transition. The weight of the new symbol is 0.5. We denote (*n*_{1}, *n*_{0}, *n*_{x}) the class of seed patterns with *n*_{1}, *n*_{0 }and *n*_{x }symbols of 1, 0 and *x*, respectively, *n*_{1 }+ *n*_{0 }+ *n*_{x }= *k*. For a given weight *W*, there may be multiple (*n*_{1}, *n*_{0}, *n*_{x}) combinations with *n*_{1 }+ 0.5·*n*_{x }= *W *. For instance, for the weight *W *= 10 and span *k *= 22, the following combinations are possible: (10, 12, 0), (9, 11, 2), ..., (0, 2, 20).

In the following sections, we investigate the behavior of seeds, and in particular best seeds, among four comparisons or cDNA-to-genome alignment models (human-mouse, human-dog, human-chicken and human-zebrafish) to obtain a robust characterization of universal good seeds.

The repertoire of species to be sequenced is expected to increase dramatically over the next few years, driven by new, more effective and increasingly reliable sequencing technologies. As the number and phylogenetic diversity of genomes increases, designing optimized seeds for each pair of compared species quickly becomes impractical. It becomes desirable to identify a limited number of seeds that perform well, if not optimally, for a large number of comparisons. We call these *universal *good seeds. We examine seeds for four comparisons between species with significantly different evolutionary distances and mutation patterns (human-mouse, human-dog, human-chicken and human-zebrafish). For simplicity, we will refer to each comparison by the name of the second organism (DOG, MUS, CHK, ZFS). By analyzing the distribution of seed sensitivities between models statistically, we identify strategies for designing universal good seeds.

We address two questions: First, are there groups of comparisons, or equivalently alignment models, that produce similar behavior of all seeds? We call such models *seed-equivalent*, and one optimal seed would then satisfy all comparisons in a seed-equivalence group. Second, are there *universal *good seeds, *i.e*. seeds that are optimal or near-optimal for a large range of comparisons and evolutionary distances?

To answer these questions, we started by calculating seed sensitivities exhaustively in the four models for three weights, *W *= 12, 14, 16, for the (12, 10, 0), (14, 8, 0) and (16, 6, 0) combinations. We then compared the distributions of values between any two models using statistical regression methods to determine seed trends, by measuring the 95% confidence interval for a seed when projected via the regression curve, and to identify outliers, or seeds that have significantly different behavior (at > 2.5*σ *from the regression curve) between the two models. Although linear and non-linear regression models showed similar goodness of fit, an order 3 non-linear method was chosen because it provided a more conservative estimate of predicted values among the top scoring seeds, which are particularly relevant to this study. The scatter plot of seed values between the CHK and DOG distributions and the two regression curves, linear and non-linear, produced with the R statistical analysis package [22] are shown in Figure Figure1.1. Scatterplots for all pairwise comparisons are shown in [Additional file 1].

Scatterplot of seed sensitivity values between the CHK and DOG comparisons, for weight *W *= 16 and for the (16, 6, 0) combination. The linear and non-linear regression (solid) and the 95% confidence interval (dotted) curves are shown.

For most pairs of models, the 95% confidence interval for the predicted seed values (2·*t*_{α/2}*σ*_{y }in Table Table1)1) is less than or close to 0.01, meaning that with high probability the observed seed sensitivity in the second model is within close vicinity ($\widehat{y}$ - *t*_{α/2}*σ*_{y}, $\widehat{y}$ + *t*_{α/2}*σ*_{y}) of the predicted value, $\widehat{y}$. When seed values can be so closely predicted between both (*X*, *Y*) and (*Y*, *X*), the two models *X *and *Y *can be deemed seed-equivalent. In our case, DOG and MUS, and CHK and ZFS, respectively, are seed-equivalent despite the difference in their sensitivity ranges. Interestingly, seed values appear to also be accurately projected from a reference model *X *describing a more distant evolutionary relationship, to a compared model *Y *representing a closer comparison, but not conversely. We explore this observation below.

With the same argument as above, when the margin of error *t*_{α/2}*σ*_{y }is small, with high confidence high-scoring seeds in the reference model are expected to lie at the top of the sensitivity range in the compared model. Hence, good seeds will largely be shared between the models. To measure how closely the predicted value of the optimal seed in the reference model (*x*_{max}) approaches the optimal (real) seed in the compared model (*y*_{max}), we used the statistical lower bound of the predicted interval for *x*_{max}, as a worst case scenario, and compared it with *y*_{max}. These ratios are shown as *T *(*x, y*) in Table Table1.1. As expected, the values are consistently above 0.98 between the seed-equivalent models, but also when a seed in a more distant model is projected onto a closer model, and the trends are more pronounced with the larger seed weights. This level of accuracy in predicting near-optimal seeds by regression projection, albeit statistical, is comparable with that obtained with the heuristic algorithm proposed in [20]. To rule out the effects of outliers among the top-scoring candidates, we identified those seeds that fall significantly outside of the 95% confidence range (Table (Table1,1, columns 7 and 8). Although roughly 1–2% of seeds are expected to place outside the predicted range, they are scattered along the sensitivity range for the model.

Collectively, these findings suggest that simply selecting top scoring seeds in the most distant model, in our case ZFS, would lead to good seeds for all other models, and therefore gives a simple strategy for determining universal good seeds.

To probe the universality of seeds, we evaluate seeds optimized for one comparison on the three others. For each comparison, we determine best seeds for most weights of practical interest, *W *= 10..16, for all (*n*_{1}, *n*_{0}, *n*_{x}) combinations. Table Table22 lists the optimal seed under our fixed span model (*k *= 22) for each weight for the four comparisons, and the complete list including all combinations is in [Additional file 2]. Figure Figure22 shows the sensitivity maxima that can be achieved by seeds for each combination. For each comparison (DOG, MUS, CHK, ZFS) the maximum overall sensitivity declines steadily as the seed weight increases, at a rate roughly proportional to the percentage of matches in the alignment (*p*_{d }= 89.0%, *p*_{m }= 85.0%, *p*_{c }= 75.5% and *p*_{z }= 68.7%). Within each weight group, the sensitivity maxima vary, sometimes dramatically, between (*n*_{1}, *n*_{0}, *n*_{x}) combinations: seeds with the largest number of transitions consistently score the lowest, and seeds with a small number of transitions, depending on the species (2–4 for CHK and ZFS, 6–8 for MUS and DOG), perform the best. We hypothesize that the optimal number of seed transitions for each comparison is determined by the transition-transversion ratio (*κ*) in the alignment model, and that more transition positions are expected in the optimal seeds as the ratio increases. Thus, DOG (*κ*_{d }= 1.97) and MUS (*κ*_{m }= 1.73) are the most likely to see the effects of transitions, while the effects on CHK (*κ*_{c }= 1.17) and ZFS (*κ*_{z }= 0.95) will be smaller. This ratio also determines the gain in sensitivity that can be obtained by seeds incorporating transitions compared to {0,1} seeds.

Seeds optimized for the CHK, DOG, MUS, ZFS comparisons, for weight W = 10..16, using hill-climbing. For large weights (*e.g*., W ≥ 16), the fixed span *k *= 22 may significantly constrain the range of seeds, and therefore the seeds produced under **...**

Sensitivity maxima for DOG, MUS, CHK and ZFS comparisons, for seeds of weight *W *= 10..18. For each weight *W*, (*n*_{1}, *n*_{0}, *n*_{x}) combinations are shown right-to-left starting with *n*_{x }= 0 and subsequently increasing *n*_{x}. Sensitivity drop rates between consecutive **...**

To validate the universal seeds, we measured the performance of seeds optimized for one comparison on each of the other three, both theoretically and on real data. For empirical evaluations, we tested the ability of each seed to detect orthologous exons between human and each of the other species. Table Table33 lists the theoretical (T) and empirical (E) seed sensitivity values, calculated as described in Methods, averaged over each weight group. Theoretical values show very similar sensitivities for all four groups of seeds on all four comparisons. Empirical values are somewhat more varied, but they also indicate that all seeds perform optimally or near-optimally for multiple comparisons. In all cases, seeds designed for the more distant comparisons have near optimal performance for the closer ones. In particular, ZFS and CHK optimal seeds are widely applicable, and thus are good candidates for universal good seeds.

While significant effort has gone in designing optimal seeds for comparing human and mouse sequences, a remarkably few other comparisons received any attention at all. With more species being sequenced over the next few years, identifying a small number of seeds that would perform optimally or near-optimally for a large range of comparisons is becoming essential. For genomic sequences, the sequence identity level has been used as a simple measure of evolutionary distance. Since the sequence composition of genes is significantly more complex, best seeds for comparing coding sequences are expected to depend not only on the sequence identity level, but also on the pattern of mutations, in particular transition-transversion biases. Intuitively, similar alignment models should produce similar behavior of seeds. Thus, one solution to seed proliferation is to group comparisons that exhibit similar behavior of seeds into *seed-equivalent *classes, such that all comparisons in one class are well served by the same optimal seed. Furthermore, it would be even more desirable to identify one or a small set of seeds suitable for a wide variety of evolutionary distances and mutation patterns, herein called *universal *seeds.

Our statistical regression analyses of seed sensitivities among four comparisons, chosen at various evolutionary distances, showed that for some pairs of comparisons seed behavior can be predicted closely with high-confidence. In particular, it was possible to identify two sets of models that have relatively different sensitivity ranges but very similar seed behavior, and therefore can be deemed seed-equivalent. Moreover, even at larger evolutionary distances more distant comparisons are predictive of closer ones. This conclusion is corroborated by several factors, including the length of the prediction interval, the pattern of outliers, and the ratio of predicted to optimal sensitivity (Table (Table1).1). For instance, at 2.5*σ *regression error, all but a few of the outliers perform better than predicted in the distant model compared to the closer one. In other words, distant comparisons contain more information than closer ones. One outstanding question is how to determine whether two models are close enough to be seed-equivalent. While we have not yet found a theoretical formulation, we investigated the relationship between alignment Markov models using a conventional distance measure between their probability distributions. The Kullback-Leibler Divergence (KLD) [23] can be applied on the space of alignment words $\mathcal{X}$ = {0, 1, *x*}^{64 }to produce a distance between the two models [see Additional file 3]:

$$KLD(P,Q)={\displaystyle \sum _{w\in \mathcal{X}}p(w)\mathrm{log}\frac{p(w)}{q(w)}}$$

*KLD*(*P, Q*) represents the relative entropy of *P *over *Q*, or the information gain about $\mathcal{X}$ when *P *is used instead of *Q*. The KLD measure is non-symmetrical, *i.e. KLD*(*P, Q*) ≠ *KLD*(*Q*, *P*), and therefore can capture unidirectional relationships. In Table Table4,4, the more distant comparisons contain consistently more information than closer ones. (Note that we have judged comparisons based on the relative sequence similarity levels, for instance DOG (*p*_{d }= 89.0%) is closer than MUS (*p*_{m }= 85.0%).) The table also suggests that a possible cutoff for deciding seed-equivalence may be set between 1.0 and 2.0. Additional models and analyses will be needed, however, to validate and fine tune this criterion.

Lastly, one natural question that arises is whether universal seeds exist for other types of comparisons, such as between genomic sequences. Our preliminary experiments using a {0, 1} Bernoulli model of alignment [7,19] and four different sequence similarity levels to represent different evolutionary distances (*p *= 0.65, 0.75, 0.85 and 0.95) indicate that, again, good seeds for the more distant comparisons will perform well on the closer ones [see Additional file 4]. Moreover, for weight 12, *p *= 0.65, 0.75, 0.85 form a seed-equivalent cluster and the optimal seed is shared among the four models, whereas for the larger weights (14, 16) the seed-equivalent groups are sparser and no one seed is optimal for all comparisons. These findings are consistent with the observations in [19]. Thus, although more analyses are needed to test it on various models, our simple strategy for selecting universal good seeds may be more widely applicable to a variety of sequence comparison problems.

We performed a statistical analysis of seed sensitivities for four species-to-species cDNA-to-genome comparisons, spanning a wide range of evolutionary distances and mutation patterns, with the goal to determine criteria for selecting a small set of seeds that would perform well on a wide range of comparisons. In particular, grouping models that exhibit similar behavior of seeds into seed-equivalence classes could significantly reduce the number of optimal seeds. Most important for practical applications, the analyses showed that with high probability optimal and near-optimal seeds for the most distant available comparison will translate into good seeds for a wide range of comparisons. These insights, and the sets of optimal seeds predicted for the four comparisons and for a wide array of weights, represent a useful resource in guiding the selection of seeds for developing practical applications.

Material referenced in the paper can be found in the Additional files below and at our website [24].

Given an alignment model, we calculate seed sensitivity in the {0, 1, x} model recursively as described in [11]. For convenience, we include a summary here.

Let $\mathcal{M}$ be a homogeneous order 2 Markov model of alignment, and $\mathcal{D}(S)=\{\mathcal{Q},\Sigma ,\mathcal{A},{q}_{0},{q}_{a},{q}_{f}\}$ a deterministic finite automoton (DFA) that accepts all and only alignment strings that contain a seed match, built with the Aho-Corasick algorithm [25]. Here, $\mathcal{Q}$ denotes the set of states, Σ = {0, 1, *x*} is the alignment alphabet, $\mathcal{A}$ is the set of state transitions, and *q*_{0}, *q*_{a}, *q*_{f }are the *start*, *accept *and *fail *states. We write $q\underset{}{\overset{b}{\to}}{q}^{\prime}$ if there is a DFA state transition from state *q *to state *q' *on the symbol *b * Σ. The sensitivity of the seed *Sn*(*S*) then is the probability of all alignment words of length *L *= 64 accepted by the DFA. We calculate recursively the probabilities *P*(*q*, *t*, *α*), *q * $\mathcal{Q}$, *α * Σ^{3 }that the automaton reaches state *q *after reading *t *input symbols ending in suffix *α *randomly generated from the alignment model. Then, the sensitivity of seed *S *is $Sn(S)={\displaystyle \sum _{\alpha \in {\Sigma}^{3}}P({q}_{a},L,\alpha )}$:

$$\begin{array}{lll}P(q,0,\epsilon )\hfill & =\hfill & \begin{array}{cc}1,& \text{if}q={q}_{0}\text{and}0,\text{otherwise}\end{array}\hfill \\ P(q,1,{b}_{1})\hfill & =\hfill & \begin{array}{cc}{P}_{0}({b}_{1}){\displaystyle \sum _{{q}^{\prime}\underset{}{\overset{{b}_{1}}{\to q}}}P({q}^{\prime},0,\epsilon ),}& \forall {b}_{1}\in \{0,1,x\}\end{array}\hfill \\ P(q,2,{b}_{1}{b}_{2})\hfill & =\hfill & \begin{array}{cc}{P}_{1}({b}_{2}|{b}_{1}){\displaystyle \sum _{{q}^{\prime}\underset{}{\overset{{b}_{2}}{\to q}}}P({q}^{\prime},1,{b}_{1})},& \forall {b}_{1},{b}_{2}\in \{0,1,x\}\end{array}\hfill \\ P(q,3,{b}_{1}{b}_{2}{b}_{3})\hfill & =\hfill & \begin{array}{cc}{P}_{2}({b}_{3}|{b}_{1}{b}_{2}){\displaystyle \sum _{{q}^{\prime}\underset{}{\overset{{b}_{3}}{\to q}}}P({q}^{\prime},2,{b}_{1}{b}_{2})},& \forall {b}_{1},{b}_{2},{b}_{3}\in \{0,1,x\}\end{array}\hfill \\ P(q,t,\delta b)\hfill & =\hfill & \begin{array}{cc}{P}_{2}(b|\delta ){\displaystyle \sum _{{q}^{\prime}\underset{}{\overset{b}{\to q}}}{\displaystyle \sum _{{b}_{0}\in \{0,1,x\}}P({q}^{\prime},t-1,{b}_{0}\delta )}},& \forall \delta \in {\{0,1,x\}}^{2}\end{array},\hfill \end{array}$$

where *P*_{0}, *P*_{1 }and *P*_{2 }are the marginal probabilities of the order 2 Markov model $\mathcal{M}$, and ε is the empty word.

To determine optimal seeds, we use a hill-climbing heuristic [6,11], starting from a random seed and swapping any two distinct symbols with the goal to optimize the score locally. If a 'better' seed is found, it becomes the start seed for variations in the next cycle, until there are no changes to the optimal score. The procedure is applied 20 times for each weight *W *and for each combination (*n*_{1}, *n*_{x}, *n*_{0}), where *n*_{i }is the number of *i *symbols in the seed, satisfying *n*_{1 }+ *n*_{0 }+ *n*_{x }= 22 and *n*_{1 }+ 0.5·*n*_{x }= *W*.

We evaluate seed sensitivity both theoretically and empirically to identify good seeds for practical applications. To determine the effects of species divergence on the choice of seeds, we analyze comparisons between human and four other species (dog, mouse, chicken and zebrafish), which coarsely sample the range of vertebrate evolutionary distances.

Given a seed and an alignment model, the theoretical seed sensitivity can be calculated recursively as described above. To train the alignment models for the four sets of comparisons, exons in the human genome were determined from spliced genomic alignments of human mRNA and EST sequences, produced with the programs Sim4/ESTmapper [2,26]. A subset of coding exons and their reading frames were then determined by Fastx [27] matches against the SwissProt database [28]. Lastly, match statistics within the coding exon regions were collected from whole-genome alignments downloaded from the UCSC Genome Browser [29] and used to train the alignment models.

This evaluation component tests the ability of a seed to accurately match dog, mouse, chicken and zebrafish coding exons with their orthologs in the human genome. For this purpose, reference sets of orthologous exons were constructed from homologous RefSeq gene pairs [30,31] as follows. Exon coordinates in the human mRNA and in the human genome were determined from Sim4 [26] alignments of the human mRNA sequence on the human genome HG17 produced with the high-throughput program ESTmapper [2]. These coordinates were then projected onto the mRNA of the other species via the pairwise mRNA-mRNA alignment. Starting with roughly 500 gene pairs for each comparison, this procedure produced 2,408 (dog), 4,198 (mouse), 4,869 (chicken) and 4,543 (zebrafish) exon pairs for use in our empirical analysis.

During the evaluation, dog, mouse, chicken and zebrafish coding sequences are searched against the human genome. The search program scans the input sequences, looking for seed matches of length 22 in the human genome. For efficiency, the search uses a bit-encoded index of 22 bp words (22-mers) in the genome, from which non-informative bits corresponding to 0 or *x *positions in the seed are removed. Empirical sensitivity is calculated as the fraction of human exons in the reference set that are detected by the seed: *Sn*^{e }= *TP/*(*TP *+ *FN*) [32].

We compare the distributions of seed sensitivities between any two comparisons (DOG, MUS, CHK and ZFS) using statistical regression methods. We used a *t*-test to determine the 95% confidence interval ($\widehat{y}$ - *t*_{α/2}*σ*_{y}, $\widehat{y}$ + *t*_{α/2}*σ*_{y}) for the projection $\widehat{y}$ = *R*(*x*) of a sensitivity value *x *of a seed from the reference model to the compared model. Here, *α *= 0.05, *t*_{α/2 }= 1.96, and *σ*_{y }is the estimated regression standard error of prediction for a given *x *value. Because the number of values is large, *σ*_{y } *σ *for all *y*. A non-linear regression method with an order 3 polynomial was applied to the data: *R*(*x*) = (*a *+*bx*_{max }+ $c{x}_{\mathrm{max}}^{2}$ + $d{x}_{\mathrm{max}}^{3}$). Lastly, outliers were determined as data points *x *whose values *y *in the compared model fall below or above the projected interval: *y *- $\widehat{y}$ > 2.5*σ *or $\widehat{y}$ - *y *> 2.5*σ*.

LZ performed the analyses and contributed to the drafting of the manuscript. JS provided computational support and critically reviewed the manuscript. LF designed the experiments, directed the project, and drafted the article. All authors have read and approved the final version of the manuscript.

Scatterplots of seed sensitivity values between pairs of comparisons. This collection of figures shows the scatterplots of seed sensitivity values and the fitted regression curves between any two models in (DOG, MUS, CHK, ZFS).

Click here for file^{(944K, pdf)}

Optimal seeds for the CHK, DOG, MUS and ZFS comparisons. This table lists the seeds optimized for the DOG, MUS, CHK and ZFS comparisons, for weights *W *= 10..16 and for all (*n*_{1}, *n*_{0}, *n*_{x}) combinations, obtained using hill-climbing. For larger weights (*e.g*., *W *≥ 16), the fixed span *k *= 22 may significantly constrain the range of seeds, and therefore the seeds produced under this model may not be optimal in practice.

Click here for file^{(48K, pdf)}

Efficient calculation of the KLD for two Markov models. This appendix describes a recursive procedure for calculating the Kullback-Leibler divergence between two Markov models efficiently.

Click here for file^{(32K, pdf)}

Regression analysis of seed sensitivity distributions for genomic sequence comparison. This table contains the results from the statistical regression analysis of seed sensitivity distributions, similarly to Table Table1,1, but for the case of genomic sequence comparisons. The four models are characterized by the sequence identity levels: *p *= 0.65, 0.75, 0.85, 0.95.

Click here for file^{(18K, pdf)}

Sensitivity calculations were performed on the "Herd" Scientific Computing Cluster at The George Washington University (NSF grant CLS20163A to JS). This work was supported in part by a Sloan Research Fellowship to LF.

- National Human Genome Research Institute, Listing of Genome Sequencing Proposals http://www.genome.gov/10002154
- Florea L, Di Francesco V, Miller J, Turner R, Mobarry C, Yao A, Harris M, Walenz B, Dew I, Merkulov G, Charlab R, Deng Z, Istrail S, Li P, Sutton G. Gene and alternative splicing annotation with AIR. Genome Res. 2005;15:54–66. [PubMed]
- Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
- Kent W, Zahler A. Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res. 2000;10:1115–1125. [PubMed]
- Buhler J, Keich U, Sun Y. Designing seeds for similarity search in genomic DNA. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB 2003) 2003;7:67–75.
- Ma B, Tromp J, Li M. PatternHunter: faster and more sensitive homology search. Bioinformatics. 2002;18:440–445. [PubMed]
- Keich U, Li M, Ma B, Tromp J. On spaced seeds for similarity search. Discrete Appl Math. 2004;138:253–263.
- Schwartz S, Kent W, Smit A, Zhang Z, Baertsch R, Hardison R, Haussler D, Miller W. Human-mouse alignments with BLASTZ. Genome Res. 2003;13:103–107. [PubMed]
- Buhler J, Keich U, Sun Y. Designing seeds for similarity search in genomic DNA. J Comput Syst Sci. 2005;70:342–363.
- Zhou L, Florea L. Designing sensitive and specific spaced seeds for cross-species mRNA-to-genome alignment. J Comput Biol. 2007;14:113–130. [PubMed]
- Sun Y, Buhler J. Designing multiple simultaneous seeds for DNA similarity search. J Comput Biol. 2005;12:847–861. [PubMed]
- Brejová B, Brown D, Vinar T. Vector seeds: an extension to spaced seeds. J Comput Syst Sci. 2005;70:364–380.
- Noé L, Kucherov G. Improved hit criteria for DNA local alignment. BMC Bioinformatics. 2004;5:149. [PMC free article] [PubMed]
- Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. [PubMed]
- Nei M, Kumar S. Molecular evolution and phylogenetics. New York: Oxford University Press; 2000.
- Brejová B, Brown D, Vinar T. Optimal spaced seeds for homologous coding regions. J Bioinform Comput Biol. 2004;1:595–610. [PubMed]
- Choi K, Zhang L. Sensitivity analysis and efficient method for identifying optimal spaced seeds. J Comp Syst Sci. 2004;68:22–40.
- Choi K, Zeng F, Zhang L. Good spaced seeds for homology search. Bioinformatics. 2004;20:1053–1059. [PubMed]
- Preparata F, Zhang L, Choi KP. Quick, practical selection of effective seeds for homology search. J Comput Biol. 2005;12:1137–1152. [PubMed]
- Kucherov G, Noé L, Roytberg M. A unifying framework for seed sensitivity and its application to subset seeds. J Bioinform Comp Biol. 2006;4:553–569.
- The R Project for Statistical Computing http://www.r-project.org
- Cover T, Thomas J. Elements of Information Theory. New York: John Wiley & Sons, Inc; 1991.
- George Washington University, Computational Molecular Biology Lab http://dna.cs.gwu.edu
- Aho A, Corasick M. Efficient string matching: an aid to bibliographic search. Comm ACM. 1975;18:333–340.
- Florea L, Hartzell G, Zhang Z, Rubin G, Miller W. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 1998;8:967–974. [PubMed]
- Pearson W, Lipman D. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988;85:2444–2448. [PubMed]
- Wu C, Apweiler R, Bairoch A, Natale D, Barker W, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin M, Mazumder R, O'Donovan C, Redaschi N, Suzek B. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006:D187–191. [PMC free article] [PubMed]
- UCSC Genome Browser http://genome.ucsc.edu
- Pruitt K, Tatusova T, Maglott D. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007:D61–65. [PubMed]
- Wheeler D, Barrett T, Benson D, Bryant S, Canese K, Chetvernin V, Church D, DiCuccio M, Edgar R, Federhen S, Geer L, Kapustin Y, Khovayko O, Landsman D, Lipman D, Madden T, Maglott D, Ostell J, Miller V, Pruitt K, Schuler G, Sequeira E, Sherry S, Sirotkin K, Souvorov A, Starchenko G, Tatusov R, Tatusova T, Wagner L, Yaschenko E. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007:D5–12. [PMC free article] [PubMed]
- Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics. 1996;34:353–367. [PubMed]

Articles from BMC Bioinformatics are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |