Overview

We are initially given a multiple alignment, a hypothetical query sequence and a phylogenetic tree of all species in the alignment as well as the query. We convert the multiple alignment into a probabilistic profile, where each position in the profile is a tuple of six numbers (

*p*_{present},

*p*_{A},

*p*_{C},

*p*_{T},

*p*_{G},

*p*_{id}). The first number,

*p*_{present}, is the existence probability. It represents the probability that a homologue of the position is present in the query or, equivalently, the probability that the query would align to the position without a gap. The next four values indicate the respective conditional probabilities that the homologous position contains an A, C, G or T, given that there exists a homologous position in the query. The nucleotide with the highest such value is called the consensus character; gaps are ignored when determining this. Note that the consensus character in principle depends on the position of the query in the tree. The final value,

*p*_{id}, is the conditional similarity level (

12) of the position given that there exists a homologous position in the query. In other words, it represents the probability that the corresponding position in the query sequence contains the consensus character of the multiple alignment. For the remainder of this paper, we will use the terms profile and probabilistic profile interchangeably.

To begin with, we define several terms. A seed of weight *w* is a sequence of possibly non-consecutive positions (*i*_{1} < ·· < *i*_{w}); by convention, *i*_{1} = 0. The span of a seed is defined as *i*_{w} − *i*_{1} + 1. We define an un-gapped homology *h* of length *l* beginning at position *p* in sequence *s* and position *q* in sequence *t* as two sub-sequences (*s*[*p*], …, *s*[*p* + *l*]) and (*t*[*q*], …, *t*[*q* + *l*]) that have descended from a common ancestor; one can think of such a homology as a bit string of length *l* with *h*[*i*] = 1 if and only if *s*[*p* + *i*] = t[*q* + *i*]. A seed is said to match the homology at offset *j* if *h*[*j* + *i*_{1}] = 1, …, *h*[*j* + *i*_{w}] = 1, and indexing a seed at position *p* in a sequence corresponds to recording in the index the presence of (*s*[*p* + *i*_{1}], …, *s*[*p* + *i*_{w}]) at position *p*. A seed matches a homology if we index the seed at every position in the homology and the seed matches the homology at at least one offset. A set of seeds matches a homology if we index every seed in the set at every position in the homology and at least one seed in the set matches the homology.

We extend these notions to a multiple alignment in a simple manner; homologies are defined to exist between the query and the alignment. An un-gapped homology beginning at position *p* in the alignment and position *q* in the query is a set of sub-sequences, one from each species in the alignment as well as the query, that have all descended from a common ancestor. In this case, we define *h*[*i*] = 1 if and only if the consensus character at position *p* + *i* in the alignment matches the query character at position *q* + *i*. Indexing a seed at a position in a multiple alignment corresponds to recording the presence of the string consisting of the consensus characters of the multiple alignment. The notion of a seed matching a homology follows from these definitions. We observe that more complex definitions of homology between a query and an alignment are possible, but we do not address them here.

With these definitions, we can formulate our problem as follows:

Given a probabilistic profile, a set of candidate seeds and a budget *B*, index a subset of the candidate set at each position in the profile such that the average number of seeds indexed per position of the database does not exceed *B*. The goal is to maximize the expected number of homologies matched by at least one seed.

Without a budget constraint, this value would obviously be maximized by indexing every seed in the candidate set at every position in the alignment. The value of the budget determines the size of the index and, therefore, the expected number of seed matches; higher values will result in more seed extensions and therefore lead to larger running times. Algorithms that build indexes from sequence databases assign the same set of seeds, with cardinality equal to the budget, to each position in the database. Because we have extra information in the multiple alignment, we can be more flexible. Intuitively, we would like to assign more seeds to positions where this increase is most likely to result in additional detected homologies. We must respect the constraint that the average number of seeds indexed per position does not exceed our budget.

Within a multiple alignment, local rates of conservation vary widely due to both random fluctuations in the number of accumulated mutations as well as differential selective pressures. Both of these effects cause some portions of the multiple alignment to be naturally less likely to contain a match to a homologue in a query sequence. Our algorithm exploits this property to determine how to vary the subset of candidate seeds indexed at each position. Specifically, we use local conservation rates to partition the multiple alignment into regions, which are contiguous blocks of positions whose boundaries reflect changes in the conservation level among species in the alignment ().

Our approach, then, is to change the set of seeds assigned on a region-by-region basis. By assigning fewer seeds to unpromising regions, we can pay more attention to promising regions and increase sensitivity while still respecting our budget. A high-level outline of our algorithm is shown in . First, we convert the multiple alignment into a probabilistic profile. We then use a hidden Markov model to partition the profile into a set of regions where each region gives us the necessary information to evaluate the probability that a homology will exist in that region as well as the probability that a seed will match the homology. Finally, we choose the set of seeds to assign to each position based upon the region to which it belongs. We aim to assign enough seeds to ensure a high probability of finding a match but not too many to waste our budget when it can be used more effectively elsewhere.

We note that in theory one could use a different candidate set of seeds for each position. Such an approach would be particularly useful in cases where highly variable similarity levels strongly influence the optimal shape of the candidate seeds. A typical case is coding exons, where seeds with significant 3-periodicity such as the (0,1,2,8,9,11,12,14,15,17,18) seed (17) have been shown to perform well because they accommodate 3rd-base wobble positions. We do not pursue these options here; rather, we fix the candidate set for all regions and vary only the number of seeds assigned to each position.

Generation of the profile

Our first goal is to convert the alignment into a probabilistic profile. We assume that we are given a phylogenetic tree and the position in the tree at which we expect the query to lie; we root this tree at the query. For each position, we work our way up from the leaves, obtaining

*p*_{present} and

*p*_{N} for each nucleotide N

{A, C, T, G} at each node in the tree; we obtain

*p*_{id} only at the root. A leaf has

*p*_{present} = 1 if the corresponding sequence is un-gapped at the position in the multiple alignment and 0 otherwise. Furthermore, it has probability

*p*_{N} = 1 for the nucleotide present in the sequence; if

*p*_{present} is 0, then we set

*p*_{N} = 0 for all N.

As we work up the tree, we obtain

*p*_{present} and

*p*_{N} independently. For each internal node, we obtain

*p*_{N} by applying Felsenstein's algorithm with a Kimura rate matrix (

34–

36). Since some children can have

*p*_{N} = 0 for each nucleotide, we consider only children for which at least one

*p*_{N} is positive. The task of obtaining

*p*_{present} is somewhat more problematic because there is not as well-developed an evolutionary theory for insertions and deletions as there is for nucleotide substitutions. For a node

*n*, our method sets

*p*_{present}(

*n*) to be a weighted average ∑

_{i} *w*_{i} ×

*p*_{present}(

*n*_{i}), where the sum is taken over all children of

*n*. We choose the weight assigned to a child to be proportional to the inverse of the branch length between the parent and the child and normalize the weights sum to one.

When we have reached the root, we choose *p*_{id} as *max*_{N{A, C, T, G}}(*p*_{N}). This is because the maximum *p*_{N} corresponds to the conditional probability that the consensus character of the multiple alignment is present at the homologous position in the query, given that there is a homologous position in the query. Because at a given position in the multiple alignment we choose the consensus character as the character to record in the index, the average value of *p*_{id} obtained in such a way over a region will correspond to the conditional similarity level of the alignment of that region to a homologue in the query.

It turns out that reconstructing the profile in this manner results in empirically slightly inaccurate values. In particular, it tends to overestimate actual values of

*p*_{id}, which presents difficulties because small changes in the value of

*p*_{id} can have large effects on the estimated hit probability of a seed (

12,

18). Furthermore, such an estimate for

*p*_{present} is heuristic and is not guaranteed to yield accurate predictions.

A complete treatment of profile reconstruction is beyond the scope of this paper. For our purposes, we plotted the predicted versus experimental values of *p*_{present} and *p*_{id} for several species and assessed accuracy. To do this, we began with a multiple alignment, removed one species as the test species, and converted the remaining alignment to a probabilistic profile using the method described above. We then grouped the values of *p*_{present} and *p*_{id} as predicted by the profile into a finite set of buckets, each representing a discrete value.

For each discrete value of *p*_{present} and *p*_{id} as predicted by the profile, we calculated the experimental values of *p*_{present} and *p*_{id} for the test species as follows. To obtain the experimental *p*_{present} for a given discrete predicted value, we first counted the total number of positions in the profile at which *p*_{present} was that value. Of those positions, we counted the number of positions that were un-gapped in the test species. The experimental value was then determined as the latter number divided by the former. The experimental values of *p*_{id} were obtained similarly.

If our predictions were perfectly accurate, resulting plots of experimental versus predicted values would show a linear relationship with slope 1. Plots for *p*_{present} had an extremely high variance and did not appear to follow an obvious pattern, and, based upon these results, we kept our predictions for *p*_{present} unchanged. Improved predictions are likely possible and can only improve the performance of our algorithm; however, they do not appear immediately available. The plots for *p*_{id}, on the other hand, did appear to follow a pattern; shows sample plots for *p*_{id} for two different test species, cat and chicken, obtained from an alignment consisting of human, chimp, baboon, dog, cat, pig, cow and chicken. These plots are similar to plots obtained using other species as test cases.

We do not attempt to address any possible theoretical foundations for the above plots here, as that would take us far from our current focus. Currently, our chief concern is only to convert our initial predictions into values that will work well when given as input to our indexing algorithm, and we found that fitting an exponential curve of the form *g*(*x*) = αe^{β(x)} + δ to our data was more than adequate for this purpose in practice. We chose α and δ to fix *g*(0) = 0.25 and *g*(1) = 1; this leaves β as an adjustable parameter. Based upon our observations, a value of β = 4 seems to work fairly well for a variety of species, and we fixed this parameter for all of our tests.

Region decomposition

As mentioned above, one advantage of a multiple alignment is that it delineates different regions, each of which can be characterized by a conservation level among the species in the alignment. Therefore, before building an index, our algorithm partitions the probabilistic profile into a set of such regions. For simplicity, we group regions into a finite number of region classes; a region class is a pair of characteristics (

,

), which represent typical values of

*p*_{present} and

*p*_{id}, respectively, for each region in the region class. All regions in a region class index the same set of seeds.

Partitioning a profile into regions can be done with the aid of a simple Hidden Markov Model, where the states are region classes that emit values of

*p*_{present} and

*p*_{id}. Similar ideas have been explored before (

19,

36); for our purposes here, it is enough that all regions in a region class possess roughly the same properties. It is important that the cardinalities of the region classes be roughly equal so that we have maximum flexibility when assigning seeds; if one region class is enormous, then in order to free enough budget to assign extra seeds to it we must choose to assign fewer seeds to a large number of smaller region classes, which may be undesirable.

We begin this section by considering a conceptually straightforward approach for decomposing a profile in order to introduce the basic ideas of our method. We then describe how our particular approach extends this idea.

Suppose that we build an HMM consisting of states for each region class (

,

). Each state emits a position of the profile with values (

*p*_{present},

*p*_{id}) with probability proportional to

and transitions to every state other than itself with equal probability. This probability can be chosen to ensure that the optimal Viterbi parse (

33) gives no region shorter than a minimum length; this length should be large enough to ensure that every region is at least larger than the span of our seeds, and we found that a minimum region length of 64 works reasonably well for seeds of span ~20.

Once this HMM has been constructed, each position can be assigned to the region class corresponding to the state that emits it in the optimal parse obtained via the Viterbi algorithm. Region boundaries then occurs between two positions that belong to different region classes.

This basic algorithm suffers from the problem that we must determine the set of region classes at the beginning of the algorithm in order to be able to construct the HMM. The weakness of this approach is shown in . If we choose to represent each position in the profile as a point (*p*_{present}, *p*_{id}), then choosing a set of region classes is conceptually related to partitioning the plane in which the positions lie. All points contained in a rectangle in the partition are closest to the center of a specific region class. By fixing the region classes a priori, we will likely make choices that do not fit the structure of the profile. The chief problem occurs when several region classes are empty and one region class contains many more regions than the others; in this case partitioning achieves little.

An alternative method is to adaptively choose region classes to match the manner in which the positions are distributed, as shown in . One way of doing this is to use k-means clustering (

37) and choose region classes corresponding to the resulting clusters. This does not translate directly to our problem, however, as choosing region classes in this manner cannot predict how the profile will actually be decoded by the HMM. Instead, we use a related algorithm that somewhat corrects this problem. This is related to the fundamental idea behind wavelets (

38), which can analyze data dynamically by decomposing a signal into pieces that can be represented at different scales of resolution.

Our algorithm is shown graphically in . At a high-level, we progressively split the profile into regions belonging to one of two region classes. We perform the decoding at each stage using an HMM exactly as described previously but consisting of only two states. We then decompose each region class further as long as the number of total region classes is less than the maximum number of region classes.

In detail, at each stage we are faced with decomposing a set of regions, all of which belong to the same region class; in the first stage, there is one region consisting of the entire profile. We construct a simple HMM consisting of two states, which correspond to two new region classes; the transition probabilities are set as described above. In principle, we can apply any training algorithm, such as the Baum–Welch or Viterbi training algorithm (

33), to learn the emission probabilities of each of the two characteristics

and

. Once we have determined the values of

and

for the states in the HMM, we can apply the Viterbi algorithm as previously described to partition the original region class into two region classes. We then decompose each of the two new region classes, stopping when the total number of region classes reaches a user specified limit; based on our experience a set of region classes of cardinality 40 works well in practice.

In practice, learning only the parameter with greater variance before applying the Viterbi algorithm and estimating the other parameter afterwards seems to give slightly better results than learning both parameters at once; this is due in large part to the greater ease in determining the initial parameters of the training algorithm in this case. Such an approach splits region classes more evenly, which is our chief goal in this process.

Seed indexing

Once we have the set of regions and region classes, we can proceed to build the index. Our algorithm must determine for each region which subset of the candidate set of seeds will be indexed. Rather than considering each region individually, we consider each region class individually and index the same subset of seeds for every region in the same region class. When it comes time to actually index the set of seeds that has been assigned to a region class, each region is indexed independently; i.e. we never index positions that cause a seed to span two regions. Our goal is to maximize the expected number of homologies in the multiple alignment matched to a homologue in a query.

We must first infer probable locations of homologies in the alignment, so that such positions can be indexed more aggressively. Our profile is partitioned into a set of regions, not homologies, and we therefore adjust our profile so that maximizing the expected number of regions matched to a homologue in the query approximates the number of homologies matched between the alignment and the query. The change that we must make concerns the length of regions; namely, very long regions are likely to contain more than one homology. In this case, there should be a reward for matching a long region to a homologue that is proportional to the expected number of homologies it contains. In accordance with the idea in (

12) that 64 is a good characteristic length for homologies, we split regions of length longer than 64 into a set of contiguous regions of size 64; the last region is allowed to be slightly larger. This enables us to approximate the number of homologies matched within the original region by the number of new regions matched to a homologue.

There is only one technical problem with this approach: having many small regions creates many boundaries between the regions. As mentioned above, we do not index positions that cause seeds to span two regions, and, therefore, having many boundaries can decrease our sensitivity if we are not careful. We can circumvent this problem by allowing seeds to span region boundaries if the regions on both sides of the boundary are part of the same region class. Therefore, for the rest of this section we consider only the problem of maximizing the number of regions in the profile matched to a homologue in the query, with the understanding that this approximates the number of homologies matched between the alignment and the query.

We now begin by describing a general method for solving the seed assignment problem, and we then introduce a greedy approximation that we use in practice. To begin, we determine for each region class and for each number of seeds *j* the expected number of regions within that class matched to a homologue in the query if every position in that region class indexes the *j* candidate seeds that optimize the hitting probability for the region class. We will say that a region class is assigned *j* seeds if every region in that class indexes *j* seeds. This can be represented as a table with a row corresponding to each region class and a column corresponding to each possible subset size. Entry (*i*, *j*) represents the expected number of regions in region class *i* matched to a homologue if region class *i* is assigned *j* seeds.

We compute this table as follows. For a region class (

,

), the probability that a region within that class matches a homologue is the probability that a homologue exists multiplied by the conditional probability that a set of seeds matches the region and its homologue, given that the homologue exists. The former probability can be approximated by

, which gives the average probability that a typical region in the region class has a homologue in a query. The latter is the hitting probability,

*p*_{hit}, which we can compute using dynamic programming. Algorithms for calculating this (

18,

20) require that we specify a region length over which the seeds will be indexed as well as the similarity level of the region. Because all regions are of approximate size 64, we can specify the same length for every region; furthermore, the characteristic

of a region class approximates the similarity level of a typical region in the region class.

We fill in the first column of the table by computing for each region class

*i* the optimal seed. This seed matches any region in the region class to its homologue with approximate conditional probability

, given that the homologue exists; the seed therefore matches each region to a homologue with unconditional probability

. If the region class has |

*C*| regions, the seed is expected to match

regions to their homologues; we fill in entry (

*i*, 1) with this value. Next, we fill in successive columns using the greedy covering procedure described by (

20). Briefly, to fill in column

*j* for a region class

*i*, we compute the optimal set of seeds of size

*j* by augmenting the optimal set of seeds of size

*j* − 1 with the seed that optimizes

*p*_{hit} given that the previous

*j* − 1 seeds do not match. This set of seeds matches with probability

, so we fill in the table entry (

*i*,

*j*) with the value

.

Once the table is full, we assign weights and values to each entry. The weight of entry (*i*, *j*) is the total length of all of the regions in region class *i* multiplied by *j*, the number of seeds corresponding to this entry; this represents the amount of budget we use by assigning *j* seeds to region class *i*. The value is the entry in the cell, which represents the expected number of regions in region class *i* matched to a homologue if *j* seeds are indexed. In this manner, each entry can be viewed as an object of a certain weight and value. Any set consisting of at most one object from each row in the table specifies a number of seeds to assign to each region class; if entry (*i*, *j*) is selected, then region class *i* is assigned *j* seeds. The sum of the weights of the objects is equal to the budget used by such an assignment, and the sum of the values of the objects is the expected number of regions matched. Therefore, such a set of objects that has maximum value and total weight less than our budget specifies the optimal set of seeds to index for each region class.

This problem can be solved exactly in time linear in terms of the size of the database by a solution closely related to that of the Knapsack Problem (

39). In practice, we use a different algorithm to obtain an approximate solution that is both faster and allows us to avoid computing the entire table of values, which is inefficient since we typically index for many region classes a small subset of the candidate seeds. We define the density of an object as its value divided by its weight and select objects in order of decreasing density, disallowing more than one object from a given row.

In detail, we proceed as follows. In order to choose the set of objects, we begin with a candidate list of objects corresponding only to the first column in the table. At each step in the algorithm, we examine the candidate list and select the object of highest density. If the object we choose corresponds to cell (*i*, *j*), then we remove it from the candidate list and add the object corresponding to cell (*i*, *j* + 1) to the candidate list. Rather than setting the value of the new object to be the value in cell (*i*, *j* + 1) of the table, though, we set its value to be the difference between the values in the cell (*i*, *j* + 1) and cell (*i*, *j*). Similarly, rather than setting its weight to the total length of all regions in region class *i* multiplied by the size of the seed set, we set its weight to be merely the total length of the regions.

Viewing this step in terms of seed selection, choosing object (*i*, *j*) corresponds to choosing to index *j* seeds for each region in region class *i*. The value of the object added to the candidate set corresponds to the additional number of regions matched to a homologue if *j* + 1 rather than *j* seeds are indexed for region class *i*, and its weight is the amount of budget used by indexing one additional seed. This object can only be added after object (*i*, *j*) is added because the first *j* seeds must be indexed before seed *j* + 1 is indexed. If an object cannot be added because its weight is too high, we remove it from consideration and try to add the object with the next highest density. We stop this process right before we exceed our budget.

The quality of this approximation depends on the relative weights and values of the objects. For instance, if all the objects are all of similar weight, the algorithm will perform well since it is exact in the case that all objects have the same weight. Similarly, if all objects except those with small values can be chosen without exceeding our budget, the greedy algorithm will perform well because it is not that important which subset of low valued objects it chooses to fill out the budget. In our case, we attempt to partition the region classes into roughly equal sizes, which, in practice, is mostly successful. When some region classes are very large, they typically contain regions with many gaps and consequently have low hitting probabilities. Therefore, in practice either objects of similar weight fill our entire budget, or we are faced with choosing between the objects of low values. In both cases our algorithm works adequately.