PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of cmbMary Ann Liebert, Inc.Mary Ann Liebert, Inc.JournalsSearchAlerts
Journal of Computational Biology
 
J Comput Biol. 2009 January; 16(1): 1–18.
PMCID: PMC2858568
NIHMSID: NIHMS183983

Exact Calculation of Distributions on Integers, with Application to Sequence Alignment

Abstract

Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.

Key words: algorithms, computational molecular biology, statistics, alignment, dynamic programming, genomics

1. Introduction

Confidence intervals and other interval estimation problems, and hypothesis testing are two of the most important topics in statistical inference. To perform either of these task requires the obtaining of the probability distribution function or, equivalently, the cumulative distribution function (CDF) of a scalar quantity, or at least the tails of such a distribution. The cumulative distribution function for the random variable S is defined as

equation M1
(1)

where Θ is the set of parameters of the distribution.

In genomics and computational molecular biology, attention has been focused to date primarily on tests of statistical significance. We say that a score is statistically significant if obtaining a score as good as or better than the one observed is unlikely under a specified null distribution with parameters Θ0, i.e., if equation M2 is not more than some threshold. A recursive algorithm has been presented to calculate the cumulative distribution function for tests of significance in motif finding when the scores are integers (Staden, 1989; Moses et al., 2004; Carmack et al., 2007). While in this case integer scores are employed to facilitate computation, integer scores arise naturally in obtaining confidence intervals for many of the most important problems in computational molecular biology, because many of our field's inference problems concern discrete high-dimensional binary unknowns. For example, in RNA secondary structure prediction, a binary variable Aij is set to 1 if base i is predicted to pair with base j, and 0 otherwise. Similarly, in sequence alignment, binary variables are used to predict the alignment of a specific residue in one sequence to a specific residue in another sequence, and in network inference, binary variables are used to predict the edges of a graph. For these problems, a count of the number of differences between a reference solution and an alternate solution is necessarily an integer.

Here we present an approach, applicable to a wide range of problems, for computing the exact probability distribution for a random variable defined on integers and for which dynamic programming algorithms are available. In the context of local sequence alignments, we apply the approach both to the calculation of credibility limits (also known as Bayesian confidence limits), described in the following, and to the calculation of the distribution of scores obtainable by alignment of any given pair of sequences.

1.1. Credibility limits

Hypothesis testing has received much attention in computational biology, but there has been considerably less focus on how to assess the global uncertainty/reliability of the prediction of the many discrete high-dimensional problems. For example, when a BLAST search is completed, not only is an E-value returned, but also an alignment; however, no overall assessment of the uncertainty/reliability of the reported alignment is given. In any discrete setting, a procedure delivering a single answer delivers one feasible solution, a point estimate, selected from the solution ensemble, i.e., from the set of all possible solutions. For high-dimensional discrete space, these ensembles are immense, and thus there is likely to be considerable uncertainty. For example, in the alignment of two sequences of length m and n there are

equation M3
(2)

possible effective alignments (see, e.g., Waterman [1995], p. 188). The number of alignments, i.e., the size of the alignment ensemble, grows rapidly with the length of the sequences being aligned; for example, two small sequences of a length of only 50 residues permit 1029 possible alignments. Thus, it is very unlikely that any procedure can confidently identify the true alignment. In fact, for the local alignments of the human/rodent pairs of sequences described below, the probability of each of the maximum similarity alignments ranged from 1.3 × 10−1 to 3.4 × 10−52, with a median value of 3.1 × 10−18. Even for algorithms that return near-optimal solutions, say the 1000 alignments with the highest alignment scores, the total probability represented can be vanishingly small. Accordingly, it is more appropriate to think of a prediction or an estimate, e.g., an estimating alignment, as a point in the solution ensemble that is selected as a representative solution, a point estimate, around which there can be considerable variability and thus uncertainty.

To asses the uncertainty/reliability of such point estimates, Webb-Robertson et al. (2008) have recently introduced credibility limits. For a hyper-sphere centered on a point estimate, a credibility limit is the radius of the minimum-sized hyper-sphere containing at least a specified fraction of the posterior distribution. When the distance between solutions is the Hamming distance that counts the number of differing discrete unknowns, the credibility limits will always be nonnegative integers. Webb-Robertson et al. (2008) also employ a Bayesian statistical approach to introduce a sampling procedure for estimating the cumulative distribution function of the Hamming distances from a point estimate for discrete high-dimensional inference problems; they use these cumulative distribution functions to obtain the desired credibility limits. As a concrete illustration of their approach, they obtained estimated credibility limits for the local sequence alignment problem. Here, our interest is in finding exact distribution functions for integer scalar-valued functions, such as those defined on alignments.

In the Methods section, we describe the general approach for calculating a distribution on integers. We specialize this to the problem of credibility limits for local sequence alignments, where the distance between two sequence alignments is a count of the number of pairings that the sequences do not have in common. We also specialize the general approach to the problem of giving the distribution of alignment scores for any given pair of sequences. In the Results section, we apply the credibility limits algorithm to obtain exact credibility limits on the alignment of 20 human-rodent sequence pairs, and on a set of alignments of pairs of Drosophila sequences.

2. Methods

We begin this section by describing how we use a dynamic programming algorithm to compute probability distributions on integers. We then specialize the algorithm, for the computation of credibility limits for local sequence alignments, and for the computation of alignment-score distributions.

2.1. General dynamic programming algorithms

Dynamic programming algorithms can be used to find solutions to a problem when there exists a hierarchy of a reasonable number of subproblems, with the feature that any problem or subproblem can be easily computed from solutions to its subproblems. A generic example with a probabilistic interpretation computes a value Z(n) by starting with Z(0) = 1 and then computing equation M4 in order, by combination of the previous values (Table 1).Generally, the final computed value, Z(n), is proportional to the Bayesian marginal posterior probability of a data set R. Via Bayes' rule, Z(n) serves as a normalizing divisor in a manner that shall become apparent presently.

Table 1.
General Probabilistic Dynamic Programming Algorithm

The run time for such an algorithm is often linear, for instance, when computation of the transition probability t(k, R|j) (Table 1)can be achieved in constant time, and when the transition probability is nonzero only when kj is less than some constant. In this case, the memory requirement can be rendered constant as well, since early terms in the sequence equation M8 can be discarded as k grows larger. Also common is a quadratic run time, for instance, when the number of terms in the sum grows with the problem size. Usually in this case, the entire Z array must be retained, and the memory requirement is linear in the problem size.

Note that we allow a transition probability to depend on the data R. In the language of finite-state machines, such a transition often depends upon a subset of the data Rk corresponding to the destination stage k, and a distinction is made between a transition probability and an emission probability. Specifically, we can write

equation M9
(3)

where the first factor is called the transition probability from stage j to stage k and is independent of R, and the second factor is called the emission probability of Rk for that transition. We will not make this distinction in the following; instead we will refer to t(k, R|j) as a transition probability.

2.1.1. General case: the direct approach

Suppose that we want the probability distribution for a score that counts an integer cost s(k, R) for a visit to stage k, and an integer cost s(j, k, R) for a transition from stage j to stage k. A direct way to achieve this is to add a second dimension to the array Z (Table 2).Here the cumulative score is tracked via the second index of the array, and the final unnormalized probability distribution over scores is stored in the last matrix row, equation M10, where smin and smax are, respectively, the minimum and maximum possible scores. These equation M11 values are normalized to probabilities by dividing by their sum, which is the Z(n) value of Table 1.

Table 2.
General Direct Approach to Integer Score Distributions

The original algorithm (Table 1)requires less time and memory than does the newer algorithm (Table 2), because the latter has the loop over values of s. When the requirements for the original algorithm are equation M17 memory and equation M18 time, where α is often 0 or 1, the newer algorithm requires equation M19 memory and equation M20 time, where s# = smaxsmax + 1. Fortunately, with the Fourier transform approach, we will be able to restore the original memory requirement and, in the case of a parallel processing environment, the original time requirement.

2.1.2. General case: the polynomial approach

Instead of adding a second dimension to a Z array that stores real (floating-point) numbers, we can have each element of the original array store a polynomial. That is, for each k, the values equation M21 will instead be stored as the polynomial

equation M22
(4)

where p is the polynomial indeterminant (also known as a variable). (Note that we have slightly generalized the strict mathematical definition of polynomial, in that we are not restricting the integers s to be nonnegative.) The sums in Table 2 can be expressed as operations on these polynomials (Table 3).

Table 3.
General Polynomial Approach to Integer Score Distributions

This approach was applied for motif-significance calculations (Staden, 1989; Moses et al., 2004; Carmack et al., 2007) and falls into the general category of generating-function techniques (Wilf, 2006). Unfortunately, the memory requirement for storage of an array of polynomials is likely to be as costly as is that for storage of a two-dimensional array of real numbers, and the time requirement is not improved, either.

Note that with a slight change of notation, substitution of eit for p, Equation (4) becomes

equation M25
(5)

which, as a function of the indeterminant t, is known as the characteristic function of the probability distribution on s. This association is closely related to the Fourier transform approach, described next.

2.1.3. General case: the Fourier transform approach

Fortunately, the expanded memory requirement can be avoided by use of Fourier transforms. If smin is the known, smallest possible score and smax is the known, largest possible score, and if s# = smaxsmin + 1 then the polynomial can be reconstructed through knowledge of its value at s# distinct values of p. In particular, if the choices for p are the s# complex roots of unity equation M26, then the polynomial can be reconstructed by Fourier transform in equation M27 time and equation M28 memory, a cost which is typically insignificant.

Through its use of a specific complex number value for p (e.g., ωs, for some s) rather than an indeterminant p, the Z array of Table 3 stores complex numbers instead of polynomials. This usage reduces memory and time requirements to the original equation M29 and equation M30, respectively. However, such a computation must be run equation M31 times, once for each equation M32, so the overall run-time remains equation M33. Fortunately, the intermediate values of these calculations need not be kept in memory simultaneously, and thus the overall memory requirement remains at the reduced level of equation M34 (Table 4).

Table 4.
General Fourier Transform Approach to Integer Score Distributions

Note that the the calculations within the outer loop of Table 4 do not depend on the calculations from previous iterations for that loop; thus, the run time can revert to the reduced equation M40 in a parallel-processing environment, with a typically insignificant post-processing time of equation M41 for the Fourier transform.

Note also that some software implementations of Fourier transforms run considerably faster if s# has only small prime factors. To accommodate these implementations, it is reasonable to slightly increase smax so that s# = smaxsmin + 1 has only small prime factors.

2.2. Credibility limits for sequence alignment

In this section, we specialize the general algorithm described above, for the purpose of computing credibility limits for pairwise local sequence alignments. Since the size of the ensemble of alignments of two sequences is typically immense, the probability of any individual alignment, including the optimal alignment, is typically very small. Accordingly, there will usually be substantial uncertainty about the alignment of the two sequences, and credibility limits, also known as Bayesian confidence limits, provide a means by which to represent the uncertainty/reliability of a reported estimating alignment (Webb-Robertson et al., 2008). Specifically, the x% credibility limit of an alignment is the radius of the smallest hyper-sphere that contains x% of the posterior ensemble of alignments. That is, when a pairing distance from the estimating alignment s satisfies the relationship

equation M42
(6)

for some x% (e.g., x% = 90%, 95%, 99%), then s is said to be the x% credibility limit for the estimating alignment A. A small value of s indicates that the probability mass of the ensemble of alignments is tightly compact around the estimating alignment. Accordingly, since there will be at most the short pairing distance s to any alignment within this radius, the estimating alignment quite reliably represents the recommendations of the data. On the other hand, a large value of s indicates that A is not a reliable representative of the distribution. In the Results section, we will present the probability distribution for s in analyses of some human-rodent sequence data and in analyses of some fly sequence data.

In this section, we will define a probability model on alignments. We will then provide an algorithm that computes the probability distribution on pairing distances from an estimating alignment A. That is, for all s, we simultaneously compute the probability that an alignment drawn randomly from the probability distribution will have exactly s pairing differences from the estimating alignment A.

2.2.1. Probability distribution for local sequence alignments

A probability distribution is an essential component of nearly all statistical inference procedures, including hypothesis testing and the identification of confidence limits. Several methods are available for obtaining these distributions. The use of a probabilistic model, such as a pair-HMM model, is the most direct means to obtain these distributions (Durbin et al., 2006). Also, these distributions can be obtained using the implicit relationship of a null distribution to parameters of a scoring matrix (Karlin and Altschul, 1990). Here we employ an approach that allows near-direct interpretation of alignment scores as logarithmic-odds ratios.

To compute local pairwise alignment scores, we employ the default nucleotide sequence alignment scoring scheme of SSEARCH (Pearson, 1991). SSEARCH uses the scoring scheme that assigns Spair(match) = +5 for each nucleotide pairing of identical nucleotides, Spair(mismatch) = −4 for each nucleotide pairing that is a mismatch, sopen = −16 to start an insertion, and sextend = −4 to extend an insertion. Because we are looking at local sequence alignments, an insertion penalty is applicable neither for insertions before the first nucleotide pairing, nor for insertions after the last nucleotide pairing. We convert alignment scores to posterior probabilities via a partition function approach; for a pair of input sequences R = (x, y) and a proposed alignment A, the alignment score sssearch (R, A) is converted to a probability via the Boltzmann distribution:

equation M43
(7)

where T, the temperature, (sometimes written as 1/λ or 1/β) is a scaling factor, and Zend is the partition function that normalizes the probabilities so that the sum over all possible alignments is 1.0:

equation M44
(8)

The distribution of Equation (7) is also known as a Boltzmann distribution.

To show that consideration of the ensemble of alignments is valuable even when a maximum-score alignment is weighted heavily, we have chosen for our experiments a temperature of T = 5/ ln(10) ≈ 2.17, which corresponds to λ = β ≈ 0.46. With this choice, an alignment with a score better than that of another alignment by +5 (e.g., one more nucleotide matching) will be 10 times as probable. However, other choices of T are viable. For instance, we could have selected T = 5.22113, the value we derive for the SSEARCH pairing alignment score matrix, via the approach of Karlin and Altschul (1990) for gapless alignments. Such a choice would have significantly de-emphasized maximum-score alignments.

2.2.2. Basic forward algorithm for Zend

The basic forward algorithm for pairwise sequence alignments is a two-dimensional dynamic programming algorithm, in that values for an alignment of the first i positions of sequence x to the first j positions of sequence y are computed from the values for i′ ≤ i and j′ ≤ j. The form of these computations depends on the state of the pair (xi, yj), e.g., whether xi and yj are paired, or not. For local sequence alignment, we find it convenient to define a BEGIN state, an END state, and seven intermediate states for a finite-state machine that recognizes the regular expression in Table 5.

Table 5.
Regular Expression for Local Sequence Alignment

We need to provide the (unnormalized) transition probabilities from one state to another. Because the contributions to an alignment score sssearch (R, A) are added together, the corresponding contributions to the unnormalized probability exp(sssearch (R, A)/T) are multiplied together. Thus, to model the default SSEARCH parameters, we set a transition probability of topen = exp(sopen/T), with sopen = −16, for a legal transition into the XINS state from a state other than XINS, regardless of the type of nucleotide being inserted. A transition from the XINS state to itself has a transition probability of textend = exp(sextend/T), with sextend = −4. We use the same transition probabilities for legal transitions into the YINS state; these transitions similarly depend on whether the transition is from the YINS state and, in this application, we do not include the common convention that excludes a visit to the YINS state directly from a visit to the XINS state. For legal transitions into the PAIR state, we use a transition probability of tpair(x+, y+) = exp(spair(x+, y+)/T), where spair(x+, y+) = +5 if x+ and y+ are the same nucleotide, and spair(x+, y+) = −4 when x+ and y+ are different nucleotides. Regardless of the identity of the nucleotide being inserted, each legal transition into each of the XBEGIN, YBEGIN, XEND, and YEND states is assigned a transition probability of tfreeopen = exp(sfreeopen/ T), with sfreeopen = 0, unless it is a transition from the same state, in which case it is assigned a transition probability of tfreeextend = exp(sfreeextend/ T), with sfreeextend = 0.

See Table 6 for the algorithm. The final computed value, Zend, is the sum in Equation (8). Thus, via its score sssearch(R, A), we can compute Pr[A|R] for any alignment A from Equation (7).

Table 6.
Sequence Alignment: The Basic Forward Algorithm

The run time of the algorithm is equation M50, and the algorithm can be made to run with equation M51 memory, where m and n are the lengths of the input sequences, because computed values can be discarded once they are no longer needed.

2.2.3. Credibility limits: the Fourier transform approach

The preceding basic forward algorithm can be adapted to compute exact credibility limits. For this problem, we wish to precisely determine the probability that an alignment drawn from the Boltzmann distribution of Equation (7) will differ from an estimating alignment A by exactly some number s of pairings not common to the two alignments. The key to the adaptation is the replacement of the real numbers topen(n), textend(n), tpair(n, n′), tfreeopen(n), and tfreeextend(n) (for nucleotides n and n′) with complex numbers.

There is some latitude in when to count pairing differences between alignments, and we choose to count an (xi, yj) pair when xi is considered. That is, in the dynamic programming algorithm, when xi is an insertion via the XBEGIN, XINS, or XEND state, the transition will contribute s = 1 to the pairing distance if xi is paired in the estimating alignment A (regardless of the yj to which it is paired in A); otherwise, the transition will contribute s = 0 to the pairing distance. On the other hand, when xi is paired with yj via the PAIR state, it will contribute s = 0 to the pairing distance if A has the same pairing, it will contribute s = 1 if A has xi unpaired, otherwise it will contribute s = 2 if A has xi paired to something other than yj.

The algorithm is given in Table 7. The sequential run time of this algorithm is equation M52, because the number of possible alignment scores s# is equation M53. However, the run time of the Fourier approach can be reduced to that of the basic forward algorithm, equation M54, if the s# calculations for Zendk) are done in parallel on separate processors.

Table 7.
Credibility Limits: Fourier Transform Approach

The algorithm utilizes the fact that the maximum number of pairing differences from an estimating alignment A is bounded by the number of pairings in the estimating alignment plus the shorter of the two sequence lengths. Implementations that adapt the generic direct approach of Table 2 and the generic polynomial approach of Table 3 are provided in Section A.1 of the Supplementary Materials (see www.liebertonline.com). Also, note that the values of smin and smid that appear in Table 4 are zero for this algorithm, because zero is the minimum achievable pairing distance; it is the pairing distance achieved for the estimating alignment itself.

2.2.4. Software specifics

Because of its improved memory requirement, we implemented the Fourier transform approach described in Section 2.2.3. We checked its accuracy on small problem instances via an implementation of the direct approach described in Section A.1 of the Supplementary Materials, and we double-checked the smallest instances by hand.

For each sequence pair, we computed the exact credibility distribution for two distinct choices of an estimating alignment: a maximum-score alignment, and the centroid alignment. (For a centroid alignment, termed a “probability alignment” by Miyazawa [1995], two nucleotides are paired if and only if the marginal posterior probability of their being paired exceeds 50%. Also see Ding et al. [2005] and Carvalho and Lawrence [2008] for descriptions of the theoretical and practical advantages exhibited by centroid estimators over maximum-score estimators.) As a first step, we computed maximum-score and centroid alignments themselves. These computations require use of a “backward algorithm,” which is a little more complex than the forward algorithm of Table 6, and which is run from the opposite end of the sequences (see, e.g., Section 4.4 in Durbin et al. [2006]). However, a straightforward implementation using a backward algorithm requires equation M61 memory, because the values that previously were discardable in the forward algorithm implementation now must be retained for use in the backward algorithm. To avoid use of too much computer memory, we employed the optimal checkpointing approach of Newberg (2008a).

See Section B of the Supplementary Materials for additional software considerations.

2.2.5. Problem instances

We analyzed 24 pairs of human-rodent DNA sequences. The first member in each of the 24 pairs was created from 3,000 nucleotides of sequence immediately upstream of a human gene, truncated so as to avoid inclusion of the upstream gene as necessary, and the second member of each pair was created from 3,000 nucleotides of sequence immediately upstream of the orthologous rodent gene, truncated as necessary. We applied RepeatMasker (Smit et al., 19962004) to these sequences and deleted each identified repeat, joining the flanking sequences together under the plausible hypothesis that the sequences were adjoining prior to the time of the insertion of the repeat. The resulting sequences varied in length from 973 to 3000 with a mean length of 2100. See Section C.1 of the Supplementary Materials for more information about this data set.

We also analyzed 20×4 sets of orthologous fly data. For each of 20 genes, we compared 1000 nucleotides of sequence immediately upstream of the gene in Drosophila melanogaster, truncated to avoid including the upstream gene as necessary, with 1000 nucleotides of sequence immediately upstream of the orthologous gene in D. pseudoobscura, truncated as necessary. Similarly, we compared the 20 D. melanogaster sequences with orthologous upstream sequences from D. erecta, D. yakuba, and D. simulans. As with the human-rodent data, we deleted each repeat identified by RepeatMasker, joining flanking sequences. The resulting sequence lengths varied from 139 to 1000 with a mean of 634. See Section C.2 of the Supplementary Materials for more information about this data set.

For each of the 24 + (20 × 4) = 104 sequence pairs, we computed both a maximum-score local sequence alignment and the centroid local sequence alignment, and we then computed the exact credibility distribution for these two estimating alignments.

2.3. Alignment score distribution

We can compute the distribution of integer-valued alignment scores (also known as the density of states) for a given pair of sequences. In Table 8, we have inserted the complex values equation M62, and equation M63into the basic forward algorithm, so as to track the contributions to an alignment score.

Table 8.
Alignment Score Distribution: Fourier Transform Approach

We can compute the alignment score distribution under the assumption that alignments are distributed according to the posterior distribution of Equation (7), or under the null assumption that alignments are equally probable. For the former, we set

equation M76
(9)

For the latter, we set

equation M77
(10)

The value of smid that appears in Table 4 is zero for this algorithm because the alignment score of zero is achievable, for the “empty” alignment, and thus smin ≤ 0 ≤ smax.

See Section A.2 of the Supplementary Materials for direct-approach and polynomial-approach implementations of this algorithm.

3. Results

Even though they score well, there are not many alignments that are nearly the same as a given estimating alignment. Thus, generally, if an alignment is chosen at random according to its Boltzmann probability, there is a low probability that the number of differences from the optimum is small. At the other extreme, alignments that have almost nothing in common with a given estimating alignment tend to score quite poorly, and accordingly the probability that an alignment chosen from the Boltzmann distribution will be a large pairing distance from an estimating alignment is also low. In between these extremes, there is often a single peak in the probability distribution. For instance, Figure 1 shows the exact credibility distributions for the maximum-score alignment and the centroid alignment, for the eighth pair of human-rodent sequences. The example demonstrates a superiority of the centroid alignment over the maximum-score alignment as well as an odd-even parity. See the figure legend for more information.

FIG. 1.
The exact credibility distributions, the distribution of the pairing distances of the posterior weighted ensemble of alignments from the centroid alignment and from the maximum-score alignment, for the eighth pair of sequences in the human-rodent data ...

Multi-modality is not uncommon. See Figure 2 for the example of the second pair of sequences from the human-rodent data. The multi-modality arises from multiple clusters of quality alignments. In this case, the centroid of the smaller cluster has the 205 pairings from the whole-ensemble centroid plus a fairly contiguous block that contains an additional 59 pairings. Ding et al. (,2006), in the context of RNA secondary structures, observe and more deeply analyze the phenomenon of multiple solution clusters.

FIG. 2.
The exact credibility distributions for the second pair of sequences in the human-rodent data set. The distribution for the centroid alignment is slightly left of that for the maximum-score alignment. This example demonstrates the occurrence of multiple ...

Figure 3 depicts the credibility distributions for the fourth pair of sequences from the human-rodent data; this is a case where the centroid and maximum-score alignments fall into different clusters. The whole-ensemble centroid, which is also the centroid of the cluster containing it, has 450 pairings; the other cluster's centroid has 459 pairings; and 363 of the pairings belong to both cluster centroids. The pairings not shared arise from a slight offset in the two clusters' centroid alignments, due to a weakly repeating pattern of approximately ten nucleotides in length. Ding et al. (,2005), in the context of RNA secondary structures, observe and more deeply analyze the phenomenon of a maximum-score solution falling into an improbable cluster.

FIG. 3.
The exact credibility distributions for the fourth pair of sequences in the human-rodent data set. The distribution for the centroid alignment is strongly to the left of that for the maximum-score alignment. The order of the peaks for the two estimating ...

Similar examples exist among the 4 × 20 pairs of fly sequences. Use of fly data set also permits us to compare differing evolutionary distances, because D. pseudoobscura, D. erecta, D. yakuba, and D. simulans have different evolutionary separations from Drosophila melanogaster (Table 9).

Table 9.
A Phylogenetic Tree of Drosophila Species

In Figure 4, for each of the 20 gene ortholog groups, are plotted four points, one for the pairwise alignment of D. melanogaster sequence to sequence from each of the four species D. pseudoobscura, D. erecta, D. yakuba, and D. simulans. Plotted are the 95% credibility limits for the centroid and maximum-score alignments; i.e., a point has an x coordinate indicating the radius from the centroid alignment that is needed if there is to be a 95% probability that an alignment chosen from the Boltzmann distribution will occur within a hyper-sphere of that radius; and the point has a y coordinate for the 95% credibility radius for the maximum-score alignment. Perhaps the most striking feature in this plot is the wide ranges of both the centroid and the maximum-score alignment credibility limits. Even the 95% credibility limits for the sequence alignments of D. melanogaster to D. simulans, the two most closely related species, range from nearly zero base pair differences to over 300 base pair differences. Also, notice that the ranges of the credibility limits of the alignment of D. melanogaster sequence with the other four species' sequences broadly overlap, indicating that, on average, the credibility limits only modestly increase as the paired species become more divergent. Further, observe that most of the points are above the y = x diagonal, indicating that the radius (i.e., the error bar) is tighter for the centroid alignment than it is for the maximum-score alignment. Theory guarantees that the mean pairing distance to the centroid alignment will be lower than the mean pairing distance to the maximum-score alignment, but it makes no such guarantee for fixed percentile credibility limits (Webb-Robertson et al., 2008).

FIG. 4.
For pairwise alignments of D. melanogaster sequences with each of the orthologous sequences in D. pseudo-obscura, D. erecta, D. yakuba, and D. simulans, for each of 20 intergenic regions—80 plotted points in total—we have plotted the 95% ...

Because statistical significance, as measured by p-value, is often used as an indicator of the reliability of a pairwise sequence alignment, we compare significance with credibility in Figure 5. For the 80 pairwise alignments of D. melanogaster sequences with the four other fly species, we depict the relationship of significance to the relative 95% credibility limit, where the relative 95% credibility limit is the ratio of the 95% credibility limit to the number of pairings in the maximum score alignment, and thus is an indication of the fraction of the maximum score alignment that is not reliable. In the first panel of Figure 5, which depicts all sequence pairs for which the p-value is greater than 10−7 (and includes all sequence pairs for which the relative credibility limit exceeds 100%), notice that credibility limits are quite variable and are frequently quite high, and that there is little or no relationship between relative credibility and statistical significance. In the second panel of Figure 5, which depicts all sequence pairs for which the relative credibility limit is less than 100% (and includes all sequence pairs for which the p-value is less than 10−7), notice that relative credibility limits are variable and can be non-negligible even in cases of p-values with extreme significance. Although relative credibility continues to exhibit sizable variability, the mean relative credibility does diminish gradually with diminishing p-value. Statistical significance calculations were made using the technique of Newberg (2008b).

FIG. 5.
Relative credibility versus significance. For each of the 80 pairwise alignments of fly sequence data, we have plotted the relative 95% credibility limit (the ratio of the 95% credibility limit for the maximum score alignment to the number of pairings ...

4. Discussion

Results from our exact credibility limit computations for pairwise local sequence alignments indicate that, in many cases, these distributions are not unimodal, and we see evidence that this deviation from “pretty” results is quite general. In such situations, the choice of a single solution does not capture the wide diversity of the solution space. Instead, a sample of solutions, from the probability distribution over the ensemble of all solutions, can be used to represent the solution space as a whole. For instance, the probability that a solution from the entire ensemble will have a particular feature can be estimated as the fraction of the solutions in the representative sample that have that feature. Ding et al. (2005) find that a sample size of 1000 is sufficient for many computations for RNA secondary structure.

Even if only a single estimating solution is required, exact credibility limit calculations allow scientists to readily appreciate the quality of such a solution. Experiments or subsequent computational analyses connected to data sets that yield solutions with tight credibility limits can be treated as reliable since the data indicate that only small deviations from the estimating alignment are likely. Thus, subsequent analyses dependent upon such alignments can be prioritized ahead of experiments and computational analyses connected to data sets that demonstrate inferior credibility limits. For instance, the 95% credibility limit for the Figure 1 centroid solution is 182 nucleotides, which is 182/1099 ≈ 17% of the number of pairings in the centroid solution. The input sequence lengths are 1769 and 1575 nucleotides. The corresponding limits for Figures 2 and and33 are 114 (114/205 ≈ 56% of centroid solution pairings, with input sequence lengths 1691 and 2219) and 160 (160/438 ≈ 37% of centroid solution pairings, with input sequence lengths 1677 and 1666). These numbers can be used in prioritizing experiments on these sequence pairs.

The use of Fourier transforms significantly reduces memory requirements for the computation of an integer distribution, making feasible calculations that previously required a prohibitive amount of memory. Furthermore, the necessary computations for the integer distribution calculation can, in many circumstances, be done in parallel, even when an underlying algorithm cannot easily be parallelized; thus, there may be little or no run-time penalty for the computation of an integer distribution.

In the example described here we intentionally chose a conservative scaling factor (i.e., a low temperature) that puts strong emphasis on the highest scoring alignments. While this choice makes the point that there is considerable uncertainly in alignments even under very conservative assumptions, this choice yields unrealistic features. For example, using the recipe for conversion of alignment scores to pair-HMM parameters from Durbin et al. (2006, pp. 98–99) this choice of temperature gives the expected lengths of indels as the unrealistically short 1.18 nucleotides. (However, note that Durbin et al. [2006] cautions the reader that scores that work well with maximum-score algorithms may not yield reasonably valued pair-HMM parameters.)

For poor significance values (first panel of Figure 5)there is little or no relationship between relative credibility limits and significance; in particular, some pairwise alignments have a reasonably low credibility limit despite having inferior p-values. For better significance values (second panel of Figure 5)we see a downward trend of mean relative credibility with diminishing p-value, but several alignments exhibit poor relative credibility despite having extremely good p-values. Thus, we caution that the quality of the p-value of an alignment is not a strong indicator of the reliability of that alignment. This should not be too surprising; interval testing and hypothesis testing are solutions to different problems.

For most of these calculations, numerical underflow is an issue. See Section B of the Supplementary Materials for some information on how we addressed this issue.

5. Conclusion

We have presented a memory-efficient approach for calculation of the exact distribution for integer scores and, in the context of pairwise local sequence alignments, we have applied it to find the alignment score distribution and Bayesian confidence limits. We have discovered that these distributions of credibility limits are often not unimodal, and we hypothesize that an analysis of their features will lead to a deeper understanding of the underlying science.

Supplementary Material

Supplemental data:

Acknowledgments

We thank the Computational Molecular Biology and Statistics Core Facility at the Wadsworth Center for the computing resources to make these calculations. This research was supported by the National Institutes of Health/National Human Genome Research Institute (grants K25 HG003291 to L.A.N. and R01 HG001257 to C.E.L.), and by the U.S. Department of Energy (grant DE-FG02-04ER63924 to C.E.L.).

Disclosure Statement

No competing financial interests exist.

References


Articles from Journal of Computational Biology are provided here courtesy of Mary Ann Liebert, Inc.