Needleman–Wunsch pairwise sequence alignment [1
] is known to be sensitive to parameter choices. To illustrate the problem, consider the eighth intron of the Drosophila melanogaster CG9935-RA
gene (as annotated by FlyBase [2
]) located on chr4:660,462–660,522 (April 2004 BDGP release 4). This intron, which is 61 base pairs long, has a 60 base pair–ortholog in Drosophila pseudoobscura
. The ortholog is located at Contig8094_Contig5509:4,876–4,935 in the August 2003 freeze 1 assembly, as produced by the Baylor Genome Sequencing Center.
Using the basic 3-parameter scoring scheme (match M, mismatch X, and space penalty S), these two orthologous introns have the following optimal alignment when the parameters are set to M = 5, X = −5, S = −5:
However, if we change the parameters to M = 5, X = −6, and S = −4, then the following alignment is optimal:
Note that a relatively small change in the parameters produces a very different alignment of the introns. This problem is exacerbated with more complex scoring schemes, and is a central issue with whole genome alignments produced by programs such as MAVID [3
] or BLASTZ/MULTIZ [4
]. Indeed, although whole genome alignment systems use many heuristics for rapidly identifying alignable regions and subsequently aligning them, they all rely on the Needleman–Wunsch algorithm at some level. Dependence on parameters becomes an even more crucial issue in the multiple alignment of more than two sequences.
was introduced by Waterman, Eggert, and Lander [5
] and further developed by Gusfield et al. [6
] and Fernandez-Baca et al. [8
] as an approach for overcoming the difficulties in selecting parameters for Needleman–Wunsch alignment. See [9
] for a review and [10
] for an algebraic perspective. Parametric alignment amounts to partitioning the space of parameters into regions. Parameters in the same region lead to the same optimal alignments. Enumerating all regions is a non-trivial problem of computational geometry. We solve this problem on a whole genome scale for up to five free parameters.
Our approach to parametric alignment rests on the idea that the score of an alignment is specified by a short list of numbers derived from the alignment. For instance, given the standard three-parameter scoring scheme, we summarize each alignment by the number m of matches, the number x of mismatches, and the number s of spaces in the alignment. The triple (m,x,s) is called the alignment summary. As an example, consider the above pair of orthologous Drosophila introns. The first (shorter) alignment has the alignment summary (33,23,9) while the second (longer) alignment has the alignment summary (36,10,29).
Remarkably, even though the number of all alignments of two sequences is very large, the number of alignment summaries that arise from Needleman–Wunsch alignment is very small. Specifically, in the example above, where the two sequences have lengths 61 and 60, the total number of alignments is 1,511,912,317,060,120,757,519,610,968,109,962,170,434,175,129 ≈ 1.5×1046.
There are only 13 alignment summaries that have the highest score for some choice of parameters M,X,S. For biologically reasonable choices, i.e., when we require M > X and 2S < X, only six of the 13 summaries are optimal. These six summaries account for a total of 8,362 optimal alignments ().
The 8,362 Optimal Alignments for Two Drosophila Intron Sequences
Note that the basic model discussed above has only d = 2 free parameters, because for a pair of sequences of lengths l,l all the summaries (m,x,s) satisfy
This relation holds with l + l′ for the six summaries in . shows the alignment polygon, as defined in the section “Alignment polytopes,” in the coordinates (x,s).
The Alignment Polygon for Our Two Introns Is Shown on the Left
In general, for two DNA sequences of lengths l
the number of optimal alignment summaries is bounded from above by a polynomial in l
of degree d
+ 1), where d
is the number of free parameters
in the model [9
]. For d
= 2, this degree is 0.667, and so the number of optimal alignment summaries has sublinear growth
relative to the sequence lengths. Even for d
= 5, the growth exponent d
+ 1) is only 3.333. This means that all
optimal alignment summaries can be computed on a large scale for models with few parameters.
The growth exponent d
+ 1) was derived by Gusfield et al. [6
] for d
= 2 and by Fernandez-Baca at al. [8
] and Pachter-Sturmfels [10
] for general d
. can be computed using the software XPARAL [7
]. This software works for d
= 2 and d
= 3, and it generates a representation of all optimal alignments with respect to all reasonable choices of parameters. Although XPARAL has a convenient graphical interface, it seems that this program has not been widely used by biologists, perhaps because it is not designed for high throughput data analysis and the number of free parameters is restricted to d
In this paper, we demonstrate that parametric sequence alignment can be made practical on the whole-genome scale, and we argue that computing output such as can be very useful for comparative genomics applications where reliable alignments are essential. To this end, we introduce a mathematical point of view, based on the organizing principle of convexity,
which was absent in the earlier studies [5
]. Our advances rely on new algorithms, which are quite different from what is implemented in XPARAL, and which perform well in practice, even if the number d
of free parameters is greater than three.
Convexity is the organizing principle that reveals the needles in the haystack. In our example, the “haystack” consists of more than 1046
alignments, and the “needles” are the 8,362 optimal alignments. The summaries of the optimal alignments are the vertices of the alignment polytope
. The alignment polytope is the convex hull
of the summaries of all (exponentially many) alignments. Background on convex hulls and how to compute the alignment polytopes are provided in the section From Genomes to Polytopes (see also [11
]). Thus, parametric alignment of two DNA sequences relative to some chosen pair hidden Markov model (PHMM) means constructing the alignment polytope of the two sequences. The dimension of the alignment polytope is d,
the number of free model parameters. For d
= 2 (the basic model), the polytope is a convex polygon, as shown in for the pair of introns above.
The basic model is insufficient for genomics applications. More realistic PHMMs for sequence alignment include gap penalties. We consider three such models. The symmetries of the scoring matrices for these models are derived from those of the evolutionary models known as Jukes–Cantor (d = 3), Kimura-2 (d = 4), and Kimura-3 (d = 5). The models are reviewed in the section “Models, alignment summaries, and robustness cones.”
Our contribution is the construction of a whole genome parametric alignment in all four models for D. melanogaster and D. pseudoobscura. Our methods and computational results are described in the next section. Three biological applications are presented in the section From Polytopes to Biology. A discussion follows in the Discussion section.
From Genomes to Polytopes
Our main computational result is the construction of a whole genome parametric alignment for two Drosophila genomes. This result depended on a number of innovations. By adapting existing orthology mapping methods, we were able to divide the genomes into 1,999,817 pairs of reliably orthologous segments, and among these we identified 877,982 pairs for which the alignment is uncertain. We computed the alignment polytopes of dimensions two, three, and four for each of these 877,982 sequence pairs, and of dimension five for a subset of them. The methods are explained in the section “Alignment polytopes.” The vertices of these polytopes represent the optimal alignment summaries and the robustness cones. These concepts are introduced in the section “Models, alignment summaries, and robustness cones.” Computational results are presented in the section “Computational results.”
The orthology mapping problem
for a pair of genomes is to identify all orthologous segments between the two genomes. These orthologous segments, if selected so as not to contain genome rearrangements, can then be globally aligned to each other. This strategy is frequently used for whole genome alignment [12
], and we adapted it for our parametric alignment computation.
MERCATOR is an orthology mapping program suitable for multiple genomes that was developed by Dewey et al. [14
]. We applied this program to the D. melanogaster
and D. pseudoobscura
genomes to identify pieces for parametric alignment. The MERCATOR strategy for identifying orthologous segments is as follows. Exon annotations in each genome are translated into amino acid sequences and then compared with each other using BLAT [15
]. The annotations are based on reference gene sets, and on ab initio predictions (see Materials and Methods
). The resulting exon hits are then used to build a graph whose vertices correspond to exons, and with an edge between two exons if there is a good hit. A greedy algorithm is then used to select edges in the graph that correspond to runs of exons that are consistent in order and orientation.
The MERCATOR orthology map for D. melanogaster
and D. pseudoobscura
has 2,731 segments. However, in order to obtain a map suitable for parametric alignment, further subdivision of the segments was necessary. This subdivision was accomplished by the additional step of identifying and fixing exact matches of length at least 10 bp (see Materials and Methods
We derived 1,116,792 constraints, which are of four possible types: 1) exact matching non-coding sequences, 2) ungapped high scoring aligned coding sequences, 3) segment pairs between two other constraints where one of the segments has length zero, so the non-trivial segment must be gapped, and 4) single nucleotide mismatches that are squeezed between other constraints.
We then removed all segments where the sequences contained the letter N (which means the actual sequence is uncertain). This process resulted in 877,982 pairs of segments for parametric alignment. The lengths of the D. melanogaster segments range from one to 80,676 base pairs. The median length is 42 bp and the mean length is 99 bp. In all, 90.4% of the D. melanogaster genome and 88.7% of the D. pseudoobscura genome were aligned by our method.
Models, alignment summaries, and robustness cones.
For each of the 877,982 pairs of orthologous segments, we constructed all optimal Needleman–Wunsch alignments, with respect to various scoring schemes that are derived from PHMMs. We considered models with two, three, four, and five parameters. See [16
] for a review of PHMMs and their relationship to the scoring schemes typically used for aligning DNA sequences. In what follows, we only refer to the scores, which are the logarithms of certain ratios of the parameters of the PHMM. Our four models are specializations of the general 33-parameter model [11
] that incorporates mutations, insertions, and deletions of DNA sequences. It is customary to reduce the dimension by assuming that many of the 33 parameters are equal to each other.
The basic model, discussed in the Introduction, has three natural parameters, namely, M for match, X for mismatch, and S for space. If the numbers M, X, and S are fixed, then we seek to maximize M · m + X · x + S · s, where (m,x,s) runs over the summaries of all alignments. In light of Equation 1, this model has only two free parameters and there is no loss of generality in assuming that the match score M is zero. From now on we set M = 0 and take X and S as the free parameters. We define the alignment summary to be the pair (x,s).
Following the convention of Pachter and Sturmfels [11
], we summarize a scoring scheme with a 5 × 5 matrix w
whose rows and columns are both indexed by A, C, G, T, and -. The matrix w
for the basic model is the leftmost matrix in , and it corresponds to the Jukes–Cantor model of DNA sequence evolution.
The Jukes–Cantor Matrix, Kimura-2 Matrix, and Kimura-3 Matrix
Our 3d model is the most commonly used scoring scheme for computing alignments. This model includes the number g of gaps. A gap is a complete block of spaces in one of the aligned sequences; it either begins at the start of the sequence or is immediately preceded by a nucleotide, and it either follows the end of the sequence or is succeeded by a nucleotide. The 3d alignment summary is the triple (x,s,g). The score for a gap, G, is known as the affine gap penalty. If X, S, and G are fixed, then the alignment problem is to maximize X · x + S · s + G · g where (x,s,g) runs over all 3d alignment summaries. The parametric version is implemented in XPARAL. Introducing the gap score G does not affect the matrix w, which is still the leftmost matrix in .
Our 4d model
is derived from the Kimura-2 model
of sequence evolution. The 4d alignment summary
is the vector (x
) where s
are as above, x
is the number of transversion mismatches
(between a purine and a pyrimidine or vice versa) and y
is the number of transition mismatches
(between purines or between pyrimidines). The four parameters are X, Y, S, and G
. The matrix w
of scores, as specified in [11
], is now the middle matrix in .
Our 5d scoring scheme
is derived from the Kimura-3 model
. Here the matrix w
is the rightmost matrix in . The 5d alignment summary
is the vector (x
), where s
counts spaces, g
counts gaps, x
is the number of mismatches y
is the number of mismatches
is the number of mismatches
. Thus, the 5d alignment summaries of the two Drosophila
intron alignments at the beginning of the Introduction are (4,10,9,9,8) and (3,3,4,29,17). Even the 5d model does not encompass all scoring schemes that are used in practice. See the section “Assessment of the BLASTZ alignment” for a discussion of the BLASTZ scoring matrix [17
] and its proximity to the Kimura-2 model.
Suppose we are given a specific alignment of two DNA sequences. Then the robustness cone of that alignment is the set of all parameter vectors that have the following property: any other alignment that has a different alignment summary is given a lower score. As a mathematical object, the robustness cone is an open convex polyhedral cone in the space Rd of free parameters.
An alignment summary is said to be optimal, relative to one of our four models, if its robustness cone is not empty. Equivalently, an alignment summary is optimal if there exists a choice of parameters such that the Needleman–Wunsch algorithm produces only that alignment summary. Such a parameter choice will be robust, in the sense that if we make a small enough change in the parameters then the optimal alignment summary will remain unchanged. Each robustness cone is specified by a finite list of linear inequalities in the model parameters.
For example, consider the first alignment in the Introduction. Its 2d alignment summary is the pair (x,s) = (23,9), labeled D in . The robustness cone of this summary is the set of all points (X,S) such that the score 23X + 9S is larger than the score of all other alignments summaries other than (23,9). This cone is specified by the two linear inequalities S > X and 4S < 3X.
If we fix two DNA sequences, then the robustness cones of all the optimal alignments define a partition of the parameter space, Rd. That partition is called the alignment fan of the two DNA sequences. shows the (biologically relevant part of the) alignment fan of two Drosophila introns in the 2d model. While this alignment fan has only 13 robustness cones, the alignment fan of the same introns has 76 cones for the 3d model, 932 cones for the 4d model, and 10,009 cones for the 5d model. These are the vertex numbers in .
Face Numbers of the Alignment Polytopes for the Intron Sequences from the Beginning of the Introduction
The convex hull
of a finite set S
of points in Rd
is the smallest convex set containing these points. It is denoted conv(S
) and called a convex polytope
. There exists a unique smallest subset V S
for which conv(S
) = conv(V
). The points in V
are called the vertices
of the convex polytope. The vertices lie in higher-dimensional faces
on the boundary of the polytope. Faces include edges,
which are one-dimensional, and facets,
which are (d
− 1)-dimensional. Introductions to these concepts can be found in the textbooks [18
]. By computing the convex hull
of a finite set S Rd,
we mean identifying the vertices and the facets of conv(S
), and, if possible, all faces of all dimensions.
Consider one of the four models discussed in the previous section. The alignment polytope
of two DNA sequences is the convex polytope conv(S
is the set of alignment summaries of all alignments of these two sequences. For instance, the 3d alignment polytope of two DNA sequences is the convex polytope in R3
that is formed by taking the convex hull of all alignment summaries (x
). shows the 3d alignment polytope for the two sequences in the Introduction. Its projection onto the (x
)-plane is the polygon depicted in .
The Third Alignment Polytope of Our Two Drosophila Introns Has 76 Vertices
It is a basic fact of convexity that the maximum of a linear function over a polytope is attained at a vertex. Thus, an alignment of two DNA sequences is optimal if and only if its summary is in the set V of vertices of the alignment polytope. The Needleman–Wunsch algorithm efficiently solves the linear programming problem over this polytope. For instance, for the 3d model with fixed parameters, the alignment problem is the linear programming problem
For a numerical example, consider the parameter values X = −200, S = −80, and G = −400, which represent an approximation of the BLASTZ scoring scheme (see the “Assessment of the BLASTZ alignment” section). The solution to Equation 2 is attained at the vertex (x,s,g) = (30,5,2), which is the 3d summary of the following alignment of our two Drosophila introns
The 3d summary of this alignment is the marked vertex in .
The problem of computing a parametric alignment is now specified precisely. The input consists of two DNA sequences. The output is the set of vertices and the set of facets of the alignment polytope conv(S
). See also [11
]. We note that the robustness cone of an optimal alignment summary is the normal cone
of the polytope at that vertex. The alignment fan is the normal fan
of the alignment polytope. See [11
] for definitions of these concepts.
We considered two different methods for constructing all alignment polytopes for the two Drosophila genomes: polytope propagation and incremental convex hull. In our study we found that polytope propagation was outperformed by the incremental convex hull algorithm, especially for the higher dimensional models.
We now briefly outline the two methods. Polytope propagation for sequence alignment is the Needleman–Wunsch algorithm with the standard operations of plus and max replaced by Minkowski sum and polytope merge (convex hull of union). The polytope propagation algorithm was introduced in [10
The incremental convex hull algorithm, on the other hand, gradually builds the alignment polytope by successively finding new optimal alignment summaries, the vertices of the polytope. To find the new optimal summaries, the algorithm repeatedly calls a Needleman–Wunsch (NW) subroutine that is an efficient implementation of the classical Needleman–Wunsch algorithm. For fixed values of the parameters, this subroutine returns an optimal alignment summary. For instance, for the 3d model, the input to the NW subroutine is a parameter vector (X,S,G) and the output is an optimal summary (x,s,g).
Suppose we have already found a few optimal alignment summaries, by running the NW subroutine with various parameter values. We let P
be the convex hull of the summaries in Rd,
and we assume that P
is already d
-dimensional. We maintain a list of all vertices and facets of P
. Each facet is either tentative
where being confirmed means that its affine span is already known to be a facet-defining hyperplane of the final alignment polytope. In each iteration, we pick a tentative facet of P
and an outer normal vector U
of that facet. We then call the NW subroutine with U
as the input parameter. The output of the NW subroutine is an optimal summary v
. If the optimal score U
equals the maximum of the linear function U
over all w
then we declare the facet to be confirmed. Otherwise, the score U
is greater than the maximum and we replace P
by the convex hull of P
. This convex hull computation utilizes the beneath–beyond construction
], which erases some of the tentative facets of the old polytope and replaces them with new tentative facets.
The algorithm terminates when all facets are confirmed. The current polytope P at that iteration is the final alignment polytope. The number of iterations of this incremental convex hull algorithm equals the number of vertices plus the number of facets of the final polytope P. So for a given model, the running time of the incremental convex hull algorithm scales linearly in the size of the output. This was confirmed in practice by our computations ().
Observed Running Times of the Incremental Convex Hull Algorithm for Computing Alignment Polytopes
Given an alignment polytope, there are various subsequent computations one may wish to perform. For instance, we may be interested in the robustness cones at the vertices. To get an irredundant inequality representation of a robustness cone, it suffices to know the edges emanating from the corresponding vertex. Thus it is useful to also compute the edge graph of each of our polytopes.
Using our implementation of the incremental convex hull algorithm described above, we computed the 2d, 3d, and 4d polytopes for each of the 877,982 segment pairs. We also computed 5d polytopes in many cases. These polytopes are available for downloading and viewing at http://bio.math.berkeley.edu/parametric/
We empirically determined the expected CPU time to construct alignment polytopes. The results are reported in . As expected, the running time of the incremental convex hull algorithm scales linearly with the number of vertices plus facets. The running time of a single Needleman–Wunsch subroutine call scales linearly with the product ll′ of the sequence lengths l and l′.
To effectively compute these polytopes, not only must we have an algorithm that runs quickly as a function of the number of vertices and facets, but also the number of vertices and facets must themselves be small. The theoretical bounds discussed in the Introduction ensure that these numbers grow polynomially, for any fixed d. In our computations we found that the numbers of vertices and facets of alignment polytopes are quite manageable even in dimensions 4 and 5. Averages of the numbers we actually observed are reported in .
Averages of the Number V of Vertices and the Number F of Facets of Alignment Polytopes
We conclude our summary of computational results with a look at the alignment polytopes of the orthologous pair of introns in the Introduction. The sequence lengths are l = 60 and l′ = 61. The 2d alignment polytope is shown in . The 3d alignment polytope is shown in . reports statistics for the 2d, 3d, 4d, and 5d alignment polytopes for this sequence pair.
From Polytopes to Biology
We describe three applications of our whole genome parametric alignment. First, we discuss how alignment polytopes are useful for parameter selection, and we assess the BLASTZ alignment of D. melanogaster
and D. pseudoobscura
. We then revisit the cis-regulatory element study in [20
], and we determine alignments that identify previously missed conserved binding sites. Finally, we examine the problem of branch length estimation and provide a quantitative analysis of the dependence of branch length estimates on alignment parameters.
Assessment of the BLASTZ alignment.
A key problem in sequence alignment is to determine appropriate parameters for a scoring scheme. The standard approach is to select a model and then to identify parameter values that are effective in producing alignments that correctly align certain features. For example, the BLASTZ scoring matrix [17
] was optimized for human–mouse alignment by finding parameters that were effective in aligning genes in the HOXD
region. The BLASTZ scoring scheme is given by a scoring matrix called HOXD70 [17
together with a space score of −30, and a gap score of −400. Although the HOXD70 matrix has six distinct entries, it can be approximated by a Kimura-2 matrix (), since 91 is close to 100, and 114 is close to 123 and 125.
The alignment polytope of a pair of DNA sequences is a representation of all possible alignments organized according to a scoring scheme. Thus, our results and methodology make it possible, for the first time, to identify parameters that are guaranteed to optimize the alignment according to desired criteria. Moreover, our results offer biologists a mathematical tool for systematically assessing whether a proposed single alignment is suitable for its intended purpose.
We initiated such a study for the BLASTZ alignment [4
] of D. melanogaster
and D. pseudoobscura,
which is available at http://genome.ucsc.edu
. This alignment is widely used by biologists who study Drosophila
. Although the BLASTZ alignment procedure is based on an initial “seeding” procedure (similar to our identification of constrained segment pairs), the alignments are then constructed using the Needleman–Wunsch algorithm with the HOXD70 matrix.
Recall that our orthology map consisted of 1,999,817 segment pairs: 1,116,792 consisted of segments for which we fixed the alignment (constrained segment pairs), and 883,025 were unconstrained segment pairs for which we constructed alignment polytopes. We found that the BLASTZ alignment agreed with 623,710 of our unconstrained segment pairs, of which 622,173 did not contain Ns. For each of these 622,173 segment pairs, we computed the 2d, 3d, and 4d alignment summaries for the BLASTZ alignments, and we determined whether or not they are optimal for some choice of model parameters.
We found that 269,186 (43.3%) of the BLASTZ alignments are vertices of the 3d polytope, but not the 2d polytope, and 201,982 (32.5%) are vertices of both the 2d and 3d polytopes. Only 151,004 (24.3%) of the BLASTZ alignments are not vertices of either the 2d or 3d alignment polytopes. In summary, our computations show that 32.5% of the BLASTZ alignments correspond to vertices of the 2d polytope and 75.7% correspond to vertices of the 3d polytope. These numbers are even higher for the 4d and 5d alignment polytopes.
Curiously, there is precisely one sequence pair where the BLASTZ alignment is a vertex of the 2d polytope but not the 3d polytope. This alignment is
The 3d summary of this alignment is (11,3,2), which is the midpoint of the edge with vertices (11,3,1) and (11,3,3) on the 3d polytope. Hence this alignment is not optimal for any choice of parameters (X,S,G). However, it is optimal for the 2d model since the edge maps onto the vertex (11,3) of the 2d polygon.
Our results show that not only does the BLASTZ alignment agree well with our constrained segment pairs, but also the BLASTZ alignments are mostly vertices of the three dimensional polytopes, even on the unconstrained segment pairs. This suggests that there may be a statistical advantage to working with one of the lower dimensional models, and also indicates that the polytopes may be useful for finding parameters. We illustrate this point of view in the next section, where we identify vertices in the alignment polytope (and therefore parameter robustness cones) that are suitable for the alignment of cis-regulatory elements. Any user of the BLASTZ alignments may now use the alignment polytopes we provide to assess whether or not the fixed choice of the HOXD70 matrix is the right one for their particular biological application.
Conservation of cis-regulatory elements.
A central question in comparative genomics is the extent of conservation of cis-regulatory elements and the implications for genome function and evolution. Using our parametric alignment, we discovered that cis-regulatory elements may be more conserved between D. melanogaster
and D. pseudoobscura
than previously thought. Specifically, we used our alignment polytopes to examine the degree of conservation for 1,346 transcription factor binding sites [21
] available at http://www.flyreg.org
(we excluded 16 sites that were located in segment pairs containing N
s). The 1,346 sites include the 142 sites examined by Richards et al. [20
] in their comparison of D. pseudoobscura
and D. melanogaster
Specifically, for each of the 1,346 elements, we identified the orthologous segment pairs from our orthology map that contained the elements. We then extracted the polytopes from our whole genome parametric alignment. For each polytope, we determined an optimal alignment for which the number of matching bases of the corresponding element was maximized.
As an example, consider the transcription factor Adf1. It binds to a cis-regulatory element at chr3R:2,825,118–2,825,144 in D. melanogaster (Adf1-> Antp:06447 in the flyreg database). The BLASTZ alignment for this element is
This alignment suggests that the D. melanogaster cis-regulatory element is not conserved in D. pseudoobscura. However, there are many optimal alignments that indicate that this element is conserved. Examining our constrained segment pairs, we found that the prefix TGTG was at the end of a 13-bp exact match. The remaining D. melanogaster element was part of a segment pair which has 813 distinct optimal alignments in the 3d model. Among these, we found the following alignment with parameters G = −3, S = −8, X = −18:
Note that we include the TGTG prefix to show a complete alignment of the cis-regulatory element. The second alignment has 24 matches instead of the BLASTZ alignment with eight. The number of matches can be used to calculate the percent identity for an element as follows:
Percent identity was used in [20
] as a criterion to determine whether binding sites are conserved. The BLASTZ alignment has 30% identity and the alignment with 24 matches has 89% identity. It is an optimal alignment with the highest possible percent identity. After examining all 813 optimal alignments, it appeared to us that the following alignment (obtained with G
= −882, S
= −87, X
= −226) is more reasonable, even though it has a lower percent identity (67%):
This alternative alignment suggests that the percent identity criterion may not be the best way to judge the conservation of elements. Regardless, we believe our parametric alignment indicates that in this particular case, the D. melanogaster cis-regulatory element is likely to have been conserved in D. pseudoobscura.
Our overall results are summarized in . We found that parameters can be chosen so as to significantly increase the number of matches for cis-regulatory elements. The “optimal parameters” row in the table shows results for the case where parameters were chosen separately for each segment pair so as to maximize the percent identity of the cis-regulatory elements. The “fixed parameters” row shows results when one parameter was selected (optimally) for all segment pairs simultaneously (this was only computed for the 2d model). Note that the mean per-site percent identity reported in [20
] was 51.3%, considerably lower than what we found using the whole genome parametric alignment (even for the 2d model).
Cis-Regulatory Element Conservation
Our results seem to indicate that cis-regulatory elements are more conserved between D. melanogaster and D. pseudoobscura than previously thought. The alignment polytopes should be a useful tool for further investigation of the extent of conservation of cis-regulatory elements among the Drosophila genomes.
The Jukes–Cantor distance function.
An important problem in molecular evolution is the estimation of branch lengths from aligned genome sequences. A widely used method for estimating branch lengths is based on the Jukes–Cantor model of evolution [22
]. Given an alignment of two sequences of lengths l,l′,
with 2d alignment summary (x
), one computes the Jukes–Cantor distance
of the two genomes as follows:
] for a derivation of this expression, which is also known as the Jukes–Cantor correction
of the two aligned sequences. The Jukes–Cantor distance can be interpreted as the expected number of mutations per site.
Because the Jukes–Cantor distance dJC
) depends on the underlying pairwise sequence alignment summary, which in turn depends on the alignment parameters, it is natural to ask how the branch length estimate depends on the parameters in a 2d-scoring scheme. We therefore introduce the Jukes–Cantor distance function
which is the function JC
→ [0,∞) given by (X
is the alignment summary maximizing X
We computed the Jukes–Cantor distance function JC for the entire genomes of D. melanogaster and D. pseudoobscura. As the result of this computation, we now know the Jukes–Cantor distances for all whole genome alignments that are optimal for some choice of biologically reasonable parameters (X,S).
The notion of “optimal” used here rests on the following precise definitions. Given parameters (X,S), the optimal 2d alignment summary (x,s) for the two genomes is the sum of the optimal summaries of all 877,982 unconstrained segment pairs plus the sum of the alignment summaries of the non-coding constrained segment pairs (which do not depend on the parameters). We determined that the constrained segment pairs contained 91,355 mismatches and 16,339,305 matches. The genome alignment polytope is the Minkowski sum of the 877,982 alignment polytopes. The vertices of the genome alignment polytope correspond to optimal summaries of whole genome alignments.
We computed the genome alignment polytope for the 2d model. Remarkably, this convex polygon, which is the Minkowski sum of close to one million small polygons as in , was found to have only 1,183 vertices. Moreover, of the 1,183 vertices of the genome alignment polytope, only 838 correspond to biologically reasonable parameters (X
< 0, 2S
). The finding that there are so few vertices constitutes a striking experimental validation of Elizalde's Few Inference Functions Theorem [11
] in the context of real biological data.
The Jukes–Cantor distance function JC of D. melanogaster and D. pseudoobscura is a piecewise constant function on the (X,S)-plane. Indeed, JC is constant on the cones in the normal fan of the genome alignment polygon. Note that JC is undefined when (X,S) is perpendicular to one of the 1,183 edges of the genome alignment polygon. For such (X,S), the Jukes–Cantor distance function jumps between its values on the two adjacent cones in the normal fan.
The graph of the Jukes–Cantor distance function is shown in . The function ranges in value from 0.1253 to 0.2853, is monotonically decreasing as a function of S, and monotonically increasing as a function of X. We found it interesting that at the line X = S, there is a large “Jukes–Cantor jump” where the value of the function increases from 0.1683 to 0.2225.
The Jukes–Cantor Distance Function of Two Drosophila Genomes
The Jukes–Cantor distance function is a new tool for parametric reconstruction of phylogenetic trees. Instead of estimating a single distance between each pair of genomes in a multiple species phylogenetic reconstruction, one can now evaluate the Jukes–Cantor function at vertices of the Minkowski sum of the whole genome alignment polytopes. These can be used for parametric phylogenetic reconstruction using distance-based methods such as neighbor joining.