Algorithm

Model Summary

While traditional alignment algorithms assume identical mutation rates throughout the sequence, we introduce an alternative evolutionary model allowing for rate heterogeneity by modeling the evolution of both quickly and slowly evolving regions. In our model, a two-state HMM emits a sequence of conserved (slow evolution) and non-conserved (fast evolution) states at the root of the tree. This defines an alternating series of conserved and non-conserved segments, allowing our model to represent both neutral sequences expected to exhibit higher mutation rates and functional sequences undergoing purifying selection. Each segment evolves independently along a phylogenetic tree according to a pairwise alignment model which allows for insertions, deletions, and substitutions on each branch of the tree. While the StatAlign package jointly estimates both the tree and the alignment, we condition the analysis on a user-inputted phylogenetic tree in order to estimate the alignment and locations of quickly and slowly evolving regions more efficiently.

Our model is a reformulation of the SAPF multiple HMM with two main differences. As in [

5], we model the difference between fast and slow states by scaling down the branch lengths of the phylogenetic tree in slow states, reducing the evolutionary time - and thus the expected divergence - in these regions. While SAPF models slow states by modifying the rate parameters of the mutation models, BigFoot uses a branch scaling approach in order to support multiple substitution models with different numbers of parameters. Two different scaling factors, both of which are model parameters endowed with user-defined priors and constrained to be less than one, are used for substitution and indel events.

Another minor difference is that the BigFoot model does not allow for insertions at the exact boundaries of functional regions. This modification was necessary to create unambigious and reversible MCMC proposals, and is a biologically relevant modification since an insertion event in one sequence should not define the beginning (or end) of a conserved region.

Modeling molecular evolution

Our alignment model is expressed as a pairwise HMM transducer [

29,

30], a conditionally normalized HMM representing the evolution of an ancestral sequence into a descendent sequence, and is similar to the transducer model in [

20]. More complete details describing the transducer are presented in the

**Methods **section and supplementary sections S1.1 and S1.2 [see Additional file

1].

We base our transducer on the 1992 approach of Thorne, Kishino, and Felstenstein (TKF92) [

14]. TKF92 models the birth and death of fragments with geometrically-distributed lengths in order to represent long indel events. Our model can be viewed as an extension of the TKF92 approach, allowing these fragments to exist in either quickly or slowly evolving regions. In our fast states, as in the TKF92 model, the length of indel fragments is modeled by a geometric distribution with the same expected length as a fragment of matched bases, resulting in an expectation of long indel events. In annotated functional regions, however, we noticed that most indel events were very short (1-3 bp). To represent this, we create a separate parameter to specify the expected lengths of indel events. Thus, we not only expect functional regions to have fewer indel events, we also predict that these events will be shorter.

We place a pairwise transducer on each branch of the phylogenetic tree in order to model the evolution from each ancestor to each descendent. Transducer theory [

29,

30] shows how the concatenation of these transducers results in a multiple HMM describing the evolution from the ancestral root node to all leaves in the tree. We place a separate HMM on the root sequence, allowing it to switch between emitting slow-evolving characters and fast-evolving characters with specific probabilities. This models fast and slow regions with geometric distributed lengths, set by model parameters.

One limitation of our approach is that our model does not allow for the creation or deletion of conserved regions along the tree. Since the annotation of fast or slow characters emitted by the root is conserved in the descendent sequences, the model cannot detect the loss or gain of binding sites. For this reason, when testing BigFoot we discarded sequences with long deletions from analysis.

The full likelihood of a tree is equivalent to the full emission probability of the multiple-HMM. Unfortunately, this likelihood cannot be calculated quickly, as the time complexity of the Forward algorithm for a multiple HMM grows exponentially with the number of sequences. Instead of direct computation via dynamic programming, we apply a Bayesian MCMC method with data augmentation.

Bayesian MCMC

All model parameter densities are estimated using MCMC sampling. Exponential priors with expectation 1 have been used as priors for insertion-deletion parameters in the alignment transducer and for all free parameters in substitution models provided with the software package. For parameters responsible for annotation (branch scalings and expected lengths for fast/slow regions, as described above), we allow the user to input either Beta or uniform priors on these parameters. This allows the user to tailor the analysis to their specific needs. For example, the user can set an informative prior to search for longer weakly conserved regions, or for very short and highly conserved regions. Alternatively, the user can set uninformative priors and allow the MCMC to estimate parameter distributions freely.

The joint posterior distribution of alignments, trees and evolutionary parameters forms a high dimensional and complex distribution from which efficient direct sampling is most likely impossible. Therefore, we applied Markov chain Monte Carlo to converge to this prescribed distribution. After convergence, samples from the Markov chain provide correlated samples from the posterior distribution.

The likelihood of a tree under the multiple HMM can only be easily calculated when we augment the tree with additional data known as *extended alignments*. This data contains information on how the observed sequences are aligned to ancestral sequences associated with internal nodes of the tree. We represent the unobserved ancestral sequences as a collection of gaps and Felsenstein wildcards in order to sum over all possible nucleotide values when calculating the total likelihood.

Our MCMC walks on the joint distribution of the extended alignments, locations of fast and slow regions, and model parameters. The random walk comprises the following components:

• Changing model parameters

• Changing extended alignment

• Shifting the boundary of an existing fast or slow region

• Creating a new (or deleting an old) fast or slow region

The first two types of moves are described in [

31], and the last two are described in the

**Methods **section. In each MCMC iteration, we apply a Metropolis-Hastings move to alter one of these components, and select moves with fixed probabilities that were chosen to enhance mixing.

Postprocessing the samples from the Markov chain

The Markov chain provides correlated samples from the posterior distribution of alignments, locations of fast and slow regions, and evolutionary parameters. To report posterior probabilities for phylogenetic footprinting predictions we take the approach of [

4,

5,

20], collapsing our predictions onto one axis and reporting posterior probabilities for a single species. Our results thus represent the posterior probability of each nucleotide having been generated from a slow state. These probabilities are simply the proportion of samples in which each nucleotide appears inside the boundaries of a slow region.

Multiple sequence alignment samples can be summarized in several ways. Unlike other authors [

17], we found the MAP (Maximum a Posteriori) alignment estimation drawn from MCMC samples to be very unstable, especially when there is autocorrelation between samples from the chain. We chose instead to estimate the MPD (Maximum Posterior Decoding) alignment [

32], which maximizes the product of alignment column posterior probabilities. We found this estimation to be more stable as it allows the uncertainty in each alignment column to be assessed independently. We present a complete algorithm for calculating the MPD alignment in supplementary section S1.3 [see Additional file

1].

Testing

As a first test of the accuracy of the MCMC results, we ran BigFoot on a relatively small dataset to compare the results with the exact dynamic programming predictions of SAPF. The two methods were expected to return similar, though not identical, results. This is because SAPF and BigFoot use slightly different alignment HMMs on each tree branch. We analyzed a

*cis*-regulatory module in four

*Drosophila *species:

*D. melanogaster*,

*D. erecta*,

*D. pseudoobscura*, and

*D. willistoni*. This 485 base pair region has been found to regulate the expression of the homeodomain encoding protein

*eve *in the second stripe of the developing

*Drosophila *embryonic blastoderm [

33]. The REDFly database provided the sequence coordinates of the biologically verified regulatory module in the

*D. melanogaster *sequence [

34], and the FlyReg database provided locations for 19 experimentally discovered binding sites [

35]. Figure exhibits the close agreement between the MCMC and dynamic programming predictions, with the locations of the known binding sites displayed above the posterior probabilities, and provides strong evidence that BigFoot is sampling from and converging to the true joint distribution. Both programs identify 14 of the 19 binding known binding sites with high posterior probabilities. Of the remaining five sites, none have homologous instances in either

*D. melanogaster *or

*D. pseudoobscura*. Four were biologically characterized as "weak-affinity" binding sites [

36] which could indicate reduced functionality and a loss of evolutionary pressure, and the last was postulated to be recently evolved in

*D. melanogaster *due to an absence of orthologous sequence in both closely and distantly related

*Drosophila *species [

33,

37].

The

*Drosophila *12 genome consortium has completed the sequencing of 12

*Drosophila *genome sequences exhibiting a large range of evolutionary distances [

1,

38]. For example, the evolutionary distance separating

*D. melanogaster *and

*D. grimshawi *is greater than that between any pair of mammals when generation time is taken into account [

1], while other sequence pairs are very closely related. The large number of sequenced genomes and the diversity in their evolutionary distances make this an ideal dataset for implementing phylogenetic footprinting techniques. We tested the

*eve *stripe 2 enhancer region using ten of the 12 genomes in this dataset. We chose to remove two species,

*D. mojavensis *and

*D. virilis*, as both sequences contained numerous long deletions and were thus too divergent to be informative. Sequence data for all species were obtained from a set of pre-computed whole-genome alignments [

39].

Figure compares MCMC predictions generated by BigFoot when analyzing either four sequences or ten sequences. The predictions made when analyzing the larger dataset correspond more closely to the locations of experimentally validated binding sites. The improvements can be summarized in two main categories:

**• Higher sensitivity to verified binding sites**. While Figure exhibits a close agreement between the two sets of analyses for many of the experimentally verified binding sites, the addition of more species does improve the conservation signal in some TFBS. In particular, peaks corresponding to a Kruppel binding site (kr7), a Bicoid binding site (bcd4), and a joint site (kr9/bcd5) are all more strongly identified as evolving slowly when ten species are analyzed. Additionally, one Kruppel binding site (kr5) is only detected, albeit weakly, when using the larger dataset. This demonstrates that while imperfectly conserved regions may be reasonably likely to occur by chance in neutral sequence when only a few species are analyzed, additional sequences may provide stronger evidence of purifying selection. Heightened sensitivity is also observed at previously unannotated peaks from bases 137-148 and 227-232. Both regions are adjacent to a verified TFBS and have a high posterior probability of being emitted from a slow state, and thus should be candidate regions for future experimental study. This heightened sensitivity does not result in a general loss of specificity, as low probability peaks (bases 298-306 and 383-421) in previously un-annotated regions disappear when using the larger dataset.

**• Finer nucleotide resolution for TFBS start/stop positions**. When analyzing a small number of species, it may be difficult to identify the boundaries between quickly and slowly evolving regions, especially in a region where TFBS may be grouped close together. The results shown in Figure demonstrate how adding additional sequence data can result in a clearer signal at the boundaries of binding sites. In the Kruppel site kr7, the distance in bases from the limits of the predicted conserved region (defined as the peak region with probability greater than 0.5) to the limits of the laboratory-identified regulatory element decreases by 3 bp when additional sequences are analyzed. Though this difference is small, it corresponds to 27% of the 11 bp binding site. A similar effect is observed in the Kruppel site kr8, for which additional sequence data decreases the boundary error by 4 bp.

Additionally, in a closely spaced group of functional elements (bases 440-482) separated only by a small number of neutral bases, the small dips in posterior probability correspond more closely to the neutral regions when more sequences are added to the analysis. While the agreement is not perfect, these results are consistent with previous findings showing an increase in nucleotide resolution as more species are analyzed [

23].

In order to quantify the predictive accuracy of our results, we calculated receiver operating characteristic (ROC) curves for both sets of BigFoot results. The area under the curve (AUC), which has a maximum value of 100%, is a summary statistic that accounts for both the sensitivity and the specificity of the predictions. A value of 50% implies that the predictions are no better than random guessing. The methodology used for creating the ROC curves is described in [

20]. The curves are displayed in Figure , which exhibits the small but noticeable predictive improvement when additional sequences are added to the dataset.

To demonstrate that BigFoot can be applied to sequence data from vertebrates, we analyzed a 256 base pair region previously identified as the major regulatory element of human

*α*-globin (

*α*MRE) [

40]. This region was sequenced in 22 species, analyzed with the TRANSFAC database, and found to contain seven TFBS. These TFBS range from 8-15 bp in length, and include recognition elements for the Maf protein and GATA-1, both important in globin gene regulation [

40,

41].

To analyze this region, we downloaded the multiz28way alignment of the region from the UCSC genome browser [

42]. This alignment provided sequence information for 15 vertebrate species, three of which (cat, shrew, hedgehog) contained long deletions and were therefore removed from the analysis. The results of the analysis are shown in Figure , where we display the results of two independent MCMC runs initialized at independent starting points. The first run was started using the UCSC alignment to initialize the Markov chain, while the second was initialized with a random alignment proposed by BigFoot. Both runs were also initialized with independent and randomly selected evolutionary parameters. Despite these differences, Figure exhibits that there is extremely close agreement - in many regions near exact correspondence - between the two sets of results, demonstrating the mixing and convergence of the MCMC sampling.

When analyzing the 12 vertebrate species, BigFoot detects six of the seven known binding sites with posterior probabilities of greater than 95%. However, the seventh binding site, notated in Figure as bs2, is poorly conserved and the binding site peak probabilities do not exceed 50%. In fact, this region was only detected due to the presence of a previously unannotated and weakly conserved adjacent region, notated as bsAlt, which is incorrectly aligned in the multiz28 way alignment. BigFoot corrects this alignment error and annotates this region with a peak exceeding 80% posterior probability. This example demonstrates the importance of calculating and correcting for alignment ambiguity and error. By doing this, BigFoot not only discovers a previously undetected conserved region in a well-annotated regulatory module, but also enables the detection of a weakly conserved but previously identified regulatory element.

Comparison with phastCons

One of the most widely used alignment-based phylogenetic footprinting tools is the phastCons program, used to create the conservation track in the University of California at Santa-Cruz (UCSC) genome browser [

5,

24]. The conservation track makes predictions from a single multiz alignment and does not incorporate indel information. This puts the method at a significant disadvantage to BigFoot, as indel information has been shown to be extremely valuable in the detection of functional elements [

43]. Indeed, we observed that predictions from the UCSC conservation track had significantly lower specificity and sensitivity when compared to the BigFoot results in both the

*Drosophila *and the

*α*MRE datasets.

In the command-line version of phastCons, however, it is possible to set an option for the program to incorporate indel information [

44]. We set this option and ran phastCons on the BigFoot test datasets, using the UCSC alignment as input. The two programs returned almost identical results for the majority of the binding sites, since the majority of the binding sites are well conserved and thus perfectly aligned in the multiz alignments. In these cases there is little alignment uncertainty, so BigFoot and phastCons are expected to return similar results. However, in shorter binding sites exhibiting weaker conservation, BigFoot outperforms the single-alignment method. The most drastic example is the Kruppel site kr7 in the

*eve *stripe 2 enhancer. The core of the TFBS is very well conserved and BigFoot predicts the site with high probability, but there are substitutions towards the edges and short indels in some species, and the multiz alignment incorrectly aligns the binding site. As a result, phastCons detects all nucleotides in the TFBS with less than 5% probability. The ROC curve for the phastCons predictions is displayed alongside the BigFoot ROC curves in Figure . Comparing the AUC values demonstrates how the increase in predictive accuracy caused by adding more sequence data is less than the corresponding increase caused by analyzing a distribution of alignments instead of a single alignment.

A similar error in the multiz alignment of the *α*MRE enhancer results in phastCons failing to annotate any nucleotide the weakly conserved region discussed in the previous section (bAlt) with greater than 5% probability. While we cannot know if this region is a functional TFBS without further experimental analysis, examining a distribution of alignments ensures that this region is not incorrectly passed over during the conservation analysis.

Since phastCons analyzes only one single alignment, the computational time to analyze a single region, around 30 seconds on a 2 ghz macbook computer, was substantially less than the 12-16 hours needed to analyze a single region with BigFoot. As a result, phastCons can be used to compute functional element predictions for the entire genome, while BigFoot can only be used to analyze individual regions of interest. However, for users who have identified specific genomic regions to study in detail, the benefit of controlling for alignment error and uncertainty by using BigFoot may justify the additional computational time needed for the analysis.

Implementation

User input

BigFoot requires the user to input a set of homologous DNA sequences (in FASTA format) and an evolutionary tree (in newick format) describing the phylogenetic relationships between the inputted species. BigFoot can construct an initial alignment of the sequences, or if the user has a previously computed starting alignment in FASTA format, they can set it as the starting alignment in the Markov chain. The user can also place either Beta or uniform priors on parameters modeling the difference between fast and slow states.

Substitution models

Our aim was to build a software package for an insertion-deletion model that can be coupled with an arbitrary substitution model. Therefore we would like to give users the option to implement their own substitution models. In the software help file, we describe how users can extend this class to create their own substitution models. We currently provide a large selection of eight nucleotide substitution models including the Jukes-Cantor model [

45], the Kimura three parameter model [

46], and the HKY85 model [

47].

Postprocessing

Our program provides random samples from a Markov chain whose stationary distribution is the joint Bayesian distribution of sequence alignments, locations of fast and slow regions, and model parameters. This high dimensional joint distribution can be analyzed in several ways, ranging from an analysis of the posterior distribution of a single rate parameter to an investigation of Markov chain convergence using a log-likelihood trace or a separate multidimensional autocorrelation analysis. We implemented a set of postprocess plugins which analyze data from the Markov chain and display in the graphical interface. In the software help file, we also describe how the user can implement their own postprocess plugin by extending the abstract class.

We implemented the following plugins, each of which represents a tab in the graphical interface:

• **Log-likelihood trace **This plugin plots the log-likelihood trace and writes the log-likelihood values into a text file when the analysis is complete.

• **Current alignment **This plugin shows the multiple sequence alignment in the current state of the Markov chain, along with the locations of slow and fast regions. Capital letters in the alignment represent slowly evolving regions in the current state of the chain, while lower case letters represent quickly evolving regions.

• **MPD alignment **This plugin calculates and performs running updates of the Maximum Posterior Decoding alignment based on all previous samples of the Markov chain. Figure pictures this alignment display, along with two curves overlaying the sequence information. The blue curve represents the posterior probability of each alignment column: higher values indicate greater confidence in the alignment. The red curve represents the phylogenetic footprinting results: higher values indicate a greater posterior probability of purifying selection. The final MPD alignment, and all footprinting results, are written into a text file after the analysis is complete.

• **Current tree **This plugin graphically displays the tree inputted by the user.