Home | About | Journals | Submit | Contact Us | Français |

**|**BMC Evol Biol**|**v.10; 2010**|**PMC2928787

Formats

Article sections

- Abstract
- Background
- Methods
- Results
- Discussion
- Conclusions
- Software availability
- Authors' contributions
- Supplementary Material
- References

Authors

Related links

BMC Evol Biol. 2010; 10: 244.

Published online 2010 August 10. doi: 10.1186/1471-2148-10-244

PMCID: PMC2928787

Guy Baele: eb.tnegu.bsp@eleab.yug; Yves Van de Peer: eb.tnegu-biv.bsp@reepednav.sevy; Stijn Vansteelandt: eb.tnegu@tdnaleetsnaV.njitS

Received 2010 March 30; Accepted 2010 August 10.

Copyright ©2010 Baele et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Recent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence.

We find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences.

We show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.

Over the past decades, context-dependent evolutionary models have been the subject of an increasing number of studies. These studies demonstrate that the assumption of site-independent evolution is overly restrictive and that evolutionary models that take into account context-dependencies greatly improve model fit. Most context-dependent models have a temporal aspect in that the evolution of a given site over a branch is assumed to depend upon its immediate neighbours at the start of the branch. Specifically, context-dependent models are useful when studying mammalian genomes, due to the extensive methylation of C in CG doublets, which could make such Cs hotspots for mutation (see e.g. [1], for a review).

By allowing for mutations of both single nucleotides and pairs of neighbouring nucleotides, Arndt et al. [2] considered the evolution of an initial random sequence of nucleotides in discrete time steps, according to a set of update rules. Arndt et al. [2] further computed dinucleotide odds ratios to measure whether a given dinucleotide pair is over- or underrepresented and concluded that these correctly capture the strong underrepresentation of the CpG-dinucleotides in the human chromosome 21.

Hwang and Green [3] not only allow the evolution of a given site over a branch to depend upon its immediate neighbours at the start of the branch but further partition each branch into two or more small discrete time units, when the length of that branch is sufficiently long, in order to accurately approximate the continuous-time Markov substitution process. The authors estimate separate context-dependent rate matrices for five different clades and for a sixth group comprising the remaining ancestral branches. Hwang and Green [3] further model a dependency pattern along the sequence at the ancestral root using a second-order Markov chain.

Empirical studies (see e.g. [4-8]) have shown that the preceding or succeeding bases in a sequence have a large influence on the occurrence of a base, both in coding and non-coding sequences. This suggests the importance of modelling dependencies along the sequence. For example, Erickson and Altman [9] studied the nucleotide sequence of the RNA of the bacteriophage MS2 using Markov chains found that the total nucleotide sequence of the RNA of the bacteriophage MS2 showed very significant second-order dependence. Blaisdell [10] studied sixty-four eukaryotic nuclear DNA sequences, half of them coding and half non-coding using first-, second- and third-order Markov chains and found significant statistical evidence in favour of a third-order Markov chain in 5 of the 10 non-coding sequences longer than 1400 bases, suggesting that most sequences required at least second-order Markov chains for their representation while some required chains of third order.

In previous work [1], we have introduced a context-dependent model of evolution which allows the evolution of a site across a branch to depend upon the identities of its two immediate flanking bases at the start of the branch. Since the root of the tree does not have any ancestral sequence, we have assumed a site-independent distribution at the ancestral root sequence using a set of context-independent model frequencies, which is also used in the estimation of the context-dependent substitution rates. In this paper, we relax this assumption by modelling Markov chains of varying orders at the ancestral root sequence, thus modelling a context-dependency scheme across the entire tree. We also show that the assumption of using a single set of model frequencies for analyzing context-dependencies is overly restrictive. We allow for a set of context-dependent model frequencies which aims to accurately model the context-specific sequence composition throughout the underlying tree and test the performance of such a model by calculating the appropriate (log) Bayes Factors using thermodynamic integration [11]. We show that context-dependent model frequencies drastically increase the model fit and allow for accurate estimation of the parameters involved in specifying the ancestral root distribution. We assess the performance of an ancestral sequence composition that is coupled to the context-dependent model, inspired by the approach of Arndt et al. [2] and provide evidence that a decoupled root distribution outperforms a coupled root distribution. We compare the parameter estimates of our model for two approaches, i.e. a first approach that allows the evolution of a given site to depend only upon the identities of its immediate flanking bases at the start of a branch versus a second approach which partitions each branch into two or more parts to better accommodate evolving flanking bases across the branch.

We have used a first dataset consisting of 10 vertebrate species (human, chimpanzee, gorilla, orang-utan, baboon, macaque, vervet, marmoset, dusky titi and squirrel monkey), as in Baele et al. [1]. In terms of the sequences used, this dataset is a subset of the alignment analyzed in the work of Hwang and Green [3] and Margulies et al. [12], containing sequences all orthologous to a ~1.9-Mb region on human chromosome 7q31.3 which contains the cystic fibrosis transmembrane conductance regulator gene (CFTR). We have extracted the ancestral repeats, transposons that appear to have been dispersed, and then become inactive, prior to the divergence of the species in question, and that are believed to have been evolving more or less neutrally since that time ([13]; see [1], for details on the preparation of the dataset). The resulting ancestral repeats dataset consists of the same type of sites as in the work of Siepel and Haussler [13], but contains only primate sequences. The dataset consists of 114,726 sites for each of the 10 sequences and is analyzed using the following rooted tree topology (((((human, chimpanzee), gorilla), orang-utan), ((baboon, macaque), vervet)), ((marmoset, dusky titi), squirrel monkey)).

A second dataset consists of the ψη-globin pseudogene sequences of six primates (human *[Homo Sapiens]*, chimpanzee *[Pan Troglodytes]*, gorilla *[Gorilla Gorilla]*, orang-utan *[Pongo Pygmaeus]*, rhesus monkey *[Macaca Mulatta] *and spider monkey *[Ateles Geoffroyi])*, containing 6,166 nucleotides in each sequence. We have used the fixed consensus tree shown in the work of Yang [14]: ((((human, chimpanzee), gorilla), orang-utan), (rhesus monkey, spider monkey)).

Bayesian inference of phylogeny is based on a quantity called the posterior probability function of a tree, in the same way as maximum-likelihood inference is based on the likelihood function. While the posterior probability is generally tedious to calculate, simulating from it is relatively easy through the use of Markov chain Monte Carlo (MCMC) methods ([15,16]). Relaxing the assumption of independent evolution leads to computational difficulties, which we handle via a data augmentation scheme [17]. Let *θ *be the collection of unknown parameters indexing the evolutionary model of interest, *Y _{obs }*the observed nucleotide sequences (i.e., the observed data) and

We provide a model-based approach to test the assumptions put forward in the empirical research of Blaisdell [10] and assume zero-, first-, second- and third-order Markov chains to specify the probability that a given base appears at a given site in the (ancestral) root sequence, given the identities of its preceding sites (i.e. the nucleotides occupying the preceding sites in the ancestral root sequence). For example, a second-order Markov chain is a temporal or spatial sequence of events characterized by the property that the outcome of any event in the chain may be dependent upon (e.g. influenced by) the two events immediately preceding it, but has no residual dependence on any further events preceding those. In other words, in the ancestral root sequence the probability that a given site occurs depends on the identities of the sites preceding it (i.e. which nucleotides precede the given site) in that same ancestral sequence. Arndt et al. [2] used a two-cluster approximation to calculate dinucleotide frequencies, yielding an exact solution if the stationary state of the mutation process is a first-order Markov chain, while Hwang and Green [3] modelled the distribution of bases at the ancestral root as a second-order Markov chain.

In the case of a zero-order Markov chain, one set of frequencies *π*_{ROOT }= {*π*_{A}, *π*_{C}, *π*_{G}, *π*_{T}} is used to specify the distribution at the root sequence. When a first-order Markov chain is used, four independent sets of frequencies are required: *π*_{ROOT }= {*π*_{X|A}, *π*_{X|C}, *π*_{X|G}, *π*_{X|T}}, with *X * {*A*, *C*, *G*, *T*} the identity of the site preceding the given site. Likewise, a second-order *π*_{ROOT }= {*π*_{YX|A}, *π*_{YX|C}, *π*_{YX|G}, *π*_{YX|T}} Markov chain and a third-order *π*_{ROOT }= {*π*_{ZYX|A}, *π*_{ZYX|C}, *π*_{ZYX|G}, *π*_{ZYX|T}} Markov chain is specified with *Y * {*A*, *C*, *G*, *T*} the identity of the site 2 positions prior to the given site and *Z * {*A*, *C*, *G*, *T*} the identity of the site 3 positions prior to the given site. Zero-, first-, second- and third-order Markov chains increase the parameter space with respectively 1, 4, 16 and 64 sets of base frequencies (i.e., 4, 16, 64 and 256 parameters). Note that our notation for the distribution at the root sequence differs from the typical notation for conditional probabilities in that we put the conditional part before the 'pipe' symbol ('|').

To study the context-dependent substitution process across the remainder of the underlying tree, we estimate a general time-reversible model (GTR; 5 free evolutionary parameters and 3 free base frequency parameters; [19]) in each of the 16 possible neighbouring base combinations, yielding a context-dependent model consisting of 160 parameters (or 5 free evolutionary parameters and 3 free base frequency parameters to describe the GTR model in each context). In each context (i.e. in each of the 16 neighbouring base combinations), the general time-reversible model has the following substitution rates in the case of context-dependent model frequencies *π*_{XY }= {*π*_{X|A|Y}, *π*_{X|C|Y}, *π*_{X|G|Y}, *π*_{X|T|Y}} and rate exchangeability parameters *q _{X|W→Z|Y}*, with

.

Let *θ *be the collection of terms of the off-diagonal elements of the matrix above (with the starting base indicated by the row index and the resulting base indicated by the column index), each multiplied by the probability that one would start with that starting base (see e.g. [20]). This context-dependent model, although consisting of a collection of reversible models, allows for describing non-reversible evolutionary phenomena, such as the CpG-methylation-deamination process (see e.g. [21]). The context-dependent evolutionary model and the Markov chain(s) used to estimate the ancestral root distribution are hence decoupled (i.e. estimated independently from one another). Note that in the case of context-independent model frequencies, the model simplifies to the model in [1]. Throughout the remainder of this paper, we will use the notation *r _{X|AC|Y}*,

Further, we assume a fixed underlying phylogenetic tree structure (see the Data section and [12]) for the evolutionary relationship between the species in our dataset. The branch lengths of the tree are estimated in our MCMC-run, along with all the other parameters of the context-dependent evolutionary model.

Let *T *be the set of branch lengths with *t _{b}*(

For the model parameters of each context (i.e. neighbouring base combination) independently, the following prior distribution is assumed (see e.g. [22] and Additional file 1):

For each of the frequencies in each neighbouring base combination, with *X *the 5' and *Y *the 3' neighbouring base at the ancestral sequence of the branch (i.e. *X*, *Y * {*A*, *C*, *G*, *T*}), the following prior distribution is assumed:

Further, branch lengths are assumed i.i.d. given *μ*:

*t _{b}*|

and

*μ *~Inv-gamma (2.1, 1.1), , *μ *> 0.

By allowing for context-dependent evolution, evolutionary models become more parameter-rich. As previously discussed [23], consistency problems may arise with such high-dimensional models, along with potential computational burdens. In view of this, a model-selection approach should be used that penalizes the addition of extra parameters unless there is a sufficiently impressive improvement in fit between model and data [23]. To compare the different assumptions concerning the root distribution, we have calculated (log) Bayes Factors [24]. Log Bayes Factors are typically divided into 4 categories depending on their value: from 0 to 1, indicating nothing worth reporting; from 1 to 3, indicating positive evidence of one model over the other; from 3 to 5, indicating strong evidence of one model over the other; and larger than 5, indicating very strong evidence of one model over the other [24]. We have chosen to calculate Bayes Factors using thermodynamic integration [11], since the traditional harmonic mean estimator of the marginal likelihood systematically favors parameter-rich models and is hence unfit to compare these complex context-dependent models. We have used the model-switch integration method and have performed bidirectional checks, i.e., we have calculated both annealing and melting integrations under various settings to obtain very similar runs, as suggested in the work of Rodrigue et al. [25]. When comparing different models, we report (log) Bayes Factor estimates for both annealing and melting integrations, as well as their mean. More details concerning the Bayes Factor calculation using a data augmentation approach can be found in Baele et al. [18].

Our approach for estimating context-dependent evolutionary patterns across a phylogenetic tree separates the estimation of the root distribution probabilities from the estimation of the context-dependent evolutionary patterns. This approach is similar to that of Hwang and Green [3] but clearly different from the approach of Arndt et al. [2] and Jensen and Pedersen [26]. Arndt et al. [2] derive the root distribution probabilities from the substitution process. Our substitution probabilities are different from theirs, as the dependencies cascade from the leaves of the tree up to its root, whereas the substitution probabilities in Arndt et al. [2] do not. In view of this, we integrated both approaches by means of a two-cluster approximation (see e.g. [27]) to derive the root distribution probabilities from our context-dependent evolutionary model. We focus here on deriving formulas for calculation of a first-order root distribution (with the 'pipe' symbol '|' denoting a traditional conditional probability) using the context-dependent model frequencies present in the context-dependent evolutionary model, with *a _{i }*the identity of the base at a given site

with:

.

This last equation can be rewritten using the two-cluster approximation as:

Substitution in the expression for *p*(*a _{i}*|

We have performed two series of simulation experiments to examine the accuracy of our MCMC-approach and to assess whether our results might be influenced by the choice of prior distributions. The first series of simulations focuses on a second-order ancestral root distribution decoupled from the context-dependent evolutionary model while the second series of simulations focuses on a first-order ancestral root distribution coupled to the context-dependent model. Both simulations show that our method is able to reliably infer root distribution estimates, context-dependent model parameters and frequencies, as well as branch lengths from a large dataset. For more information on the set-up of the simulations, as well as the results of both simulations, we refer to Additional file 1, Additional file 2 and Additional file 3.

In view of the computational complexity, we make the weak assumption that the identities of the immediate flanking neighbors remain constant across each branch of the tree. As the dataset analyzed in this paper contains closely related organisms, i.e. short internal and terminal branches will be estimated for the given tree topology, we expect that this assumption will not lead to drastically different parameter estimates compared to a branch-partitioning approach such as the one proposed by Hwang and Green [3]. Indeed, while this assumption makes our approach less biologically realistic, branch-partitioning is most likely only required for longer branches (i.e. branches longer than or equal to 0.005 using the approach of Hwang and Green [3,20]). To assess the need for partitioning the branches in our dataset, we split each branch into two or more parts of equal length such that the average substitution rate per time unit is smaller than or equal to 0.005 [3] and compare all the parameter estimates of our context-dependent model with a second-order Markov chain at the ancestral root to our approach without branch partitioning. Instead of estimating context-dependent substitution patterns on 18 branches, these patterns now have to be estimated on 52 branch parts, which leads to an almost 3-fold increase in computation time. The branch-splitting approach of Hwang and Green is a discrete-time method. For a comparison of both continuous and discrete-time methods to sample substitution histories along branches, in terms of performance and accuracy, we refer to the work of de Koning et al. [29]. Apart from comparing the parameter estimates, we have also calculated an additional (log) Bayes Factor of our context-dependent model with a second-order Markov chain against an independent GTR model, with the branch lengths partitioned.

In Figure Figure1,1, we calculate the increase in model fit vis-à-vis the general time-reversible model, brought about by relaxing the assumption of site-independent evolution at the ancestral root sequence (the numerical results are reported in Table S1 in Additional file 1). The assumption of a separate set of base frequencies at the root already provides a significant increase in terms of model fit over the model of Baele et al. [1], even though such a zero-order Markov chain still assumes a site-independent distribution at the ancestral root sequence. The first- and second-order Markov chain at the root sequence yield a phenomenal increase in terms of model fit. The assumption of a third-order Markov chain at the root sequence yields no further improvement, which may be related to the corresponding drastic increase in number of parameters.

We have calculated (log) Bayes Factors for our context-dependent model with context-dependent model frequencies and Markov chains of different orders at the ancestral root sequence in the same way as reported above, with the general time-reversible model as the reference model. The results are shown in Figure Figure22 (the numerical results are reported in Table S2 in Additional file 1). Even when assuming a site independent distribution at the ancestral root sequence, a drastic increase in model fit is realized through context-dependent model frequencies. As when assuming independent model frequencies, optimal fit is reached when assuming a second-order Markov chain at the ancestral root sequence, although the increase in model fit is now higher than when assuming independent model frequencies.

Inspired by the work of Arndt et al. [2], we have calculated the log Bayes Factor for this first-order successive approximation approach with coupled root distribution against the independent GTR model, resulting in an annealing estimate of 6648.61 ([6628.69; 6668.52]) and a melting estimate of 6682.52 ([6659.95; 6705.09]), yielding a bidirectional mean log Bayes Factor of 6665.56 log units. We conclude that our approach to couple the root distribution to the evolutionary model, inspired by the work of Arndt et al. [2], yields a suboptimal model fit for this data.

As mentioned above, the context-dependent model with context-dependent model frequencies and a second-order Markov chain at the ancestral root sequence yields a a bidirectional mean log Bayes Factor of 7221.47 log units, compared to the independent GTR model (7218.01 ([7200.34; 7235.68]) log units for the annealing scheme and 7224.94 ([7207.16; 7242.71]) log units for the melting scheme). We have recalculated the log Bayes Factor for this model, this time splitting the branches where appropriate. Due to the use of a full data likelihood (as a consequence of using a data augmentation approach), the branches under both the context-dependent and independent model being compared need to be split into parts as otherwise the two likelihoods corresponding to the two models would be incomparable. The independent model will however not benefit from the branch partitioning, as it is an approach aimed at approximating the continuous distribution of the context-dependent Markov substitution process. The split branches model comparison yields a bidirectional mean log Bayes Factor of 7626.70 log units, compared to the independent GTR model (7617.66 ([7600.11; 7635.21]) log units for the annealing scheme and 7635.74 ([7617.94; 7653.55]) log units for the melting scheme). This difference of approximately 405 log units cannot solely be attributed to the difference in evolutionary model estimates, given that these differ only slightly between both approaches. The main reason for this significant difference is hence likely the more accurate approximation of the context-dependent Markov substitution process by allowing the ancestral sequences to change in between the internal speciation nodes (which is also responsible for the small differences in model estimates).

From previous work [1], we have obtained the stationary distribution given (with accompanying 95% credibility intervals) by: *π*_{A }= 0.300 ([0.297; 0.302]), *π _{C }*= 0.192 ([0.190; 0.194]),

Modelling a zero-order Markov chain along the ancestral root sequence still assumes a site-independent distribution but relaxes the stationarity assumption by allowing for a different set of base frequencies to estimate the evolutionary models that describe substitutions in the remainder of the tree. By comparing the two sets of base frequencies, one can assess the restrictions imposed by assuming a stationary distribution across the entire tree. The estimates for the base frequencies along the ancestral root sequence (with accompanying 95% credibility intervals) *π _{A }*= 0.295 ([0.293; 0.298]),

Allowing for a first-order Markov chain along the sequence yields the four sets of base frequencies for the ancestral root sequence in Table Table1.1. The estimates from the first-order Markov chain at the ancestral root sequence clearly show an important effect of mammalian evolution, i.e. the presence of a so-called CpG-effect (see e.g. [21]). Indeed, the probability of encountering a G at the ancestral root sequence when it is preceded by a C is extremely low (0.69%), which leads to increased probabilities of observing either A, C or T at the root sequence when the preceding site is a C. This confirms the result of Arndt et al. [2], who calculated dinucleotide frequencies and odds ratios and showed that the CpG dinucleotide is underrepresented in mammalian sequences as a result of the CpG-methylation-deamination process, although the estimate in Table Table11 is much lower than theirs. This underscores the importance of (at least) a first-order Markov chain when modelling context-dependent evolution. The comparison of the remaining rows in Table Table11 illustrates the presence of other evolutionary patterns. For example, while the probability of observing a C or a G in the ancestral root sequence is quite similar regardless of whether the preceding site is G or T, there is a higher probability of observing an A when the preceding site is G than when it is T.

To assess whether the probability of observing a given nucleotide at a given site in the ancestral root sequence depends only on its immediate preceding site, we have also modelled a second-order Markov chain at the ancestral root sequence. This was also used by Hwang and Green [3], but without testing whether it was supported by the data. The estimates for the sixteen sets of base frequencies for the ancestral root sequence are shown in Table Table22.

The estimates from the second-order Markov chain at the ancestral root sequence show additional differences in base frequencies when compared to the estimates from the first-order Markov chain. For example, there are large relative differences in the probabilities of observing a G when the preceding site is a C. When that C is preceded by an A, the probability of observing a G is more than twice as much as when C is preceded by G. Likewise, from Table Table22 we see that the probability of observing an A equals 31.14% when its preceding site is an A. However, the site two positions away causes variation in this estimate, i.e. from 24.37% up to 34.96%. This illustrates the importance of using a second-order Markov chain at the ancestral root sequence.

We have shown that the addition of a second-order Markov chain at the ancestral root sequence increases model fit drastically over a site-independent distribution at the root. To determine whether the inferred context-dependent parameters are influenced by modelling a second-order Markov chain at the ancestral root sequence we have performed two separate MCMC runs of 100,000 iterations, one for the independent general time-reversible model and one for the context-dependent model (using independent model frequencies) with a second-order Markov chain at the root. After discarding the first 20,000 iterations as the burn-in sequence, we have constructed posterior difference densities for each of the 192 entries in the context-dependent matrices. The resulting posteriors were used to determine a Bayesian p-value (see e.g. [30]) for each of the 192 entries to test whether the parameter estimates were significantly different between both MCMC runs. This approach resulted in the detection of 24 significantly differing evolutionary parameters (at a 5% level). Interestingly, all of these were found to include one of the transition parameters. Indeed, 4 out of 16 neighbouring base combinations showed different estimates for *r _{AG }*(specifically:

As a second-order Markov chain at the ancestral root sequence yields the largest increase in model fit, we have re-estimated the corresponding root distribution probabilities when assuming context-dependent model frequencies. The results can be seen in Table Table3.3. We compare them with the results reported in Table Table2,2, which were obtained using a single set of independent model frequencies. Large relative differences occur in those combinations of sites involved in the CpG-methylation-deamination process. Indeed, the root distribution probabilities differ in the following combinations of preceding sites: AC, CC, GC, TC and CG. As each row in Table Table33 sums to one, a difference in one probability estimate influences the three remaining probability estimates given the same combination of preceding sites.

Each of the four preceding site combinations AC, CC, GC, TC has a drastically different estimate for the probability of a G occurring at the succeeding site. This is in accordance with Arndt et al. [2], who considered a more approximate development based on dinucleotide frequencies and odds ratios conditional on the left preceding base (rather than the two left preceding bases). The estimates indicate a dependence of the probability of observing a G on the second left-most base. Indeed, the probability of observing a G at a given site is respectively 4.62% and 4.06% when AC and CC precede that site, but only 2.91% and 2.90% respectively, when GC and TC precede that site.

The decreased probabilities of observing a G at a given site when a C precedes it, is caused by the hypermutability of CpG in human and other species, which, in turn, is due to the fact that cytosine is methylated only in CpG dinucleotides (in vertebrates). Both cytosine and 5-methylcytosine undergo high rates of spontaneous hydrolytic deamination, but deamination of 5-methylcytosine produces thymine, and mismatch repair of *C *→ *T *transitions is less efficient than that of *C *→ *U *transitions (for more information, see [31]). Further, when CG precedes a given site, the probability of observing a C at that site drops from 19.60% when assuming a single set of independent model frequencies to 15.98% when assuming context-dependent model frequencies. These differences demonstrate the importance of relaxing the assumption of independent model frequencies.

Based on the estimated 192 off-diagonal matrix entries of our context-dependent model assuming context-dependent model frequencies, we constructed Bayesian p-values (see e.g. [30]) to test whether the parameters were significantly different between an MCMC assuming independence at the ancestral root (i.e., a zero-order Markov chain) and a second-order Markov chain. For matters of comparison, Figure Figure44 shows all the substitution rates of our context-dependent model using context-dependent model frequencies, assuming a zero-order Markov chain to describe the sequence composition of the ancestral root. It shows large variation in substitution rates across different neighbouring base combinations. Given the presence of the CpG-methylation-deamination process in mammals, we focus on the substitution rates from C to T (*r _{CT}*) in the AXG, CXG, GXG and TXG contexts as well as their compensating substitution rates from G to A (

When comparing these estimates to the estimates of the matrix entries when assuming a second-order Markov chain at the ancestral root sequence, 27 significantly differing substitution rates are observed out of the 192 tested at the 5% significance level. Specifically, the following substitution rates were found to differ most (in terms of their p-values): *r _{GA }*in contexts CXA, CXC, CXG and CXT;

However, as can be seen from Figure Figure2,2, a second-order Markov chain at the ancestral root sequence along with context-dependent model frequencies offers the largest increase in model fit when compared to the independent general time-reversible model. We show the substitution rates for the 192 off-diagonal elements in our context-dependent model using context-dependent model frequencies and a second-order Markov chain at the ancestral root sequence in Figure Figure5.5. The difference with employing a zero-order Markov chain at the ancestral root is staggering, specifically in terms of the substitution parameters that describe the CpG-effect. The mean substitution rates *r _{CT }*are now 11.69, 11.22, 9.82 and 7.53 for the AXG, CXG, GXG and TXG contexts, respectively, versus 11.23, 8.84, 8.30 and 7.21 for the compensating substitution rates

The context-dependent substitution rates *r _{CT }*clearly illustrate the hypermutability of CpG in mammals, as the

These results are clearly more indicative of the effect of compensating mutations at the opposite side of a stem region and are supported by the largest increase in model fit. However, as discussed by Green et al. [32] and Siepel and Haussler [13], three of the nine studied genes are located on the opposite strand from the other six so our data contains a mixture of bases that correspond to the transcribed and non-transcribed strands. This supports our findings of compensating substitution rates at opposing stem regions, although we find that the *r _{CT }*substitution rate is between 4.1% and 26.9% higher than the compensating

Finally, in Table S3 (in Additional file 1), we report the estimates for the context-dependent model frequencies for our optimal model. Note that these probabilities are conditional on the identities of the two immediate flanking bases, while those at the ancestral root sequence are dependent upon the two left flanking bases. The table reveals two sets of frequency estimates related to the CpG-methylation-deamination process: *π*_{A|C|G}, *π*_{C|C|G}, *π*_{G|C|G}, *π*_{T|C|G }and *π*_{C|G|A}, *π*_{C|G|C}, *π*_{C|G|G}, *π*_{C|G|T}. The estimates of *π*_{A|C|G }and *π*_{C|G|T }are nearly identical, which can be attributed to the fact that these frequencies represent compositional aspects of opposing sides of a stem region. This also holds for *π*_{T|C|G }and *π*_{C|G|A }but not for the other two pairs. The reason for *π*_{C|C|G }and *π*_{C|G|G }as well as *π*_{G|C|G }and *π*_{C|G|C }to differ is reflected in the estimated substitution rates in Figure Figure5.5. Indeed, for the CXG and GXG neighbouring base combinations there is a discrepancy between the *r*_{CT }and *r*_{GA }substitution rates corresponding to the difference in estimates for the context-dependent model frequencies.

While Bayes Factor calculations did not support the assumption of a third-order Markov chain at the ancestral root, this may be partly related to the inflation in number of parameters in such models. To gather additional information on the usefulness of a third-order Markov chain at the ancestral root sequence, we have therefore applied a four-dimensional (i.e. clustering *π _{A}*,

Figure 6 (a) shows a comparison of the substitution patterns obtained using our context-dependent model with context-dependent model frequencies, with both a first-order Markov chain and the first-order successive approximation approach to model the ancestral root sequence. There are many differences (but also many similarities) between the two sets of parameter estimates and here we focus on the magnitude of the CpG-effects in both approaches (for a more thorough comparison, we refer to Additional file 1). While Figure 6 (a) shows a clear tendency for the four *r _{GA }*and

The underperformance of the coupled root distribution is quite unexpected as the substitution process along the lineages of the tree should give rise to the root distribution (which is always the case for independent evolutionary models). Lineage-dependent substitution processes, when unaccounted for, can disturb the estimation of the coupled root distribution (but not of the decoupled root distribution). To assess whether the context-dependent substitution process is lineage-dependent, we have divided our observed sequences in three clades: a first clade, called the 'human clade', consists of the sequences for human, chimpanzee, gorilla and orang-utan; a second clade, called the 'baboon clade', consists of the sequences for baboon, macaque and vervet; and a third clade, called the 'marmoset clade', consists of the sequences for marmoset, dusky titi and squirrel monkey. We have performed one MCMC run for each of these three datasets to estimate the context-dependent substitution parameters. Figure 6 (b) to 6 (d) compares these 192 parameters between the three clades using XY plots (if all the parameter means are equal to one another for the two datasets, then the parameter estimates will lie on the diagonal). Large differences can be seen on all three figures, indicating that many context-dependent substitution rates are in fact lineage-dependent, which may explain the underperformance of the coupled root distribution approach.

We have performed 100,000 iterations (discarding the first 20,000 iterations as burn-in) to estimate the parameters of our context-dependent evolutionary model with a second-order Markov chain at the root, when the branches are split so that the average substitution rate per time unit is smaller than or equal to 0.005 [3]. We compare all the estimates involved with those of the same model without splitting the branches in Figure Figure7.7. The evolutionary parameters show no differences and only minor differences can be seen for the estimates of the model frequencies. No significant differences could be found for the root distribution estimates nor for the branch length estimates. For more details, we refer to Additional file 1.

We have again compared the performance of the four different Markov chains (zero-, first-, second- and third-order) along the root sequence by calculating the appropriate (log) Bayes Factors. This time, the general time-reversible model (GTR) was used as the reference model. In Figure Figure8,8, we calculate the increase in model fit brought about by relaxing the assumption of site-independent evolution at the ancestral root sequence, while assuming a context-dependent model with independent model frequencies (the numerical values are shown in Table S4 in Additional file 1). A first-order Markov chain at the ancestral root sequence yields the largest increase in model fit, outperforming a second-order Markov chain at the ancestral root, which also drastically increases model fit compared to the reference model. All other assumptions in terms of the ancestral root distribution yield negative log Bayes Factors compared to the GTR model and are hence outperformed by a site-independent evolutionary model. The assumption of a third-order Markov chain at the root sequence yields a drastic decrease in model fit compared to the first- and second-order Markov chains, which may be related to the increase in number of parameters.

Figure Figure99 shows (log) Bayes Factors for our context-dependent model with context-dependent model frequencies and Markov chains of different orders at the ancestral root sequence (see also Table S5 in Additional file 1). Even when assuming a site-independent distribution at the ancestral root sequence, a drastic increase in model fit is realized through context-dependent model frequencies. As when assuming independent model frequencies, optimal fit is reached when assuming a first-order Markov chain at the ancestral root sequence. Once again, the drastic decrease (both in absolute terms and compared to the first- and second-order Markov chains) of a third-order Markov chain is apparent.

As a first-order Markov chain at the ancestral root sequence, combined with context-dependent model frequencies in the evolutionary model, yields the largest increase in model fit, we here report the root distribution and parameter estimates for this model. The first-order ancestral root distribution estimates can be seen in Table Table5,5, from which the decreased probability of observing a G at a given site when its preceding site is a G is immediately apparent due to the 5-methylcytosine deamination process (i.e., the CpG effect). This is in accordance with the results for the ancestral repeats dataset, although in that dataset a second-order Markov chain was used. These differences once again demonstrate the importance of relaxing the assumption of independent model frequencies.

Figure Figure1010 shows all the substitution rates of our context-dependent model using context-dependent model frequencies, assuming a first-order Markov chain to describe the sequence composition of the ancestral root. As for the ancestral repeats dataset, a large variation in substitution rates across different neighbouring base combinations can be observed. Given the presence of the CpG-methylation-deamination process in mammals, we once again focus on the substitution rates from C to T (*r _{CT}*) in the AXG, CXG, GXG and TXG contexts as well as their compensating substitution rates from G to A (

In Table S6 in Additional file 1, we report the estimates for the context-dependent model frequencies for our optimal model for the pseudogenes dataset. The decreased *r _{CT }*substitution rate in the GXG context mentioned above is reflected in the first set of frequency estimates related to the CpG-methylation-deamination process:

In this paper we have demonstrated the importance of modelling non-independent root distributions at the ancestral root sequence in a phylogenetic tree. This completes the dependency scheme across the entire tree so that the evolution at every nucleotide, ancestral or observed, is allowed to depend on other nucleotides. We have found that the use of a second-order Markov chain at the ancestral root sequence results in the largest increase in terms of model fit for a large dataset of primate ancestral repeat sequences. For a smaller dataset of primate pseudogenes, we have found that a first-order Markov chain at the ancestral root sequence yields the largest increase in model fit. We have calculated the differences in model fit using computationally demanding (see Additional file 1 for computation time requirements), but accurate Bayes Factor calculations and report drastic increases in model fit by modelling context-dependence, especially compared to other evolutionary assumptions such as among-site rate variation. Indeed, as shown in previous work [18], the increase in model fit brought about by assuming among-site rate variation equals about 355 log units, which is drastically lower than the more than 7.000 log units increase reported here. Further, for the pseudogenes dataset, the increase in model fit brought about by the optimal context-dependent model and root distribution equals about 150 log units, compared to an increase in model fit for among-site rate variation with a bidirectional mean log Bayes Factor of 5.4 units (annealing: [5.5; 5.7]; melting: [5.0; 5.2]), corresponding to the findings of Yang [14]. As far as we know and given the known presence of CpG-effects in our dataset, context-dependent models seem to be the best choice for increasing the evolutionary model's realism.

Our approach is probably best compared to that of Hwang and Green [3], although we have not employed branch-specific context-dependent models due to the fact that our dataset consisted only of primate sequences. We have shown, using careful and computationally intensive model selection, that the choice of the ancestral root distribution heavily influences the substitution rates for the CpG-methylation-deamination process and that first- and second-order Markov chains at the root sequence, independent of the context-dependent evolutionary model, are better fit to describe the data than a first-order approximation coupled to the context-dependent evolutionary model. Hwang and Green [3] do not employ a model selection approach in their study on a branch-specific context-dependent model.

Even though Hwang and Green [3] use more data (longer sequences and more species), statistical support for their modelling choices in terms of observed increase in model fit, which we have shown here to be a crucial aspect of examining complex evolutionary models, should be used to further strengthen their findings [23]. Given the complexity of branch-specific context-dependent models, model comparison is however seriously hampered by the immense computational demands. Our estimates of the substitution rates for the CpG-effect seem to be lower than those reported by Hwang and Green [3], although the authors do not report the mean estimates so only a visual comparison is possible. The *r _{CT }*and

We have also shown that the discrete approximation to the continuous-time Markov substitution process, as used by Hwang and Green [3], yields a significant increase in model fit in terms of the log Bayes Factors calculated. As the evolutionary model parameters were unaffected by partitioning the branches where appropriate, the branch partitioning approach itself is responsible for more accurately modelling the ancestral states throughout the underlying phylogenetic tree. As mentioned in the Methods section, we have split each branch into two or more parts of equal length such that the average substitution rate per time unit is smaller than or equal to 0.005 [3]. However, by no means does this approach guarantee that the optimal number of branch partitions are used and that the corresponding increase in model fit over an independent evolutionary model is maximized. Further research on such discrete approximations is reported in the work of de Koning et al. [29].

The design and study of context-dependent evolutionary models is also of interest when studying coding sequence evolution, as shown by recent publications on mutation-selection models (see e.g. [33-35]). Indeed, while current mutation-selection models use the general-time reversible model (GTR) for modelling mutation bias (see e.g. [34,35]), the context-dependent model presented in this manuscript may be better suited to model mutation bias at either one of the three codon positions.

Designing accurate context-dependent models is a complex process, with many different assumptions that require testing using an accurate procedure for model testing, which are computationally very demanding. Not only the context-dependent evolutionary model itself requires estimation and evaluation, we have also shown that the choice of an adequate ancestral root distribution is essential to accurately estimate the model's parameters. Moreover, we have also shown that the choice of an ancestral root distribution, changes across different datasets. In other words, there is no single context-dependent model or ancestral root distribution that can be argued to work well for a given dataset, without having to actually perform a series of model comparisons.

We have shown in this paper that context-dependent models are very useful when analyzing primate sequences, due to the presence of the CpG-methylation-deamination process. Since this is a process known to occur in mammalian evolution, context-dependent evolutionary models should also prove useful in the analysis of non-primate mammalian datasets. Even though the context-dependent models presented in this paper greatly increase model fit over independent models in the two datasets studied, additional work on context-dependent evolutionary models is required. For example, we have shown that an ancestral root distribution coupled to the context-dependent model does not improve model fit as much as a decoupled ancestral root distribution does. While we have provided a possible explanation for this result, additional work is required to clarify this issue.

Additional file 5 contains the necessary software routines, evolutionary models, ancestral root distributions and several example input files for the software to reproduce the various results presented in this paper. More information on how to use the software can be found in Additional file 1.

GB initiated the study, designed the context-dependent evolutionary model, implemented the Markov chain root distribution approach, co-designed the coupled root distribution approach, performed all the analyses, programmed the software routines and wrote a first complete version of the manuscript. YVdP contributed biological expertise to the analyses and edited the manuscript. SV contributed statistical expertise to the analyses, co-designed the coupled root distribution approach and edited the manuscript. All authors read and approved the final manuscript.

**File containing supplementary material and information that was not included in the main document**.

Click here for file^{(192K, DOC)}

**File containing supplementary figure S1**. Encapsulated PostScript figure.

Click here for file^{(200K, EPS)}

**File containing supplementary figure S2**. Encapsulated PostScript figure.

Click here for file^{(10K, EPS)}

**File containing supplementary figure S3**. Encapsulated PostScript figure.

Click here for file^{(2.0M, EPS)}

**File containing software classes, alignments used and example input files**.

Click here for file^{(2.6M, JAR)}

Yves Van de Peer acknowledges support from an Interuniversity Attraction Pole (IUAP) grant for the BioMaGNet project (Bioinformatics and Modelling: from Genomes to Networks, ref. p6/25). Stijn Vansteelandt acknowledges support from IAP research network grant nr. P06/03 from the Belgian government (Belgian Science Policy). We acknowledge the support of Ghent University (Multidisciplinary Research Partnership "Bioinformatics: from nucleotides to networks").

- Baele G, Van de Peer Y, Vansteelandt S. A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. Syst Biol. 2008;57:675–692. doi: 10.1080/10635150802422324. [PubMed] [Cross Ref]
- Arndt PF, Burge CB, Hwa T. DNA sequence evolution with neighbor-dependent mutation. J Comp Biol. 2003;10:313–322. doi: 10.1089/10665270360688039. [PubMed] [Cross Ref]
- Hwang DG, Green P. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci USA. 2004;101:13994–14001. doi: 10.1073/pnas.0404142101. [PubMed] [Cross Ref]
- Morton BR. Neighbouring base composition and transversion/transition bias in a comparison of rice and maize chloroplast noncoding regions. Proc Natl Acad Sci. 1995;92:9717–9721. doi: 10.1073/pnas.92.21.9717. [PubMed] [Cross Ref]
- Morton BR. The influence of neighbouring base composition on substitutions in plant chloroplast coding sequences. Mol Biol Evol. 1997;14:189–194.
- Morton BR, Clegg MT. Neighbouring base composition is strongly correlated with base substitution bias in a region of the chloroplast genome. J Mol Evol. 1995;41:597–603. doi: 10.1007/BF00175818. [PubMed] [Cross Ref]
- Morton BR, Oberholzer VM, Clegg MT. The influence of specific neighbouring bases on substitution bias in noncoding regions of the plant chloroplast genome. J Mol Evol. 1997;45:227–231. doi: 10.1007/PL00006224. [PubMed] [Cross Ref]
- Salser W. Globin messenger - RNA sequences - analysis of base-pairing and evolutionary implications. Cold Spring Harbor Symp Quant Biol. 1997;42:985–1103. [PubMed]
- Erickson JW, Altman G. A search for patterns in the nucleotide sequence of the MS2 genome. J Math Biol. 1979;7:219–230. doi: 10.1007/BF00275725. [Cross Ref]
- Blaisdell BE. Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eukaryotic nuclear DNA sequences both protein-coding and noncoding. J Mol Evol. 1985;21:278–288. doi: 10.1007/BF02102360. [PubMed] [Cross Ref]
- Lartillot N, Philippe H. Computing Bayes factors using thermodynamic integration. Syst Biol. 2006;55:195–207. doi: 10.1080/10635150500433722. [PubMed] [Cross Ref]
- Margulies EH, Chen CW, Green ED. Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. Trends Genet. 2006;22:187–193. doi: 10.1016/j.tig.2006.02.005. [PubMed] [Cross Ref]
- Siepel A, Haussler D. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol. 2004;21:468–488. doi: 10.1093/molbev/msh039. [PubMed] [Cross Ref]
- Yang Z. Estimating the pattern of nucleotide substitution. J Mol Evol. 1994;39:105–111. [PubMed]
- Gilks WR, Richardson S, Spiegelhalter DJ. Markov chain Monte Carlo in practice. Chapman & Hall, London, UK; 1996.
- Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. Bayesian inference of phylogeny and its impact on evolutionary biology. Science. 2001;294:2310–2314. doi: 10.1126/science.1065889. [PubMed] [Cross Ref]
- Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. J Am Stat Assoc. 1987;82:528–540. doi: 10.2307/2289457. [Cross Ref]
- Baele G, Van de Peer Y, Vansteelandt S. Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences. BMC Evol Biol. 2009;9:87. doi: 10.1186/1471-2148-9-87. [PMC free article] [PubMed] [Cross Ref]
- Lanave C, Preparata G, Saccone C, Serio G. A new method for calculating evolutionary substitution rates. J Mol Evol. 1984;20:86–93. doi: 10.1007/BF02101990. [PubMed] [Cross Ref]
- Felsenstein J. Inferring phylogenies. Sinauer Associates, Sunderland, Massachusetts, USA; 2004.
- Fryxell KJ, Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol. 2000;17:1371–1383. [PubMed]
- Zwickl DJ, Holder MT. Model parameterization, prior distributions, and the general time-reversible model in Bayesian phylogenetics. Syst Biol. 2004;53:877–888. doi: 10.1080/10635150490522584. [PubMed] [Cross Ref]
- Steel M. Should phylogenetic models be trying to 'fit an elephant'? Trends in Genetics. 2005;21:307–309. doi: 10.1016/j.tig.2005.04.001. [PubMed] [Cross Ref]
- Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90:773–795. doi: 10.2307/2291091. [Cross Ref]
- Rodrigue N, Philippe H, Lartillot N. Assessing site-interdependent phylogenetic models of sequence evolution. Mol Biol Evol. 2006;23:1762–1775. doi: 10.1093/molbev/msl041. [PubMed] [Cross Ref]
- Jensen JL, Pedersen A-MK. Probabilistic models of DNA sequence evolution with context dependent rates of substitution. Adv Appl Prob. 2000;32:499–517. doi: 10.1239/aap/1013540176. [Cross Ref]
- ben Avraham D, Köhler J. Mean-field (n, m)-cluster approximation for lattice models. Physical Rev. 1992;45:8358. doi: 10.1103/PhysRevA.45.8358. [PubMed] [Cross Ref]
- Larson R, Hostetler RP, Edwards BH. Calculus with analytic geometry. eighth. Brooks Cole; 2005.
- de Koning APJ, Gu W, Pollock DD. Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. Mol Biol Evol. 2010;27:249–265. doi: 10.1093/molbev/msp228. [PMC free article] [PubMed] [Cross Ref]
- Schafer JL. Analysis of Incomplete Multivariate Data (Monographs on Statistics and Applied Probability) Chapman & Hall/CRC; 1997.
- Fryxell KJ, Zuckerkandl E. Cytosine deamination plays a primary role in the evolution of mammalian isochores. Mol Biol Evol. 2000;17:1371–1383. [PubMed]
- Green P, Ewing B, Miller W, Thomas PJ. NISC Comparative Sequencing Program. Green ED. Transcription-associated mutational asymmetry in mammalian evolution. Nat Genet. 2003;33:514–517. doi: 10.1038/ng1103. [PubMed] [Cross Ref]
- Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL. Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol. 2003;20:1692–1704. doi: 10.1093/molbev/msg184. [PubMed] [Cross Ref]
- Yang Z, Nielsen R. Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol. 2008;25:568–579. doi: 10.1093/molbev/msm284. [PubMed] [Cross Ref]
- Rodrigue N, Philippe H, Lartillot N. Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles. Proc Natl Acad Sci USA. 2010;107:4629–4634. doi: 10.1073/pnas.0910915107. [PubMed] [Cross Ref]

Articles from BMC Evolutionary Biology are provided here courtesy of **BioMed Central**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |