|Home | About | Journals | Submit | Contact Us | Français|
Molecular evolutionary rates can show significant variation among lineages, complicating the task of estimating substitution rates and divergence times using phylogenetic methods. Accordingly, relaxed molecular clock models have been developed to accommodate such rate heterogeneity, but these often make the assumption of rate autocorrelation among lineages. In this paper, I examine the validity of this assumption.
Rates of molecular evolution can vary substantially among sites, genes and lineages. In the past two decades, phylogenetic methods have been modified to take these forms of rate heterogeneity into account. For example, a number of ‘relaxed-clock’ models have been developed, which allow substitution rates to vary among lineages in a phylogenetic tree, without the need to assign a separate rate parameter for each branch (for a general overview, see Rutschmann 2006). These models enable the estimation of divergence times and lineage-specific substitution rates from sequence data that do not conform to a strict molecular clock. To make the estimation procedure tractable, relaxed-clock models place limitations on how rates are able to vary throughout the tree. Many of the widely used models assume that the substitution rate is indirectly heritable because it is correlated with a variety of inherited characteristics, including those associated with cellular environment, physiology and life history. Such patterns are then assumed to lead to some degree of autocorrelation between molecular rates in adjacent branches of the tree.
In practice, the assumption of rate autocorrelation is applied in one of several ways. In autocorrelated relaxed-clock models, the various biological factors are encapsulated in a single function describing the behaviour of rates throughout the tree. Some relaxed-clock methods employ an algorithm to minimize the rate changes between adjacent branches (Sanderson 1997, 2002), while others implement an explicit model of rate variation in which substitution rates can change or ‘evolve’ along branches (Aris-Brosou & Yang 2002; e.g. Huelsenbeck et al. 2000; Kishino et al. 2001; Lepage et al. 2006; Rannala & Yang 2007). After reviewing the various relaxed-clock models in detail, Lepage et al. (2006, 2007) proposed that the Cox–Ingersoll–Ross process possesses a number of desirable statistical properties that make it suitable for describing rate evolution. In this model, the mean rate at time t, R(t), is equal to
where μ is the stationary mean of the rate and θ determines the speed of the decay in rate autocorrelation.
Studies of simulated and real data have demonstrated that estimates of substitution rates and divergence times are sensitive to the choice of relaxed-clock model (Ho et al. 2005; Drummond et al. 2006; Lepage et al. 2007), highlighting the need for careful model selection.
The biological motivation behind autocorrelated relaxed clocks can be summarized in the form of two key assumptions. The first assumption is that mutation rates are influenced by life-history characteristics such as generation time, metabolic rate and DNA repair efficiency (Gillespie 1991; Baer et al. 2007). For example, herbaceous plants generally have shorter generation times than woody plants, and so exhibit higher rates of molecular evolution (Smith & Donoghue 2008).
The second assumption is that rates of mutation and substitution are correlated. Unless evolution is proceeding in an effectively neutral manner, substitution rates are somewhat removed from mutation rates. It is possible, however, that closely related species experience similar selection intensities, with comparable fitness distributions for mutations. In some empirical studies of rate variation among lineages, the two steps are not separated, and substitution rates are taken as a proxy for mutation rates. This is primarily due to the difficulty in obtaining reliable estimates of mutation rates. In any case, the substitution rate in each lineage depends on the interplay of mutation, selection and drift.
Among mammals, substitution rates have been found to be correlated with body size (Lanfear et al. 2007) or metabolic rate (Gillooly et al. 2005), synonymous rates with generation time (Nikolaev et al. 2007) and maximum lifespan (Welch et al. 2008), and non-synonymous rates with population size (Nikolaev et al. 2007) and several other traits (Welch et al. 2008). It is not clear, however, whether such patterns extend to other taxonomic groups; Lanfear et al. (2007) found no evidence of a metabolic rate effect on substitution rates across a variety of metazoan taxa. Investigations of the correlations between rates and biological traits in plants have yielded mixed results (e.g. Barraclough & Savolainen 2001; Davies et al. 2004; Smith & Donoghue 2008).
The biological assumptions underlying autocorrelated relaxed clocks warrant closer examination. The first assumption, that mutation rates are closely linked to heritable traits, receives support from studies of mammalian data. Nevertheless, even these trends differ between mammalian mitochondrial and nuclear genomes (Welch et al. 2008). Studies of other taxa have indicated that the correlations observed in mammals cannot be readily extended to other metazoans (Thomas et al. 2006; Lanfear et al. 2007).
Another pertinent question, related to the first biological assumption, concerns the taxonomic scale of the sequence data that are being analysed with autocorrelated relaxed-clock models. In a study of the cytochrome b gene in mammals, Nabholz et al. (2008) found that family-level categorization explained the greatest amount of rate variation. Overall, one would predict the highest degree of autocorrelation to be observed at intermediate levels of the taxonomic hierarchy. At one extreme, we would expect a very high degree of underlying rate autocorrelation within a species, such that any rate variation among lineages would be primarily due to stochastic, uninherited factors (Drummond et al. 2006); indeed, many population genetic and coalescent-based approaches assume a strict molecular clock.
At the other end of the continuum, autocorrelation in life-history traits (or any other factor that might be strongly correlated with mutation/substitution rates) would inevitably break down at higher taxonomic levels (Gittleman & Kot 1990; Drummond et al. 2006). The magnitude of the differences among lineages would be amplified if there is very incomplete taxon sampling, and the degree of autocorrelation would decrease as taxon sampling becomes more sparse. In cases where a dataset consists of distantly related taxa, there is little reason to expect any appreciable autocorrelation among the rates on different lineages. Consequently, it would be difficult to defend the validity of making a priori assumptions about the manner in which the rates vary among lineages.
Autocorrelated rate methods have been used to analyse sequences at various taxonomic scales, ranging from viral sequences obtained from a single host, to sequences acquired from representatives of different kingdoms of life. To investigate the trends in the application of autocorrelated relaxed clocks, a survey was conducted of all 46 studies that used such methods and were published in Royal Society journals prior to November 2008 (table 1).
The sequence data examined in these studies spanned a broad range of taxonomic levels (figure 1). Five studies analysed datasets in which the majority of nodes in the tree represented ordinal divergences or higher. At the other extreme, nine studies involved the analyses of datasets that included large numbers of sequences from conspecific individuals, with three conducted entirely at the population level.
For the methods of analysis to be applicable to all of these datasets, they would need to be sufficiently flexible such that they could accommodate widely varying levels of rate change and autocorrelation. For small, sparsely sampled datasets, it is doubtful whether there should be any expectation of rate autocorrelation at all.
The second assumption behind autocorrelated relaxed-clock models is that mutation and substitution rates are strongly correlated. This is reasonable for sequences that are evolving neutrally. In analyses of sequences under selection, however, such an assumption is far more questionable. This relates particularly to non-mammalian mitochondrial sequence data, of which the evolutionary history appears to have been driven substantially by adaptive evolution (Bazin et al. 2006). If rates of adaptive substitution are not tied to inherited factors, then the presence of such substitutions can seriously weaken the link between life-history traits and substitution rates. As mentioned above, however, closely related species could experience similar selection intensities, as implied under covarion models of sequence evolution (e.g. Tuffley & Steel 1998). The extent to which such processes could lead to rate autocorrelation among lineages is not known.
In a comprehensive study of mammals, no correlation was found between non-synonymous mitochondrial rates and life-history traits (Welch et al. 2008). Indeed, this suggests that autocorrelated relaxed-clock models might be inappropriate for analyses of amino acid sequences. Thus, perhaps it would be desirable to employ separate autocorrelated and uncorrelated models of among-lineage rate variation for non-coding and coding or amino acid sequences, respectively.
Rate autocorrelation among branches can be detected in a Bayesian framework, e.g. by Bayes factor comparison of autocorrelated and uncorrelated relaxed clocks (Lepage et al. 2007). Using this approach, Lepage et al. (2007) found that autocorrelated models provided a significantly better fit than uncorrelated models to three protein-coding DNA alignments, but not when the number of taxa was small.
Rate autocorrelation can also be measured by using relaxed-clock models that do not make an a priori assumption of autocorrelation (Drummond et al. 2006; Rannala & Yang 2007). In these models, the rate on each branch is sampled from a single underlying distribution (such as a lognormal distribution), of which the parameters are estimated in the analysis. Comparison of the posterior and prior distributions of the covariance in (estimated) rates in neighbouring branches can then be used to provide an indication of rate autocorrelation in a dataset. Analyses performed in this framework have failed to detect rate autocorrelation in DNA/RNA sequences of influenza virus, dengue virus, marsupials (Drummond et al. 2006), plants (Moore & Donoghue 2007) and birds (Brown et al. 2008).
Given the considerations described above, it is clear that further investigations of among-lineage rate variation are critically required. In view of the rapidly growing amount of sequence data, it should become possible to employ mixed relaxed-clock models when analysing datasets that comprise both selected and putatively neutral sites. For example, it might be preferable to employ relaxed-clock models that separate synonymous and non-synonymous rates (Seo et al. 2004; Lemey et al. 2007). Such an approach has the potential to provide a better fit to the data, and to be more illuminating with respect to the molecular evolutionary process.
I am grateful to Rob Lanfear and three anonymous referees for their constructive comments. This research was supported by the Australian Research Council.
One contribution of 11 to a Special Feature on ‘Whole organism perspectives on understanding molecular evolution’.
List of studies using autocorrelated relaxed-clock methods and published in Royal Society journals, complete as of 19 November, 2008