|Home | About | Journals | Submit | Contact Us | Français|
During mitochondrial replication, spontaneous mutations occur and accumulate asymmetrically during the time spent single stranded by the heavy strand (DssH). The predominant mutations appear to be deaminations from adenine to hypoxanthine (A → H, which leads to an A → G substitution) and cytosine to thymine (C → T). Previous findings indicated that C → T substitutions accumulate rapidly and then saturate at high DssH, suggesting protection or repair, whereas A → G accumulates linearly with DssH. We describe here the implementation of a simple hidden Markov model (HMM) of among-site rate correlations to provide an almost continuous profile of the asymmetry in substitution response for any particular substitution type. We implement this model using a phylogeny-based Bayesian Markov chain Monte Carlo (MCMC) approach. We compare and contrast the relative asymmetries in all 12 possible substitution types, and find that the observed transition substitution responses determined using our new method agree quite well with previous predictions of a saturating curve for C → T transition substitutions and a linear accumulation of A → G transitions. The patterns seen in transversion substitutions show much lower among-site variation, and are nonlinear and more complex than those seen in transitions. We also find that, after accounting for the principal linear effect, some of the residual variation in A → G/G → A response ratios is explained by the average predicted nucleic acid secondary structure propensity at a site, possibly due to protection from mutation when secondary structure forms.
Vertebrate mitochondrial DNA (mtDNA) have an asymmetric replication mechanism that leads to asymmetry in nucleotide frequencies (Clayton, 1992a; Tanaka and Ozawa, 1994; Reyes et al., 1998; Yang et al., 2002). Recent studies suggest that these nucleotide asymmetries are caused by asymmetries in the probabilities of various, specific substitutions types (Reyes et al., 1998; Bielawski and Gold, 2002; Faith and Pollock, 2003). The major enzyme responsible for DNA replication in mitochondria is gamma polymerase (Copeland and Longley, 2003; Copeland et al., 2003), which initiates DNA synthesis directionally and asymmetrically from origins of heavy- and light-strand replication (OH and OL) that are separated by about two-thirds of the length of the genome (Clayton, 1992b; Graziewicz et al., 2002).
According to the classic model, replication of the heavy strand begins first, and proceeds from the OH towards Cytochrome b (Cyt-b) along the circular genome (Fig. 1). When the replication fork reaches the OL, a short ~30-bp stem-loop structure, replication of the light strand starts in the opposite direction, back towards Cytochrome c oxidase I (COI). After the heavy-strand replication fork has passed, the original heavy strand remains single-stranded until the light strand replication fork passes, and as a result, different portions of the genome spend different amounts of time single stranded, depending on their location on the circular mitochondrial chromosome and their distance from the OL. Since COI is close to the OL and the first gene passed by the light-strand replication fork, it spends the least amount of time single stranded, whereas Cyt-b is farthest and spends the longest time single stranded. Assuming constant average movement of the replication forks, estimates of the times spent single stranded (DssH) can easily be made (see Materials and Methods for further details). Hydrolytic deaminations occur in the single-stranded state at much higher rates than in the double-stranded state (Frederico et al., 1990, 1993; Francino and Ochman, 1997), causing a notable increase in substitutions of A → G and C → T on the heavy strand. These accumulate over the amount of time the heavy strand spends single stranded (Reyes et al., 1998; Tanaka and Ozawa, 1994), resulting in an asymmetric skew in base frequencies within single genomes, and cause asymmetry in substitution processes over evolutionary time (Bielawski and Gold, 2002; Faith and Pollock, 2003).
Different types of substitutions respond differently to time spent single stranded (Faith and Pollock, 2003). The number of A → G substitutions appears to increase almost linearly with DssH (Faith and Pollock, 2003), and we have used linear models to estimate the slope and intercept for A → G gradients, and to provide credible intervals for these estimators (Krishnan et al., 2004a; Raina et al., in review). For C → T substitutions, there is apparently a steep initial rise with increasing DssH followed by saturation for the rest of the genome (Faith and Pollock, 2003). The gene-level analysis in this study did not provide a clear description of the C → T substitution response curve because the quick increase in substitution rates occurs mainly within COI.
Here, we introduce a method that allows the asymmetric component of the substitution probability matrix to differ among sites without imposing a linear relationship (or any other prespecified relationship) on the DssH response curve. We assume that sites with similar DssH values tend to evolve at similar rates, and this is embodied in our model through a simple hidden Markov model (HMM) component, with the strength of the asymmetric component as the hidden state, and a transition probability such that the difference between substitution probabilities at adjacent sites in the alignment is distributed as a Gaussian with mean zero. Since our purpose is to understand the substitution process, we employed a Bayesian analysis that assumed a constant phylogeny, and which relied on likely distributions for ancestral reconstructions at each node and each site considered (Krishnan et al., 2004c). The model used to obtain ancestral reconstructions was the simpler general-time-reversible (GTR) model, which introduces an unknown but conservative degree of bias towards reversibility and equal substitution probabilities among sites. The use of these simply generated ancestral state distributions allowed us to build more complicated models that varied at each site with relatively little computational costs. The base model (Bielawski and Gold, 2002) assumed symmetric rates on both strands (Lobry and Sueoka, 2002; Sueoka, 1995), and asymmetry (Bielawski and Gold, 2002) was incorporated into probabilities for specific substitution types as the “hidden” component that was variable among sites. We evaluated results using available complete primate mitochondrial genomes (plus two near outgroups) for all eight transversion types as well as the four transition substitutions.
We used 18 complete mitochondrial genomes for our study, mostly primates, 15 of which (13 primates and two outgroups) were available from GenBank when this study was initiated. These are: Cebus albifrons (NC_002763, Arnason et al., 2000), Gorilla gorilla (NC_001645, Horai et al., 1995); Homo sapiens (NC_001807, Ingman et al., 2000); Hylobates lar (NC_002082, Arnason et al., 1996); Lemur catta (NC_004025, Arnason et al., 2002); Macaca sylvanus (NC_002764, Arnason et al., 2000); Nycticebus coucang (NC_002765, Arnason et al., 2000); Pan paniscus (NC_001644, Horai et al., 1995); Pan troglodytes (NC_001643, Horai et al., 1995); Papio hamadryas (NC_001992, Arnason et al., 1998); Pongo pygmaeus pygmaeus (NC_001646, Horai et al., 1995); Pongo pygmaeus abelii (NC_002083, Xu and Arnason, 1996); and Tarsius bancanus (NC_002811, Schmitz et al., 2002) and outgroups Tupaia belangeri (NC_002521, Schmitz et al., 2000) and Cynocephalus variegatus (NC_004031, Arnason and Janke, 2002). Our colleagues (Raaum et al., in review) provided us with three other primate genomes, Cercopithecus aethiops, Colobus guereza, and Trachypithecus obscurus.
Gene alignments were obtained using ClustalW (Thompson et al., 1994), and a neighbor-joining (NJ) tree was obtained in PAUP* 4.0 (Swofford, 2001) using a GTR model applied to concatenated alignments of all tRNAs, rRNAs, and protein-coding genes. This tree (Fig. 2) was used in all further analyses. To analyze substitution rates that were as unaffected by selective processes as possible, we used the third codon positions in all 13 aligned regions of protein-coding genes and only from codons that were in the fourfold redundancy class in all species in the alignment. Since asymmetric mutations appear to occur on the heavy strand, we considered the substitution process for the heavy strand, rather than the coding strand (in contrast to Faith and Pollock, 2003, and many other publications).
Each aligned site, s, in each genome, g, is associated with a position number, pg, which is based on the arbitrary designation of the beginning of the tRNA adjacent to the control region as position 1. The calculation of time spent single stranded at a site in a genome ( ) depends upon whether that site is located before (case 1) or after (case 2) the light-strand origin of replication, with respect to the direction of movement of the heavy-strand replication fork (Fig. 1). Calculations used the following rules:
where OLg is the position of the origin of light-strand replication in genome g, Lg is the length of genome g, and units are proportional to genome length (that is, in units of genomes divided by the unknown replication rate per genome). If the first position in the arbitrary numbering system of the circular genome lies between OLg and a position under consideration (Fig. 1), then a correction must be made to get the distance in nucleotides between OLg and the position. For each site, the DssHp were then averaged over all 18 species. For simplicity’s sake, in future reference we will define DssH to mean DssHp.
Bayesian analyses that simplify computational complexity by assuming two substitutions per branch have been developed and used to obtain a posterior distribution of ancestral sequences, and have been tested and described elsewhere (Krishnan et al., 2004c). Here, we used this method with a GTR model and the NJ phylogeny to reconstruct a posterior distribution of ancestral states for all nodes for the third codon positions of conserved fourfold redundant codons (see Materials and Methods section). The ancestral states mapped onto internal nodes were treated as augmented data along with the original sequence data and updated during MCMC runs using a Gibbs Sampling scheme (Krishnan et al., 2004c). The distribution of ancestral sequences obtained in this fashion has been compared with other methods (i.e., ML and parsimony in PAUP*; Swofford, 2000), and relatively low frequency biases were found (Krishnan et al., 2004c).
Samples from the posterior probability distribution of these augmented data then served as the first stage of a posterior predictive distribution to relatively easily calculate our more complex model, in which the substitution matrix varied among all the sites (Krishnan et al., 2004a, 2004b; Nielsen, 2002). A full posterior predictive approach involves evaluating the accuracy of the model-based data by obtaining distributions of test-statistics such as likelihood ratio (Gelman et al., 1996). Previous studies on among-site or among-gene variation of average rates assume context dependence (Pedersen and Jensen, 2001; Robinson et al., 2003), correlation of rates among genes (Thorne and Kishino, 2003), or adjust for rate variation among genes and lineages (Hasegawa et al., 2003). None of these actually vary the substitution probability matrix (as opposed to the average rate) among sites, since with most standard methods there would be an exorbitant computational expense involved in doing so. Here, the calculation of likelihoods is feasible since the ancestral states at internal nodes are known (that is, the posterior predictive ancestral states from the simpler model are known) and used for all calculations involving the complex model.
Our complex model starts with a symmetric “base” model that is the same at all sites and assumes strand symmetric rates of evolution (Bielawski and Gold, 2002; Lobry and Sueoka, 2002; Sueoka, 1995). This symmetric model is not necessarily reversible (in contrast to most commonly used models of evolution), and has fewer free parameters than the reversible model (Yang, 1994), since it assumes equal rates of complementary substitutions. An asymmetric component was included by adding a site-specific parameter, cp, to a particular prespecified substitution rate (e.g., A → G). This “hidden” component was also subtracted from the rate of self-change at each site for the appropriate nucleotide (e.g., A → A). The values of cp at each site were not dependent upon the magnitudes of DssH, but values of cp at consecutive sites considered were related by a Markovian component dependent on Δ, the difference in DssH at consecutive sites (when sites are ordered according to DssH), such that
where N is the normal distribution and α is a variable parameter that determined the magnitude of the variance; it is adjusted over the course of the Markov chain, and is constant among sites. Thus, the probability distribution of the asymmetric component at a site was a normally distributed random variable with mean equal to the asymmetric component at the previous site, and variance estimated as a function of Δ and the free parameter a. There was no specific a priori linear or nonlinear response built into this HMM. An MCMC analysis was run on the parameters of the symmetric model, the site-specific hyperparameters cp, and the HMM component α. Posterior distributions of each of these parameters were obtained using flat, uninformative priors for all parameters and symmetric, uniform, and bounded proposal distributions centered around the previous parameter value for each step in the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970). To determine the optimal proposal ranges, simulations were run from the starting values based on the average of a short preliminary MCMC run, and using different proposal ranges. The range for each parameter proposal distribution that corresponded to 60–80% acceptance was then fixed for the subsequent runs (Krishnan et al., 2004c).
MCMC analyses with this HMM model were performed on all 12 substitution types in 12 separate runs, and site-specific response curves of substitution rates (relative to the symmetric rate for the same substitution type) versus DssH were obtained (Fig. 3).
The online interface mfold (Zuker, 2003) was used to predict secondary structures for each predicted mRNA for each of the eighteen species, and the “loopiness” of each site in the alignment considered (see Materials and Methods section) was estimated as the proportion of alternative structures for which that site forms a “loop” rather than a “stem.” Although biological effects may occur based on DNA structure, we used RNA structure predictions because they are more developed and probably more accurate, and take into account higher order interactions among sites (Zuker, 2003). We considered all predicted alternative secondary structures that were at least half as stable as the optimal structure, and the “loopiness” of a site was averaged across the 18 species. For analysis of the correlation between loopiness and the asymmetric substitution component for A → G substitutions, the expectation for the component was calculated based on a linear regression of the posterior average A → G/G → A ratio at each site, treating DssH as an independent variable. Residuals were then calculated as deviations from expectation. Sites were grouped into 17 categories based on degree of average loopiness, and residuals were averaged for all sites in a category. We tested for association between loopiness and average residual A → G/G → A ratio using a weighted regression analysis with loopiness as the independent variable.
The approach outlined here is innovative in that it does not specify the precise nature of the relationship between substitution rates and time spent single stranded, and in that it allows for variation in the rate of individual substitution types at every point in the genome. We were able to evaluate a continuous response that did not require arbitrary choice of window sizes for averaging and which enabled us to detect both site-specific deviations and regional trends. The nature of each substitution response is not built into the HMM model a priori, and there is no equation, linear or otherwise, which determines the type of response visualized. Therefore, any linear or nonlinear response observed in our results is a direct reflection of trends in the data. Our use of a posterior distribution of ancestral states from a simpler model makes it feasible to create this site-specific model complexity with relatively little computational effort, and thus allow exploratory statistical analysis of complex site-specific substitution behavior.
We first demonstrated the utility of this method by analyzing transition substitution probabilities. As previously predicted (Faith and Pollock, 2003), substitution probabilities from A → G increase linearly with time spent single stranded, and C → T substitutions increase rapidly at low single-strandedness values and then remain approximately constant over the rest of the genome (Fig. 3a and b). In contrast, the reverse substitutions (G → A and T → C) remain relatively constant and the asymmetric components remain at relatively low levels (Fig. 3a and 3b). Expanded views of the number of reverse transition substitutions relative to the symmetric base model (Fig. 4a and 4b) show that there are trends in these substitutions: G → A substitutions decrease in approximately linear fashion with increasing time spent single stranded, whereas T → C substitutions increase approximately linearly, then appear to level off at about DssH = 0.9. We must caution here that relative rates below 1.0 do not make clear biological sense, since this implies that fewer mutations occur due to time spent in the single-stranded state; the observed trend in G → A substitutions may instead be due to a tendency to confound backward (G → A) substitutions with the much more prevalent forward (A → G) substitutions.
In addition to the main trends, there is also variation in the ratios of substitution probabilities in the form of local dips and rises in the average posterior probability. It is likely that various factors, including codon bias, dinucleotide bias, and nucleic acid secondary structural features, affect substitution rates in addition to the effect of time spent single stranded. We are currently addressing these factors by combining them into even more complex phylogeny-based evolutionary models; we show preliminary results on the effect of one of these factors (nucleic acid secondary structure) below.
The relative levels of transversion substitutions tend to be much closer to 1.0 than for transitions (Fig. 3c–f), with the notable exceptions of C → G and A → C substitutions, which average around 3.3 and 1.6 times as much (respectively) as their rates in the symmetric base model and vary somewhat along the genome (Fig. 3c and 3d). Arbitrarily referring to the transversion substitutions with a greater asymmetric component as “forward” substitutions (C → G, A → C, T → G, and A → T), all four reverse substitutions are nearly constant along the genome. Furthermore, the forward T → G and A → T relative substitution rates, although on average close to 1.0, can be seen to vary considerably with time spent single stranded when the scale is expanded (Fig. 4e and 4f). Posterior means for α (the parameter that controls correlation between adjacent sites) are much higher for forward transitions (C → T and A → G) compared to backward transitions (T → C and G → A) and all transversions.
The variation in forward transversion response curves with time spent single stranded is intriguing in that they tend to increase after a short lag, peak around DssH = 0.9, then decrease (Fig. 4). Again, we are cautious about overinterpreting this, since the observed trend may be due to a tendency to confound the transversion substitutions with the more strongly biased and variable G → A and C → T substitutions. It is also hard to see a plausible biological reason why mutation rates should decrease with longer times spent single stranded, and a complicated interactive bias caused by both transition types seems more likely. Transcription could be invoked, but there is no clear difference in the transition responses for ND6, which is transcribed on the opposite side as the other protein-coding genes (Fig. 3; Faith and Pollock, 2003). This interpretation is supported by the shape of the T → G curve, which is almost exactly the opposite of the A → T curve, and reaches a minimum at the same point the others reach a maximum. This point corresponds to the ND4 gene, which has been previously noted to have unusually strong asymmetric features for undetermined reasons (Bielawski and Gold, 2002). If taken at face value, the situation with transversions appears to be complex, and in future studies we will incorporate simulations to determine the strength of biases that strongly asymmetric substitution process may have on inferring other substitution processes.
To determine whether nucleic acid secondary structure has a detectable effect on asymmetric transition rates, we looked for a correlation between secondary structure and residual transition asymmetry. Since the average A → G/G → A ratio (ρ) has an approximately linear relationship with time spent single stranded, we performed a linear regression on this average, treating DssH as an independent variable. Residuals were then taken as differences from the expectation, ρ = 1.2839 * DssH + 1.2242. The residuals are larger when there is more loopiness (less secondary structure) at a site (Fig. 5). A possible explanation for this preliminary result is that formation of secondary structure in the “single-stranded” DNA decreases the effective time spent single stranded (Seligmann et al., in review), thus decreasing the A → G mutation rate.
We have shown here that individual asymmetric mutation processes can be detected and evaluated at a site-specific level along the mitochondrial genome. Such evaluation will be important in developing a more complete and unified model of evolution in mitochondrial genomes. We have obtained an indirect indication that secondary structure may modify mutation and substitution processes, and in other work we are also incorporating the effects of adjacent nucleotides (which can strongly modify substitution rates) and codon bias. Many transversions show no indication of asymmetric substitution bias due to single strandedness, and only two of them show strong asymmetric bias. Interestingly, the two most biased transversions (C → G and A → C) do not correspond to the best-characterized lesion from oxidative damage, the conversion of the guanine to 8-oxoguanosine, which can mispair with adenine leading to a G → T transversion substitution. There are indications, however, that methylglyoxyl (a major product of DNA oxidation), readily produces C → G transversions in vivo (Murata-Kamiya et al., 2000). The variation in the transversion substitution response curves is not simple, and may well be the result of fairly simple mutation response curves (linear or saturating) combined with inference bias introduced by the much stronger transition response curves.
Future work should determine how much inference bias can be expected with these methods, and whether such bias can be corrected for in a combined analysis. It will also be important to expand the analysis to do comparative analysis of other taxon groups and larger taxon groups. The primates were chosen partly because they were the vertebrate family with the largest representation of complete genomes, but presuming that the rapid rate of increase in complete genomes continues (from 67 to over 300 in the last 4 years), expanded comparative analyses will soon be feasible. The results of the work presented here may be useful for improving phylogenetic analysis, for carrying out refined comparative analysis of substitution processes and replicative mechanisms, and in improving estimates of synonymous DNA substitution processes for incorporation into and comparison with amino acid substitution models, which may allow more accurate detection of selection and functional divergence.
We thank T. Disotell’s group for sharing genomes prior to publication. This work was supported by grants from the National Institutes of Health (GM065612-01 and GM065580-01), and the State of Louisiana Board of Regents [Research Competitiveness Subprogram LEQSF (2001-04)-RD-A-08 and the Millennium Research Program’s Biological Computation and Visualization Center] and Governor’s Biotechnology Initiative.