|Home | About | Journals | Submit | Contact Us | Français|
Geminiviruses are devastating viruses of plants that possess single-stranded DNA (ssDNA) DNA genomes. Despite the importance of this class of phytopathogen, there have been no estimates of the rate of nucleotide substitution in the geminiviruses. We report here the evolutionary rate of the tomato yellow leaf curl disease-causing viruses, an intensively studied group of monopartite begomoviruses. Sequences from GenBank, isolated from diseased plants between 1988 and 2006, were analyzed using Bayesian coalescent methods. The mean genomic substitution rate was estimated to be 2.88 × 10−4 nucleotide substitutions per site per year (subs/site/year), although this rate could be confounded by frequent recombination within Tomato yellow leaf curl virus genomes. A recombinant-free data set comprising the coat protein (V1) gene in isolation yielded a similar mean rate (4.63 × 10−4 subs/site/year), validating the order of magnitude of genomic substitution rate for protein-coding regions. The intergenic region, which is known to be more variable, was found to evolve even more rapidly, with a mean substitution rate of ~1.56 × 10−3 subs/site/year. Notably, these substitution rates, the first reported for a plant DNA virus, are in line with those estimated previously for mammalian ssDNA viruses and RNA viruses. Our results therefore suggest that the high evolutionary rate of the geminiviruses is not primarily due to frequent recombination and may explain their ability to emerge in novel hosts.
While most research on plant virus evolution has concentrated on RNA viruses (27, 73), the biggest emerging threat to crops worldwide are the single-stranded DNA (ssDNA) geminiviruses, especially those of the dicot-infecting, whitefly-transmitted Begomoviridae (47, 78). One reason for the emphasis on plant RNA viruses has been the desire to assess the extent and structure of genetic variation created by their use of error-prone RNA polymerase, which in turn is necessary to understand their molecular epidemiology (54) and which can complicate the creation of resistant crop lines (35). However, multiple genera of geminiviruses also display high levels of within-host genetic variation (37, 74, 84), suggesting that plant ssDNA viruses might exhibit genetic diversity similar to that seen in plant RNA viruses, even though they utilize host DNA polymerases (8). More recently, a controlled experimental study on within-host variability of the begomovirus Tomato yellow leaf curl China virus found an average mutation frequency of 3.5 × 10−4 per base in 60 days of infection on a tobacco plant (30), a value similar to those found in plant RNA viruses (76, 77). Together, these results imply that the mutation rate (and, by extension, long-term rates of nucleotide [nt] substitution) of the geminiviruses might be more similar to those of plant RNA viruses (as high as ~10−5 nt/generation) (37, 46, 60) or to the ssDNA bacteriophage X174 (~7 × 10−6 nt/generation) (70) than to double-stranded DNA viruses (~10−8 nt/generation) (17).
The idea that ssDNA viruses exhibit substitution rates similar to those seen in RNA viruses (between 10−3 and 10−5 substitutions per site per year) (38) has been previously proposed. Most notably, studies of two mammalian parvoviruses, canine parvovirus and B19 erythovirus, found substitution rates similar to those of RNA viruses, at ~10−4 substitutions per site per year (subs/site/year) (79, 80). However, to date, no such equivalent analyses have been conducted on any plant ssDNA virus.
Rather than considering substitution dynamics, most work on evolutionary mechanisms in the geminiviruses has focused on the causes and consequences of recombination (34). Geminiviruses can replicate by rolling-circle and recombination-dependent mechanisms (69) and show evidence of both homologous and heterologous recombination. Strikingly, recombinant strains occur frequently (21, 26), emerge often (4, 52, 62), and occasionally yield virulent strains that cause pandemics (95). However, although recombination is undoubtedly an important evolutionary mechanism in these viruses, it cannot create genetic variation de novo, so background mutation pressure must also be a key element of geminivirus evolution (30, 37).
To explore the dynamics of nt substitution in the geminiviruses, we undertook an analysis of the rate of molecular evolution in one of the most devastating and well-studied tomato pathogens, Tomato yellow leaf curl virus (TYLCV) (65), for which a relatively abundant data set of whole-genome sequences exists. This virus was first identified in 1939 (65). It caused the complete destruction of the tomato crop in the Jordan River valley in 1959 (10) and a near-complete loss in Hispaniola in the early 1990s (66). It was also the first begomovirus described with an unusual monopartite genome—analogous to the DNA-A strand—that is in some, but not all instances, accompanied by a DNA-β or a typical DNA-B (30, 71).
We used these data to estimate the rate of nt substitution and its determinants in the whole DNA-A-like genome of tomato yellow leaf curl disease (TYLCD)-causing viruses. The DNA-A codes for two proteins in the sense strand: a pre-coat (V2) and coat protein (V1), the latter of which interacts with the whitefly vector (65) and is shown to be conserved among geminiviruses (61). The complement of the DNA-A codes for four proteins: the replication-associated protein (Rep), a transcriptional activator of the coat protein (TrAP), a replication enhancer (REn), and the C4 protein that assists in intercell movement and which is entirely encoded in a different reading within the Rep gene (65). To increase the rigor of our analysis, we also examined the substitution rate in the coat protein-coding gene, V1, and in the “hypervariable” intergenic region (30, 61).
Complete genome sequences of viruses casing TYLCD were downloaded from GenBank (Table (Table1).1). When the viruses were bipartite or were accompanied by a DNA-β, only the DNA-A was considered in the analysis. Viruses were abbreviated as most recently codified (22) or as suggested in publications referencing the sequence. For the viruses submitted directly to GenBank without accompanying publications, we named the sequences in accordance with current conventions and according to the isolate designation information contained in the GenBank file.
Since we were interested in the rate of molecular evolution, our analyses only considered sequences that could be dated. We therefore collected the dates (years) of isolation for as many of these TYLCD-causing viruses as possible from the GenBank files, from related publications, or by contacting the authors directly (Table (Table1).1). Sequences from viruses that had been reared for a substantial period of time on plants in the laboratory were excluded from analysis, since they were not necessarily reflective of evolutionary dynamics in nature. We also excluded the distantly related Tomato yellow leaf curl Indonesia virus and Tomato yellow leaf curl Aragua virus.
All genome sequences were organized to begin at the nick site in the invariant nonanucleotide sequence in the origin of replication (TAATATT-3′/5′-AC ). The data sets were then aligned by using MUSCLE (http://www.drive5.com/muscle/ ) and manually adjusted by using SE-AL (http://evolve.zoo.ox.ac.uk). In the coat protein-coding region, two sequence regions were excised from the alignment because they were present only in the TYLCV-[IL] isolate and would have generated three missense residues and the insertion of two residues in the coat protein-coding region (6 nt in total, nt numbers 1109 to 1114 and 1123 in GenBank accession X15656). Since TYLCV-[IL] was the first TYLCV sequenced, we considered it likely that these “mutations” were sequencing errors and so excluded them from our study. The final full genome alignment was 2,830 nt in length and consisted of 56 taxa (Table (Table22).
In addition to the full genome, the 56 genome data set was trimmed to create a smaller data set (780 nt), corresponding to the coding region for the coat protein (V1 or CP). We did not conduct an equivalent analysis for the gene encoding the replication-associated protein (Rep) because the length of the Rep protein was highly variable.
We also created two data sets to explore the rate of evolution in the rapidly evolving intergenic region (61) for two clades of TYLCV: the severe phenotype of TYLCV sensu stricto and the mild phenotype of TYLCV sensu stricto. We supplemented the dated full-length sequences with additional partial genome sequences of TYLCV with the published dates of isolation. For the severe TYLCV sensu stricto, 19 additional “intergenic region” sequences from Italy and Spain sampled between 2000 and 2003 (DQ121479 to DQ121485, DQ317727 to DQ317771) were added to the data set (excluding two highly divergent isolates: TYLCV-Gez and TYLCV-IR). The resultant 312 nt, 71-taxa alignment was trimmed to exclude all coding regions. The mild clade of TYLCV sensu stricto included all 12 whole-genome sequences and nine additional intergenic sequences that have dates of isolation between 2000 and 2003 (DQ121474 to DQ121478, DQ317772 to DQ317775). Trimming these to just the intergenic region yielded a 319-nt alignment.
Putative recombinant genomes were identified by using the RDP3 package (http://darwin.uvigo.es/rdp/rdp.html), which contains six recombination detection programs: RDP, GENECONV, MaxChi, Chimeara, Bootscan, and SiScan (49). The default detection thresholds were used in all cases.
Maximum-likelihood (ML) phylogenetic trees for each data set were estimated by using PAUP* 4.0 (86) with tree-bisection-recombination branch-swapping in each case, with the best-fit model of nt substitution for each data set determined using MODELTEST (68) (Table (Table2).2). To assess the support for individual nodes on the phylogenies, we undertook a bootstrap analysis utilizing 1,000 replicate neighbor-joining trees under the ML substitution model in each case. This analysis was also conducted in PAUP*.
To estimate rates of nt substitution per site and the “times to most recent common ancestor” (TMRCA), we used the Bayesian Markov chain Monte Carlo (MCMC) method available in the BEAST package (http://evolve.zoo.ox.ac.uk). Both strict and relaxed (uncorrelated exponential and uncorrelated lognormal) molecular clocks were utilized for each data set (19), as well as five demographic models (constant population size, expansion, exponential and logistic population growth, and a piecewise Bayesian skyline plot), which were used as coalescent priors. The best-fitting models were determined by a comparison of posterior probabilities and are shown in Table Table2.2. MCMC chains were run for sufficient length to ensure convergence, with 10% of the MCMC chains discarded as burn-in, and statistical uncertainty in the estimates provided by the 95% highest probability density (HPD) values (Table (Table22).
The ratio of nonsynonymous (dN) to synonymous (dS) nt substitutions per site (ratio dN/dS) in the coat protein alignment was estimated by using the ML single-likelihood ancestor counting method available through the DATAMONKEY web server (44), with the nt substitution model chosen by MODELTEST and the input phylogenetic tree inferred by using neighbor joining.
Finally, we used a phylogenetic approach to assess the extent of mutational bias in the full genome data set. These data were first aligned with their probable closest outgroup (African cassava mosaic virus, Kenya isolate, GenBank accession number J02057 [see, for example, references 45 and 63]), and an ML phylogeny, rooted with this outgroup, was inferred by using the procedure described above (data not shown, but available from the authors on request). The number of unambiguous nt changes of each type (e.g., A→C and A→G), as well as the putative ancestral sequence, were then inferred by mapping all substitutions onto the ML phylogeny using the parsimony method available in PAUP*. These observed proportions were then compared, using χ2 tests of significance, to those expected by summing the forward and reverse of each substitution type (i.e., the sum of the number of A→C and C→A transversions), assuming they were equally likely by dividing that sum in half, and then adjusting that value by the relative frequency of each base in the inferred ancestral sequence (i.e., frequency of C/[frequency of C + frequency of A]).
For the full genome data set, a phylogenetic signal compatible with recombination (i.e., topological incongruence) was detected in nearly all of the isolates analyzed. Hence, removing recombinants would have left too few sequences for a viable estimation of substitution rates. However, while the occurrence of recombination violates the assumptions of coalescent analysis, this process is most likely to increase genetic variation and hence elevate both the average apparent substitution rate and its variability.
Significant phylogenetic evidence for recombination (P < 0.05) was also detected in the coat protein gene (V1) alignment, but in only in 2 of the 56 taxa: TYLCTHV-SaNa by RDP, Geneconv, and Bootscan and TYLCTHV-ChMai by Geneconv, Bootscan, MaxChi, Chimaera, and SciScan. Removing these two taxa was sufficient to remove detectable recombinants, producing a data set of 54 taxa for subsequent analysis. No recombination was also detected in either of the intergenic data sets.
The ML phylogenetic tree estimated for the full genome alignment is shown in Fig. Fig.1.1. Despite the high levels of recombination detected in this data set, the monophyly of nearly all TYLCD-causing viral species was strongly supported (the placement of TYLCV-Gez, a known interspecific recombinant  disrupted the monophyly of TYLCV sensu stricto). Further, the phylogenies estimated from the whole-genome alignment and the coat protein (V1) alignment were qualitatively congruent (Fig. (Fig.11 and and2).2). One notable exception was the grouping of the three Asian species: TYLCCNV, TYLCTHV, and TYLCKaV. The V1 tree strongly suggests that the coat protein of these Asian species is most closely related to that of TYLCSV and its recombinant relatives (Fig. (Fig.2),2), which are predominantly found in the Mediterranean (25), whereas there is no bootstrap support for this in the full genome phylogeny (Fig. (Fig.1).1). Support for a clade containing TYLCSV and the Asian TYLCV has been previously observed with a smaller coat protein data set (53) but is unsupported in all full genome (DNA-A) phylogenies published to date (see, for example, references 52 and 64). The phylogenies for the two intergenic alignments are provided in the supplemental material.
To further reveal the processes shaping TYLCD-causing virus evolution, we determined whether specific kinds of mutational changes are overrepresented relative to others in the full-genome alignment. After adjusting for the initial nt frequency, 8 of the 12 substitution patterns were significantly more (C→A, C→T, G→A, and G→T) or less (A→C, A→G, T→C, and T→G) likely than expected (Table (Table3).3). The most significant of these were C→T (P = 6.9 × 10−4) and G→A (P = 1.1 × 10−6). Therefore, these results indicate there are major biases in the evolution of the TYLCD-causing viruses, even within the same substitution types. These biases also favor a shift from CG→AT genome content, most notably in a bias in C→T and G→A transitions.
Although a variety of different demographic models were supported, in all cases the relaxed molecular clock was a better fit than the strict molecular clock to the data analyzed here (Table (Table2).2). Notably, for all four data sets, qualitatively (and, for the mean values, quantitatively) similar values were found with all demographic models and for both uncorrelated lognormal and exponential relaxed molecular clock models. The mean rate of nt substitution estimated for the full genome was high, at 2.88 × 10−4 subs/site/year (Table (Table2)2) and hence within the range documented in RNA viruses (38). As we had expected given recombination, the 95% HPD values around that average estimate were extensive, representing the widest HPD calculated for any of the alignments in the present study.
Despite the occurrence of recombination, that similarly high rates were observed in the coat protein gene data set where recombinants were excluded reveals that this process has not significantly biased estimates. Indeed, the mean substitution rate in the coat protein gene data set was higher than that of the whole genome, at 4.63 × 10−4 subs/site/year. It was also notable that the high substitution rate for the coat protein gene was largely due to synonymous substitutions (dN/dS = 0.12). This low dN/dS ratio is typical of the coat proteins of vector-borne plant RNA viruses and most likely reflects strong purifying selection resulting from the highly specific interaction between the virus and the insect vector (27, 78). Indeed, in our data set, 44% of the codons (114 of 259) were determined to be under statistically significant negative selection, with no sites under positive selection (P < 0.05), so that most fixations are likely to be neutral.
Substitution rates were even higher for the noncoding intergenic regions (severe, 1.75 × 10−3 subs/site/year; mild, 1.37 × 10−3 subs/site/year; Table Table2).2). Notably, the severe intergenic region evolved significantly faster than V1 such that the 95% HPD values around the average estimates did not overlap. The mild TYLCV analysis had wider HPD values, presumably because there were fewer sequences in this data set compared to the TYLCV severe intergenic analysis (21 compared to 71; Table Table2).2). Others have also noted the increased genetic variability of the intergenic region relative to genes in geminiviruses, likely due to relaxed purifying selection since there are fewer critical functions coded in this region (30, 61).
Full genome sequences from the eight divergent TYLCD-causing virus species (Table (Table1)1) were estimated to have a mean TMRCA of approximately 10,000 years ago, with a very wide range of 95% HPD values of between 290 and 40,211 years ago (Table (Table2).2). A more recent mean TMRCA of 1,308 years was estimated for the coat protein data set, although this falls within the HPD values estimated for the genomic data set (Table (Table2).2). The TMRCAs estimated for the severe and mild phenotypes from the intergenic regions were even more recent, at approximately 30 and 20 years ago, respectively (Table (Table2).2). Despite this wide range of date estimates, our analyses reveal that the TYLCD-causing viruses diversified from a common ancestor long before TYLCD became a significant agricultural problem, since the symptoms associated with this virus were not noted until 1939 (65). On the basis of the intergenic region, we infer that the severe phenotype TYLCV sensu stricto isolates shared a common ancestor within the last 44 years (upper HPD value). This suggests that the diversification of TYLCV sensu stricto occurred after the first major outbreak of TYLCD in the Jordan River valley in 1959 (65).
The mild TYLCV clade was well supported in the full genome phylogeny, and all but two mild isolates formed a well-supported clade in the V1 phylogeny. The mild phenotype of TYLCV is due to incomplete base pairing in a stem-loop motif in the intergenic region (57), and recombination could have led to “mild” isolates with different V1 genes (as is likely the case for TYLCV-Mld[RE] and TYLCV-Mld[JO]; Fig. Fig.2).2). Our data suggest that the intergenic region of the mild phenotype TYLCV evolved once, and the TMRCA of the isolates included in the present study is within the last 35 years (Table (Table2).2). Since the first mild phenotype TYLCV was isolated from a mixed infection maintained in the laboratory from an initial isolation in the late 1960s (2), it could be that the mild genotype existed in mixed infection with severe TYLCV for some time before the virus spread worldwide.
We were able to elucidate the evolutionary history and dynamics of the TYLCD-causing viruses by using dated genomic sequences. Most notably, we observed that these geminiviruses evolve as rapidly as some RNA viruses. Also of note was that the mild phenotype of TYLCV appears to have evolved only once from severe TYLCV (13, 21, 36, 43, 52, 57, 88) and within the last 35 years has spread to places as distant as Japan (88), Reunion Island (15), and Venezuela (94).
While there have been few estimates of substitution rates in plant viruses, there is some evidence that plant RNA viruses are more genetically stable than animal RNA viruses, leading to lower rates of evolution (24, 27, 28) attributable to a lack of immune selection on plant viruses (27). Indeed, there are prior suggestions that geminiviruses evolve faster than some plant RNA viruses (60), including tobamoviruses (40, 42, 72) and criniviruses (48). We present here the first evidence that this might be the case: the coat protein gene evolves at least as rapidly as plant RNA viruses (40, 85), as quickly as animal ssDNA viruses (79, 80, 89), and as fast as some animal RNA viruses (38). Most notably, the substitution rates inferred here are many orders of magnitude greater than those observed in double-stranded DNA viruses (5, 50). It is also likely that the high substitution rate of the begomovirus TYLCV is typical of geminiviruses as a whole. High levels of genetic variation have been observed in the curtovirus Beet curly top virus (84), and a high mutation frequency has been observed in the mastrevirus Maize streak virus (37).
Hence, we propose that there is a specific feature of geminivirus genomes, and perhaps of ssDNA viruses as a whole, that leads to these high, RNA virus-like rates of evolution. Although positive selection may greatly elevate substitution rates compared to background (neutral) mutation rates, the very high evolutionary rates observed in the intergenic regions and the low dN/dS ratios in the coat protein gene strongly suggest that the high substitution rates observed in TYLCV are a reflection of rapid mutational dynamics rather than frequent adaptive evolution.
It is currently unclear what mechanisms lead to the elevated substitution rates in ssDNA viruses (79). Unlike RNA viruses that use their own error-prone polymerases, geminiviruses use the cellular machinery of their host plants, implying that they should have the same replication fidelity as their host. However, it is possible that geminivirus genomes cannot be repaired by host exonucleases because they do not possess the proper methylation patterns, which will increase mutation rates during replication of viral DNA (74). Alternatively, even if geminivirus genomes are correctly methylated, base-excision repair might not proceed on replicating geminivirus genomes because they are only transiently double stranded during rolling-circle replication. It is also possible that geminivirus genomes recruit a more error-prone polymerase (such as one lacking a base-excision repair mechanism altogether) from the host's nucleus for its own replication (29).
The biased pattern of mutation, favoring C→T and G→A transitions, suggests another potential mechanism for rapid mutation. Both of the most significant transition biases commonly occur due to deamination of the bases: from cytosine to uracil and from guanine to xanthine (which base pairs to thymine ). Base deamination can occur spontaneously and is more likely to occur when the nt is not based paired, as for instance when DNA spends a prolonged period of time in a single-stranded state (9, 23, 91). Biases consistent with deaminated nt are often observed in viral genomes, such as human immunodeficiency virus, because animals have specific enzymes to deaminate ssDNA and ssRNA, such as the APOBEC family (90). Deaminating enzymes have been identified in plant nuclei (82) that could be responsible for the biased substitutions observed in TYLCD-causing viruses, in a process unrelated to mutations due to DNA polymerase error.
Consequently, deamination could represent a significant component of mutation in TYLCD-causing viruses, particularly since a recent mutation frequency study in TYLCCNV revealed the same transition biases as our genomic analysis (30). Deamination of bases, whether by enzymatic or spontaneous reactions, is a mutational mechanism that occurs before the template replicates; it is therefore separate from the polymerase-induced mutation that is measured by mutation rate assays. This additional source of mutation likely experienced by ssDNA viruses may help explain how geminiviruses generate variation while using fidelitous plant DNA polymerases. However, without direct experimental tests, we cannot be certain of the role that the significant base deamination found in TYLCD-causing virus genomes plays in shaping mutation rates.
RNA viruses mutate quickly (18) and usually have high substitution rates (38). While these traits are not necessarily adaptive, they are thought to facilitate viral emergence on novel hosts (55, 83). The frequent emergence of geminiviruses on novel hosts has previously been explained by the fact that they are more vector specific than host specific (78) and by the creation of genetic variation by recombination (62). We demonstrate here that the geminivirus TYLCV also has a high, RNA-like rate of nt substitution, independent of its frequent recombination, and this could be an important factor in its emergence on novel crops.
This study was supported by National Science Foundation grant DBI-0630707 to S.D.
We thank Laura Shackelton for technical assistance. We thank the geminivirus research community for supplying dates of isolation for many TYLCD-causing virus sequences and an anonymous reviewer for the suggestion that host error correction mechanisms might not function during rolling-circle replication.
Published ahead of print on 31 October 2007.
†Supplemental material for this article may be found at http://jvi.asm.org/.