|Home | About | Journals | Submit | Contact Us | Français|
Despite their importance as agents of emerging disease, the time scale and evolutionary processes that shape the appearance of new viral species are largely unknown. To address these issues, we analyzed intra- and interspecific evolutionary processes in the Luteoviridae family of plant RNA viruses. Using the coat protein gene of 12 members of the family, we determined their phylogenetic relationships, rates of nucleotide substitution, times to common ancestry, and patterns of speciation. An associated multigene analysis enabled us to infer the nature of selection pressures and the genomic distribution of recombination events. Although rates of evolutionary change and selection pressures varied among genes and species and were lower in some overlapping gene regions, all fell within the range of those seen in animal RNA viruses. Recombination breakpoints were commonly observed at gene boundaries but less so within genes. Our molecular clock analysis suggested that the origin of the currently circulating Luteoviridae species occurred within the last 4 millennia, with intraspecific genetic diversity arising within the last few hundred years. Speciation within the Luteoviridae may therefore be associated with the expansion of agricultural systems. Finally, our phylogenetic analysis suggested that viral speciation events tended to occur within the same plant host species and country of origin, as expected if speciation is largely sympatric, rather than allopatric, in nature.
Although RNA viruses are the most common agents of emerging disease, key aspects of their evolution are still only partly understood. This is of both academic and practical importance, as virus evolution may compromise disease control strategies, including the rapid generation of genotypes that are able to evade host immune responses or of those that are resistant to antivirals or crop genetic resistance (20, 34, 47).
Most of our knowledge of the rapidity of RNA virus evolution comes from the study of animal viruses, for which estimates of rates of nucleotide substitution normally fall within 1 order of magnitude of 1 × 10−3 nucleotide substitutions per site per year (subs/site/year) and largely reflect the background mutation rate (10, 13, 29, 37, 53). Equivalent studies on plant RNA viruses have reported more heterogeneous rates. Early studies suggested that some plant RNA viruses evolved more slowly than RNA viruses that infect animals. For example, estimates of the nucleotide substitution rate in the range of ~1 × 10−6 to 1 × 10−8 subs/site/year have been obtained for Turnip yellow mosaic virus (4, 23) and some tobamoviruses (21, 25). In contrast, more recent estimates using Bayesian coalescent methods applied to sequences with known dates of sampling and allowing for rate variation among lineages have reported substitution rates in the same range as those of animal RNA viruses (22, 63) and therefore suggest relatively high rates of mutation, as expected, given the intrinsically error-prone nature of RNA replication (15, 65). As well as the differences in how these rates are estimated, a reasonable biological explanation for such a diversity of rate estimates is that they are increased in the short term due to the presence of mutational polymorphisms but lower in the long term because any such deleterious mutations would have then been removed by purifying selection (17, 22). In particular, severe population bottlenecks at transmission would allow deleterious mutations to rise to a high frequency due to strong genetic drift. Such effects make it dangerous to extrapolate long-term rates of evolutionary change from the analysis of intraspecific sequence data (31). Differences in the strength of adaptive evolution could also cause rate heterogeneity, including such processes as competition for susceptible individuals and the colonization of new host species (22, 65).
Although there is a growing body of data on intraspecific evolutionary processes in plant RNA viruses, including rates of nucleotide substitution, there has been a general neglect of long-term evolutionary patterns, including the determinants of viral speciation. Exceptions are recent analyses of the Potyviridae and the Sobemovirus, which associated viral speciation with the development of agriculture (17, 24). Although RNA viruses reproduce asexually, it is informative to consider as analogies the two major forms of speciation used in studies of sexually reproducing eukaryotes: allopatric speciation, in which reproductive isolation follows geographic separation, and sympatric speciation, in which reproductive isolation occurs within an interbreeding population (67). In the context of RNA viruses, allopatric speciation can be thought of as the genetic diversification that occurs when viruses jump to new host species and thereafter evolve independently, as is commonly associated with the process of viral “emergence.” In contrast, sympatric speciation would occur when viruses diversify within a single host species, perhaps by exploiting different cell types (34). Despite the importance of these processes for our understanding of the macroevolution of RNA viruses, their respective roles are currently unknown.
To better understand the nature of long-term evolutionary processes in plant RNA viruses, we undertook an extensive molecular evolutionary analysis of the family Luteoviridae, a heterogeneous family of plant viruses divided into three genera, Luteovirus, Polerovirus, and Enamovirus, containing five, nine, and one classified species, respectively, as well as a number of unclassified species (18). The Luteoviridae possess positive-sense single-stranded RNA genomes of 5,600 to 6,000 nucleotides (nt). These genomes can harbor five or six open reading frames (ORFs). 5′-proximal partially overlapping ORF1 and -2 encode proteins P1 and P2, which are involved in virus replication. Low-frequency −1 ribosomal frameshifting in the overlapping region results in the P1-P2 fusion RNA-dependent RNA polymerase protein (RdRp). ORF3 encodes the coat protein (CP) and completely contains ORF4, which is not found in Enamovirus and is needed for virus movement in the plant (the movement protein, MP). ORF5, which is necessary for aphid transmission (6, 27, 49) and is also involved in virus movement (57) and Luteoviridae phloem limitation (58), is translated through in-frame read-through of the ORF3 stop codon, existing as a read-through domain (RTD) fused to the CP. Members of the genus Polerovirus have an extra ORF0 in the 5′ end of the genome partially overlapping ORF1. Its translation product (P0) acts as a repressor of the RNA-silencing plant defense response (44, 59). Finally, some Luteovirus species have an additional ORF6 with an unknown function in the 3′ end of the genome (18, 48, 70). As a consequence of this particular genomic organization, approximately one-third of the Polerovirus genome, and a smaller fraction in the Luteovirus genome, is composed of overlapping regions.
Due to their agronomic importance, gene sequence data, together with information on host range and geographical distribution, are available for a relatively large number of members of the family Luteoviridae. However, to date the only luteovirus for which rates of evolutionary change have been estimated is Barley yellow dwarf virus (BYDV). In this case, an analysis of substitution rates based on viral RNA extracted from herbarium specimens produced estimates of between 6.2 × 10−4 and 9.7 × 10−5 subs/site/year (43). Similarly, only one estimate of the point at which genetic diversity arose in the family Luteoviridae has been obtained, i.e., approximately 9,000 years ago, and therefore it is perhaps associated with the rise of agriculture (17). However, only a limited number of Luteoviridae species and sequences were included in this analysis. No studies have yet considered the mechanisms of speciation in the family Luteoviridae.
The family Luteoviridae also represents a useful data set to study two other evolutionary phenomena: the pattern and determinants of recombination, which appears to be commonplace within the family Luteoviridae (26, 49, 51, 70, 71), and the differing evolutionary dynamics in genes with overlapping reading frames. There are contrasting hypotheses as to why overlapping reading frames are so commonly used in RNA viruses. According to one view, gene overlapping maximizes the genetic information in smaller genomes (1, 39). Alternatively, it has been suggested that gene overlap generates mutational robustness (i.e., the ability to preserve phenotypes despite the genomic mutational load) at the population level (2, 16, 42). Under the latter hypothesis, gene overlapping generates hypersensitivity to deleterious mutations, as these affect more than one gene. Although this hypersensitivity reduces the capacity of each individual to buffer mutation effects, it represents a selective advantage for wild-type genotypes, which then bolsters robustness at the population level (16, 42). As a consequence of this elevated burden of deleterious mutation, RNA viruses with larger proportions of their genomes present as overlapping reading frames are expected to exhibit lower rates of nucleotide substitution (41, 50). Such a rate reduction has been observed in many animal DNA and RNA viruses (for example, see references 35, 36, 55, 77, and 78), although only a few studies have considered plant RNA viruses in this context (28, 61).
Available sequences representing the 15 members of the family Luteoviridae were compiled from GenBank. As temporal information is central to the analyses undertaken here, the year of isolation of each sequence was obtained either from GenBank or from the associated publications or was kindly provided by the relevant authors. Sequences from extensively passaged isolates in nonnatural hosts were excluded from the analyses. For a full list of the accession numbers, origins, and years of isolation of the sequence used, see Tables S1 to S9 in the supplemental material).
To determine long-term evolutionary patterns in the Luteoviridae, we focused our analysis on the CP gene, as this is the gene with the largest data set available. Species with more than 20 available CP sequences were retained for this analysis, which meant that we were able to utilize 11 out of the 15 classified taxa in this analysis, i.e., the three BYDV species (BYDV-MAV, BYDV-PAS, and BYDV-PAV), Soybean dwarf virus (SbDV), Beet chlorosis virus (BChV), Beet mild yellowing virus (BMYV), Cucurbit aphid-borne yellows virus (CABYV), Cereal yellow dwarf virus-RPV (CYDV-RPV [from here on CYDV]), Potato leafroll virus (PLRV), Sugarcane yellow leaf virus (ScYLV), and Turnip yellows virus (TuYV). Among the unclassified members of the family Luteoviridae, sufficient sequence data were available only for BYDV-GAV. No members of the genus Enamovirus were available for study. Because of the very large size of our complete luteovirus data set, which greatly inhibits computational tractability, 20 CP sequences were randomly chosen for each viral species. This resulted in a total of 240 CP sequences, with dates of isolation ranging from 1921 to 2008. Sequences were aligned according to the amino acid sequence, and three sets of data were generated: (i) CP, comprising the complete CP nucleotide sequence (636 nt); (ii) CP3′, including the 421 nt of the 3′ end of the CP, as the first 215 nt is the most divergent CP region among the members of the family Luteoviridae; and (iii) CPov, which includes 363 nt from the center of the CP sequence (nt 216 to 579) and allows us to consider a region that completely overlaps ORF4.
Phylogenetic analyses of these 12 Luteoviridae species were carried out individually, except for the four BYDV species, which were analyzed together. BYDV is currently classified into four species (BYDV-GAV, BYDV-MAV, BYDV-PAS, and BYDV-PAV) (18, 38). However, as these taxa are very similar, we clustered them in one group, as proposed previously (49), from here on referred to as BYDV. Protein-coding alignments were derived for the RdRp (ORF1+ORF2), CP (ORF3), and RTD (ORF5) genes of each species and the P0 (ORF0) gene of the Polerovirus species. In addition, the two ORFs that make up the RdRp were analyzed separately. As similar evolutionary rates were obtained for ORF1, ORF2, and ORF1+ORF2, only estimates for ORF1+ORF2 analyses are presented here. Unfortunately, the number of full-length genome sequences was insufficient for most of the species (i.e., <15 sequences), so that a reliable analysis could not be undertaken. For each data set, sequence alignments were obtained using MUSCLE 3.7 (14) and adjusted manually according to amino acid sequences using Se-Al (http://evolve.zoo.ox.ac.uk/).
For each species, sequence alignments were also obtained for the P0/RdRp and CP/MP overlapping regions and for the corresponding nonoverlapping fragments. Sequence alignments of the overlapping regions were adjusted according to the amino acid sequence of each of the two genes involved, thus generating two data sets for each overlapping region. Since no differences in nucleotide substitution rate were observed between these two data sets in any overlapping fragment, only those with the least statistical error are presented here.
For each data set, rates of nucleotide substitution per site and the time to the most recent common ancestor (TMRCA) were estimated using the Bayesian Markov Cain Monte Carlo (MCMC) method available in the BEAST package (11). The best-fit model of nucleotide substitution in each case was determined using Modeltest 3.7 (60), and all data sets were subsequently run using the general time-reversible substitution model with invariant sites and a gamma distribution of among-site rate variation (GTR+I+Γ4), with three partitions into individual codon positions. These sequence data were analyzed using a relaxed (uncorrelated, log-normal) molecular clock (values for the coefficient of variation were always >0, indicative of non-clock-like evolution; see reference 12) and a Bayesian skyline model as a coalescent prior, as estimating demographic parameters was not the aim of this study. To gauge the robustness of these estimates, we repeated the analysis first by using a strict molecular clock and the simpler Hasegawa, Kishino, and Yano (HKY85) model of nucleotide substitution (30) and second by excluding putative intraspecific recombinant sequences (see below for a description of our recombination detection methodology). Estimation of amino acid substitution rates was performed using the Whelan and Goldman substitution model (76) utilizing the same clock and demographic parameters as described above. In all cases, the BEAST analyses were run until all relevant parameters converged, with 10% of the MCMC chains discarded as burn-in. Statistical confidence is represented by values for the 95% highest probability density (HPD).
Our analyses of substitution rates and TMRCAs for the family Luteoviridae as a whole using either nucleotide or amino acid sequences of the CP failed to produce stable estimates (results available on request). This is likely a function of rate variation among viral species (see Results) and the short time scale of sampling relative to the total depth of the tree. We therefore estimated these parameters for the family Luteoviridae as a whole by using an empirical prior distribution on the substitution rate that is based on the lowest mean value (6 × 10−4 subs/site/year) of the substitution rates estimated for each virus (Table (Table1).1). The lowest substitution rate was chosen to be as conservative as possible. We also assessed the possible effect of excessive multiple substitutions at single nucleotide sites by analyzing (i) the first and second codon positions and (ii) the second codon positions independently. Finally, maximum clade credibility trees, with Bayesian posterior probability values providing a measure of statistical support at each node, were also inferred using BEAST.
To test the strength of the temporal signal in these data, essential to the accurate estimation of substitution rates, the BEAST analyses described above were repeated on data sets in which sampling times were randomized such that they lack any temporal structure. For computational tractability, this randomization was undertaken using the gene with the smallest number of taxa for each viral species. Runs for randomized data were repeated 10 times. The mean and 95% HPDs of the substitution rate estimates for the randomized data were then compared with those obtained from the real data; major differences in these estimates indicate the presence of temporal structure.
We determined the occurrence of recombination within and between the RdRp and CP genes for each of the Luteoviridae species. In each case, sequences of different genes belonging to the same isolate were concatenated and recombination breakpoints were detected by using three different methods available in the RDP3 package (http://darwin.uvigo.es/rdp/rdp.html), i.e., RDP, GENECONV, and Bootscan, and employing the default parameters (45). To be as conservative as possible, only recombination signals detected by all of the methods were considered (P < 0.05). The lack of sufficient isolates with both the P0 and RTD genes prevented us from including these two genes in the analyses.
Selection pressures for each gene in each luteovirus species were measured as the mean number of nonsynonymous (dN)-to-synonymous (dS) nucleotide substitutions per site (dN/dS ratio) using the single-likelihood ancestor counting method implemented in the HYPHY package (40). In all cases, dN/dS ratio estimates were based on neighbor-joining trees inferred under the GTR substitution model, with 95% confidence intervals (CI) calculated assuming a χ2 distribution.
To determine the respective roles of allopatric versus sympatric speciation, we assessed the strength of clustering by host species and country of sample origin within the interspecific Luteoviridae phylogeny. If allopatric speciation following host jumping were the dominant process, we would expect no significant association between phylogeny and host species, and perhaps between phylogeny and geography. In contrast, if sympatric speciation were the most important macroevolutionary process in these data, we would expect a significant association between phylogeny and both the host species and the country of sampling.
To undertake this analysis, we utilized the 240-sequence CPov Luteoviridae data set described above. However, to avoid biasing this analysis toward intraspecific evolutionary patterns, the data were subsampled such that only one sequence was chosen randomly for each species, unless a specific virus was located from multiple hosts or multiple countries, in which case one representative sequence was chosen for each host/country. This subsampling resulted in a data set of 89 isolates representing 45 different host species isolated from 32 different countries, on which we inferred BEAST trees as described above. We then employed the association index (AI) (75) and the parsimony score (PS) (19) to determine whether particular traits (host, country) are more strongly associated with the underlying phylogeny than expected by chance. These analyses were undertaken by using the BaTS method (54), which utilizes the posterior sample of trees produced by BEAST (with the first 10% removed as burn-in), thereby incorporating phylogenetic uncertainty. Null distributions for both statistics were generated by using 1,000 data replications.
We first estimated rates of nucleotide substitution within each luteovirus species (Table (Table1).1). Reliable estimates could not be obtained for the RTD in BChV, BWYV, CABYV, and TuYV and for the P0 in CABYV and CYDV due to insufficient data numbers. In addition, some rate estimates exhibited a very wide range of 95% HPD values, with lower values of ≤1 × 10−7, strongly suggesting that they are unreliable. Those data sets with very wide 95% HPD intervals under both models were excluded from further analysis. However, no major differences in evolutionary rates were observed when the data were reanalyzed using the HKY85 substitution model and a strict molecular clock. Similarly, the exclusion of putative intraspecific recombinants had no significant impact on rate estimates (results available on request).
Mean evolutionary rates across all of the genes in all of the species analyzed ranged over 2 orders of magnitude, from 3.5 × 10−2 to 1.4 × 10−4 subs/site/year (Table (Table1),1), with the highest rates recorded in CABYV and CYDV (at >1 × 10−2 subs/site/year). Substitution rates for the CP gene ranged from 3.5 × 10−2 to 6 × 10−4 subs/site/year. Importantly, mean evolutionary rates in amino acids covered the same range as those for nucleotides, from 1.1 × 10−2 to 4.1 × 10−4 subs/residue/year, indicating that our results are robust to the effect of multiple substitution (see Table S10 in the supplemental material). Examining rates across all of the genes revealed no significant variation in substitution rates between species for the P0 and RTD genes, although differences were found for the RdRp and CP genes. For example, in the RdRp, BYDV and CYDV showed significantly higher substitution rates (i.e., nonoverlapping HPD values) than BMYV, ScYLV, and TuYV.
In all cases, estimates of the mean substitution rate differed by at least 1 order of magnitude between the real and randomized data sets (Fig. (Fig.1).1). More importantly, the 95% HPD values for all of the randomized controls excluded the mean substitution rates estimated for the real data, indicating that they are significantly different. In addition, the lower 95% HPD values in the randomized data sets differed by at least 3 orders of magnitude from the corresponding mean estimates, and all were equal to or lower than 1 × 10−7 and hence strongly indicative of insufficient temporal structure. Far narrower 95% HPD results were observed for the real data. Hence, the sequence data analyzed here contain sufficient temporal structure for reliable estimation.
Nucleotide substitution rates were estimated for the P0/RdRp and CP/MP overlapping regions, and their corresponding nonoverlapping fragments, for those species for which sufficient sequence data were available (Table (Table2).2). All PLRV rate estimates, as well as that for the CP/MP overlapping region of BMYV, exhibited a very wide range of 95% HPD values, indicating that there is insufficient temporal signal in these data for meaningful analysis.
For the P0/RdRp region, nucleotide substitution rates did not vary between the overlapping and nonoverlapping fragments within each virus species, with the exception of the P0 nonoverlapping fragment of ScYLV, which showed a higher rate than the corresponding overlapping fragment. In the case of CP/MP, the nonoverlapping region tended to exhibit higher evolutionary rates than the overlapping region in all of the Luteoviridae analyzed, this difference being significant in four of six species (Table (Table2).2). Thus, the theory that overlapping regions reduce the rate of evolutionary change holds for some, but not all, gene-species combinations in the Luteoviridae.
Substitution rate estimates for the Luteoviridae as a whole—that is, at the interspecific level—using the CP gene resulted in unreliable estimates (i.e., a very wide range of 95% HPD values), irrespective of the substitution model or whether nucleotides or amino acids were utilized. We therefore estimated substitution rates using an empirical prior distribution on the mean substitution rate that conservatively reflects the lowest mean estimate of the substitution rate at the intraspecific level (i.e., 6 × 10−4 subs/site/year) and which effectively stabilizes rate estimates. The mean rates of nucleotide substitution obtained from these analyses were ~4 × 10−4 subs/site/year (lowest HPD value = 2.6 × 10−4 subs/site/year) and hence lower than those seen in the majority of individual species (Table (Table3),3), perhaps reflecting the inclusion of transient deleterious mutations in the intraspecies comparisons (see Discussion). To remove any effect of site saturation at third codon positions, we repeated our analyses of the luteoviruses as a whole by using only the first and/or second codon positions of the CP. This resulted in estimates similar to those obtained by using all of the codon positions, again suggesting that our overall rate is robust (Table (Table33).
Our dN/dS ratio analysis revealed that although genes evolve under different pressures within and between species, all luteovirus genes are subject to purifying selection (dN/dS ratio of <1.0), with the exception of the MP gene (Table (Table4).4). Mean dN/dS ratios were also systematically lower for the RdRp and RTD genes (0.19 to 0.22) than for the other genes (0.31 to 1.18). In the case of the MP gene, the dN/dS ratios indicated purifying selection only for BYDV and CYDV (mean, ≤0.69; CI = 0.56 to 0.84), while the mean values for the remaining species were close to or greater than 1 (mean = 0.82 to 2.61; CI = 0.58 to 3.53), indicating that this gene is either evolving strictly neutrally or exhibits some localized positive selection.
We also investigated the nature of selection pressures in regions of gene overlap. Interestingly, we observed no significant difference in the dN/dS ratio between overlapping and nonoverlapping fragments of the CP gene (which overlaps the MP gene) in five of the nine species analyzed. In the remaining species, the dN/dS ratio was higher for the overlapping than for the nonoverlapping region (mean = 0.46 to 0.58 versus 0.04 to 0.18; CI = 0.35 to 0.70 versus 0.00 to 0.28, respectively), indicative of increased purifying selection in the latter (Table (Table4).4). Unfortunately, dN/dS ratios for overlapping and nonoverlapping fragments of the P0 and the RdRp genes could only be obtained for PLRV and ScYLV due to a lack of sequence data on the remaining species. In these two species, the dN/dS ratio was also higher in the overlapping than in the nonoverlapping regions of both genes (mean = 0.41 to 0.67 versus 0.03 to 0.13; CI = 0.35 to 0.95 versus 0.00 to 0.18; not shown in Table Table44).
The maximum clade credibility tree for the CPov data set reveals that the members of the family Luteoviridae can be divided into two clusters largely corresponding to the Luteovirus and Polerovirus genera (very similar trees and TMRCAs were observed for all CP data sets) (Fig. (Fig.2).2). The Luteovirus cluster contained the BYDV species in isolation. All of the remaining species fell into the Polerovirus group, although support for the position of ScYLV was very low (posterior probability = 0.15 to 0.16). As noted previously (18), SbDV, the other virus classified within the genus Luteovirus, in fact clusters with the Polerovirus group.
To determine the time scale of this evolutionary history, we estimated TMRCAs for each gene-virus species combination (Table (Table1).1). Most of these estimates indicated that the sampled genetic diversity arose no earlier than the second half of the 19th century. The only major differences in TMRCA estimates among genes were observed for BYDV and CYDV, where the RdRp gene diverged significantly later than the CP gene (mean estimates of 42 and 45 years for the RdRp gene versus 456 and 247 years for the CP gene), while the opposite pattern was observed for CABYV and ScYLV (TMRCAs of 233 and 347 versus 21 and 26 years, for the RdRp and CP genes, respectively). Importantly, no significant differences in TMRCA values were found in parallel analyses using the amino acid sequences (see Table S10 in the supplemental material). Hence, all estimates suggest that the sampled diversity within the Luteoviridae species arose during the last 500 years.
Our mean estimate for the TMRCA of the members of the family Luteoviridae sampled as a whole using the empirical prior distribution on the substitution rate was approximately 2,000 years (1,732 to 2,007 years), with upper 95% HPD values, representing the oldest credible age, of 3,088 to 3,583 years (Table (Table4).4). Similarly, mean estimates of TMRCAs for the Luteovirus and Polerovirus genera were 1,010 and 897 years, respectively (95% HPD = 403 to 1,606 and 300 to 1,051 years; Fig. Fig.2).2). Finally, mean TMRCAs estimated for the different species within each genus were also very similar, ranging from the end of the 19th century to the beginning of the 20th century, and comparable to the results obtained for each species individually. Interestingly, BYDV retains the earliest genetic diversity, with coalescent events between 1 and 7 centuries before the other species (mean = 1,010 years ago; 95% HPD = 403 to 1,606 years ago). However, differentiation into the various BYDV “species” did not occur before the 18th century.
The TMRCA differences between genes noted above were also reflected in the topological incongruence of the RdRp and CP gene trees. Specifically, phylogenetic analyses of BYDV, CABYV, CYDV, and ScYLV revealed that the clustering of specific viral isolates differed between the RdRp and CP genes. In particular, the CP gene phylogeny divided isolates of CYDV and ScYLV into three clusters, as previously described (62, 74), while analyses of the RdRp gene resulted in only two clusters (see Fig. S1 and S2 in the supplemental material). Overall, these results are suggestive of recombination between the RdRp and CP genes in a number of Luteoviridae species.
To better characterize the frequency and distribution of recombination events, we analyzed the concatenated RdRp and CP gene sequences of all of the species in the family by using three different recombination detection methods. This analysis revealed clusters of recombination breakpoints—putative recombination “hot spots”—in BYDV, CABYV, CYDV, ScYLV, and TuYV, although not in the remaining species (Fig. (Fig.3).3). Notably, the region comprising the end of the RdRp gene and the beginning of the CP gene and the overlapping region between RdRp ORF1 and -2 exhibited a concentration of breakpoints in these five species. Recombination signals were also detected in the 5′ end of the RdRp gene and the 3′ end of the CP gene in BYDV, ScYLV, and TuYV, which suggests that recombination may also occur in the boundaries between the P0 and RdRp genes and between the CP and RTD genes. Although most putative recombination events were observed at gene boundaries, we also observed recombinant breakpoints within the RdRp gene of BYDV and TuYV and in the CP gene of CYDV (Fig. (Fig.33).
Both the AI and PS statistics revealed a significant association between the interspecies phylogeny of the Luteoviridae and the plant host each virus species naturally infects (P < 0.001); hence, virus species that infect the same hosts tend to cluster together on the tree. Similarly, a statistically significant association was observed between the luteovirus phylogeny and the country of sampling (P < 0.001 in both tests), such that virus species that infect plant hosts in the same geographic region tend to cluster together. Together, these data suggest that viral speciation tends to occur within the same host species and in the same geographic area, which is compatible with genetic diversification through sympatric speciation.
Although there was some gene- and species-specific variation, our large-scale comparative analysis revealed that members of the family Luteoviridae generally experience rates of nucleotide substitution ranging from 10−3 to 10−4 subs/site/year, comparable to those observed in animal RNA viruses (13, 37) and some other plant RNA viruses (22, 63). Most cases of gene-specific rate variation seem to reflect differences in selection pressure across the luteovirus genome, particularly as equivalent differences were found in the dN/dS ratio. A good example is provided by the RdRp and RTD genes, in which purifying selection was stronger (i.e., the dN/dS ratio was lower) for the Luteovirus species than for the Polerovirus species. In the case of the RdRp gene, this variation may reflect differences in the gene arrangement of the 5′ half of the genome. Specifically, Polerovirus species have an extra ORF0 encoding the P0 protein in the 5′-terminal region of the genome. Since the P0 gene is located in the −1 reading frame of the RdRp gene, synonymous nucleotide changes in the third codon position of the P0 gene result in amino acid changes in the overlapping region of the RdRp, increasing its overall dN/dS ratio. The analysis of the dN/dS ratio between the overlapping and nonoverlapping fragments of the RdRp supports this hypothesis. In the case of the RTD, which together with the CP is involved in specificity of vector-mediated transmission (6, 27, 49), as well as in long-distance movement through the phloem in a host-specific manner (57, 58), differences in the dN/dS ratio between the Luteovirus and Polerovirus genera may reflect differences in the vector species responsible for virus transmission and/or in the host range. Finally, the CP gene and the N-terminal region of RTD are highly conserved in BYDV and CYDV, which is suggestive of strong functional constraints (7, 27, 48). Thus, both the CP and RTD genes are expected to evolve at lower rates than other genes, which is in agreement with our observations.
The very high substitution rates observed for the CP gene of CABYV (at >1 × 10−2 subs/site/year) merit special attention, as they represent some of the highest rates observed to date in a plant RNA virus. However, such an elevated rate is likely to be a function of the fact that these sequences were only sampled over a very short time period (5 years). Sequences sampled over a short time scale tend to produce artificially inflated rate estimates, reflecting short-term mutation rates that include the circulation of transient deleterious mutations (i.e., polymorphisms), rather than more meaningful long-term rates of nucleotide substitution that measure evolutionary dynamics following the action of purifying selection (13). Indeed, that our substitution rate estimates are higher within (mean = 6 × 10−3 subs/site/year) than among (mean = 4.4 × 10−4 subs/site/year) luteovirus species suggests that transient deleterious mutations are a common component of intraspecific comparisons. It is clear that the time scale of sampling has a major impact on the reliability of substitution rate estimates.
Although overlapping reading frames are predicted to result in a reduced rate of nucleotide substitution (41, 50), this effect was only apparent in some virus-gene combinations. In particular, although substitution rates were indeed generally higher in the nonoverlapping region of the CP gene than in the portion that overlaps the MP gene, no such rate differences were observed in the overlapping and nonoverlapping regions of the P0/RdRp genes. Again, this difference in evolutionary dynamics may reflect the specific structural-functional constraints of the genes in question. For example, the P0 gene contains sites important for RNA-silencing suppression and for viral replication (44, 56, 59), and mutations in these regions are expected to drastically affect virus fitness, which in turn constrain genetic diversity and hence impact on dN/dS ratio estimates (33, 68). Similarly, the putative catalytic sites for serine proteinase activity, the VPg domain, and the frameshift-inducing pseudoknot of poleroviruses are all located in the nonoverlapping region of the RdRp gene (8, 64, 73).
Luteoviruses are thought to be characterized by relatively frequent intra- and interspecific recombination (3, 26, 49, 51, 71). Indeed, we observed clusters of recombination breakpoints at the boundaries of the RdRp and CP genes of BYDV, CABYV, CYDV, ScYLV, and TuYV, which strongly suggest that these two genes have experienced different evolutionary histories in these species. This is supported by a marked incongruence between the RdRp and CP gene phylogenetic trees (see Fig. S1 and S2 in the supplemental material). Strikingly, between 66 and 100% of the recombination breakpoints detected here were located at gene boundaries, similar to results reported previously (51, 52, 72).
Such widespread intergenic recombination can also be considered a form of “modular evolution.” Modular evolution has been proposed as a major mechanism in generating genetic variability in animal and plant RNA viruses (5, 9). It has also been proposed that speciation in the family Luteoviridae may have resulted from gene module exchange with members of the Sobemovirus and Tombusvirus genera and between Luteovirus and Polerovirus species (3, 26, 49). However, despite the evident frequency of intergenic recombination, it is unclear whether the clustering of breakpoints at gene boundaries occurs because certain RdRp-CP-RTD combinations confer a selective advantage on particular host/vector combinations (46, 72) or because the recombination breakpoints that occur within genes are strongly injurious such that they are rapidly purged by purifying selection.
Our time-structured phylogenetic analysis revealed three periods of evolutionary diversification in the family Luteoviridae (Fig. (Fig.2).2). First, on the basis of these data, the origin of the sampled Luteoviridae may have occurred within the last 4,000 years. Such a time scale of evolutionary change is rather shorter than that proposed for the Potyviridae at ~6,500 years or even than that previously proposed for the Luteoviridae using Rice yellow mottle virus (a Sobemovirus) in isolation, both of which were linked to the rise of agriculture (17, 24). However, the probability distributions of these and our time estimates overlap, and difficulties in dating deep events in RNA virus evolution mean that all such conclusions should be drawn with caution (32). Indeed, it is noteworthy that we were unable to obtain reliable estimates of the substitution rate of the family Luteoviridae as a whole without using an empirical prior distribution and that our substitution rate estimates were higher than those for the family Potyviridae (~1 × 10−4 subs/site/year) (24). Employing such lower rates would result in an evolutionary time scale for the luteoviruses that is highly comparable to that of the family Potyviridae and which similarly implies a role for early agriculture in viral speciation. It is also important to recall that the TMRCA estimates given here are for the sampled Luteoviridae species and so do not rule out the existence of extinct species of far greater antiquity, which might explain this difference from previous analyses. Finally, although recombination is known to influence estimates of divergence time (66), we observed no such adverse effects in our study.
The second key evolutionary period relates to the origin of the Luteovirus and Polerovirus genera, which we estimate took place no earlier than 1,500 years ago (although noting the same caveats as above). It is intriguing that both genera seemingly emerged at approximately the same time, although the ecological reasons (if any) underlying this are unclear. Finally, we observed that all individual luteovirus species appeared within the last 500 years. It is striking that the dates estimated for the majority of these speciation events fall into the same range as those of the Potyviridae and of other families of plant RNA viruses (17, 24, 43, 69). Given this overall similarity in evolutionary patterns, we hypothesize that the intensification of agriculture in the modern era, which resulted in an increase in cultivated plant populations, as well as the establishment of global communication networks, had a major effect on the extent and structure of genetic variation in multiple plant RNA viruses.
Finally, we have, for the first time, been able to shed light on the patterns and processes of speciation within the family Luteoviridae. In particular, our observation of a statistically significant association between phylogeny and both host specificity and geographical origin suggests that many species of luteovirus may have arisen within the same host species and in a restricted geographic area, as is expected under a process of “sympatric” speciation. Such a predominance of sympatric speciation may be expected, given that both plants and the aphid vectors that transmit plant viruses clearly have a limited ability to move large geographic distances and that the anthropogenic factors that would assist allopatric speciation (such as increased transportation of crop plants) have generally occurred too recently to greatly influence speciation processes. Indeed, most luteovirus species are restricted to one plant family (27) and it has been previously suggested that specialization to increase within-host multiplication plays a major role in virus evolution (46). However, it is important to note that the presence of as-yet-unsampled Luteoviridae infecting different plant species may be biasing this analysis against the detection of allopatric processes. Similarly, it is clear that allopatric speciation can occur, particularly in the context of those plant species that are only infected by a single luteovirus, such as sugarcane infected by ScYLV (see reference 52). Despite these caveats, we believe that equivalent phylogenetic studies of speciation processes in other families of RNA viruses may prove to be equally informative.
This work was supported by Marie Curie fellowship PIOF-GA-2009-236470 to I.P. and NIH grant R01 GM080533 awarded to E.C.H.
We thank three reviewers for useful suggestions.
Published ahead of print on 7 April 2010.
†Supplemental material for this article may be found at http://jvi.asm.org/.