|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: MDP DUB. Performed the experiments: JCV. Analyzed the data: MDP JCV. Contributed reagents/materials/analysis tools: DUB. Wrote the paper: MDP JCV DUB.
Potato virus Y (PVY) is a major agricultural disease that reduces crop yields worldwide. Different strains of PVY are associated with differing degrees of pathogenicity, of which the most common and economically important are known to be recombinant. We need to know the evolutionary origins of pathogens to prevent further escalations of diseases, but putatively reticulate genealogies are challenging to reconstruct with standard phylogenetic approaches. Currently available phylogenetic hypotheses for PVY are either limited to non-recombinant strains, represent only parts of the genome, and/or incorrectly assume a strictly bifurcating phylogenetic tree. Despite attempts to date potyviruses in general, no attempt has been made to date the origins of pathogenic PVY. We test whether diversification of the major strains of PVY and recombination between them occurred within the time frame of the domestication and modern cultivation of potatoes. In so doing, we demonstrate a novel extension of a phylogenetic approach for reconstructing reticulate evolutionary scenarios. We infer a well resolved phylogeny of 44 whole genome sequences of PVY viruses, representative of all known strains, using recombination detection and phylogenetic inference techniques. Using Bayesian molecular dating we show that the parental strains of PVY diverged around the time potatoes were first introduced to Europe, that recombination between them only occurred in the last century, and that the multiple recombination events that led to highly pathogenic PVYNTN occurred within the last 50 years. Disease causing agents are often transported across the globe by humans, with disastrous effects for us, our livestock and crops. Our analytical approach is particularly pertinent for the often small recombinant genomes involved (e.g. HIV/influenza A). In the case of PVY, increased transport of diseased material is likely to blame for uniting the parents of recombinant pathogenic strains: this process needs to be minimised to prevent further such occurrences.
Potato virus Y (PVY) afflicts potato producers worldwide –, causing loss of yield ranging from 10% to complete crop failure. The extent of yield reduction depends on a range of factors (including the viral load, the time of infection, temperature during growth and tuber storage and the cultivar of potato that is infected , ), but the strain of PVY involved is particularly important: some are considerably more pathogenic than others , .
Whilst potatoes have been in cultivation outside the New World since the mid 16th Century, PVY was first discovered and has developed into a major crop disease only within the last 80 years. All PVY infections reduce yield, but under warmer growing conditions (such as in the potato growing regions of southern Europe and South Africa) the most detrimental strains can entirely compromise the economic viability of a crop by inducing Potato Tuber Necrotic Ringspot Disease (PTNRD). In this respect, the earliest known strains were relatively innocuous, with symptoms largely restricted to mosaic patterns or stipple streaks on leaves (PVYC ; PVYO ), and/or venal leaf necrosis and only rarely PTNRD (PVYN , ).
More recently, genetic recombinants between PVYO and PVYN that induce PTNRD much more frequently have been identified. Recombination is prevalent in viruses – and its impact on the virulence of disease may be considerable. Foremost amongst the recently identified strains are PVYNTN (N-tuber necrotic)  and PVYN-W (N-Wilga) , described in 1984 and 1991 respectively . Both PVYNTN and PVYN-W have spread rapidly, causing severe reductions in yields worldwide , . In order to both limit the impact of existing strains on their hosts and, if possible, avoid creating the conditions that drive further escalation of pathogenicity, we need to understand the circumstances under which pathogenic recombinant virus strains such as these evolve.
However, evolutionary scenarios involving recombination are challenging to reconstruct. Individual ‘gene trees’ (phylogenies of non-recombinant regions of genomes) deviate from one another and from the underlying ‘species tree’ (representing the historical sequence of speciation events) due to differing underlying processes that are notoriously difficult to discern. Besides the various potential sources of analytical error (such as incorrect assessment of homology; model misspecification etc.), these include biologically meaningful processes such as reticulation (recombination between the branches of the species tree; i.e. between different species) and coalescent stochasticity (resulting from recombination within those branches, i.e. between individuals of the same species). It is important to distinguish reticulation from coalescent stochasticity in order to correctly infer species trees under the current methods that assume exclusively the latter process (e.g. ). The problem is compounded in viruses because despite generally high evolutionary rates  that might favour the availability of the necessary informative sequence variation, their diminutive genomes (e.g. PVY, 9.7 kb in length; HIV, 9.8 kb ; influenza A, 16.6 kb and polio, 7.4 kb ) represent a limited total source of data. Currently available phylogenetic hypotheses for PVY are either restricted to non-recombinant strains ,  (i.e. excluding the most pertinent pathogenic ones), represent only parts of the PVY genome , , and/or incorrectly assume a strictly bifurcating phylogenetic tree , –. Despite attempts to date potyviruses in general , no attempt has been made to date the origins of pathogenic PVY.
In this study we reconstruct and date the phylogeny of PVY by means of a phylogenetic approach to analysing DNA sequence data in the presence of reticulation ,  that we extend to address multiple recombination events between whole genomes. Unlike existing approaches, ours neither assumes a bifurcating species tree nor assumes prior knowledge of processes underlying deviations between individual gene trees. We use the resulting robust, time calibrated phylogeny to place patterns of divergence and recombination in PVY in the historical context of human cultivation of potatoes. In particular, we test whether diversification of the major strains of PVY and recombination between them occurred within the time frame of potato domestication and/or modern cultivation.
We sampled PVY isolates from Africa, Asia, Europe, and both North and South America, covering all known recombinant strains for which whole genome sequences were available (Table S1). Fifteen new genome sequences were generated following direct amplification RT-PCR protocols described in –, and 29 further sequences , , , – were obtained from GenBank. Outgroups were Pepper Mottle Virus (PMV) and two isolates of Sunflower Chlorotic Mottle Virus (SCMV); the latter more closely related to PVY than those included in previous analyses (cf. ). Sequences were aligned using BioEdit 22.214.171.124 . Short regions of uncertain homology between outgroup and ingroup sequences were treated as insertions and excluded from analyses (Dataset S1).
Our analytical approach is illustrated in Fig. 1. We used multiple recombination detection methods as implemented in RDP3  and SimPlot  to identify breakpoints followed by testing those breakpoints using phylogenetic analyses under parsimony and ML (as below) of non-recombinant regions to confirm the changing phylogenetic signal observed when progressing from the 5′ to the 3′ end of the linear PVY genome. Using RDP3, five methods were applied: RDP , , GENECONV , , MaxChi , , BootScan ,  and SiScan . Sequences were treated as linear. The threshold P-Value was set at 0.05, using Bonferroni correction. Following the RDP3 manual this should give few false positives but will still allow detection of most recombination events. The SEQGEN parametric simulations and phylogenetic evidence options were selected. For method-specific settings we followed the RDP3 manual. The results of the subsequent phylogenetic analyses of putatively non-recombinant regions were assessed for topological conflict subject to bootstrap support (BS) ≥70% under both parsimony and ML. The process was then repeated with breakpoints that did correspond to such conflict tentatively assumed to be correct until all topological differences between the ‘gene trees’ could be explained by specific recombination events and vice versa. The phylogenetic analyses represent a conservative test of the (not necessarily unanimous) results of the recombination detection methods. It will tend to reject recombination both when it has been incorrectly inferred and where it is real but involves little sequence variation (generally corresponding to very short regions and/or very recent events). We regard the latter as effectively impossible to address using phylogenetic approaches and assume that it will be of low impact on the subsequent analyses.
A supermatrix was subsequently constructed in which recombinant sequences were split into multiple taxa using the ‘taxon duplication’ approach , ; equivalent to the ‘compatibility matrix’ output format implemented in RDP3  (Dataset S2). Following this approach, taxa that exhibit conflicting phylogenetic positions according to different gene regions are duplicated in the matrix and the (different) conflicting gene regions re-coded as missing data for each duplicate. The resulting supermatrix can be analyzed using standard phylogenetic techniques to produce a single ‘multi-labelled’ tree in which conflicting taxa – e.g. putative hybrids or recombinants – are represented more than once. This approach has previously been applied to phylogenetic analyses of conflicting gene trees in various groups of flowering plants –, including an extension in a coalescence framework ( under the assumption that reticulation could be discerned from coalescent stochasticity). To our knowledge, the approach has not previously been applied to multiple recombinants or whole genomes of viruses. In order to determine which non-contiguous genome regions should be combined as single taxa in the supermatrix we first identified homologous recombination patterns, on the basis of common breakpoints and/or (given the possibility for nested recombinants) monophyly in gene trees; and then identified shared phylogenetic signal across non-contiguous genome regions (i.e. those interrupted by recombinant regions) on the basis of gene tree topological congruence and the logical sequence of homologous recombination events (Fig. 2). Where the evidence for combining non-contiguous genome regions was equivocal, we excluded from the analyses the shorter regions from the taxa in question, recoding them as unknown in the matrix (Fig. 2).
Phylogenetic analyses were performed under parsimony using PAUP* 4.0b10  and under likelihood using RAxML . Under parsimony, the following heuristic search options were employed: 500 random addition sequences (RAS) with tree bisection and reconnection (TBR) branch swapping saving a maximum of 25 trees of minimal length in each replicate. Clade support was estimated using 10,000 replicates of non-parametric bootstrapping each comprising a single RAS and TBR, saving a single tree in each replicate. RAxML analyses were performed using the CIPRES Science Gateway (http://www.phylo.org/portal2/) , , assuming a gamma model of rate heterogeneity. The supermatrix was analysed as above both with and without monophyly constraints on eight clades descendent from specific recombination events. This topological constraint represents the strong phylogenetic evidence of genome-scale processes that is further demonstrated by the monophyly of the clades in the separate analyses of non-recombinant gene regions. It may be important in preventing arbitrary groupings of closely related recombinant (duplicated) taxa that in the supermatrix do not share overlapping sequences. Under the tentative assumption of a reticulation scenario, phylogenetic networks were summarised using SplitsTree 4.12  and Dendroscope 3 , 1) from the trees resulting from individual analyses of non-recombining regions and 2) from the single multi-labelled tree (in which recombinant taxa are represented more than once) resulting from analysis of the supermatrix. In both cases, nodes subject to <70% BS were first collapsed to form polytomies. Using Splitstree, consensus splits of trees were computed using the Consensus Network method  and splits were transformed into a reticulate network using the RECOMB2007 method ; cluster-based rooted networks were computed using Dendroscope.
Path-o-gen 1.3  was used to investigate the ‘clocklikeness’ of the PVY phylogeny with an ML tree obtained using RAxML and asynchronous tip ages (as below). The supermatrix was analysed with the above topological constraint plus a monophyly constraint for the ingroup (to root the phylogeny) using BEAST 1.7.2  on the CIPRES Science Gateway . We applied fixed age constraints to tips to calibrate the rate of molecular evolution , . For sequences not produced for this study this information was obtained from the authors of the original studies; where this was not possible the isolates were omitted from the analyses (Table S1). Age estimates are more precise when the range of tip age constraints spans a higher proportion of the total age of the group . The total age range of the tips in these analyses spans the years 1982 to 2010 (i.e. 28 years), which is 35% of the 80 year putative timeframe for the observation of the disease in crops, but is likely to represent a much smaller proportion of the total age of PVY root node (which is unknown). Using the taxon duplication approach, the ages of recombinant isolates contribute to calibration in multiple branches of the tree (similar to the calibration of multiple homeologues in polyploids in ). We applied the general time reversible (GTR) substitution model with gamma distributed rates and a proportion of invariable sites. We applied strict clock (SC) and relaxed clock models, the latter assuming lognormal (LN) and exponential distributions (EX) of rates across the phylogeny, in order to assess the sensitivity of age estimates to assumptions regarding patterns of molecular rate variation, particularly given the potential (though, to our knowledge, as yet untested) impact of missing data. Under SC two MCMC runs of 10 million generations each were performed, each sampling trees every 1,000 generations. Under LN and EX three runs of 50 or 100 million generations each were performed, sampling trees every 10,000 generations. Shorter runs excluding the recombinant sequences were performed as a joint sensitivity test for the impact of calibrations and missing data. Removing taxa from the matrix might be expected to reduce the precision of age estimates due to the directly associated loss of information regarding rate calibration (i.e. both tip ages and molecular variation). However, should the wider confidence intervals of such age estimates not contain those inferred in the presence of significant proportions of missing data this might provide evidence for some form of bias. Likelihood and topological convergence and adequate sampling of the runs were confirmed using AWTY  and Tracer . Randomisations of tip ages as a further test of the validity of rate estimates  were not feasible given the lengths of runs necessary to reach convergence. We note however that datasets shown to fail such tests are generally characterised by low levels of sequence variation and/or only produce precise age estimates when constrained by informative priors, e.g. on demographic parameters or on the age of the root . Neither is the case here.
We analysed 44 PVY genomes with recombination detection software in order to identify recombinant isolates and locate the recombinant regions of their genomes. Just 19 of these isolates can be regarded as non-recombinant. Five are single recombinants and 20 show two or more recombination events. We located 17 breakpoints and inferred phylogenetic trees for the corresponding sequence of non-recombining regions under parsimony and Maximum Likelihood (ML) to confirm topological differences between trees (Fig. 3). We then identified homologous recombination patterns on the basis of common breakpoints and/or (given the possibility for nested recombinants) monophyly in gene trees. Both were evident for PVYNTN (18 isolates) and PVYN-W B-type (four isolates) separately and for PVYN-W A- and B-types plus PVYNTN together (Figs. 2 and and3).3). A further 10 recombination events (of 13 in total) were represented by single isolates (Table S2). Recombinant regions, the strains/isolates that exhibit them and the genome type (PVYNONPOT: a strain found in various plants other than potatoes; PVYO or PVYN) involved are illustrated in Fig. 4.
In order to simultaneously infer the phylogenetic relationships of both non-recombinant and recombinant whole genomes we identified shared phylogenetic signal across non-contiguous genome regions (i.e. those interrupted by recombinant regions) on the basis of gene tree topological congruence and the logical sequence of homologous recombination events (Figs. 2 and and3).3). On this basis, the whole genome sequences of single recombinants were subdivided between two taxa each in the matrix, double recombinants PVYN-W B-type and PVYNTN between three, and triple recombinant PVYNTN1 between four; each taxon representing a subset of the alignment with distinct phylogenetic signal with the rest of the alignment coded as missing data (Fig. 2). In the case of double recombinants Fr and NNP, the phylogenetic signals were not sufficiently strong to discern congruence and conflict, and ancestral-type single recombinants are unknown. Recombinant regions 9052–10654 and 10172–10654 of Fr and NNP respectively were therefore excluded from further analyses by means of recoding as missing data (Fig. 2).
Phylogenetic analyses were performed on the resulting supermatrix under parsimony and ML. Of 9,723 characters included in the analyses (reduced from an alignment of 10,685 with outgroups), 5,683 were variable and 4,267 parsimony informative. In order to avoid potential loss of phylogenetic resolution between taxa with entirely non-overlapping sequences a topological constraint was designed to enforce the monophyly of each clade of homologous recombinants. This corresponded to eight nodes of a total of 89 given a 90 taxa bifurcating tree (Fig. 5). The results were entirely congruent irrespective of whether this topological constraint was applied or not, but the constrained analysis resulted in considerably higher resolution both within and between constrained clades without leading to an increase in shortest tree length (Fig. 5). Networks summarised from the individual gene trees (non-recombinant regions separately; Fig. S1 A) and the multi-labelled tree (resulting from analysis of the supermatrix; Fig. 6, Fig. S1 B) were broadly comparable, but the former considerably more complex. In both cases network structure was revealed within monophyletic PVYNTN and PVYN-W B clades.
Path-o-gen was used to calculate a regression of root-to-tip distances against dates of sampling. The slope of the regression (representing the rate) was 0.00237; with correlation coefficient (variation in rate) 0.1885; R squared 0.0355; and residual mean squared 0.0073; indicating deviation from a strict molecular clock. In order to infer simultaneously phylogenetic relationships and the ages of clades and recombination events we analysed the supermatrix under Bayesian inference with both strict and relaxed clock models, including only the 28 (of 44) PVY isolates for which we could obtain accurate asynchronous sampling dates. Tree samples from independent Bayesian runs of the supermatrix under each model showed consistent and stable posterior probability (PP) clade support and effective sampling sizes for model parameters >200. The maximum clade credibility tree of the LN analysis is illustrated in Fig. 7; nodes subject to ≥0.95 PP were consistent with those ≥70% BS inferred under parsimony and ML (data not shown). In general, strict clock (SC) based estimates for deeper nodes are older than those based on either lognormal (LN) or exponential (EX) models, but the discrepancy is smaller for more recent nodes. The 95% highest posterior density interval (95% HPD) for the clock/mean rate were 0.0001–0.0003 (SC); 0.0003–0.0012 (LN); and 0.0005–0.0015 (EX). The 95% HPD for the standard deviation of the LN relaxed clock was 0.6128–1.0086, which as it does not include zero further indicates the rejection of the strict molecular clock. The 95% HPD for the crown node of PVY dates to 619-161 (LN)/436-123 (EX)/970-525 (SC) years ago. Crown nodes ages of PVYO, PVYN and PVYNONPOT are similar according to the different methods, falling between around 150 and 30 years (although PVYNONPOT is older under the SC model: 365-201 years; Table 1). Excluding the recombinant isolates (leaving just 13 of 66 taxa in the supermatrix) resulted in much broader but overlapping age ranges, e.g. for the PVY crown node: 4,738-84 (LN); 3,498-58 (EX); and 133,080-801 (SC). Hence the confidence intervals extended considerably further back in time (there was no prior constraint for the age of the root node in any of the analyses), but with the exception of the SC model also increased towards the present (which is by definition constrained by the ages assigned to the tips). In the relaxed clock results based on full taxon sampling with missing data there was thus no obvious bias towards either older or more recent age estimates. The range of crown and stem node age estimates for recombinant clades (i.e. PVYN-W/PVYNTN, PVYNTN and PVYN-W B) place the corresponding recombination events between 48 and 20; 47 and 19; and 47 and 6 years ago, respectively (Table 2). Recombination events represented by single isolates have stem ages in years as follows: PVYNTN1: 35-17; NN300_60: 44-32; and Fr: 112-93.
The most serious consequences of PVY infection are yield reduction and PTNRD. Specific genes are currently under investigation for their potential to cause pathologies , , ,  but the symptoms are generally worse in the recombinant strains PVYNTN and PVYNW and recombination has also been directly implicated as a cause of pathogenicity . Our results show that recombination is widespread amongst both highly pathogenic and less pathogenic PVY strains. We recovered essentially the same breakpoints identified for individual isolates in previous work (suggesting that our recombination detection approach was not overly conservative), as well as identifying a novel recombination pattern in NN300_60, a South African isolate similar to the previously described NE-11 . The resulting phylogeny shows the major groupings of PVY strains , – and qualifies the phylogenetic affinities of the isolate Chile3 (apparently not the sister group to PVY, contra ), whilst simultaneously identifying the multiple phylogenetic relationships of the recombinants.
Our results confirm the single origins of recombinant PVYNTN and PVYNW strains, whilst at the same time providing evidence for recombination events within those strains. Both the original recombination events and some of those occurring subsequently must be associated with more or less identical breakpoints. This phenomenon has been reported for other viruses, such as begomoviruses, which have been shown to recombine at non-random breakpoints , due at least in part to selection . The extreme consequence of such a process would be where structural genes in viruses represent ‘functionally interchangeable modules with effectively independent evolution’ . Our results nevertheless imply tractable sequences of recombination events within an otherwise effectively tree-like underlying phylogeny. With the supermatrix approach, the phylogenetic affinities of even relatively short (non-recombinant) genome regions can be assessed within the strong phylogenetic ‘scaffold’ provided by the other data , . The improved overall resolution that we obtained for the PVY phylogeny using the taxon duplication-based supermatrix compared to the separate analyses of non-recombinant genome regions is reflected in a simpler network with far fewer alternative connections between recombinants and non-recombinants (Fig. S1).
Further investigation of our approach (and of supermatrix analyses in general) should address the potential impact of missing data on age estimation. Although no obvious bias was apparent here (with the exception of the arguably inappropriate SC model), the power of the assessment was limited by the small number of non-recombinant genomes analysed. Assuming that the models implemented in BEAST do adequately reflect the uncertainty involved, the advantages of our approach are clear: by including both non-recombinant and recombinant isolates in single molecular dating analyses in this manner we were able to estimate the ages of both clades and recombination events between them, with improved precision relative to analysis of non-recombinants alone.
The potato, Solanum tuberosum L., originated in the New World and its centre of diversity is in the Andes. All cultivated varieties descend from a single domestication , with the first archaeological evidence of potato use dating to ca. 750 BC in Peru and the first likely cultivation of potatoes dating to 400 AD . However, our age estimates for the most recent common ancestor (mrca) of PVY correspond more closely to the timing of the first introductions of potatoes to Europe (between 1540 and 1565 to Spain and in 1565 to Britain ), consistent with the recent origins inferred for various plant viruses . The age estimates could indicate origin either in the New World before the European introduction, or thereafter, but given that our oldest age estimates were produced under the SC model, which appears less appropriate for this data, the latter seems more plausible. Age estimates for deeper nodes in the phylogeny are imprecise, which is inevitable given the relatively recent asynchronous tip ages used for the molecular clock calibration. However, our calibration is independent of any assumptions regarding correlations of particular phylogenetic relationships with historical isolation and outbreak events (as used by  in their study of potyviruses). This logical independence is crucial given our aim to test exactly these kinds of hypotheses. Despite the wide confidence intervals, even the oldest estimates for the ages of the crown nodes of PVYO and PVYN fall within the last c. 150 years. This places the timeframe of the radiation of PVY clades known from potato crops, as well as all known recombination events between those clades, well within that of modern potato cultivation.
In fact, most recombination events inferred here are conspicuously recent. PVYN-W A-type descends from a common recombination between PVYO and PVYN which dates to 48-20 years ago, and PVYNTN and PVYN-W B-type are the results of two subsequent recombination events that we date to between 47 and 19 years ago and 47 and 6 years ago, respectively. These ages are consistent with the first observations of symptoms associated with specific strains in potato crops. PVY strains isolated from non-potato hosts are restricted to the PVYNON-POT clade and isolate Chile3  and gene flow (in the form of recombination) between these and other PVY clades appears to be rare compared to that between and within PVYO and PVYN. Overall, these suggest that the strains of PVY currently infecting crops have evolved as specialists of potato cultivars and not, as might have been the case, by lateral transfer from other hosts. They are consistent with recent (recombinant) origins of pathogenic strains of PVY within modern potato crops. Given the age estimates, these may have been particularly associated with increased 20th Century international trade; there is no evidence for earlier recombination between the major PVY strains.
Advances in transport have inevitably led to the increasingly rapid distribution of material infected with different strains of PVYO and PVYN. It is clear from the age estimates presented here, as well as the high genetic diversity of PVY strains found in individual countries such as South Africa , that measures to control such movement were implemented subsequent to the origin and spread of the most damaging PVY recombinant strains. Our results illustrate how recombination between both distantly and closely related strains of PVY has contributed to the origin of pathogenic strains such as PVYNTN. They also provide evidence for ongoing recombination within PVYNTN, as might be expected from patterns observed in other viruses –. This process is a cause for concern in the context of disease prevention, as it could facilitate combinations of sequence variants that increase virus fitness, for example by improving transmission. For crop plants, it is possible at least in principle to reduce the spread of diseased material by stringent testing as part of national certification schemes and monitoring of imports. Our results serve to further highlight the importance of such efforts.
Phylogenetic networks summarised using SplitsTree a) from the 12 70% BS consensus trees in Fig. 3; b) from a single 70% BS consensus of the multi-labelled tree in Fig. 5 b. Nodes recovered in one network but contradicted in the other are indicated with red dots on the corresponding branches. Major PVY strains and recombinant clades are indicated.
Accessions details, including details regarding tip dates used in BEAST analyses and from who the information was obtained.
Summary of break points identified in the genomes of the isolates analysed in this study.
Standard alignment of sequences used in this study.
Supermatrix used in this study.
Aelys Humphreys provided comments on a draft of the manuscript. Two anonymous reviewers and Academic Editor Darren Martin provided further constructive criticism. The authors are grateful to various researchers who provided additional information regarding their published sequences.
This research was supported in part by funding granted to DUB from Potatoes South Africa, and by a Claude Leon Foundation Postdoctoral Fellowship to MDP. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.