|Home | About | Journals | Submit | Contact Us | Français|
Potato virus Y (PVY) is a major pathogen of potatoes and other solanaceous crops worldwide. It is most closely related to potyviruses first or only found in the Americas, and it almost certainly originated in the Andes, where its hosts were domesticated. We have inferred the phylogeny of the published genomic sequences of 240 PVY isolates collected since 1938 worldwide, but not the Andes. All fall into five groupings, which mostly, but not exclusively, correspond with groupings already devised using biological and taxonomic data. Only 42 percent of the sequences are not recombinant, and all these fall into one or other of three phylogroups; the previously named C (common), O (ordinary), and N (necrotic) groups. There are also two other distinct groups of isolates all of which are recombinant; the R-1 isolates have N (5′ terminal minor) and O (major) parents, and the R-2 isolates have R-1 (major) and N (3′ terminal minor) parents. Many isolates also have additional minor intra- and inter-group recombinant genomic regions. The complex interrelationships between the genomes were resolved by progressively identifying and removing recombinants using partitioned sequences of synonymous codons. Least squared dating and BEAST analyses of two datasets of gene sequences from non-recombinant heterochronously-sampled isolates (seventy-three non-recombinant major ORFs and 166 partial ORFs) found the 95% confidence intervals of the TMRCA estimates overlap around 1,000 CE (Common Era; AD). We attempted to identify the most accurate datings by comparing the estimated phylogenetic dates with historical events in the worldwide adoption of potato and other PVY hosts as crops, but found that more evidence from gene sequences of non-potato isolates, especially from South America, was required.
Potato virus Y (PVY) is the type species of the genus Potyvirus, one of the largest, most widespread, and economically important genera of plant viruses. It is a major world pathogen of potatoes, and other solanaceous crops, such as tobacco, tomato, and pepper, and is probably the most damaging virus of the world’s potato crop (Loebenstein et al. 2001; Stevenson et al. 2001; Kerlan 2006; Kerlan and Moury 2008; Gray et al. 2010; Ogawa et al. 2012; Karasev and Gray 2013; Jones 2014). PVY was first described by Smith (1931), its C strain by Bawden (1936) who considered it to be a distinct but related virus (potato virus C; PVC), and its N strain by Nobrega and Silberschmidt (1944). PVC was subsequently recognized as the common strain of PVY by Bawden and Sheffield (1944) and PVY’s original or “ordinary” strain then became its O strain. The O and C strains from potato were distinguished by their reactions with the potatoes Ny and Nc hypersensitivity genes, and the third strain, the N strain, caused systemic veinal necrosis in tobacco. The biological strains named O, C, and N were later adopted (De Bokx and van der Want 1987; Jones 1990) and were mostly congruent with their phylogenetics (Boonham et al. 2002; Moury et al. 2002). Isolates from pepper and tomato mostly belong to the C phylogenetic group (Moury 2010), whereas those from tobacco were mostly placed in the N and O phylogenetic groups (Tian et al. 2011). However, as more genomes have been sequenced and more potato hypersensitivity genes found (Singh et al. 2008; Moury 2010; Karasev et al. 2011; Karasev and Gray 2013; Jones 2014; Rowley, Gray, and Karasev 2015; Kehoe and Jones 2016), continued use of the same names for biological and phylogenetic groups has proved to be increasingly confusing due to the lack of complete coincidence between them. A new strain nomenclature system for sub-dividing the phylogenetic groups using Latinised numerals has therefore been proposed (Jones 2014; Jones and Kehoe 2016; Kehoe and Jones 2016). Their groupings correlate with the five broad ‘phylogroups’ we discuss in this article; the traditional C, O, and N groups plus two groups of recombinants, R1 and R2.
The genomic sequences of more than 200 isolates of PVY are now publicly available in the international databases. They come from all continents except Antarctica but none have been collected from crops in the potato’s main center of domestication in the Andean region of Bolivia and Peru, where nine potato species are cultivated in contrast to only one (Solanum tuberosum) outside the Andean region (Hawkes 1978; Brown 1993; Brown and Henfling 2014). The phylogenetic history of PVY is complex. It has undoubtedly involved spread within and between wild and domesticated potato species, wild ancestors of tomato, pepper, tobacco, and other solanceous crops in South or Central America, and also within plantings of potato and other cultivated solanaceous plants throughout the world (Klinkowski and Schmelzer 1960; Silberschmidt 1960; Brücher 1969; Jones 1981; Spetz et al. 2003). It has also involved recombination between lineages (Glais, Tribodet, and Kerlan 2002; Moury et al. 2002; Lorenzen et al. 2006; Schubert, Fomitcheva, and Sztangret-Wisniewska 2007; Ogawa et al. 2008; Singh et al. 2008; Hu et al. 2009a, b; Visser and Bellstedt 2009; Moury and Simon 2011; Cuevas et al. 2012; Karasev and Gray 2013). Many of these recombinants cause tuber necrosis (Beczner et al. 1984; Boonham et al. 2002; Karasev and Gray 2013).
Here, we report an analysis updating the phylogenetic history of PVY. It is based on the genome sequences available in the international databases in January 2016. They were from isolates collected from around the world, but none were from the potato crop’s domestication center in the Andes, where greater diversity would be expected. The majority of the isolates were from potatoes, but several also came from pepper, tomato, or tobacco. As recombination confounds phylogenetic analysis, our strategy was to separate the genomic sequences that are mostly non-recombinant (n-rec) from those that had major recombinant (rec) regions. To simplify the separation of rec from n-rec sequences, the PVY ORF sequences were partitioned into alignments of codons that only varied synonymously (syn codons), and those that had also varied non-synonymously (n-syn codons), and these were then analyzed separately. This strategy identified two sets of n-rec sequences, one of full-length ORFs and the other partial ORFs, for dating the PVY phylogeny. Our dating analyses, like those of Visser, Bellstedt, and Pirie (2012), find that the PVY population outside the Andean region mostly diverged over the past few centuries. This was after the Spanish colonization of South America following the defeat of the Inca empire in 1,532, and after the introduction of one species of potato (S. tuberosum) to Europe in the second half of the 16th century and later to other continents (Salaman and Hawkes 1949; Salaman 1954; Brown 1993; Hawkes and Fransisco-Ortega 1993; Brown and Henfling 2014). We found, like Visser, Bellstedt, and Pirie (2012), that the damaging rec variants emerged and spread only during the past century.
Two hundred and forty complete genomic sequences of PVY (Supplementary Table S1) were obtained from Genbank in January 2016. Each was edited using BioEdit (Hall 1999) to extract its main ORF. These were aligned, using the encoded amino acids as guide, by the TranslatorX online server (Abascal, Zardoya, and Telford 2010; http://translatorx.co.uk) with its MAFFT option (Katoh and Standley 2013) to give an alignment of 9,201 nucleotides (available at http://184.108.40.206/_resources/e-texts/blobs/240PVYORFs.zip). BlastN and BlastP (Altschul et al. 1990) online facilities of Genbank were used with the Chile 3 sequence, and representative PVY sequences from the N and C phylogroups, to search for related sequences. A simple pairwise sliding-window method DnDscan (Gibbs et al. 2006), available at http://220.127.116.11/_resources/e-texts/blobs/DnDscan1.ZIP, was used to identify codons in the alignments that had only evolved synonymously or had also evolved non-synonymously. These were partitioned using SEQSPLIT v1.0, a Fortran program written by, the late John Armstrong (available at http://18.104.22.168/_resources/e-texts/blobs/SeqSplit.ZIP). Sequences were tested for the presence of phylogenetic anomalies using the full suite of options in RDP4 (Maynard Smith 1992; Holmes, Worobey and Rambaut 1999; Padidam, Sawyer and Fauquet 1999; Gibbs et al. 2000; Martin and Rybicki 2000; McGuire and Wright 2000; Posada and Crandall 2001; Martin et al. 2005; Boni, Posada and Feldman 2007; Lemey et al. 2009; Martin et al. 2015); regions found to be anomalous by three or fewer methods and <10−6 random probability were ignored. For one test, codons in the aligned sequences with gaps were removed using POSORT (available at http://22.214.171.124/_resources/e-texts/README-POSORT.pdf).
Models for ML analysis were tested using TOPALi (Milne et al. 2009) and the ProtTest 3 server at http://darwin.uvigo.es (Darriba et al. 2011); the best fit models were found to be GTR+Г4+I (Tavaré 1986) for nucleotide sequences and LG+Г4+I (Le and Gascuel 2008) for amino acid sequences. Phylogenetic trees were inferred using the neighbor-joining (NJ) facility in ClustalX (Jeanmougin et al. 1998), the SplitsTree method (Huson and Bryant 2006) and PhyML 3.0 (ML) (Guindon and Gascuel 2003), and the support for their topologies assessed using the log-likelihood support for the trees and the SH-support (Shimodaira and Hasegawa 1999) for their nodes. Nucleotide diversities (Nei and Li 1979) were computed using DAMBE5 online (Xia 2013). Trees were drawn using Figtree Version 1.3 (http://tree.bio.ed.ac.uk/software/figtree/), and pairs of trees were compared using PATRISTIC (Fourment and Gibbs 2006) to test for mutational saturation and were confirmed by the method of Xia (2013). Most virus isolate collection dates (Supplementary Table S1) were obtained from Genbank files, or from Visser, Bellstedt, and Charleston (2012), and dates for some N and NTN isolates were not previously published (Ohshima K—unpublished data); the dating of sequence EU563512 is mentioned in the Section 4. The temporal signal in sets of aligned sequences, and the dates of the most common recent ancestor (TMRCA) and other nodes of inferred phylogenies were estimated by TempEst (Rambaut et al. 2016), the ‘Least Squares Dating’ method of To et al (2015) using Version lsd-0.3beta, and by the probabilistic methods of BEAST v1.8.2 (Drummond et al. 2012). In BEAST analyses Bayes factors were used to select the best-fitting molecular-clock model and coalescent priors for the tree topology and node times, and we compared strict and relaxed (uncorrelated exponential and uncorrelated lognormal) molecular clocks, as well as five demographic models (constant population size, expansion growth, exponential growth, logistic growth, and the Bayesian skyline plot). Posterior distributions of parameters, including the tree, were estimated from Markov Chain Monte Carlo samples taken every 104 from 108 steps after discarding the first 10 percent, and checked using Tracer v1.6 (http://tree.bio.ed.ac.uk/software/tracer/). These provided the TMRCAs, and other dates were obtained from the ‘maximum clade credibility tree’, namely the tree that was commonest among those observed. The adequacy of the temporal signal in our data was also checked by using ten independently date-randomized replicates in both the least squared dating (LSD) and BEAST analyses (Ramsden, Holmes and Charleston, 2009; Duchêne 2015).
The principal ORFs of 240 genomic sequences of PVY gave an alignment 9,201 nucleotides long with 6.7 percent of the 3,067 codons invariate. The ORFs formed five major groupings in a mid-point rooted NJ tree (Fig. 1A). Their relationships closely resembled those found in a maximum likelihood phylogeny reported by Kehoe and Jones (2016) who examined seventy-three isolates from among the 240 that we examined. In both our NJ phylogeny and the published ML phylogeny, the C phylogroup had the longest branches and formed the basal tree. One of its lineages diverges to give the O group and three others, N, R-1, and R-2; see Fig. 1 legend for the correspondences between our phylogroups and those of Kehoe and Jones (2016). In both phylogenies, many sister sequences have very asymmetric relationships; when summed to the midpoint root, the shortest branches are less than 25 percent of the longest. This may indicate that evolutionary rates have varied or, more likely, that some of the sequences are recombinant. The same arrangement of the same phylogroups was inferred by SplitsTree analysis (Fig. 2A) with the C and O phylogroups linked to others by a complex web of interrelationships indicating that many of the ORFs are recombinants, some of which are closely similar as shown by clustering of the links between the groups.
The ORFs were also directly searched for phylogenetic anomalies using the RDP suite of programs, and more than half gave significant evidence of recombination but with the assignment of ‘parental’ sequences often uncertain as many alternative combinations of ORFs were identified as possible ‘parents’.
The search for the n-rec ORFs was simplified by using alignments obtained by partitioning the variable codons into sequences of the syn codons (50.4 percent) and the n-syn codons (42.9 percent) they contained. NJ trees of these two alignments were closely similar topologically to those of the complete sequences (Fig. 1A) and, when compared in a patristic-distance graph (Fig. S1), they showed no evidence of mutational saturation along the main axis of the tree, and this was confirmed using the binary test for saturation (Xia 2013). However, there were discrete clusters of points off the main diagonal, as would be expected for rec sequences, and the main diagonal was very broad with the spread being greater in the n-syn axis (X) than in the syn axis (Y), again indicating the presence of recombinants.
ML trees obtained from the complete ORF sequences, and from their syn and n-syn codons, were also closely similar however the syn codon tree had a greater statistical support than those calculated from the n-syn codons or the complete sequences; log-likelihoods of −45390.6 and −60825.8 and −109855.39165, respectively. Furthermore, more nodes in the syn sequence tree had SH-support >0.9 than the same nodes in the other trees.
A preliminary search of the syn codon sequences using RDP, as well as NJ trees and patristic distance graphs, showed that all members of the R-1 and R-2 phylogroups had some large rec regions with the same ‘parents’. A representative selection of 48 sequences (Supplementary Table S1) was used to resolve these groupings. The selection included all sixteen sequences of the C phylogroup together with eight sequences from each of the other phylogroups, chosen to be as representative as possible of the basal divergences of those phylogroups (i.e. those with the shortest branch lengths, and therefore least changed after the phylogroup had been established). Sequences with the same pattern of recombination and parentage were progressively identified and removed. The most strongly supported phylogenetic anomaly was shared by all eight R-2 sequences. They had an R-1 phylogroup sequence (EF026076) as the major parent and an N phylogroup minor parent (AJ890346); the latter region was from around nt 5707 to the 3′ terminus of the complete ORFs (38 percent of the ORF). It was identified by seven out of nine RDP methods with probabilities ranging from 10−27 to 10−112. All R-2 sequences were therefore removed, and the remaining forty sequences again analyzed by RDP, which then found the most strongly supported and shared anomaly to be in all R-1 sequences. These had an O phylogroup major parent (EF026074) and an N phylogroup minor parent (X97895), that had, in most (see below), been provided from the 5′ terminus to nt 2205 (24 percent of the ORF); this region was identified by seven out of nine RDP methods with probabilities ranging from 10−23 to 10−84. The syn and n-syn sequences of the remaining thirty-two C, O, and N phylogroup ORFs gave almost identical ML and NJ phylogenetic trees, and a patristic distance graph of those trees had most points close to a single diagonal; the statistical support for the syn ML tree was again greater than for the n-syn ML tree (log-likelihood −24723.4 and −29299.9, respectively).
Finally, the syn codon sequences from C and O phylogroup isolates, that were not among the forty-eight representative sequences, were added to the remaining thirty-two C and O syn sequences to produce an alignment of 120 sequences. This was then searched using RDP for other more specific rec regions (i.e. sequence specific, not phylogroup specific), and seventeen sequences were found to have significant inter- and intra-phylogroup rec regions (i.e. three C phylogroup sequences, seven from the O phylogroup, and seven from the N phylogroup). One of those N phylogroup sequences was X97895 from isolate N605, which has been used in several studies as a reference sequence, but was found by us to be a recombinant between AJ890346 and DQ157180, representing the two main lineages of the N phylogroup. Figure 1B shows the NJ tree of the complete ORFs of the remaining 103 sequences. This phylogeny is much more symmetrical than that of the original 240 sequences (Fig. 1A) and its shortest branches (tip to root) are around 70 percent of the length of the longest. Figure 2B shows their SplitsTree graph, which again has a much simpler linkage structure in its main branches than the 240 sequence tree (Fig. 2A). Figure S1 also shows the patristic distance graph comparing the NJ trees of the 103 syn and n-syn sequences, and this also confirms that these n-rec sequences have evolved in a biologically coherent manner with most points aligned with, and close to, a diagonal that has a slope around 0.8, and shows no evidence of mutational saturation.
The genome maps of the three rec phylogroups are shown in Fig. 3. Phylogroup N is the sister lineage to the C and O phylogroups and most genomes of these three phylogroups are non-recombinant, whereas all the R-1 and R-2 phylogroup sequences are recombinants and have parents in the O and N phylogroups.
The ORFs of all forty-three R-1 phylogroup sequences are most closely related to ORFs of the O phylogroup, especially that of sequence EF026074. However all have a 5′ terminal region (24 percent of the complete ORF) that is closely related to the homologous region of N phylogroup ORFs, especially that of X97895. In around half the ORFs (e.g. DQ157179, HQ912870, JF927762) the N phylogroup region is from its 5′ end to around nt 2205, and in others (e.g. AJ890349, HQ912863, JN935419, and KJ801915) it is from nts 301 to 2,205, and is preceded by a short region closest to O sequences. In two of the sequences (HM991454 and JQ969040) the N phylogroup region is most closely related to AB331517, not X97896. These differences indicate that more than one event and O parent was involved in establishing the R-1 phylogroup. Some of the R-1 sequences (e.g. AJ889868 and KJ634023) also have a short rec region around nts 5,600–6,300 that is most closely related to that region of R-2 isolates.
The sequences of all seventy-six R-2 ORFs are closest to that of an R-1 phylogroup sequence, EF026076, but all have a 3′ region (nts 5707-end; 38 percent of the ORF) that is most closely related to the homologous region of AJ890346, an N phylogroup sequence. In addition around one-fifth of the R-1 and R-2 sequences also have smaller individual regions that in RDP analyses are recorded as significant inter- and intra-phylogroup recombinations.
The C phylogroup ORFs sequences, including that of the Chile 3 isolate, are the most variable; nucleotide diversity π=0.0974± 0.0460. Those of the other four phylogroups are much less variable, indicating that they have probably diverged more recently; the N phylogroup population is the most variable, π=0.0256± 0.0121, compared with the O phylogroup, 0.0211± 0.0100, R-1 phylogroup, 0.0196± 0.0093, and R-2 the least variable, 0.0188±0.0089.
BlastN and BlastP searches using representative sequences from all of the n-rec PVY phylogroups found the most closely related genomic sequences to be those of sunflower chlorotic mottle virus (JN863233), pepper severe mosaic virus (AM181350) and bidens mosaic virus (KF649336). In both ML and NJ trees of seventy-three dated n-rec complete ORFs (see below), including these three viruses as an outgroup, the N phylogroup was placed as sister to all other PVY phylogroups, as in Fig. 1B. However, when ML or NJ trees were calculated from the encoded amino acid sequences, or from the 166 core nucleotide sequences (see below) together with the homologous regions of the three outgroup sequences, the Chile 3 sequence (FJ214726) was placed as sister to all other PVY lineages. This difference was found not only when complete aligned sequences were used, but also when all indels (9.5 percent of the sequences) had been removed. Nonetheless all the major nodes in trees found from complete and degapped sequences of both ORF and amino acid trees, had>0.97 SH-type statistical support. Thus the exact position of the basal node of the present PVY phylogeny is unresolved but is either side of a single robustly supported node linking the Chile 3 sequence and the N phylogroup to all other PVYs. The reason for this rooting uncertainty is unknown; the individual mutational differences between the Chile 3, N phylogroup and all other sequences are spread throughout the full length of the sequences. Furthermore the points representing the pairwise syn and n-syn comparisons of the Chile 3 sequence and other PVY isolates in a patristic distance graph (Fig. S1) were close to the main diagonal trend, indicating that the Dn/Ds ratios of the Chile 3 isolate are not unusual, but similar to those of other PVY isolates.
The 240 PVY genomic sequences we analyzed provided two sets of heterochronously dated sequences (Supplementary Table S1). Seventy-three non-recombinant isolates from the C, O, and N phylogroups provided a set of full length ORFs; the earliest sample was collected in 1938 CE (Common Era or AD), the most recent in 2013 CE and their mean sample collection date was 1999.4 CE. As described above the R-1 and R-2 genomes are recombinants that share the central core region of their genomes (nts 2,206–5,706) with those of the O phylogroup. Sample collection dates are known for ninety-eight R-1 and R-2 isolates (Supplementary Table S1). Thus, together with the core regions of the seventy-three dated recombinant sequences, their shared central region provides another dataset of 171 sequences for estimating PVY phylogenetic divergence dates; it covers the same range of sampling dates, but with a mean of 2004.1 CE. Although there are twice as many sequences, they are only 38 percent of the length of the complete ORFs. The alignment of the 171 core regions was re-examined by RDP, and five found to have a short 5′ terminal region of N phylogroup sequence (closest to JQ969036), so these were removed to leave 166 sequences in the dataset. ML and NJ trees inferred from these core regions differed only in minor details and resembled closely those of the seventy-three n-rec ORFs (Fig. 1), except that they placed all the O, R-1, and R-2 sequences in single tight cluster that did not cleanly resolve into O, R-1, and R-2 lineages (Fig. 4). The poor resolution within this cluster of O, R-1, R-2 sequences probably results from the small diversity within the cluster; nucleotide diversity π=0.018± 0.008 compared with 0.048± 0.023 in all 166 core sequences, and 0.075± 0.035 in the seventy-three n-rec complete ORFs. Nonetheless the distribution of O, R-1 and R-2 sequences within the lineages (Fig. 4) indicate clearly that O isolates are ancestral to R-1 isolates, and R-2 isolates are the most recent.
TempEst analyses of the two datasets (73 n-rec ORFs and 166 cores) found that both had a strong temporal signal; correlation coefficients 0.436 and 0.228, P=0.00012 and 0.0032, for the complete and core sequences, respectively. They had small residuals with no trend, and these were mostly associated with the C phylogroup sequences and that of Chile 3. The estimated TMRCAs were 1411.6 CE and 1076.8 CE, respectively (Fig. 5). Separate TempEst analyses of the seventy-three n-rec cores and ninety-three rec cores in the 166 core dataset found that the seventy-three n-rec core sequences had a temporal correlation of 0.448 (P=0.00007) with a TMRCA of 1459.8 CE, whereas the regression for the ninety-three rec core sequences was 0.0356 (P=0.73). Thus the temporal signal detected by TempEst in the 166 core sequences seems to be mostly, if not exclusively, in its seventy-three n-rec core sequences
The LSD method was used in its ‘constrained’ mode (i.e. an ancestral node must be older than its daughter nodes) to calculate the dates, evolutionary rates, and confidence intervals, for the TMRCAs of the ML and NJ trees of the seventy-three ORF sequences, and also their encoded amino acid sequences (Table 1). Estimates of the TMRCA dates obtained from complete sequences were close to the mean of those from partitioned syn and n-syn codon sequences (Table 1). However, the TMRCA date estimates from ML trees were 80–120 years earlier than those obtained from NJ trees, and there were even larger differences between the TMRCA date estimates from ORF versus amino acid sequences; the former were 250–290 years more ancient than those from amino acid sequences. Table 1 also records the estimated dates of the five principal nodes in the phylogenies. Randomizing the collection dates confirmed that the data contained a clear temporal signal as the ML phylogeny of seventy-three n-rec complete sequences had a TMRCA date of 1085 CE, but gave a mean TMRCA date from twenty collection date randomizations of 12,555 BCE (standard deviation 7,076 years; range 10 BCE–17,036 BCE). Thus the TMRCA date calculated with the true sampling dates is outside the range of dates obtained with randomized collection dates, and 1.9 standard deviations from their mean.
To et al. (2015) stated that although LSD can analyze trees obtained by any method, ‘more accurate results are expected from trees obtained using maximum-likelihood methods’, which, with our data, gave the earliest TMRCA estimates, and these were most precise in that their 95 percent confidence limits were much smaller (Table 1). They estimate that the present PVY population diverged from a common ancestor around 1085 CE with commensurately more recent dates for the divergences within the phylogeny (Fig. 6).
LSD analysis of the 166 core sequences gave TMRCA dates (Table 1) for the ML and NJ trees close to those estimated for the seventy-three complete sequences, and all major nodes had>0.9 SH-type statistical support including that of the combined cluster of O, R-1, and R-2 phylogroup sequences and its major nodes. The mean root date from twenty collection date randomizations of the ML data was 11,491 BCE (standard deviation 5,716 years; range 1292 BCE–15,269 BCE), again confirming the strong temporal signal.
The lineages within the cluster of O, R-1, and R-2 sequences in the ML and NJ trees of core sequences, when collapsed, had identical groupings (Fig. 4), although the dates estimated for these lineages differed (Table 1 and Fig. 4). For example the ML tree estimated the origins of O isolates, R-1 isolates and R-2 isolates to be 1918.2 CE, 1946.7 CE, and 1969.4 CE, respectively (Fig. 4), whereas the NJ tree estimates were all earlier and were, respectively, 1906.2 CE, 1933.3 CE, and 1960.3 CE.
In BEAST analyses of the two datasets the best-supported model for mutational substitution was GTR+I+Γ4, while the best model for the rate of substitution was the ‘relaxed uncorrelated lognormal’ clock, and a population of constant size. All datasets passed date-randomization tests. The phylogenies inferred by BEAST had essentially the same topology as those inferred by ML and NJ, however the dates (Table 2) were significantly earlier than those estimated by the regression methods. The TMRCA of the seventy-three n-rec ORF sequences was 1590 BCE, and was twenty-four CE for the 166 cores. The topology and dates of the other nodes were obtained from the maximum clade credibility tree, were commensurately earlier and showed exactly the same pattern of slightly unresolved clustering of the cluster of O, R-1, R-2 core sequences as in the ML and NJ trees (Fig. 4A) with their origins estimated to be 1832.9 CE, 1880.6, and 1932, respectively (Fig. 4B). The 166 core sequences were partitioned into those from the seventy-three from n-rec sequences and those from the ninety-three from rec sequences, they gave TMRCA estimates of 214 BCE and 7 CE, respectively, and passed the date randomization test, indicating that they contained significant temporal signals, in contrast to the results of the TempEst tests.
Part of Fig. 5 summarizes and compares the TMRCAs and confidence intervals estimated by the different methods, and shows that despite the TMRCA estimates covering a six-fold range, the 95% confidence limits of most overlap between 500 CE and 1500 CE.
Finally, a simple comparison was made of the basal branch length of the PVY cluster as a proportion of the basal branch length of the potyviruses. A ML phylogeny was calculated from four representative PVY ORF sequences [AB331515 (N phylogroup), EU563512 (C), FJ214726 (C), HM367075 (O)] aligned with those of the other 103 representative potyviruses and rymoviruses used for Fig. 1 of Gibbs, Nguyen, and Ohshima (2015). The mean patristic distances of the sequences connected through the root of the phylogeny (i.e. from Narcissus degeneration virus, NC_008824; Onion yellow dwarf virus, NC_005029; Shallot yellow stripe virus, NC_007433; Vallota speciosa virus isolate Marijiniup 7, JQ723475 to all the other potyviruses) was 2.258 substitutions/site, and through the root of the PVY sequences (i.e. from AB331515 (N) to all the other PVYs) was 0.227 substitutions/site, a ratio of 9.94:1.
This article reports that using syn codon sequences simplified the resolution of relationships in a complex population of PVY genomes. The phylogenies and RDP analyses calculated from the syn codon sequences were simpler to interpret than those calculated from the comparable n-syn codon or complete sequences, and had greater statistical support. Syn codon changes reflect the temporal component of phylogenetic change of gene populations better than n-syn codons, as changes of the latter are also linked to changes of the encoded protein, and incoherent changes of syn and n-syn codons in complete sequences may confound analysis of either, especially when rec sequences are present. Significantly Ohshima et al (2016) also found that Bayesian estimates of the ages of cucumber mosaic cucumovirus (CMV) populations were more precise if made using syn codon sequences rather than complete sequences; CMV has three genomic segments, which encode five ORFs and they found that the age estimates for different ORFs using syn codon sequences had a coefficient of variation around half that of complete sequences, and the 95% CI ranges of those estimates was around 25–50 percent smaller than those of the original complete sequences.
The strategy of progressively removing the most strongly supported cluster specific recombinants resolved the complexity of the relationships of the rec and n-rec ORFs. In Fig. 1A (i.e. a NJ tree), and in Fig. 2 of Kehoe and Jones (2016) (i.e. an ML tree), the N phylogroup forms an outlier of the R-2 cluster. We have shown that this topology is false, as the dominance of the recombinant linkages between the phylogroups seriously compromise agglomerative methods of tree-building in the following way. When sequences of all PVY phylogroups are present in an analysis, the R-1 sequences are more closely related to their O parent than their N parent, as the former provided a greater percentage (76 percent) of their sequence, and therefore the R-1 and O phylogroups link. Likewise the R-2 sequences are more closely related to their R-1 parent than their N parent, as the former provided 62 percent of their sequence, and so again the R-2 phylogroup links to the R-1 phylogroup rather than the N parent. Finally the N phylogroup clusters with the R-2 ORFs as they share two-thirds of their sequence, whereas the true links of the N phylogroup as the sister lineage to all the other phylogroups are more distant. Thus the inferred inter-phylogroup relationships, when all are present, are false (Figs 1A and 2A), and the true topology of the phylogroups is only revealed when the recombinants are removed.
In the second part of this project we attempted to date the major features of the PVY phylogeny using different heterochronous tip-dating methods. The resulting estimates, although overlapping as judged by their 95% CIs, differed significantly in scale, so that the TMRCAs estimated, especially for the basal nodes, by the probabilistic method were between two to four times earlier than those estimated by the regression methods. So here we discuss whether events of the history of PVY and its hosts are congruent with those of the PVY phylogeny, and provide, in essence, independent node dating that supports some estimates more than others.
All our analyses supported a single PVY phylogeny that has several distinctive features (Fig. 6). PVY is a recent member of the ‘PVY lineage’ of potyviruses, most of whose members were isolated first, or only, from plants in the Americas (Gibbs and Ohshima 2010), and although PVY was first identified in the UK, it almost certainly came from the Americas. Furthermore the basal divergences of the PVY phylogeny produce three lineages, and one is the Chile 3 isolate found in Capsicum baccatum only in South America (Moury 2010). Thus the basal (TMRCA) node in the PVY phylogeny most likely represents an event that occurred in South America. The other two basal PVY lineages have been found throughout the world, but not yet in South America. The flora and fauna of South America was biologically isolated from Europe until the start of the ‘Columbian Exchange’ in the late 15th century CE (Crosby 1972) when the early trans-Atlantic maritime traders first took American animals and plants, including potato tubers, to Europe and vice versa. Thus the basal nodes of the PVY phylogeny are likely to represent events that predate the late 15th century, and this conclusion agrees with all our date estimates of those nodes; the most recent is 1411 CE with estimates from regression analyses ranging from 1085 CE to 1411 CE, and those from probabilistic analyses from 1590 BCE to 24 BCE.
The second basal lineage, the C phylogroup, probably diverged in Europe, and consists mostly (7/12 isolates) of isolates found in tobacco, tomato and pepper crops. Potato, pepper, tomato and tobacco were first introduced to Europe during the Columbian Exchange from their domestication centres in the Andean regions of Peru and Bolivia (potato), more widely in the Andes (tomato, tobacco) and further north in the Americas (pepper) (Bai and Lindhout 2007; Pickersgill 2007; Brown and Henfling 2014; Kraft et al. 2014). There is no evidence that PVY is transmitted by sexually produced seed, so it was most likely introduced to Europe in the small numbers of S. tuberosum tubers first taken from South America to the Canary Islands in 1562 CE, to Spain in 1570 CE and the UK in 1588 CE, if so then all the initial PVY population of Europe probably originated from infected S. tuberosum tubers in those cargoes. The divergence to several non-potato hosts probably occurred outside South America, although it is unknown to what extent different C phylogroup isolates are ecologically adapted to particular hosts. The dates estimated by the regression methods are congruent with this interpretation whereas those estimated by the probabilistic methods are not (Fig. 6; nodes 3 and 4).
The third basal lineage, the N phylogroup, is known only from a cluster of closely related isolates, so it probably radiated recently, and at present we have no idea of when and how it first migrated from the Americas. It has been isolated worldwide, mostly from potatoes (13/15 isolates).
The earliest potato breeding programs of any size began in 1810, but did not become widespread until the second half of the 19th century. Initially there were very few potato introductions to Europe, and for the first two centuries the crops had little genetic diversity. Virus diseases that seriously debilitated established cultivars, but did not pass through sexually produced potato seed, probably encouraged selection and, as a result, more cultivars were grown in the late 18th and early 19th centuries. These were taken from natural berries produced by self pollination. Most were selfed derivatives so selection reduced the gene pool (Glenndinning 1983). Potato blight disease caused by the oomycete Phytophthora infestans appeared in the mid 19th century and eliminated almost all cultivars as they were so inbred, further reducing the gene pool. Breeding among blight survivors and new introductions from the Andean region of South America led to many new cultivars being grown by early 20th century (Glenndinning 1983). This introduction of new potato germplasm led to genetic diversity in the crop, which apparently introduced the selection pressure in the form of PVY resistance genes that diversified the PVY population in the early 20th century. Our analyses indicate that the parents of the recombinant R-1 and R-2 strains were members of existing O and N populations, therefore probably European. As the Nc resistance gene, which C strains ellicit, was widely distributed in early potato cultivars (Bawden 1936; Cockerham 1943; Bawden and Sheffield 1944), selection for avirulence to this gene was likely to be an important factor in the diversification of the O phylogroup population. Similarly, N strains, but not O strains, infect plants with both Nc and the less common Ny and Nz genes, so selection for avirulence would again have favored diversity in the N phylogroup population (Cockerham 1970; Jones 1990;Chikh-Ali et al. 2014; Kehoe and Jones 2016).
PVY strains causing veinal necrosis symptoms in tobacco were first reported in 1935 in the UK (Smith and Dennis 1940) and soon afterwards in Brazil, Europe and North America (Nobrega and Silberschmidt 1944; Bawden and Kassanis 1947, 1951; Richardson 1958; Klinkowski and Schmelzer 1960; Silberschmidt 1960; Todd 1961; Kahn and Monroe 1963; Brücher 1969). However, whether these strains included ones in both the N and R1 phylogroups is unknown. Strains inducing this symptom in tobacco produce mild symptoms in potato plants so they soon became widely distributed because infected plants were often missed when seed potato crops were rogued (Todd 1961; Jones 1990). Possibly, this increased spread can be attributed to emergence of the R1 population at that time, but evidence for that is lacking, and none of the sequences we analyzed came from early isolates. Although the R-1 population is not very damaging to potato, it is difficult to control as its symptoms are so mild and it overcomes resistance genes Nc, No, and Nz. By contrast the R-2 population is more damaging as it often causes tuber necrosis, and still overcomes these three resistance genes.
The tuber necrosis isolates, R-2 phylogroup, were first sighted in the early 1980s in Europe and shown to be recombinants (Beczner et al. 1984; Le Romancer, Kerlan, and Nedellec 1994; Boonham et al. 2002; Lorenzen et al. 2006). These reports are likely to be accurate as R-2 infections are so noticeable and, by the 1980s, potato crops in Europe were intensively monitored for disease. Our estimates of the origin of R-2 isolates were 1969.4 CE, 1960.3 CE, and 1932.4 CE for the ML-LSD, NJ-LSD and BEAST analyses, respectively. Thus the ML-LSD estimate agrees most closely with the crop record, and the BEAST estimate agrees least well, but all are possible. Similarly, PVY infections that may have been caused by R-1 isolates were first reported in 1935, and soon confirmed to be widespread. The ML-LSD, NJ-LSD and BEAST methods placed the origin of R-1 as 1946.7 CE, 1933.3 CE and 1880.6 CE respectively. Thus the NJ-LSD estimate is the most likely, the BEAST estimate less likely, and the ML-LSD impossible.
All our date estimates agree that the near simultaneous radiation of the O and N populations of PVY (Figs 4 and 6; nodes 5 and 6) coincided with the earliest potato breeding programs of any size. These programs increased the genetic diversity of available potato cultivars and this would have included the Ny and Nz resistance genes. They also coincided with cultural and agronomic changes to seed and ware potato crop production (Singh et al. 2008; Gray et al. 2010; Jones 2014). A combination of these factors is likely to have led to conditions favoring recombinants. This scenario fits better with the LSD dates, 1918.2 CE and 1906.2 CE, respectively, than with the earlier BEAST date of 1832.9 CE.
Most of our PVY date estimates are congruent with those of the recent history of the international potato crop, but this is not surprising as the estimated dates have very broad overlapping 95% CI estimates. One crucial event might however inform us whether some TMRCA estimates are more accurate than others, and that is whether the divergence of the C phylogroup (Fig. 6; nodes 3 and 4) occurred before or after PVY was introduced to Europe in the late 16th century; the TMRCAs estimated by the regression methods are congruent with a European divergence of the C phylogroup, whereas those estimated by the probabilistic methods indicate a much earlier divergence of the C phylogroup. This dilemma will only be resolved when the genomic sequences of more Chile 3-like isolates or of the C phylogroup are known.
Our estimates of the dates for the PVY phylogeny are somewhat earlier than those reported by Visser, Bellstedt, and Pirie (2012) who analyzed rec, n-rec, and partitioned PVY genomic sequences by probabilistic methods, although our 95% CI estimates overlap. The difference may merely be the result of population sampling differences as a much larger set of sequences was available to us, and it included the early dated isolate KP691327 (1943 CE). A further difference between our analyses is the date given to sequence EU563512, which as Visser, Bellstedt, and Pirie (2012) reported, was obtained ‘from a potato plant vegetatively propagated since 1938’ and sequenced in 2007 (Dullemans et al. 2011). We dated this sequence as 1938, the earliest of our sequences, whereas Visser, Bellstedt, and Pirie (2012) dated it as 2007. In making this choice we checked how this difference affected the date estimates of the seventy-three n-rec dataset, and found that using the 1938 date gave an ML-LSD TMRCA of 1085.4 CE (95% CI 1267 CE–321 CE), whereas using the 2007 date gave a TMRCA of 363 CE, and very much broader 95% CIs (918 CE to 1.8 × 109 BCE) and, when the sequence was omitted, an intermediate TMRCA of 687 CE (1016 CE–9503 BCE). Thus the 95% CI range was smallest using the 1938 date, and justified our choice.
Finally, we explored the possibility of extrapolating dates over an even larger timescale. Gibbs et al. (2008) suggested that the Neolithic invention of agriculture, which in Eurasia occurred in the northern Levantine/Mesopotamian area around 12,000 BCE (Bellwood 2005; Pinhasi, Fort, and Ammerman 2005), produced the conditions for the starburst potyvirus diversiﬁcation, and this provides another possible dating point for potyviruses. We estimated that in an ML phylogeny of the major ORFs of a large representative set of potyviruses and four PVY isolates (one N, one O, and two C phylogroup isolates) the ratio of the mean patristic distances of the sequences connected through the root of the phylogeny and through the root of the PVY sequences was 9.94:1 indicating that a TMRCA of all potyviruses of around 10,000 BCE would give a TMRCA of PVY around 1000 CE.
Supplementary data are available at Virus Evolution online.
We are very grateful to Hien To and Olivier Gascuel for great help with use of their LSD software.
Conflict of interest: None declared.