|Home | About | Journals | Submit | Contact Us | Français|
A phylogenetic analysis of 14 complete simian virus 40 (SV40) genomes was conducted in order to determine strain relatedness and the extent of genetic variation. This analysis included infectious isolates recovered between 1960 and 1999 from primary cultures of monkey kidney cells, from contaminated poliovaccines and an adenovirus seed stock, from human malignancies, and from transformed human cells. Maximum-parsimony and distance methods revealed distinct SV40 clades. However, no clear patterns of association between genotype and viral source were apparent. One clade (clade A) is derived from strain 776, the reference strain of SV40. Clade B contains isolates from poliovaccines (strains 777 and Baylor), from monkeys (strains N128, Rh911, and K661), and from human tumors (strains SVCPC and SVMEN). Thus, adaptation is not essential for SV40 survival in humans. The C terminus of the T-antigen (T-ag-C) gene contains the highest proportion of variable sites in the SV40 genome. An analysis based on just the T-ag-C region was highly congruent with the whole-genome analysis; hence, sequencing of just this one region is useful in strain identification. Analysis of an additional 16 strains for which only the T-ag-C gene was sequenced indicated that further SV40 genetic diversity is likely, resulting in a provisional clade (clade C) that currently contains strains associated with human tumors and human strain PML-1. Four other polymorphic regions in the genome were also identified. If these regions were analyzed in conjunction with the T-ag-C region, most of the phylogenetic signal could be captured without complete genome sequencing. This report represents the first whole-genome approach to establishing phylogenetic relatedness among different strains of SV40. It will be important in the future to develop a more complete catalog of SV40 variation in its natural monkey host, to determine if SV40 strains from different clades vary in biological or pathogenic properties, and to identify which SV40 strains are transmissible among humans.
It has recently been recognized that naturally occurring genetic variants of simian virus 40 (SV40) exist (18, 23, 26, 27, 31, 34, 37-40, 46). As more genomic sequences became available, it was apparent that isolates differed from the reference strain SV40-776 and from each other. Major genetic variation is localized in two regions of the viral genome: the noncoding regulatory region and the C-terminal nucleotides (approximately 300) of the carboxy-terminal region of the T-antigen gene (T-ag-C), referred to as the variable domain (39, 40).
Variations in the T-ag-C gene region were frequently detected in human tumor-associated sequences, ruling out the possibility that positive findings were the result of laboratory contamination of specimens (4, 7, 26, 27, 40, 44-46). The T-ag-C sequence was shown to be stable during tissue culture passage of SV40 isolates (24). In contrast, the SV40 regulatory region may contain large insertions, deletions, or duplications (25), and rearrangements have been observed to occur within individual infected monkeys (23, 31) and during passage of SV40 in certain cultured cells (32).
Comparative study of entire genomic sequences is one of the best methods for determining the evolutionary relationships among organisms (11, 48). In the detailed analysis reported here, we examined the known complete SV40 genomic sequences, as well as described partial sequences from the SV40 T-ag-C region.
The specific goals of this study were (i) to examine the patterns of genetic variation in the complete genomes of SV40 isolated from human, nonhuman, and vaccine sources, (ii) to determine if phylogenies based on the T-ag-C gene are congruent with phylogenies based on whole-genomic sequences, and (iii) to examine the genetic variability of T-ag-C genes across available samples for which the entire genomic sequences are unknown.
The viral strains and associated sequences that were analyzed are listed in Table Table1.1. The origin of each sequence and its GenBank accession number are shown. There are two entries for strains 776 and 777, because each was sequenced twice by using different source materials and varied slightly in sequence. SV40 reference strain 776 (SV40-776) was a BamHI clone in pBR322 (pSVB-3) obtained from G. Khoury. That sequence was previously reported. SV40-776* was an EcoRI clone (pWTSV40) prepared from a laboratory stock of SV40-776 (22). Strain 777 was obtained from A. M. Lewis, Jr. (see below). SV40-777* was a BamHI clone of SV40-777 in pBR322 made in 1983, which was obtained from M. Bastin. Because SVCPC and SVMEN are identical in sequence, they were treated as a single entry for this study and are sometimes designated SVCPC(SVMEN).
Several viral isolates were cloned and sequenced as part of this study. SV40 strains Rh911 and N128 were independent isolates recovered from uninoculated rhesus monkey kidney cultures in the early 1960s and were from samples held in storage at Baylor College of Medicine for over 25 years. SV40-Rh911 was an American isolate from 1964 (13). The virus used for cloning was from passage 2 in CV-1 cells made in 1970 at The Wistar Institute (Philadelphia, Pa.); it was obtained from A. Girardi in 1971 and remained archived at −70°C at Baylor College of Medicine until April 1996, when an aliquot was seeded into CV-1 cells to prepare a virus stock for cloning. The passage history of SV40-N128 is unclear; it was isolated in Russia in 1965 (10) and was obtained from M. Nachtigal in 1971. SV40-777, recovered originally from an inactivated poliovaccine, was from an archived stock that was obtained from A. M. Lewis, Jr., in July 1998. SV40-GM00637H cells (virus-producing SV40-transformed human fibroblast cells containing integrated and episomal SV40 genomes) were obtained in November 1999 from the National Institute of General Medical Sciences Human Genetic Mutant Cell Repository through Coriell Cell Repositories (Camden, N.J.). These cells were chosen to determine whether SV40 developed adaptive genetic changes for growth in human fibroblast cells (17).
Cloning of SV40 strains Rh911, N128, 777, and GM00637H was performed as described in detail previously (23). Briefly, when the cytopathic effect was advanced in SV40-infected CV-1 cells, a Hirt extraction (15) was performed; the resulting cleared lysate was digested with proteinase K and extracted with phenol, and the DNA was precipitated with isopropanol. After being washed once with 70% ethanol, the DNA was suspended in Tris-EDTA buffer (pH 8), cut with restriction enzyme EcoRI, and cloned into EcoRI-digested pUC-19 plasmid that had been pretreated with shrimp alkaline phosphatase (USB Corp., Cleveland, Ohio). DNA sequences were determined from plasmid clones that were purified by using a Qiagen (Valencia, Calif.) Plasmid Midi-prep kit. Double-stranded DNA sequencing using automated ABI Prism primer extension technology was performed commercially (SeqWright, Inc., or Lone Star, Inc., both in Houston, Tex.); both DNA strands were sequenced. The primers used for SV40 genomic DNA sequencing were purchased from Invitrogen Corp. (Carlsbad, Calif.), and their sequences are shown in Table Table22.
A multiple sequence alignment of 14 whole-genomic SV40 sequences was performed with ClustalX V1.81 (43) by using the default parameters. The noncoding regulatory region of SV40 is subject to insertions, deletions, and/or duplications (INDELS) (23, 25, 31, 32). Alignment gaps were treated as missing data in the whole-genome analysis; therefore, part of the regulatory region (nucleotides 29 to 246 [according to SV40-776 numbering]) was excluded from the analysis. These nucleotides (29 to 246) encompass part of the early promoter and part of the enhancer (24, 25); the core ori was essentially intact (24, 25). Under these conditions, there were few alignment gaps and no ambiguous positions in the alignment. Phylogenetic trees were constructed by the maximum-parsimony and neighbor-joining methods. Maximum parsimony was implemented by use of PAUP* 4.0b (version 4.0b; Illinois Natural History Survey, Champaign, Ill.), and gaps were treated as missing data; however, treating gaps as a fifth character did not significantly alter the outcome. The data were bootstrapped with 1,000 replicates by using the full heuristic search option.
The neighbor-joining method (36) was implemented by using MEGA-2 (Molecular Evolutionary Genetics Analysis, version 2.1; Arizona State University, Tempe). Gaps were excluded from the analysis; however, including gaps under the “pairwise deletion” option did not alter the tree topology. Distances were calculated by using the two-parameter model of Kimura (20). The same analyses were performed for a 315-position alignment of T-ag-C sequences on the same taxa.
Congruence between the whole-genome and the T-ag-C data sets was assessed by the partition metric (8) and the quartets method (9) implemented in COMPONENT (version 2.00a; University of Auckland, Auckland, New Zealand). The partition metric indicates the level of disagreement between sets of trees; specifically, the number of clusters found in one or the other tree, but not in both. In contrast, the quartets measure is an indication of the number of quartets (smallest unrooted sets of four taxa) that are clearly resolved and explicitly agree. One hundred unrooted maximum-parsimony trees were generated by nonparametric bootstrap replication for the following data sets: the whole-genomic data set (G); an alternate whole-genomic data set (G+); the T-ag-C data set for the same taxa (TAG); the T-ag-C data set for the same taxa, treating gaps as a fifth character (TAG5th); and two sets of randomly generated trees (RND and RND+). The following sets of 100 trees were compared in a pair-wise manner (100 × 100 = 10,000 total comparisons) by using COMPONENT (version 2.00a) for both the partition metric and quartets: (G) × (G+), (G) × (TAG), (G) × (TAG5th), and (RND) × (RND+). These comparisons provide an indication of the level of congruence between data sets, relative to the internal congruence in the whole genomic data set as opposed to sets of randomly generated trees.
The maximum-parsimony and neighbor-joining methods were implemented as in the whole-genome analysis, except that gaps were treated as fifth characters in the preliminary survey of the genetic variation in all available T-ag-C sequences (14 sequences were acquired by entire genome sequencing, and an additional 16 were obtained by PCR from a variety of sources). In contrast to the regulatory region, gaps in the T-ag-C region consist of small INDELs and are less likely to be phylogenetically problematic; therefore, they were included in the analysis. Treating gaps as missing data resulted in similar, yet less-resolved, tree topologies.
The whole-genome alignment consisted of 5,303 positions, 98.9% of which were invariant. When gaps were excluded from the analysis, 26 positions were parsimony informative and 32 were uninformative. The strict-consensus maximum-parsimony tree was highly consistent (consistency index [Ci], 0.967; rescaled consistency index [Rc], 0.942) and had an overall length of 60 steps. Two clades were consistently resolved, by both parsimony and distance methods (Fig. (Fig.1):1): clade A (containing 776, 776*, and H328) and clade B [containing 777, 777*, N128, GM00637H, Rh911, SVCPC(SVMEN), Baylor, and K661]. Isolates that grouped to clade B were derived from all three source populations (monkeys, contaminated vaccines, and humans). The polymorphisms detected in viruses within clades A and B and their locations are itemized in Table Table3.3. Several isolates were ungrouped (PML-1, MC028846B, and VA45-54).
Analyses of the SV40 C-terminal T-ag variable domain sequences were performed to determine if the T-ag-C sequences are capable of resolving relationships that are congruent with the whole-genome analysis. If the T-ag-C sequence phylogeny is highly congruent with the whole-genome analysis, then rapid and cost-effective strain identification becomes possible, as does a preliminary analysis of a larger number of samples. A partial listing of T-ag-C sequences is shown in Fig. Fig.2.2. The T-ag-C alignment consisted of 315 positions; when gaps were treated as missing data, 93% of the positions were invariant, with 9 parsimony-informative and 10 uninformative sites (Ci = 0.558; Rc = 0.400) and an overall length of 34 steps (data not shown). When gaps were included as a fifth character, then only 68% of the positions were invariant, with 51 parsimony-informative and 49 uninformative sites (Ci = 0.966; Rc = 0.93) and an overall length of 59 steps.
The major groupings were the same regardless of whether gaps were treated as missing data or as fifth characters, but treating gaps as missing data resulted in very little resolution. The parsimony consensus trees resulting from treating gaps as fifth characters are presented because of higher resolution, higher Ci index, and slightly higher congruence to the whole-genome data as measured by quartet(s) (see “Congruence between data sets” below [Fig. [Fig.3]).3]). Both distance and parsimony methods resolved the same groupings as the whole-genome data, although the topology is slightly different for intermediate taxa.
The partition metric is a measure of dissimilarity (the number of clusters in one tree or the other tree but not in both). Penny and Hendy (33) computed exact probabilities for comparing two trees up to 16 taxa by using the partition metric. The probability that two trees with 14 taxa have a partition metric <6 is very small (P < 0.01). The whole-genomic and T-ag-C data sets have few partitions that disagree, in contrast to comparisons between randomly generated trees (Fig. (Fig.4A).4A). The comparison between the T-ag-C data set (TAG) and the whole-genome data set (G) showed a level of congruence similar to that observed between two alternate whole-genomic data sets (G × G+). The quartets metric(s) indicates strict agreement between resolved quartets between trees. The whole-genomic and T-ag-C data sets have a large number of quartets that agree (Fig. (Fig.4B).4B). Both metrics indicate that the T-ag-C data set is highly congruent with the whole-genome data set. The levels of congruence relative to randomly generated trees and to internal congruence of the whole-genomic data set are also indicated (Fig. 4A and B).
The consensus of maximum-parsimony trees of all available T-ag-C sequences (Ci = 0.718; Rc = 0.587; total length = 142) is indicated in Fig. Fig.5A.5A. The previously identified clusters (clades A and B) were clearly resolved. In addition, a third group (clade C) was resolved by the majority of trees (Fig. (Fig.5B).5B). All three clades were resolved by both distance and parsimony methods. Clade C consists of human isolate PML-1 and sequences that were associated with human tumors.
The whole-genome sequences were further analyzed to identify other polymorphic regions in addition to the T-ag-C terminal domain. The entire viral genome was divided into 100-bp intervals and scored for different types of mutations (Fig. (Fig.6).6). “Singletons” occur only once and might include PCR errors or base identification errors, “confirmed” mutations occur in more than one sequence at that nucleotide position, and “unique INDEL” events represent unique gaps. In addition to T-ag-C, other polymorphic regions exist, as indicated in Fig. Fig.6.6. The use of existing sequencing primers (Table (Table2)2) covering positions 671 to 1012, 1664 to 2223, 3682 to 4413, and 4527 to 5056 (SV40-776 numbering), in addition to T-ag-C, would capture 89% of the confirmed known polymorphisms and 63% of the singletons.
This study is the first to examine evolutionary relationships in polyomavirus SV40. Although the overall level of genetic variation in SV40 is minimal, distinct clades were found, and these are strongly supported by whole-genome analysis using neighbor-joining and maximum-parsimony methods. The same groups were resolved when T-ag-C data alone were used, and the two data sets did not appear to be significantly incongruous. The survey of T-ag-C sequences from all available sources indicated that additional genetic variation is likely to exist; the same clades were resolved as with the whole-genome analysis, as well as a group that exclusively contains human isolates (PML-1 and sequences associated with human cancers).
A portion of the regulatory region was excluded from this survey of SV40 genetic variation, because it is unclear whether the large insertions and deletions that are a characteristic of the region should be treated as single or multiple mutation events. The regulatory region contains enhancer and promoter sequences as well as the viral origin of DNA replication (ori). Genetic changes can occur within the SV40 regulatory region during viral growth in vivo (23, 31) and possibly in vitro under certain conditions (32) through mechanisms that are not understood. Genetic changes at the regulatory region typically consist of duplication or deletion (or both) of enhancer and/or promoter sequences. These changes occur relative to the structure of a viral species-specific basic regulatory region termed archetypal or protoarchetypal (3, 25, 35). In contrast, the T-ag-C region of SV40 strains can vary in sequence and in length but does not appear to change in response to growth in vitro or in vivo (18, 23, 24, 31). We and others have previously proposed the identification of SV40 strains based on T-ag-C sequences (18, 23, 24, 31, 40). The analyses described here confirmed the validity of the use of the T-ag-C domain for genotyping SV40 isolates and tissue-associated sequences.
The data suggest that several SV40 genotypes are common to different population sources (monkeys, contaminated vaccines, and humans). A possible explanation for this pattern is that the variation present in the simian population of SV40 was sampled during the manufacture of the poliovaccines and was transferred to the human population. Data from the present study based on known isolates and genomic fragments indicate that monkey and vaccine populations contain A and B clade viruses, whereas the human population contains representatives in the B and C clades. It is of interest that vaccine-derived viruses appear to overlap with the monkey and human populations, supporting the hypothesis that contaminated vaccines played a role in the introduction of SV40 into humans (Fig. (Fig.5B).5B). It should be noted that clade B isolates from humans have not diverged from monkey isolates of clade B genotype, showing that adaptation is not essential for viral survival in humans. It is possible that clade C is representative of viruses that have undergone genetic change during adaptation to human hosts. However, since DNA viruses evolve slowly, it is more likely that clade C occurs at some frequency in monkey populations but remains undiscovered. It will be necessary to have a more comprehensive knowledge of SV40 genetic variation in simian hosts to be able to interpret the significance of the relative distribution of strains found in humans. If a particular genotype has a higher frequency in humans than in monkeys, it could be the result of selection, creating a founder effect in the new host environment. It is also possible that some viral sequences detected in human tumors represent dead-end infections unable to be transmitted from human to human. Additional sampling is necessary to determine if any particular genetic variant is found exclusively in human populations.
Prior to this analysis, phylogenetic studies of whole genomes of polyomaviruses had been performed only for JC virus (1-3, 19). The results obtained for SV40 are similar in that genotypes can be distinguished by analyzing a genomic region with sequence variability outside the viral regulatory region. This includes the T-ag-C region for SV40 and the V-T intergenic region between the T-antigen and VP1 coding regions for JC virus.
The terminology used for describing SV40 isolates is currently inconsistent and becomes more difficult with the realization that SV40 clades exist. Based on this study, we recommend the following terminology. The term “strain” should describe particular isolates. The word “genotype” should be utilized when distinguishing among SV40 isolates based on T-ag-C sequences. Viruses with the same genotype can vary at the regulatory region or the viral coat protein coding regions and should be referred to as “variants.” A “clade” (or “genogroup”) is a phylogenetic group of viruses with similar genotypes. In general, viruses within a clade differ very slightly. For example, SVCPC and SV40-777 differ by 3 of 5,180 nucleotides (in the VP2 gene) and are 99.94% similar. The T-ag-C region appears to be useful as a rapid means of classifying genotypes of SV40, as the whole-genome and T-ag-C data sets are highly congruent (disregarding rearrangements in the viral regulatory region).
This analysis, establishing SV40 genetic diversity, raises new questions. It emphasizes the need to learn more about the natural distribution of transmissible SV40 strains in monkeys. The biological significance of these groupings is unclear but warrants further investigation. It is of interest that preliminary findings from hamster studies indicate that isolate SVCPC (clade B) is more tumorigenic in vivo than VA45-54 (ungrouped) (R. A. Vilchez, C. Brayton, C. Wong, P. Zanwar, D. E. Killen, J. L. Jorgensen, and J. S. Butel, unpublished data). It will be important to determine the relative pathogenicity and tissue tropisms of representative viruses of clades A, B, and C. It would be informative, also, to obtain sequence data from additional polymorphic regions in the viral genome (Fig. (Fig.6)6) to increase phylogenetic resolution of known sequences. The ability to incorporate additional full-length genomes from humans, as well as monkeys, would also contribute to higher phylogenetic resolution. Greater numbers of T-ag-C sequences amplified from monkey and human tissues would help determine whether any human-specific viral clades do, in fact, exist. Finally, direct evidence of human-to-human transmission of SV40 would indicate if the sequences associated with human tumors represent transmissible variants or if they are at an evolutionary dead end.
This work was supported in part by a grant from the Center for Biologics Evaluation and Research, Food and Drug Administration, to J.S.B.; by grant CA096951 from the National Cancer Institute to J.S.B.; and by grants from the National Space Biomedical Research Institute (NASA Cooperative Agreement NCC-9-58) to G.E.F., R.C.W., and J.S.B.
We thank X. M. Dai for excellent technical assistance.