This study is the first to examine evolutionary relationships in polyomavirus SV40. Although the overall level of genetic variation in SV40 is minimal, distinct clades were found, and these are strongly supported by whole-genome analysis using neighbor-joining and maximum-parsimony methods. The same groups were resolved when T-ag-C data alone were used, and the two data sets did not appear to be significantly incongruous. The survey of T-ag-C sequences from all available sources indicated that additional genetic variation is likely to exist; the same clades were resolved as with the whole-genome analysis, as well as a group that exclusively contains human isolates (PML-1 and sequences associated with human cancers).
A portion of the regulatory region was excluded from this survey of SV40 genetic variation, because it is unclear whether the large insertions and deletions that are a characteristic of the region should be treated as single or multiple mutation events. The regulatory region contains enhancer and promoter sequences as well as the viral origin of DNA replication (
ori). Genetic changes can occur within the SV40 regulatory region during viral growth in vivo (
23,
31) and possibly in vitro under certain conditions (
32) through mechanisms that are not understood. Genetic changes at the regulatory region typically consist of duplication or deletion (or both) of enhancer and/or promoter sequences. These changes occur relative to the structure of a viral species-specific basic regulatory region termed archetypal or protoarchetypal (
3,
25,
35). In contrast, the T-ag-C region of SV40 strains can vary in sequence and in length but does not appear to change in response to growth in vitro or in vivo (
18,
23,
24,
31). We and others have previously proposed the identification of SV40 strains based on T-ag-C sequences (
18,
23,
24,
31,
40). The analyses described here confirmed the validity of the use of the T-ag-C domain for genotyping SV40 isolates and tissue-associated sequences.
The data suggest that several SV40 genotypes are common to different population sources (monkeys, contaminated vaccines, and humans). A possible explanation for this pattern is that the variation present in the simian population of SV40 was sampled during the manufacture of the poliovaccines and was transferred to the human population. Data from the present study based on known isolates and genomic fragments indicate that monkey and vaccine populations contain A and B clade viruses, whereas the human population contains representatives in the B and C clades. It is of interest that vaccine-derived viruses appear to overlap with the monkey and human populations, supporting the hypothesis that contaminated vaccines played a role in the introduction of SV40 into humans (Fig. ). It should be noted that clade B isolates from humans have not diverged from monkey isolates of clade B genotype, showing that adaptation is not essential for viral survival in humans. It is possible that clade C is representative of viruses that have undergone genetic change during adaptation to human hosts. However, since DNA viruses evolve slowly, it is more likely that clade C occurs at some frequency in monkey populations but remains undiscovered. It will be necessary to have a more comprehensive knowledge of SV40 genetic variation in simian hosts to be able to interpret the significance of the relative distribution of strains found in humans. If a particular genotype has a higher frequency in humans than in monkeys, it could be the result of selection, creating a founder effect in the new host environment. It is also possible that some viral sequences detected in human tumors represent dead-end infections unable to be transmitted from human to human. Additional sampling is necessary to determine if any particular genetic variant is found exclusively in human populations.
Prior to this analysis, phylogenetic studies of whole genomes of polyomaviruses had been performed only for JC virus (
1-
3,
19). The results obtained for SV40 are similar in that genotypes can be distinguished by analyzing a genomic region with sequence variability outside the viral regulatory region. This includes the T-ag-C region for SV40 and the V-T intergenic region between the T-antigen and VP1 coding regions for JC virus.
The terminology used for describing SV40 isolates is currently inconsistent and becomes more difficult with the realization that SV40 clades exist. Based on this study, we recommend the following terminology. The term “strain” should describe particular isolates. The word “genotype” should be utilized when distinguishing among SV40 isolates based on T-ag-C sequences. Viruses with the same genotype can vary at the regulatory region or the viral coat protein coding regions and should be referred to as “variants.” A “clade” (or “genogroup”) is a phylogenetic group of viruses with similar genotypes. In general, viruses within a clade differ very slightly. For example, SVCPC and SV40-777 differ by 3 of 5,180 nucleotides (in the VP2 gene) and are 99.94% similar. The T-ag-C region appears to be useful as a rapid means of classifying genotypes of SV40, as the whole-genome and T-ag-C data sets are highly congruent (disregarding rearrangements in the viral regulatory region).
This analysis, establishing SV40 genetic diversity, raises new questions. It emphasizes the need to learn more about the natural distribution of transmissible SV40 strains in monkeys. The biological significance of these groupings is unclear but warrants further investigation. It is of interest that preliminary findings from hamster studies indicate that isolate SVCPC (clade B) is more tumorigenic in vivo than VA45-54 (ungrouped) (R. A. Vilchez, C. Brayton, C. Wong, P. Zanwar, D. E. Killen, J. L. Jorgensen, and J. S. Butel, unpublished data). It will be important to determine the relative pathogenicity and tissue tropisms of representative viruses of clades A, B, and C. It would be informative, also, to obtain sequence data from additional polymorphic regions in the viral genome (Fig. ) to increase phylogenetic resolution of known sequences. The ability to incorporate additional full-length genomes from humans, as well as monkeys, would also contribute to higher phylogenetic resolution. Greater numbers of T-ag-C sequences amplified from monkey and human tissues would help determine whether any human-specific viral clades do, in fact, exist. Finally, direct evidence of human-to-human transmission of SV40 would indicate if the sequences associated with human tumors represent transmissible variants or if they are at an evolutionary dead end.