|Home | About | Journals | Submit | Contact Us | Français|
The hepatitis E virus (HEV) polyproline region (PPR) is an intrinsically unstructured region (IDR). This relaxed structure allows IDRs, which are implicated in the regulation of transcription and translation, to bind multiple ligands. Originally the nucleotide variability seen in the HEV PPR was assumed to be due to high rates of insertion and deletion. This study shows that the mutation rate is about the same in the PPR as in the rest of the nonstructural polyprotein. The difference between the PPR and the rest of the polyprotein is due to the higher tolerance of the PPR for substitutions at the first and second codon positions. With this higher promiscuity there is a shift in nucleotide occupation of these codons leading to translation of more cytosine residues: a shift that leads to more proline, alanine, serine, and threonine being encoded rather than histidine, phenylalanine, tryptophan, and tyrosine. This pattern of amino acid usage is typical of proline-rich IDRs. Increased usage of cytosine also leads to >22% of all amino acids in the PPR being prolines. Alignments of PPR sequences from HEV strains representing all genotypes indicate that all zoonotic isolates share an ancestor, and the carboxyl half of the PPR is more tolerant of mutations than the amino half. The evolution of HEV PPR, in contrast with that of the rest of the nonstructural polyprotein, is molded by pressures that lead toward increased proline usage with a corresponding decrease in the usage of aromatic amino acids, favoring formation of IDR structures.
Hepatitis E virus (HEV) is a single-stranded, positive-sense RNA virus. The genome, which is 5′ capped and has a 3′ poly(A) tail, consists of three overlapping open reading frames (ORFs). The 5′-most ORF (ORF1) is encoded by nonstructural genes, the next 5′-most ORF (ORF3) is a phosphoprotein involved in viral regulation, and the 3′-most ORF (ORF2) is the viral capsid (1, 11). HEV causes both epidemic and sporadic jaundice (15, 21, 27). It is classified as belonging to four recognized mammalian genotypes (1–4, 21). Genotypes 1 and 2 infect only humans and are transmitted fecal-orally. Genotypes 3 and 4 infect several animals, including humans, swine, boar, deer, and mongooses (19). Besides these four genotypes there are additional mammalian HEV strains that have been isolated from rabbits (33), rats (14), and wild boars (24). The relationship between these more recently characterized strains and the recognized genotypes is still a matter of research and debate. Moreover, nonmammalian HEV strains have been found in chickens (13) and cutthroat trout (4).
The HEV nonstructural genes are most closely related to a group of viruses called the rubi-like viruses because of homology between the nonstructural genes of HEV and those of rubivirus (16). From the amino to the carboxyl terminus of the ORF1 polyprotein, these genes are the viral transferase, the Y domain, a papain-like cysteine protease, a region of unknown function, the polyproline region (PPR), the macro domain (also called the X domain), the helicase, and the RNA-directed RNA polymerase.
The PPR is also called the hypervariable region because it has higher genetic diversity than any other region in the genome (20, 23, 29). Koonin et al. (16) suggested that this region serves as a proline hinge. More recently it was determined that the region is intrinsically disordered and may regulate transcription and translation (23). Intrinsically disordered regions (IDRs) do not have stable tertiary structure (6, 9, 28). They have lower amino acid complexity, with a high proportion of polar and charged amino acids (Ala, Gly, Pro, and Ser), and a low content of bulky hydrophobic amino acids (Ile, Met, Phe, Trp, and Try) (7, 10). This disordered structure allows IDRs to assume several configurations thereby expediting the binding of this region to multiple ligands and facilitating its regulatory role (9, 10).
Because of the hypervariable sequence in the PPR some researchers avoid this region or exclude it from phylogenetic analysis of the ORF1 polyprotein (22), although it does contain phylogenetic information that has been used to genotype HEV strains (2, 3, 5, 18). The discovery of insertions and deletions in HEV genotype 3 PPR led to the assumption that the evolution of the PPR was too complex to model because of the difficulty of reconstructing its indel history (23). This assumption is questioned by data from the current study.
Sequences from HEV genotypes 1, 3, and 4, avian HEV and rubivirus were examined (see Table S1 in the supplemental material). ORF1 sequences were split into two regions. The first region was the PPR. The second was the rest of the ORF1 polyprotein without the PPR (here, the nonpolyproline region [nPPR]). In genotypes 1, 3, and 4, the PPR was located using the conserved sequences that flank it (23). The flanking sequences for avian HEV were those obtained from the NCBI alignment for CDD:152960 (http://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?ascbin=8&maxaln=10&seltype=2&uid=152960). In rubivirus, the PPR was estimated to be situated between amino acid 702 and amino acid 813 from a plot of the Shannon entropy for the nonstructural genes using the longest continuous region for which the entropy value was >0.1 (23).
Sequences were aligned in Clustal X (version 2.1) (17) and adjusted manually to optimize the alignment of purines and pyrimidines.
Sequences were segregated by codon position, and nucleotide counts were done using a Perl script. The number of segregating sites, nucleotide diversity, transition/transversion bias, and codon usage were calculated in Mega5 (version 5.05) (25). Shannon entropy was calculated by codon position in BioEdit (version 184.108.40.206) (12). Nucleotide sequence alignments were done initially in Clustal X2 (version 2.1) (17) and modified with manual adjustments.
Bayesian estimation of the mean substitution rate and the relative substitution rates at each codon position in the PPR and the nPPR were calculated using BEAST (version 1.6.1) (8). A general time-reversal substitution model was used with estimated base frequencies. A site-heterogeneity model with invariant sites and four gamma categories was used with codon positions segregated into three partitions by position. Substitution-rate parameters, the rate-heterogeneity model, and base frequencies were unlinked across codon positions. A relaxed, uncorrelated, lognormal, molecular clock was used. A constant-size tree prior was used, with the initial tree generated by unweighted-pair group method using average linkages (UPGMA). This series of analyses was conducted on HEV genotype 1, 3, and 4 sequences and rubiviruses because of the number of sequences available but not on avian HEV because of the limited number of sequences available. Because of the indels observed in subgenotypes 3a, 3e, and 3f (23), the insertions in 3a sequences were removed from an alignment of genotype 3 sequences. Additionally, because of the 27-amino-acid insertion seen in some 3f sequences all other sequences were aligned against the 3f repeat closer to the carboxyl terminus of the nPPR, and the 27 amino acids closer to the amino terminus were deleted from all 3f sequences containing the insertion.
An examination of a variety of codon properties shows that the expected levels of substitution by codon position are maintained because of codon degeneracy (30) in the nPPR (Table 1). The second position is the most conserved position in the codons followed by the first position, and the third position is the least conserved (Table 1, conserved). This pattern is also reflected by the number of segregating sites per codon position, S. The lowest conservation at the third codon position is seen in genotypes 3 and 4, which may be due to the wider host range seen in these genotypes compared with the other viruses in Table 1 (19). Nucleotide diversity also reflects codon degeneracy, with more nucleotide divergence seen in the third codon position, and, as with position conservation and S, genotypes 3 and 4 exhibit the highest nucleotide divergence at the third codon position. The values for the first and second codon positions are more similar among all the viruses examined. The data for the third codon position in avian HEV suggest that it is intermediate between genotypes 3 and 4 and between genotype 1 and rubivirus, suggesting further that avian HEV has a wider host range than that seen in genotype 1 and rubivirus but not as wide as that seen in genotypes 3 and 4.
Lower nucleotide conservation of the nPPR with higher numbers of segregating sites, and increased nucleotide diversity and entropy, implies a higher tolerance for nucleotide substitutions and thus a higher rate of substitution. A review of the corresponding data for the PPR shows that the first and second codon positions are more tolerant of substitutions than in the nPPR, but the bias toward higher substitution rate at the third codon position compared with the first and second codon positions is still maintained, although not at the levels seen in the nPPR (Table 1). Further, the lower levels of substitution seen at the second codon position versus the first in the nPPR are not as pronounced in the PPR. This leveling of values among all three codon positions in the PPR is observed across all the variables analyzed. The higher nucleotide diversity seen in the nPPR in genotypes 3 and 4 is also seen in the PPR but is not as pronounced.
The relative substitution rates seen in genotype 4 and rubivirus confirm this tolerance for substitutions (Table 1). The relative rates of substitution in the nPPR for genotype 4 and rubivirus, respectively, are 61 and 25 times higher in the third codon position than in the second codon, and the relative substitution rates are 5.7 and 2.7 times higher at the first codon position than the second (Table 1, μ). However, in the PPR for genotype 4 and rubivirus, the relative substitution rates are twice as high at the third codon position than at the second, and the rates at the first and second codon positions are about equal. The relative substitution rates for genotype 1 are similar to those for rubivirus, and the genotype 3 rates are similar to those for genotype 4 (Table 1). The mean substitution rates as calculated in BEAST using a relaxed, uncorrelated lognormal clock for the nPPR are 1.6 × 10−3 and 5.7 × 10−4 for genotype 4 and rubivirus, respectively, and those for the PPR are 3.7 × 10−3 and 1.1 × 10−3 for genotype 4 and rubivirus, respectively. These results indicate that the overall substitution rate is about twice higher in the PPR than the nPPR. However, if the relative substitution rate by codon position is taken into account, the estimated substitution rate for the third codon position is about the same in both regions of the ORF1 polyprotein (1.4 × 10−3 versus 1.8 × 10−3 for genotype 4 and 5.0 × 10−4 versus 5.4 × 10−4 for rubivirus). These data suggest that the difference between the nPPR and the PPR is not due to a difference in rate of mutation but in higher promiscuity at the first and second codon positions in the PPR.
Codon usage in the nPPR and the PPR shows there is also a difference in codon usage between them (Table 2). The most frequently used codons in the PPR are used at higher rates than in the nPPR. This is probably due to lower amino acid complexity in the PPR (23). The difference is also seen with those codons used the least. The PPR has more codons that are not used than does the nPPR, and the frequency of occurrence is lower for the least-used codons in the PPR (Table 2). Another difference is the higher usage of codons with C at the second codon position in the PPR. That leads to a higher content of Pro, Ala, Ser, and Thr. The bias toward Pro is further increased by the preference for codons with C in the first codon position, while the nPPR shows a preference for G in this position. The preference for C in the PPR is so high that the most highly used codon in avian HEV, genotypes 3 and 4, is the Pro codon, and >22% of the PPR codons in all the viruses examined encode Pro. Among the least used codons in the PPR are codons encoding His, Phe, Trp, and Tyr (Table 2). This pattern of codon usage is what would be expected for intrinsically disordered proline-rich regions (9, 10).
The distribution of nucleotides by codon position in the nPPR and PPR shows that specific changes lead to the shift in codon usage. Table 3 shows that there is a significant GC bias at codon positions 1 and 3 (P < 0.07) but not at position 2 (P > 0.5) of the nPPR. However, in the PPR the GC bias is seen in positions 1 and 2 (P < 0.001) but not at position 3 (P > 0.2). The nucleotide preference by codon position in the nPPR is for G at position 1, C at position 2, and a pyrimidine at position 3. In the PPR at position 1 this preference is for G in rubivirus and genotypes 1 and 4 but for C in genotype 3 and avian HEV; at position 2 for these viruses, the preference is for C, and at position 3, the preference is for a pyrimidine except for rubivirus, where it is for C. The greatest nucleotide bias is seen in the second codon position of the PPR, where C is preferred at significantly higher levels than any of the other three bases (P < 0.0001). A comparison of the second codon position between the PPR and the nPPR shows that although C is the preferred nucleotide, the nucleotide fraction of C is almost twice as large in the PPR, the other three nucleotides exhibiting decreases of 18% to 69% except for G in avian HEV, which shows almost no change. These differences indicate that although there is not much change in nucleotide preferences at codon position 3 in the PPR, there is an increase in the fractional content of C at positions 1 and 2, with the greatest shift in nucleotide preference occupying position 2 thus leading to a preference for Pro in the PPR.
The estimated transition/transversion bias for these viruses ranges from 6.2 to 11 in the nPPR and from 3.2 to 17 in the PPR. This bias suggests that transitional mutations are much more favored among these viruses than are transversions. One explanation is that transitions are less likely to result in the generation of stop codons (31), and transversions result in more diverse amino acid substitutions and significantly different chemical composition (31, 32). Given the high transition/transversion bias it might be possible to discern evolutionary patterns in the PPRs of these viruses.
Examining the amino half of the PPR of zoonotic HEVs shows that there is homology among them, suggestive of descent from a common ancestor. As expected from phylogenetic trees, genotypes 4 and the Japanese wild boar sequences exhibit more similarity, and genotype 3 and the Chinese rabbit sequences are more similar to each other at the amino end of the PPR (Fig. 1). In the carboxyl end of the PPR, this clustering is still evident; however, the similarity seen among all of these sequences at the amino end of the PPR is less evident at the carboxyl end, suggesting that the amino half of the PPR may not tolerate mutations as well as the carboxyl half. Further examination of the genotype 3 sequences and the Chinese rabbit sequences shows similarities and differences between sequences. Like the 3a, 3e, and 3f sequences, the rabbit sequences have a deletion in the carboxyl half of the PPR although not where insertions and deletions in 3a, 3e, and 3f occur (Fig. 2). The PPR sequence alone is not enough to determine whether or not the rabbit sequences belong to a separate genotype from genotype 3. An examination of genotypes 1 and 2 shows that their amino ends are not similar to those from the zoonotic HEVs (Fig. 3A), and unlike the situation in the zoonotic sequences, the amino ends of the PPR in genotypes 1 and 2 are not similar enough to suggest descent from an ancestor common to the two of them or to the zoonotic HEVs, due perhaps to the low numbers of sequences (there being only one sequence from genotype 2). Some similarity is seen when out-of-frame shifts are allowed, implying the existence of an anthroponotic ancestor. Like the zoonotic HEVs, genotypes 1 and 2 are less similar at the carboxyl end of the PPR than the amino end (Fig. 3B), suggesting that the carboxyl end of the PPR is more susceptible to substitution.
Because of the indels seen in the HEV PPR genotype 3 (Fig. 2), it was assumed that much of the hypervariability seen in the PPR is due to insertions and deletions (23). The current study shows instead that much of the variability seen in the PPR is due to higher rates of nucleotide substitution at the first and second codon positions in the PPR.
Although the PPR is hypervariable, this hypervariability is not due to a higher substitution rate in the PPR compared to the nPPR. The same substitution rate appears to be operational in both regions (Table 1). The difference is that fewer mutations in the first and second codon positions are lethal in the PPR. Most likely this higher promiscuity, seen at the first and second codon positions in the PPR, is due to its intrinsically disordered structure. The lack of a well-defined tertiary protein structure means that substitutions in the first and second codon positions, which are more likely to result in nonsynonymous amino acid switches, are allowed more often than in the nPPR, where a tertiary structure must be maintained constitutively for proper function. However, the PPR does have constraints, as suggested by the higher usage of structure-breaking Pro codons (6, 7). The bias toward transitional substitutions may be because these substitutions are less prone than transversional substitutions to generate stop codons and because transversions lead to more diverse amino acid substitutions and significantly different chemical composition in the resultant peptide (31, 32).
Codon usage in the nPPR and the PPR shows there is a shift toward using C at the first and second codon positions in the PPR (Table 2). This is due to a shift away from using A and T at these positions and a reduction in the use of G at the second codon position (Table 3). Although the usage of G at the first codon position does not change much, the usage of C increases with the decreases in A and T (Table 3). The shift at codon position 2 is even more dramatic: from about equal usage of all nucleotides at the second codon position in the nPPR to C occurring at >50% of the second codon positions in the PPR (Table 3). This in turn results in a shift toward high usages of Pro, Ala, Ser, and Thr in the PPR, so marked that the most frequently used codon in genotypes 3 and 4 and avian HEV is Pro (Table 2). Even in genotype 1 and rubivirus, >22% of all codons in this region are Pro codons (Table 2). The decrease in A/T usage leads to a decrease in His, Phe, Trp, and Tyr. These are the patterns of amino acid usage typical of IDRs (7, 28). The decrease in A at the first codon position and A and G at the second codon position of the PPR means that transversional substitutions have occurred; these transversions appear to be more common among genotypes and subgenotypes (Fig. 1, ,2,2, and and33).
Although the first and second codon positions are more promiscuous in the PPR than the nPPR, alignments of zoonotic HEVs suggest that this promiscuity is greater in the carboxyl half of the PPR than in the amino half (Fig. 1). The carboxyl half of the PPR is also where most of the recognized indel activity occurs in the PPR (Fig. 2). This difference suggests that the carboxyl half of the PPR is more mutable than the amino end, and the carboxyl half of the PPR may be more involved in binding multiple ligands (23).
Evolution is more easily traced in the nPPR because of the tertiary structural constraints required by the nonstructural genes for them to function properly. In contrast, because of the higher promiscuity toward substitutions and the lack of intrinsic structure or active-site amino acids, it is much more difficult to trace evolution in the PPR alone. However, an alignment of zoonotic HEVs shows that there is a similarity in purine/pyrimidine (transitional substitution) banding in the amino half of the PPR, suggesting that these isolates share an ancestor (Fig. 1). This commonality is not seen in the carboxyl half of these PPR sequences, perhaps due to higher mutability in that domain. An alignment of the PPR for the anthroponotic genotypes 1 and 2 does not exhibit an easily recognized similarity of purine/pyrimidine banding, perhaps because only one example of the genotype 2 PPR sequence exists; nonetheless, out-of frame shifting of the alignment implies a common ancestor (Fig. 3).
The similarity of sequence (Fig. 3) and lower nucleotide diversity (Table 1) seen in genotype 1 suggest that less substitution occurs in genotype 1 than in genotypes 3 or 4. This could be because the zoonotic HEVs have a wider host range, and higher nucleotide diversity is required for adaptation of these strains to their hosts. Another explanation is that modern genotype 1 is actually composed of a subset of subgenotypes from a genotype 1 ancestor. Paleoepidemiological research indicates that epidemic HEV was more common in Australia, North America, and Europe in the 18th and 19th centuries than today (26). An analysis of the evolution of HEV suggests further that genotype 1 went through an evolutionary bottleneck about 80 to 90 years ago (22). Improvements in sanitation in developed countries from the early 20th century could have forced genotype 1 through an evolutionary bottleneck that led to the extinction of genotype 1 in Australia, North America, and Europe, with the only surviving subtypes of genotype 1 being found in developing countries. More isolates of genotypes 1 and 2 are needed to better define the evolution of these genotypes and of the PPR in mammalian HEVs.
The hypervariability seen in the HEV PPR appears to be due to increased rates of substitution in the PPR compared to the nPPR, but the impetus for this hypervariability is increased promiscuity toward substitution at the first and second codon positions in the PPR. In conjunction with this promiscuity is a shift in nucleotide usage toward increased usage of C such that Pro codons are among the most favored in the PPR, and the decreased usage of A and T results in decreased use of His, Phe, Trp, and Typ codons. This shift leads to a region with a high number of structure-breaking Pro residues and few aromatic residues, thereby accounting for the proline richness seen in IDRs.
I thank Chong-Gee Teo for discussions and review of this paper, and I acknowledge the helpful suggestions of reviewers from CDC and the journal.
The findings and conclusions in this article are those of the author and do not necessarily represent the views of the Centers for Disease Control and Prevention/Agency for Toxic Substances and Disease Registry.
Published ahead of print 18 July 2012
Supplemental material for this article may be found at http://jvi.asm.org/.