|Home | About | Journals | Submit | Contact Us | Français|
Background.Infection with hepatitis C virus (HCV) is a burgeoning worldwide public health problem, with 170 million infected individuals and an estimated 20 million deaths in the coming decades. While 6 main genotypes generally distinguish the global geographic diversity of HCV, a multitude of closely related subtypes within these genotypes are poorly defined and may influence clinical outcome and treatment options. Unfortunately, the paucity of genetic data from many of these subtypes makes time-consuming primer walking the limiting step for sequencing understudied subtypes.
Methods.Here we combined long-range polymerase chain reaction amplification with pyrosequencing for a rapid approach to generate the complete viral coding region of 31 samples representing poorly defined HCV subtypes.
Results.Phylogenetic classification based on full genome sequences validated previously identified HCV subtypes, identified a recombinant sequence, and identified a new distinct subtype of genotype 4. Unlike conventional sequencing methods, use of deep sequencing also facilitated characterization of minor drug resistance variants within these uncommon or, in some cases, previously uncharacterized HCV subtypes.
Conclusions.These data aid in the classification of uncommon HCV subtypes while also providing a high-resolution view of viral diversity within infected patients, which may be relevant to the development of therapeutic regimens to minimize drug resistance.
Hepatitis C virus (HCV) is an enveloped, positive-strand RNA virus belonging to the Hepacivirus genus in the Flaviviridae family. Seven confirmed genotypes (1–7) are generally distinguished by phylogenetic methods and pair-wise distance calculations [1, 2]. On the basis of full genome nucleotide sequences, HCV genotypes diverged from each other by a pair-wise distance of >30%. Individual genotypes can be further divided into more closely related subtypes that diverged by a pair-wise distance of 15%–30%. All viral genotypes retain their repertoire of colinear structural and nonstructural genes, thereby facilitating preliminary genotype classification on the basis of partial genome sequences of short fragments (approximately 300–400 nucleotides) in the structural core/E1 region and the nonstructural NS5B region . However, full genome sequences remain indispensible for the detection of genome recombination events.
While only recombinant HCV genotype 2 k/1b viruses have been found to actively circulate in the population so far [4–7], recombinants of genotypes 2/5, 2b/1b, 2b/1a, and 2i/6p have been detected in single isolates from humans [8–11]. Furthermore, the frequency of HCV intergenotype and intragenotype recombination may be underestimated because of the lack of robust detection methods [12, 13]. Although genotyping based on core/E1 or NS5B sequences has resulted in the provisional classification of a large number of subtype variants within each genotype, at least 1 but preferably ≥2 full genome sequences are required to confirm subtype designation [2, 12].
Accurate genotype and subtype classification is clinically important because major genotypes differ considerably in their response rates to treatment with pegylated interferon and ribavirin and with directly acting antiviral drugs (DAAs) that are designed mainly against genotype 1 isolates. The limited efficacy of DAAs against other genotypes has been shown for genotype 3 , and treatment response rates for drug regimens containing boceprevir (an NS3 protease inhibitor) or BMS-790052 (an NS5A inhibitor) were found to vary even between subtypes 1a and 1b [15, 16]. Different baseline frequencies of drug resistance mutations may account for these differences , and such variation might become more significant if interferon-free regimens containing 1 or more DAAs could indeed displace the current standard of care . Until clinically tolerable DAAs with high barriers to resistance against the complete spectrum of HCV genotypes become available [18, 19], accurate subtype determination and, possibly, drug resistance profiling of the individual's viral population may guide the optimal choice of drugs. Full genome deep sequencing, as performed in this study, allows the identification of drug resistance mutations across the genome that may exist as dominant or minor variants in treatment-naive patients, thereby informing the design of therapeutic regimens.
Plasma samples obtained from HCV-infected patients in Canada and the United Kingdom were received with no accompanying patient-associated demographic or clinical information. This study was approved by the Massachusetts General Hospital Review Board and was granted exemption by the Massachusetts Institute of Technology and Massachusetts General Hospital Review Boards as discarded diagnostic samples. Preliminary HCV genotypes were assigned by the clinical centers, and all samples were thought to represent poorly characterized variants of genotypes 2–4.
RNA isolation, complementary DNA generation, and PCR amplification within the 5′ untranslated region (UTR) and 3 small sequence islands, followed by amplification of 2 hemi-genome fragments that covered the entire protein coding region, were performed as previously described , using primers described in Supplementary Table 1. The 5′ UTR and small sequence islands underwent conventional Sanger sequencing. Pooled hemi-genome PCR products were sequenced by Roche/454, and a de novo assembly was generated as previously described . Any ambiguities in 454 assemblies were resolved using Sanger sequencing. All full genome sequences used in this study have been submitted to Genbank (accession numbers JX227949–JX227979).
All unique, nonrecombinant, full-length HCV genomes found in the Los Alamos Hepatitis C Sequence Database (http://hcv.lanl.gov), along with sequences generated in this study, were used for analyses. For core/E1-NS5B analyses, the Los Alamos National Laboratory (LANL) HCV database was searched for all isolates with available sequences for both core/E1 and NS5B (nucleotide positions 869–1292 and 8276–8615, according to genotype 1a reference strain H77). The intersection of isolates containing both of these sequences was examined on a genotype and subtype level.
Multiple alignments were performed in MUSCLE v3.8.31 , using default parameters with 5 iterative refinements. A total of 193 full-length genomes (excluding recombinants) were trimmed to the coding sequence producing a master alignment of 9323 bases (including gaps). Any gaps in the triplet of nucleotides underlying codons were removed from the entire alignment in a greedy fashion, thereby conservatively eliminating any alignment artifacts. For the genotyping fragment trees that included 300 viral strains for which both regions of core/E1 (869–1292) and NS5B (8276–8615) were available, these regions were concatenated into an alignment of 779 nucleotides (including gaps). No gap stripping was performed on the genotyping fragment sequences.
The recombination detection program (RDP)  was run on all full-length genome sequences representing all available genotypes and subtypes. Any sequences that produced consistently low P values among the RDP's multiple tests for recombination were subject to further analysis. These potential recombinant sequences were further validated with SimPlot v3.5.1 , in which all possible combinations of major and minor parent strains were run against the recombinant strain.
The randomized axelerated maximum likelihood program  was run with 1000 bootstraps to construct a single tree for the 193 genomes mentioned above, using the cleaned alignment containing 3753 bases. The smaller genotyping trees of 300 concatenated fragments spanning 779 nucleotides were used in the same manner. The GTRGAMMA model was chosen as the model of nucleotide substitution. The single best tree from the 1000 bootstraps is shown, along with the total percentage (0%–100%) of bootstrap support for each branch.
Pair-wise distances for all 193 full-length nonrecombinant genomes were calculated in MEGA v.5.05 .
The PhyloPlace tool was used to compute the branching index over the length of our 2 putative subtype 4v genomes (http://hcv.lanl.gov/content/sequence/phyloplace/PhyloPlace.html).
Prior to variant calling, correction of known error modes in Roche/454 read data was performed with the RC454 program . Reliable and sensitive detection of variants in a mixed population and analysis of intrahost diversity was then accomplished using the VPhaser tool .
To better characterize global HCV genotype diversity, sequences representing the entire HCV protein-coding region and portions of the 5′ and 3′ UTR were generated from plasma samples acquired from 54 patients in Canada or the United Kingdom. Preliminary HCV genotyping based on core/E1 or NS5B sequences indicated that these individuals were infected with poorly characterized subtypes of genotypes 2, 3, or 4. Use of a hemi-genome amplification approach , coupled with high-throughput pyrosequencing technology, enabled generation of sequence reads covering the entire coding region without the need for traditional primer-walking methods. The success rate for amplification of both hemi-genomes from the 54 samples attempted was 83%, with 31 samples being used for sequencing. A recently developed algorithm allowed de novo assembly of all 31 genomes, despite the lack of full-length reference sequences for these poorly characterized genotypes . Genome sequences were aligned to all available full-length sequences from the National Center for Biotechnology Information and LANL public databases, including select representatives of the commonly found 1a or 1b subtypes.
Phylogenetic analysis revealed that several of the new sequences clustered with subtypes for which only 1 full-length genome sequence had been reported (2c, 2k, 3i, 4g, 4l, 4m, 4n, 4o, and 4r), thereby strengthening these phylogenetic groupings (Figure (Figure1),1), with pair-wise distance measurements confirming their distinction from other subtypes (Table (Table1).1). Sequences provisionally assigned to the 2 m and 3 g subtypes on the basis of partial sequence grouped accordingly, formally establishing these as distinct HCV subtypes (Figures (Figures11 and and2).2). Likewise, quantification of relatedness to all known subtypes by branching index analysis (Supplementary Figure 1), coupled with 100% bootstrap support by phylogeny (Figure (Figure1)1) and pair-wise distance measurements (Table (Table1),1), clearly grouped 2 newly described genotype 4 sequences, G1248/PB66742 and G1249/PB65568, to near full-length genomes from the recently identified 4v provisional subtype , thereby confirming 4v as a distinct subtype of genotype 4.
In addition to validating previously identified HCV genetic subtypes, phylogenetic analysis also identified a novel genotype 4 variant. While isolate G1253/PB53414 grouped with genotype 4 sequences in the full-genome phylogenetic tree (Figure (Figure1),1), bootstrap resampling strongly supported G1253/PB53414 as a unique subtype as compared to genotypes 4g and 4k (Figure (Figure1).1). Similarly, G1253/PB53414 was genetically distinct from subtypes 4h, 4g, and 4k in the core/E1-NS5B tree (Figure (Figure2).2). In addition, nucleotide sequence divergence of G1253/PB53414 from all other genotype 4 full genome sequences was >17.7% (Table (Table1).1). These lines of evidence strongly suggest that G1253/PB53414 represents a novel subtype of a genotype 4. This new genotype 4 variant could be designated as subtype 4w, but since only 1 example of this subtype has been sequenced, accepted convention for assigning provisional subtypes calls for designation of G1253/PB53414 as an unclassified genotype 4 variant.
One recombinant sequence was also identified. SimPlot analysis showed G1241/PB41880 to be a recombinant between subtypes 2k (K1-S2/AB031663) and 1b (Vat-96/D50485), sharing a recombination break point with 2k/1b recombinants previously identified in St. Petersburg, CRF_01_1b2k (Figure (Figure3).3). Despite high sequence identity (93%–94%), G1241/PB41880 is not identical to previously reported 2k/1b recombinants. Nevertheless, separate phylogenetic analysis of 2k and 1b components of the new strain identified here along with other nonrecombinant 2k and 1b sequences confirm that there is 100% bootstrap support for G1241/PB41880 grouping with all other reported 2k/1b strains (Figure (Figure44).
In the case of subtypes 2 m and 3 g, for which no full genome sequences were available in public databases, classification of these new sequences on the basis of genotyping regions alone was clearly supported (Figure (Figure2).2). As expected, in instances where both full genome and partial genome data were available, bootstrap support for subtype classification was stronger in full-genome trees as compared to partial genome trees (Figures (Figures11 and and22).
Alignments of these full genome sequences allowed us to investigate the prevalence of dominant amino acid substitutions among rare HCV subtypes and identify those known to be associated with resistance to NS3 protease and NS5B polymerase inhibitors (Table (Table2).2). In NS3, the V36L substitution that confers resistance to telaprevir  was found to be dominant in 100% of genotype 2–5 sequences analyzed, the single genotype 7 sequence analyzed, and 20% of genotype 6 sequences analyzed, while the A39V substitution responsible for resistance to ACH-806 , was dominant in 100% of genotype 2 sequences analyzed. Moreover, all genotype 3 sequences and the 1 available genotype 7 sequence analyzed contained a D to Q substitution at amino acid 168, a site at which other mutations conferring up to 10 000-fold resistance to TMC-435 have been described in HCV genotype 1 . Additionally, a D168E substitution known to confer 40-fold resistance to TMC-435 was found in 1 of 2 genotype 5 sequences and in 2 of 62 genotype 6 sequences . In rare instances, we also identified genomes with dominant drug resistance mutations at positions C16, T54, A156, or V170 (Table (Table22).
In NS5B, the M414L substitution conferring resistance to the polymerase inhibitor A-782759 was found in 53% of genotype 4 sequences spanning multiple subtypes. The S282T mutation associated with resistance to the nucleoside analog mericitabine and the nucleotide analog PSI-7977  was found to be dominant in only one of the 191 sequences analyzed, a previously described genotype 4a isolate (ED43) from Egypt . A small number of genomes harboring dominant drug resistance mutations at positions N411, M423, P495, and G554 were also identified (Table (Table22).
Combinations of dominant resistance mutations in NS3 and/or NS5B were found in genotype 2 (NS3:V36L and A39V), genotype 3 (NS3:V36L, and D168Q), and genotype 4 (NS3:V36L and M414L) sequences, suggesting possible resistance to multiple antiviral drugs (telaprevir/ACH806, telaprevir/TCM-435, and telaprevir/A-782759, respectively). Further examination of the highly polymorphic residue 414 of NS5B in genotype 4 sequences revealed a subtype-specific pattern whereby all subtype 4g, 4k-4r, and 4v genomes and 3 of 4 subtype b sequences had the M414L mutation reported to confer resistance to A-782759, while the remaining subtypes (a, c, d, f, and t) harbored multiple variants of unknown phenotype (M414I, M414Q, or M414V; Table Table22 and Supplementary Table 2).
In addition to known resistance changes, other dominant substitutions of unknown resistance status were also common at key sites in NS3 and NS5B (Table (Table2).2). The dominant residue at amino acid 16 of NS3 was cysteine, in genotype 1, but was alanine, in genotypes 2 (74.4%) and 5 (100%), and threonine, in genotypes 4 and 6 (both 100%). The dominant NS3 V170 residue found in genotype 1 was an isoleucine, in 95% of genotype 2 and 71% of genotype 3 sequences. Isoleucine was also found at this position in genotypes 4–6, although at lower frequency (3%–50%). All genotype 2 and 3 sequences had a valine at residue 489 in NS3, rather than the serine found in genotype 1. Finally, the M414 in NS5B was replaced with glutamine, in 100% of genotype 2 sequences, and with isoleucine, in 32% of genotype 4 sequences.
The average depth of coverage generated by pyrosequencing across all 31 samples ranged from 2 to 3200 across the complete HCV genome and from 22 to 3041 at the drug resistance sites analyzed in NS3 and NS5B. This led to the identification of minor variants at drug resistance sites in NS3 in 7 of 31 samples. Minor variants ranged in frequency, from 0.10% to 7.78% (Table (Table3).3). Only the dominant amino acid was detected in the remaining 24 sequences, despite 500–900-fold sequence coverage. Even within genomes with extremely high sequence coverage, no minor variants were detected at known drug resistance sites in NS5B. It is possible that sequencing genomes to higher coverage could result in identification of additional low-frequency mutations at these drug resistance sites. Furthermore, we cannot exclude the possibility that some mutations were not detected because of primer selection bias.
The viral population of 3 individuals contained single low-frequency polymorphisms in NS3 that have been associated with resistance to BI-201335 or SCH-900518 (G1251/PB80852:D168E, G1314/QC104:A156V, and G2035/PB83515:D168E). One individual (G2036/PB84975) had multiple mutations (A29V, T54A, A156T, and V170A) associated with resistance to the protease inhibitors ACH806, telaprevir, boceprevir, SCH-900518, or R7227. The viral populations also contained variants of unknown phenotype at key sites in NS3 (Table (Table3).3). One individual, G2037/PB57078, harbored low-frequency HCV variants L36P/R altering the dominant leucine that is associated with drug resistance to mutations of unknown drug response status.
In this study, we report the use of pyrosequencing combined with a hemi-genome amplification approach to rapidly sequence, assemble, and classify 30 complete HCV genome sequences of poorly characterized HCV subtypes (2c, 2k, 2m, 3g, 3i, 4g, 4l, 4m, 4n, 4o, 4r, and 4v) and 1 intergenotypic recombinant (2k/1b) for a better view of global sequence diversity, enabling formal confirmation of the provisionally classified HCV subtypes 2c, 3g, and 4v and the identification of a novel variant of genotype 4. Together with sequences available from public databases, these new data enhance our knowledge of potential drug resistance mutations, especially for subtypes of genotypes 2 and 4 that are mainly found on the African continent . Pyrosequencing of HCV hemi-genomes, without previous knowledge of target sequence or use of time-consuming and labor-intensive methods such as primer walking, allowed a high-resolution assessment of both high-frequency and low-frequency drug resistance mutations across the full viral genome in infected individuals.
The importance of accurate HCV genotyping, with at least 1 full genome sequence required to resolve discrepancies in HCV subtype classification, has been emphasized for the study of HCV evolution and potential associations with treatment response to DAAs [2, 12, 28, 33–37]. Unlike core-E1/NS5B genotyping, analysis of full genome sequences allows identification of divergence across all regions of the genome, thereby facilitating parsing of closely related, yet distinct subtype sequences. This is further evidenced in our phylogenetic analysis by the increase in bootstrap support for subtype assignments based on full genome sequences as compared to assignments based on core-E1/NS5B sequences. As a result, we are able to provide definitive confirmation of several provisionally classified subtypes and to identify a potential novel subtype of genotype 4. We expect that as more full genome sequences become available, new HCV subtypes will be teased apart from existing ones and that additional nomenclature criteria will be necessary.
The ability to robustly identify and classify genotype recombinants in full genome sequences is also critical in determining the prevalence and clinical relevance of such genetic events. The genotype 2k/1b recombinant sequenced here, G1241/PB41880, appears to be a member of the same circulating recombinant form as that identified in St. Petersburg [4, 5]. However, despite sharing the same mosaic structure as the St. Petersburg isolate, phylogenetic and break point site evidence suggests that G1241/PB41880 is likely to be an independently transmitted and diverged copy of the previously observed recombinant type, rather than a novel recombinant of similar parental strains.
Dominant drug resistance mutations in response to treatment with DAAs have been extensively studied in genotype 1 infection, but data on drug resistance mutations in other genotypes remain scarce [38–40]. Furthermore, very few studies explore sensitive techniques for the identification of minor drug resistance variants in viral populations . The application of pyrosequencing to poorly characterized HCV genotypes allowed the identification of naturally occurring genotype and subtype-specific drug resistance mutations to mutations conferring resistance to DAAs, as well as a high-resolution view of intrapatient sequence diversity. In these poorly described HCV genotypes, we find a genotype and subtype-specific distribution of dominant changes in key resistance sites in NS3 and NS5B proteins, as well as minor variants in NS3. While mutations at drug resistance sites in NS3 in genotypes 2–5 have been previously reported [39, 40], this study extends the analysis to resistance mutations in NS5B and includes genotype 6 and 7 sequences in the analysis.
The availability of full genome sequences also allowed the identification of genotype 4 subtypes that harbor resistance mutations in both NS3 (V36L) and NS5B (M414L) that have been associated with increased resistance to treatment with telaprevir and A-782759. The presence of both of these dominant resistance mutations in subtype 4g, 4k-4r, and 4v virus isolated from infected patients suggests that these patients might respond poorly to treatment with both drugs, although this remains to be tested. Furthermore, the finding that these 2 changes are found only in a subset of genotype 4 subtypes reinforces the importance of subtype classification in helping to predict clinical response to treatment, as has recently been found for treatment response rates to DAAs for HCV subtypes 1a and 1b [15, 16].
It is important to note that for a significant proportion of amino acid changes identified here, their impact on drug resistance is unknown. Moreover, given the significant nucleotide differences between HCV genotypes and subtypes, it is unclear whether the drug resistance mutations that have been extensively characterized in genotype 1 would have the same impact on resistance in other genotypic backgrounds. A recent in vitro pilot study on the activity of the NS5A inhibitor BMS-790052 in genotype 4, however, found overlapping resistance profiles between HCV genotypes 1 and 4 , suggesting that mutations analyzed here may also be important for treatment-response rates in areas with a high HCV prevalence, such as Africa, where HCV genotype 4 subtypes are prevalent. It is also possible that previously uncharacterized mutations such as the D168Q mutation prevalent in 100% of HCV genotype 3 sequences may help explain the poor treatment response of this genotype to TMC-435 . It will be important to study the consequence of these changes in the various genetic backgrounds on replication and resistance to antiviral drugs.
Taken together, our results indicate that deep sequencing of full genomes from HCV-infected patients can be an efficient method to classify virus subtypes and may help to inform treatment strategies. Appraisal of consensus full genome sequence from HCV-infected individuals can be used to identify the presence of dominant antiviral drug resistance mutations before initiation of a treatment regimen, while deep sequence data can be used to identify resistance mutations present at low frequency in the viral population of the same individual. Our approach identified minor drug-resistant variants in a subset of the treatment-naive patients tested, while a recent report describing the use of Illumina sequencing to identify DAA resistance mutations in treatment-naive genotype 1b–infected patients showed the presence of DAA resistant mutations in both NS3 and NS5B in up to 74.1% of the patients tested , suggesting that these low-level mutations may be highly prevalent in selected HCV subtypes. This information can be used to study the impact of dominant or subdominant mutations on treatment response rates for future interferon-free treatment regimens with DAAs and to eventually steer selection of appropriate single or combinatorial antiviral therapies in order to minimize treatment failure due to preexisting drug resistance mutations among viral subtypes.
Supplementary materials are available at The Journal of Infectious Diseases online (http://jid.oxfordjournals.org/). Supplementary materials consist of data provided by the author that are published to benefit the reader. The posted materials are not copyedited. The contents of all supplementary data are the sole responsibility of the authors. Questions or messages regarding errors should be addressed to the author.
Acknowledgments.We thank the Broad Institute's Sequencing and Biological Samples Repository Platforms for their assistance in generation of genomic data and sample handling.
Disclaimer.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Financial support.This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services (contract HHSN272200900018C to B. W. B., contract HHSN272200900006C to B. W. B., grant R01-AI067926 to T. M. A., and grant U19-AI082630 to T. M. A.; and the Deutsche Forschungsgemeinschaft (grant DFG KU2250/1-1 to T. K.).
Potential conflicts of interest.All authors: No reported conflicts.
All authors have submitted the ICMJE Form for Disclosure of Potential Conflicts of Interest. Conflicts that the editors consider relevant to the content of the manuscript have been disclosed.