HCV samples studied. Table S1 in the supplemental material summarizes the HCV samples studied and the numbers of sequence reads determined for each. Three patients (P1 to P3) diagnosed close to the time of initial infection were studied longitudinally. Longitudinal HCV viral-load data are shown in Fig. , together with serum alanine aminotransferase (ALT) levels, a marker for hepatic inflammation. These patients were tracked from the time of clinical presentation, with newly diagnosed positive HCV RNA (all genotype 1) and elevated ALT levels consistent with acute hepatitis C virus infection, as previously defined (
15). All had negative HCV tests preceding diagnosis. Patients P1 and P2 developed apparently chronic infection over 6 and 2 years of follow-up, respectively. The long-term outcome is not known for patient P3, who was lost to follow-up 10 weeks after clinical presentation.
For the cross-sectional study, eight HIV-HCV-coinfected patients were studied. With the exception of one patient, all patients had CD4 counts below 300 cells/μl, enabling us to investigate the consequences of decreased immune function during HIV coinfection. As controls, the HCV populations in two additional monoinfected patients were characterized.
We also analyzed plasmid DNA carrying the HCV genome (pFL-H77/JFH1) (
26), which provided a homogeneous control template of known sequence. Any diversity detected in sequence reads from the plasmid template were due to either mutation during PCR amplification or errors in the sequence determination steps, providing data for optimization of our error control procedures.
Pyrosequence data acquisition. HCV RNA was purified from plasma, reverse transcribed, and then PCR amplified using primers complementary to HCV sequences. Amplicons were targeted to four regions (Fig. ): the 5′ UTR, the core nucleocapsid protein-coding region, and two regions of the envelope genes. Amplicons were chosen based on previous literature to sample both highly conserved (5′ UTR and core) and diverse (envelope) regions. The amplicons were numbered 1 to 4. For the two envelope amplicons, two primer pairs were used in an effort to minimize recovery biases possibly arising due to mismatches in the primer binding sites (indicated as A and B, with a and b indicating the direction of sequence determination). The primer sequences are shown in Table S2 in the supplemental material. All primers were composites containing the required primer binding sites for the 454/Roche sequencing procedure linked to the HCV-specific primers (Fig. ). An 8-base DNA barcode, which indexed the patient and time point from which the sample was derived, was positioned between the two (
3,
12). This allowed PCR products to be pooled for sequencing, and then reads were parsed into individual amplicons after sequence determination.
We analyzed 288 patient-time-point-amplicon combinations (see Table S1 in the supplemental material). A total of 174,185 pyrosequence reads passed quality control (~40 million bases of DNA sequence). The quality control criteria (
14) included (i) a perfect match to the barcode and primer sequences, (ii) >200 bases in length, and (iii) no Ns. For HCV monoinfection, 69,825 sequences were recovered; for HIV-HCV coinfection, 73,656 sequences were recovered; and for the plasmid DNA, 30,704 sequences were recovered.
Data processing using PyroNoise. Sequencing error can artificially inflate the numbers of sequence variants. This is particularly severe at homopolymeric runs, where the 454/Roche method can have difficulty calling the correct number of identical bases. Quince et al. have shown that improved accuracy can be obtained by first clustering the quantitative data from the initial light intensities generated during pyrosequencing using a distance measure that models pyrosequencing noise. The method has been implemented in the PyroNoise program (
29). In addition, in this study, we incorporated a second stage of sequence clustering to remove PCR base pair substitutions, in which denoised sequences are clustered into OTUs with 98.5% identity (accepting three differences over 200 nt). PyroNoise preclustering greatly suppressed inflation of OTUs due to errors in published model studies (
29), and similar results were obtained here.
Figure presents an example, using the data from amplification of the HCV-bearing control plasmid pFL-H77/JFH1. At the top are the sequences after conventional processing and alignment by the left PCR primer sequence. As can be seen, many reads go out of alignment as the sequence extends from left to right due to errors at homopolymers. At the bottom is the alignment after error control using PyroNoise, which removed homopolymer errors.
The use of PyroNoise processing prior to formation of OTUs was then compared to a widely used program for making OTUs (Dotur), which uses conventional percent identity of the base calls as the basis for clustering sequence reads (
33). In this analysis, OTUs were (i) clustered by Dotur using a 97% identity threshold or (ii) first preclustered using PyroNoise and then clustered at 97% identity in a second step. To allow comparison, the number of OTUs and the distribution of reads within OTUs were summarized using the Shannon Diversity Index for output from Dotur and PyroNoise. Values for the Shannon Index are increased by either adding more species (in this case, OTUs) to a community or distributing individuals (in this case, sequence reads) more evenly among the species (OTUs) present. Figure shows that Shannon values for control plasmid sequence reads were much higher for the OTUs made using Dotur than for OTUs made using PyroNoise. We also compared the two methods over the HCV sequence data from patient samples, though unlike the case of the plasmid control, we did not know the true number of OTUs in the patient samples. Figure shows that values were higher with Dotur than with PyroNoise, particularly for low Shannon values, which we infer to be erroneous inflation of the diversity values in the OTUs generated with Dotur. This general conclusion was not sensitive to variations in the parameters used in each program for data processing (data not shown). Thus, subsequent analyses were carried out on the data after PyroNoise processing.
Diversity along the HCV genome. Variation in the HCV sequences is summarized in Fig. . Each base is color coded, and all 174,185 sequences are shown as “stacks” aligned on the HCV genome. The x axis indicates the position on the HCV genome, corresponding to the indicated positions on the genetic map. The y axis indicates the number of sequence reads, showing a cumulative sum over all samples and amplicons. The samples were analyzed separately for reads from the plasmid control (top), HCV-monoinfected (bottom), and HCV/HIV-dual-infected (middle) subjects.
For the plasmid control, each of the four amplicons showed almost exclusively continuous vertical colored bands. This indicates that the same base was found in each position in almost every read after PyroNoise processing.
For the two sets of HCV samples, corresponding to HCV monoinfection and HIV-HCV coinfection, the results varied similarly by amplicon. For the leftmost 5′-UTR amplicon, little diversity in sequence was seen. Previous literature has established that this region is extremely highly conserved. For the core amplicon, the sequence composition was generally similar, though some variation could be discerned. For the two amplicons in the envelope region (rightmost in Fig. ), much more variation was observed. The variation was higher throughout the read but was particularly high around base position 1500, in HVR1 of the E2 envelope. Both the HCV monoinfection and HIV-HCV coinfection samples showed a high degree of variation in this region (discussed further below).
Longitudinal sequence variation in early HCV infection. We next investigated longitudinal trends in sequence diversity for patients 1 to 3. Samples from each time point were condensed into OTUs after PyroNoise processing, and the diversity was summarized using the Shannon Index. Figure shows longitudinal plots for each of the amplicons for patients 1 to 3. Each subject is shown by a single color. The raw numbers of sequence reads for each sample, and the Shannon values for each, are presented in Table S1 in the supplemental material.
The Shannon values for the 5′-UTR (
1) and core (
2) regions were low (<0.3) and did not increase significantly over time. These findings provide a critical control for other amplicons with greater diversity. Mutation during RT-PCR or errors in sequence determination could falsely inflate diversity, but results with the 5′-UTR and core amplicons show that the magnitude of such errors was comparatively small.
Shannon Index values for the envelope amplicons were much higher than for the 5′ UTR and core (P < 0.02 for pairwise comparisons) and tended to increase over time (P < 0.05 for comparison of the first and last points of any amplicon). Patient 1 showed the most extreme behavior, with a high Shannon Index value (up to 2.5) for the envelope HVR1 amplicon at 215 weeks, indicating either diversification of the envelope sequences or a second infection. Patient 2 also showed a notable increase in diversity in HVR1 over time, reaching values up to 2.0. Patient 3, who was sampled over the shortest time (from 1 to 10 weeks following the initial clinical presentation), showed a longitudinal increase in diversity for some but not all the envelope amplicons.
Changes in sequence diversity were not strictly linked between amplicons. While the envelope-coding regions varied, the 5′-UTR and core regions did not. Even within the envelope-coding region, patterns of diversity varied. For example, in the HVR1 amplicon (3Aa) for patient 1, there was a sharp increase in diversity over time, whereas in the 4Aa amplicon, the diversity increased much less. Evidently different parts of the genome, or even different parts of the envelope region, diversified at different rates after infection.
The longitudinal patterns of PyroNoise-processed OTUs were analyzed further using phylogenetic trees. The analysis of diversity using the Shannon Index presented above is not sensitive to rare sequence variants, such as those due to mutation from PCR error or chimera formation. However, the phylogenetic survey, which emphasizes rare variants, would be. For that reason, in this and subsequent studies, we excluded variants supported by only one or two sequence reads. Control experiments showed that these reads were enriched in sequences with low sequence similarity to a collection of all available HCV sequences. Inspection of individual sequences indicated that several were apparent chimeras of HCV sequences with unidentified sequences. After being processed by this approach, 11 of 12 plasmid control samples showed single OTUs, with the 12th showing two (see Table S1, column K, in the supplemental material).
For the 5′-UTR amplicon (Fig. ), all the sequences clustered in a single OTU. In the figure, each rectangle indicates a single sequence variant, and the extent of the black bar indicates the percentage of the total population contributed by that variant. The plasmid control also formed a single OTU. Longitudinal analysis of the core amplicon showed similar highly conserved structure with only rare variation (data not shown).
In contrast, the amplicons for the envelope region showed much greater variation (Fig. ). For patient 1, three related OTUs predominated for the first three time points sampled (5 to 40 weeks after clinical presentation). Over this time period, the principal OTUs differed by about 5 base substitutions out of ~200 total. By 215 weeks after infection, a different branch of the lineage predominated, separated from the earlier OTUs by an edit distance of at least 12 substitutions. For patient 2, the population was relatively stable for the first 17 weeks, dominated by a single OTU. Diversification was seen at 34 weeks, and by 107 weeks, the population had changed over to a divergent lineage. Patient 3 was sampled over only 10 weeks, and relatively little diversification was seen. Similar patterns were seen for the three patients for the other envelope amplicons analyzed (data not shown).
HCV diversity during HIV-HCV coinfection. Approximately one-third of HIV-infected patients are coinfected with HCV (
35), and HIV coinfection increases HCV-related liver diseases (
10), raising the question of how HIV coinfection affects the structure and evolution of HCV populations. One possibility is that immune impairment due to advanced HIV infection results in reduced immune pressure on HCV populations (
4,
18) and thus in diminished HCV diversity. To investigate this issue, we compared HCV sequence diversities in HIV-HCV-coinfected and HCV-monoinfected patients. We analyzed plasma HCV RNAs from eight HIV-HCV-coinfected subjects. For HCV monoinfection, we analyzed the last time points for patients 1 to 3 described above and two additional chronically infected HCV subjects, providing five HCV-monoinfected patients for comparison.
The monoinfected and coinfected individuals were compared over each genomic region to identify reproducible differences between the two patient groups. The Shannon Index value was calculated for each subject individually, and then the mean Shannon Index values for the two groups were compared (Fig. ). For the 5′-UTR amplicon, almost no diversity was detected, and there was no difference between monoinfected and coinfected subjects. Higher diversity levels were seen for the core and the E2 envelope (excluding HVR1) amplicons for both groups, but there also were no significant differences in the mean diversity values.
In contrast, significantly higher diversity was seen for the monoinfected subjects in the HVR1 envelope region than for coinfected subjects (P = 0.02616; Mann-Whitney comparison of means for pooled 3Aa, 3Ab, 3Ba, and 3Bb). This is consistent with the idea that loss of immune function due to advanced HIV infection may diminish selective pressure on HCV populations, so that the HCV population diversifies to a lesser extent. However, the magnitude of the effect was modest, and we note that there are possible complications in this analysis that could influence the outcome. We return to these issues in the Discussion below.
Estimating the number of OTUs in an HCV population. We investigated how close we were to completely sampling the HCV sequence populations using rarefaction analysis. In this method, random subsamples of sequences are drawn from the sequence pool for each amplicon from each individual, and the numbers of OTUs found in the subsample are plotted (Fig. ). The numbers of OTUs are then plotted for many different subsample sizes. If the curve reaches a plateau at values below the total number of sequences in the sample, it indicates that further sampling would yield few or no new OTUs. For this analysis, we chose the conservative approach of excluding variants supported by only one or two sequence reads.
For the 5′-UTR sequences (Fig. ), no samples from any subject showed more than four OTUs, even at a sampling depth of 1,500 sequences per sample. For most of the samples, the values were near saturation at low sampling depths. We thus concluded that most or all of the variation in the 5′-UTR sequence pool had been sampled. For the core region (Fig. ), all of the samples appeared to be near plateau, with the highest OTU value <20.
For the envelope region, many of the OTU values were much higher (Fig. ), with a maximum value of ~100. Nevertheless, the curves all reached clear plateaus, indicating that most or all of the variants in the sample were detected.
How many undetected OTUs are there in the HVR1 sequence populations? We investigated this using the Chao1 estimator for species richness, which allows the total number of species in a population to be inferred from counts of individuals from that population. If many species (OTUs) are represented by low counts of individuals (sequences), then it is likely that further sampling will yield new species (OTUs). Conversely, if most OTUs are represented by many sequences, then few additional OTUs will be recovered by further sampling. The Chao1 method uses such frequency of isolation information to estimate of the number of unseen species.
The Chao1 values for all HCV amplicons studied are shown in Table S1 in the supplemental material. When Chao1 was used to analyze the data in Fig. (after PyroNoise clustering and removal of OTUs supported by 1 or 2 sequences), there were only slightly higher estimates. The highest value for any of the HVR1 samples was 124, only slightly higher than the measured number of OTUs.
The number of virions transmitted early during infection. For HIV, exhaustive sequencing effort has suggested that typically only one or a few virions initiate infections during initial transmission (
17,
32). Our data allow us to address this for HCV. Although we do not know the precise time of transmission for the patients studied longitudinally here, ALT values indicated that infection was probably close to the time of diagnosis. We thus assessed the diversity, and hence the number of virions transmitted, at the earliest available time points for patients 1 to 3.
For this, we used the HVR1 amplicons (3Ba and 3Bb), since they are the most diverse and so are most suited to reporting the number of HCV variants. We used the conservative Chao1 analysis described above to estimate population sizes (see Table S1 in the supplemental material). For the two HVR amplicons, the numbers of sequence reads ranged from 148 to 551, but the numbers of OTUs ranged from only 1 to 4 for the first time points. The higher numbers of OTUs were from patient 2, who was first sampled the longest time after diagnosis (9 weeks). For all three patients, OTU numbers were higher for the HVR1 amplicons at the last time point than for the first time point (see Table S1 in the supplemental material). Thus, these data indicate that during transmission or shortly after infection, the HCV sequence population was comprised of only one or a few OTUs.
The viral population structure strongly affects estimation of the size of the transmitted inoculum. In previous literature, detection of low viral sequence diversity immediately after transmission has been taken as evidence that only one or a few viral particles seeded the initial infection. However, this conclusion is highly dependent on the structure of the viral population in the individual virus donor. For example, in the extreme case in which the viral population in the donor is homogeneous, finding a single variant in the recipient does not establish transmission of only a single virion. Conversely, if the donor harbors a highly diverse viral population and only a single viral variant is observed in the recipient, then a single particle probably initiated the productive infection. Here, we developed a quantitative framework to analyze the probability of transmitting one viral sequence variant given known structure of the donor population and used the pyrosequence data to model the size of the initial HCV inoculum transmitted in the patients studied here.
The probability of receiving only one OTU from the donor is as follows: Prtotal = p1n + p2n + … + pin, where p1, p2, … pi are the proportions of OTU1, OTU2, … OTUi in the donor, Prtotal is the total probability of transmission of a single variant, and n is the number of virions transmitted initially. For example, if OTU1 comprises 90% of the population in the donor, then the first virion transmitted to the recipient has a 90% chance of being OTU1. The probability of transmission of two virions belonging to OTU1 is thus 0.9 × 0.9, or 0.81. The squares of each probability are then summed over all OTUs in the sample to yield the total probability of transmission of a single variant given infection by two particles. If three viral particles started the infection, the probability would be the proportion of each OTU cubed, and so forth for larger numbers of virions transmitted initially. Thus, either increasing the diversity of the viral population in the donor or increasing the numbers of viral particles transmitted decreases the probability of obtaining a single variant in the recipient.
We used the HVR1 amplicon (3Ba) to represent proportions of OTUs in experimentally measured HCV populations (see Table S1 in the supplemental material). For subjects for whom multiple time points were available, the last time point was studied. Figure shows the calculated probability of obtaining a single OTU from a single transmission of n virions from the viral populations in the subjects studied. The x axis shows the number (n) of virions in hypothetical transmission events, comparing n values of 1 to 100 virions. The y axis shows the probability of obtaining only one OTU given the numbers of virions transmitted. Data for each HCV donor (modeled as individuals from our study) are shown by a separate curve on the graph.
Inspection of the graph shows that most curves fall quickly, indicating that transmission of even small numbers of virions would result in the appearance of multiple sequence variants in the recipient. For most subjects, transmission of 20 functional viral particles would likely result in accumulation of more than one HVR1 variant. However, there is clearly also considerable variation among the subjects studied. The most extreme case (the rightmost curve in Fig. ) is patient 8, 96% of whose viral population was made up of a single sequence variant. Thus, in general, the measured structures of the HCV populations in the subjects studied were sufficiently complex that transmission of >20 particles would be unlikely to result in a single OTU appearing in the recipient, but the variation among subjects was considerable.