We analyzed HIV-1 subtype B evolution in 11 individuals, including four transmitter-recipient pairs () and three independent seroconverters. At the time of transmission, three transmitters were chronically infected while one was acutely infected (). Sequences from transmission pair 1 (subjects T1 and R1 in ) have been described previously (41
); from the nine other individuals, 475 nearly full-length viral genome sequences (“genomes”) were generated, representing an average of 10 genomes at up to 12 serial time points (). No evidence of dual infection was found. All viruses were predicted by PSSM (30
) to use the CCR5 coreceptor, as expected for early HIV-1 infection (63
Fig. 1. Phylogenetic trees of HIV-1 genome sequences. The large (left side, unboxed) tree contains nine individuals and is based on 9 to 15 full genome sequences from each sampled time point (up to 12 time points extending up to 350 days after the onset of symptoms). (more ...)
Longitudinal follow-up of subjects
HIV-1 infection is typically founded by a single variant.
Eight of the nine acutely infected individuals had infections founded by a single HIV-1 lineage, while one individual (R4) replicated two lineages. Founder viral populations were remarkably homogeneous () based on Hamming distances and pairwise diversity measures (). For single founder infections, the mean pairwise diversity among genomes at visit 1 was 0.32% (range, 0.17 to 0.44%) (). For R4, the individual with two founder lineages identified at 13 days postonset of symptoms of acute retroviral infection (referred to as “days”), the two distinct variant lineages differed by a mean of 1.12%, whereas each individual lineage was nearly homogeneous ().
Transmitter-recipient transmission pairs.
Transmission pair 2 corresponds to two individuals with acute infections. Since the transmitter, T2, had a single variant with little diversity among sequences, infection was founded by a single strain in the recipient partner (R2) (). Sequences from both individuals were intermingled in the tree, and interhost genome pairwise distances ranged from 0.18% to 0.43%, a range conforming to the variation seen with a single founder within one host at the earliest time point (). Transmissions in two other pairs resulted in infections established by a single founder strain (pairs 1 and 3) even though each of the transmitting partners was chronically infected, with extensive diversity among their sequences (41
). Transmitting partner T4 had been enrolled during primary infection, and little viral genetic variation was observed ( and ), but 9 years later, at the time of transmission to R4, genomes from T4 contained extensive variation, and two variants were found in primary infection in the recipient (see posted at http://mullinslab.microbiol.washington.edu/publications/herbeck_2011/
For the four transmission pairs, we compared sequences from the recipient to sequences from the respective transmitter. There were exact matches (100% similarity) between transmitter and recipient sequences when we considered the conserved genes gag or pol. However, over the whole genome there were no exact matches between recipient sequences and those from the transmitting partner; the closest sequence between transmitter and recipient had between 1.69% (T3-R3) and 0.18% (T2-R2) divergence.
An important question is whether the founder variant in the recipient can be distinguished from sequences in the transmitter due to properties advantageous for the establishment of infection. That is, is the founder variant rare or common (typical) in the transmitter? To address this question for each transmission pair, we compared the consensus of the recipient population to each sequence in the transmitter. From the resulting ranked pairwise distances, we identified an approximate transmitted variant, i.e., the transmitter sequence that is most closely related to the recipient consensus (founder variant). Next, we compared all transmitter sequences to the transmitter consensus sequence under the hypothesis that an approximate transmitted variant that is rare would have a greater distance to the transmitter consensus than most, if not all, other transmitter sequences. shows the distribution of genetic distances for all the sequences from the transmitter. For each transmission pair, the transmitter sequence that matches most closely the consensus sequence in the recipient at visit 1 (i.e., our best approximation of the transmitted/founder virus) was found to be representative of the sequences in the transmitter. Indeed, it was generally very close to the mean genetic distance corresponding to all the transmitter sequences. Thus, the approximate founder variants did not appear to be unusual or rare in the transmitter ().
Fig. 2. The recipient founder virus is typical of the viral population in the transmitter. Distribution of genetic distances between the 10 transmitter sequences obtained near the time of transmission and the corresponding consensus sequence in the transmitter (more ...) Stochastic versus selective processes in the first weeks of HIV-1 infection.
Viral genome diversity increased over time across all individuals at a yearly rate of 0.55% for all nucleotide sites (A); the rate of accumulation of selected sites corresponded to an average of 40 sites in each subject in the first year of infection (B). However, the evolutionary rate at the genome level masks decoupled rates in the different genes. When we examined individual genes, as expected, the average rates of diversification were lower in gag (0.33%) and pol (0.31%) and higher in env or C2V5 (1.07%) and nef (1.34%) (all values are pooled estimates for the four individuals chosen because they had five or more time points evaluated).
Fig. 3. Trends in genetic diversity, positive selection, potential N-linked glycosylation, and epitope number. (A) Mean pairwise nucleotide diversity across genomes (corrected with the Hasegawa-Kishino-Yano [HKY] substitution model). (B) Cumulative number of (more ...)
At visits in the first month of infection, we observed a transient decrease (a dip) in nucleotide diversity for both genomes (C) and independent gene sequences (data not shown). This suggested a contraction in diversity following the establishment of infection. We also noted a decrease in APOBEC3F/G-mediated mutations that coincided with the dip in nucleotide diversity (see posted at http: //mullinslab.microbiol.washington.edu/publications/herbeck_2011/
), yet the dip in nucleotide diversity was of substantially larger magnitude and thus not due to the decrease in APOBEC-induced mutations (see posted at the URL mentioned above).
To evaluate potential factors behind the dip in nucleotide diversity and assess the forces acting on the viral population very early in infection, we assessed how the data conformed to the neutral theory, given that the dramatic, several-order-of-magnitude change in plasma viremia that occurs during acute infection suggests that changes in HIV-1 population size (i.e., demographic processes) might influence genetic diversity in this time period. Trends in genome diversity and divergence are plotted along with viral load data in . We performed neutrality tests on genomes from the four individuals with five or more sequential visits (R3, S1, S2, and S3). Both Tajima's D
) and Fu and Li's D
* tests (18
) revealed negative deviations from neutral evolution, suggesting either positive selection and/or demographic events (69
) (see Table 2 posted at http://mullinslab.microbiol.washington.edu/publications/herbeck_2011/
). The most significant negative deviations (P
< 0.001) were observed in the earliest time points after infection, specifically before ~50 days, coinciding with the rapid viral population growth and contraction during acute infection (shaded in ). Next, to distinguish demographic and selective processes, we calculated D
* separately for env
, and pol
; there was no evidence of selection acting specifically on a particular gene as genomes and individual genes showed similar patterns, implying the existence of demographic processes acting uniformly across genomes. Significant negative deviations were again more common at the first time points, and the strongest P
values in the gene-specific analyses coincided with negative deviations in the whole-genome analyses. Since sequential visits are not independent due to shared evolutionary history, the number of independent tests can be reduced (compared to strict Bonferroni correction for 144 tests), thus revealing significant deviations from neutrality in the early time points (see Table 3 posted at the URL mentioned above). In addition, in pairwise comparisons of genes for each time point, the Hudson-Kreitman-Aguadé (HKA) tests (28
) revealed no sign of adaptive positive selection.
Fig. 4. Stochastic processes predominate during acute HIV-1 infection. Plots for four individuals followed from 3 up to 350 days postonset of symptoms. Trends in pairwise diversity and divergence from the first visit consensus for genome nucleotide alignments (more ...)
The significant negative deviations observed for both genomes and separate genes persisted until the rapid decline in viral loads (). Importantly, negative D
values that are due to demographic processes can result from a founder effect or from a recent population expansion with a subsequent delay in the population reaching neutral equilibrium (69
). Evolution of HIV-1 during acute infection is therefore marked by both a founder effect and subsequent population expansion. We conclude that the observed early dip in viral diversity is likely caused by rapid viral population expansion; the process of population expansion can result in decreased mean population diversity as most lineages in the growing population are descendant from a limited number of ancestral lineages.
Indicators of selection begin to appear in the first week after onset of symptoms.
While demographic processes predominated at the earliest time points (before ~50 days), later visits revealed the role of positive selection in HIV-1 primary infection. Using a comparative dN
) and a simulation approach that identifies directional selection (41
), we identified amino acid sites under positive selection for the four individuals whose data are shown in B. Over the whole proteome, an average of 24 sites were under positive selection for each individual (range, 20 in R3 with 222 days of follow-up to 37 in S2 with 346 days of follow- up) (see Table 4 posted at http://mullinslab.microbiol.washington.edu/publications/herbeck_2011/
). No significant change in the number of potential N-linked glycosylation sites (PNGS) was seen over these time periods or between transmitters and recipients (E). The mean number of PNGS ranged between 27 and 34 per sequence. However, only two to five PNGS had variation (of which only one site, in S1, had a positively selected mutation).
To assess T cell-mediated pressure on HIV-1 evolution, we analyzed CTL responses and predicted epitopes based on each individual's HLA type. Akin to the dip in viral diversity, we noted that the average number of predicted epitopes also decreased in the first ~50 days after infection (D). However, with the exception of subject S2, these dips occurred later and for a more prolonged period than the dips in viral diversity for the same individuals. The above data along with CTL response data are illustrated for four newly infected individuals: three enrolled in Fiebig stage I (A; see also posted at http://mullinslab.microbiol.washington.edu/publications/herbeck_2011/
) and one in Fiebig stage V (see posted at the URL mentioned above). Overall, mutations accumulated gradually over the genome through time. The initial appearance of a mutation that later came to fixation in the Tat protein was detected at 7 days in subject S1 (A). The earliest fixations of mutant amino acids were at 21 (in Tat from S1) (A) and 33 (in Nef from R3) (see posted at the URL mentioned above) days although the mutation in Tat was not identified as positively selected by the two algorithms used here due to the extremely abrupt change in the population. By ~6 months postonset of symptoms (181 to 210 days), positively selected mutations were much more frequent, ranging from 9 in subject S2 to 18 in S3. Selected loci were more frequent in the 3′ half of the genome, which includes the most variable HIV-1 genes.
Fig. 5. InSites diagrams of genomes from longitudinal samples. The figure shows the alignment of phylogenetically informative sites identified in genome sequences relative to the visit 1 consensus sequence in the recipient. Genome sequences from different time (more ...)
When examining mutations in CTL epitopes (recognized and predicted), we noted several instances of the initial mutations being replaced by secondary mutations located nearby and usually in mutually exclusive sequence patterns. These patterns were seen in each of the four individuals followed for more than 180 days at one to eight sites across the proteome (A; see also and posted at http://mullinslab.microbiol.washington.edu/publications/herbeck_2011/
). Similar to the first amino acid mutations, the second mutations noted were most often to amino acids of low database frequency. These mutual exclusion patterns were seen in epitopes corresponding to three CTL responses against Env in S1 (outlined in green boxes in ). In this complex case, the original Env epitopes were replaced by day 68 by two to four variants harboring mutually exclusive mutations. A response was detected against the known epitope SFNCGGEFF (C04; residues 375 to 383) (SFC of 620 at day 127; not measured at day 7), which had been replaced by day 68 by two mutually exclusive variants (mutated residues are underlined in the sequences) SV
NCGGEFF and SFNCR
GEFF. The epitope RRGWEILKY (A01; residues 787 to 795) represented >90% of sequences until day 13 and only 27% at day 21 and was not detected afterwards, while the variant RRGWET
LKY became the consensus (the ELISPOT assay response was 15 SFC at day 7 and 715 at day 127). A stronger response at day 127 (SFC of 1,310; not detected at day 7) was elicited against RQGLERALL (B08; residues 848 to 856), which was the predominant variant until day 44, when it was replaced by RQGLERV
L. Five more CTL responses were detected by ELISPOT assay in subject S1; however, their targeted epitopes showed no sequence variation over 181 days of follow-up. Three other examples of mutually exclusive mutations were observed in this subject. In Pol, two mutations were 7 amino acids (aa) apart in the predicted epitope B*0801; in R
L an R-to-K mutation found at position 1 was in mutual exclusion with an S-to-P mutation at position 8 of the epitope. In Rev, the mutually exclusive mutations were 7 aa apart, but only one site was found within a predicted epitope. In one case, the known Gag B08-restricted epitope DCKTILKAL (residues 197 to 205) was transiently replaced between days 68 and 96 by DCR
TILKAL (A). This corresponded to a switch in database frequency from 94% to 3% for the K331R mutation (a previously documented HLA-B08-associated polymorphism [50
]). The resurgence of the original amino acid at day 127 was accompanied by a mutually exclusive A-to-S mutation located 2 aa downstream of the epitope, corresponding to a 76% decrease in database frequency (from 87% to 11%).
Escape versus reversion.
We assessed the direction of mutations by comparing the conservation level of the founder and mutant amino acids in a database of circulating HIV-1 sequences (41
). We defined forward (likely to be escape) mutations as those that reflected a decrease in database frequency of at least 50% (, shown in orange) and reverse (likely to be reversion) mutations as those that reflected an increase of at least 50% (, shown in turquoise; see also and posted at http://mullinslab.microbiol.washington.edu/publications/herbeck_2011/
). Amino acids with less substantial changes in database frequency are highlighted in green. A predominance of forward mutations was observed in all individuals. When we counted the mutations that became fixed, the majority corresponded to forward mutations with a drastic switch to amino acids with lower database frequencies. The ratio of forward to reversion mutations was 37/2 for subject S7, 6/0 for T4, 53/7 for S1, 91/7 for S2, 55/10 for R3, and 4/1 for R4 (for R4 for whom two founder variants were identified, we qualified mutations relative to the consensus corresponding to the respective founder variant).
Reverse mutations were also rare when we analyzed only mutations located in targeted/predicted epitopes. Regarding S2, only one reverse mutation among 11 potential epitopes was suggested (Env; C08-QFEDKTIIF replaced by QFENKTIIF at day 98); the original residue D was found in 0.005% of sequences in the HIV database, while N was found in 96%. For S3, two putative reversions were found, both in Env, including the mutation of IYAPPIQGL to MYAPPIQGL, corresponding to a switch from residues found in 1% (I) to 98% (M) of database sequences. In contrast, several possible escape mutations were seen in Env, Nef, and Gag, including some complex patterns with, for example, four different amino acid mutations in the known Nef epitope VLMWKFDSHL (A02); all were found in less than 8% of circulating sequences. Six independent ELISPOT assay responses were detected in R3, all against invariable epitopes, except for one Env response in which the original DPNPQEIRL epitope was replaced by DPNPQEIGL from day 144.