The primary objective of this study was to develop, validate, and implement a new experimental strategy for analyzing complete HIV-1
env genes and, eventually, complete HIV-1 genomes from plasma RNA in a manner that would accurately reflect their identity and composition in vivo. To this end, we adapted methods previously described by other investigators (
27,
43-
45) and tested them using in vitro-synthesized HIV-1 RNA transcripts of known sequence identity and plasma specimens from subjects with acute and early infections. Using an equal mixture of T7-synthesized RNA transcripts from two related but distinct HIV-1
env clones (BORId9.4F12 and 4F8), we carried out SGA-direct sequencing to estimate the rates of nucleotide misincorporation and recombination from this method. We observed 3.4 assay-related errors per 10,000 nucleotides, indicating a misincorporation rate of 0.034%. In addition, we observed a 0.01% rate of elongation errors within runs of the same nucleotide. We attribute these rates to a combination of T7 polymerase and Superscript III reverse transcriptase errors and note that Palmer and coworkers reported very similar values (0.011% nucleotide substitution errors and 0.022% elongation errors) for T7 RNA transcripts of HIV-1
pro-pol genes (
27). We also performed SGA analysis on an equal mixture of vRNA from two transfection-derived HIV-1 strains, YU2 and SG3. Here, human RNA Pol II, and not T7 polymerase, catalyzes the RNA synthesis step. In this case, we observed a nucleotide misincorporation rate of 0.0068%, or 0.68 assay-related errors per 10,000 amplified nucleotides. In addition, we observed a 0.0015% rate of elongation errors within runs of A or T residues. We attribute these to a combination of human Pol II and Superscript III reverse transcriptase errors and note that Mansky and Temin reported a similar value of 0.0034% for the overall HIV-1 reverse transcriptase-plus-human Pol II error rate on a
lacZ template (
23). Thus, the total error rate of the SGA-direct sequencing method as described here for the analysis of HIV-1
env sequences is no more than 8 × 10
−5, due mostly to Superscript III errors. This is not a negligible error rate even when
Taq polymerase errors are avoided altogether (by direct sequencing of uncloned amplicons), since it can lead to single-nucleotide misincorporations in as many as 1 in 5
env sequences. Nonetheless, we suspect that 8 × 10
−5 is an upper limit for the Superscript III error rate in our analyses, since in some patients with very early infection (Fiebig stages I and II), we found as many as 48 of 52 (92%) plasma virion
env sequences to be identical, with the remaining four varying by only five nucleotides altogether; this yields an inferred rate of Superscript III error of <4 × 10
−5 (G. M. Shaw and B. H. Hahn, unpublished data). For certain applications, such as identifying transmitted or early founder sequences, infrequent nucleotide misincorporations are of no consequence since many independently generated sequences are analyzed together and all coalesce to a single consensus. However, if SGA-derived
env genes from chronically infected subjects, in whom most circulating viruses are unique (
27), are analyzed, then single-base-pair misincorporations due to Superscript III error can be a confounding variable. Finally, we examined whether cDNA synthesis using Superscript III generated recombinant viral sequences in vitro. Among 109 complete
env sequences, corresponding to 278,000 nucleotides, we observed no recombinants. We also observed no instances of intragenic recombination or of insertion, deletion, or duplication. Our findings are thus in agreement with those of Palmer and coworkers, who also found no evidence of Superscript III-mediated recombination, insertion, deletion, or duplication in any of 50 genomes (66,000 nucleotides) analyzed (
27). Although we did not formally evaluate the substitution rates and template-switching frequencies for
Taq polymerase when using the bulk amplification method, we frequently encountered
Taq-induced recombinants and/or misincorporations in bulk-amplified sequences (Fig. and ). Moreover, in a separate study where we obtained functional
env clones from SGA-derived amplicons (B. F. Keele, unpublished data), we identified (and discarded) numerous clones that contained
Taq-induced errors and thus did not correspond to the
env consensus sequences. These clones were excluded from subsequent biological analyses since they did not represent viruses present in the patient but rather in vitro artifacts. We thus conclude that only SGA-based strategies can unequivocally identify genetically linked mutations and that assay-related nucleotide misincorporations and recombination frequencies are much lower for SGA approaches than with other strategies.
A second objective of this study was to evaluate in a field trial setting the ability of SGA-direct sequencing strategies to decipher transmitted clade C (or other non-clade B) viruses and their early evolution in a time frame typical of vaccine trial follow-up schedules (every 3 months). In a companion study of acute and early subtype B infections (B. F. Keele, unpublished data), we studied 51 subjects in Fiebig stages I/II and 26 subjects in Fiebig stages III/IV; with such early sampling we found that we could infer transmitted or early founder env sequences in most patients, including those infected by more than one virus. A mathematical model of early HIV-1 replication and diversification described in that study provided the theoretical basis for identifying transmitted or early founder viral genomes. Here, we were less certain whether this approach would be applicable since the frequency of patient sampling was less, samples were obtained from the majority of subjects (9/12) weeks to months after infection (Fiebig stage V or VI), and the genetic subtypes analyzed were non-clade B. Nonetheless, we show here for three subjects (ZM249M, ZM247F, and ZM246F) studied prior to seroconversion that the phylogenetic trees and Highlighter analyses allow for an unambiguous identification of the transmitted or early founder virus(es). For six homogeneous-transmission cases studied later in the infection process (ZM178F, ZM180M, ZM184F, ZM206F, ZM231F, and ZM235), the env sequences also coalesced in a time frame consistent with transmitted or early founder viruses (Fig. , Table ). However, this was not the case for individuals who were infected by more than one virus and were sampled for the first time at later time points (e.g., Fiebig stages V/VI); in these instances, identification of the transmitted viruses was precluded by more-extensive nucleotide substitutions, as well as in vivo recombination. This limitation notwithstanding, our findings for primary clade C infections mirror data obtained for primary clade B infections (B. F. Keele, unpublished data): sequences of transmitted or early founder env genes can be readily inferred from SGA-derived sequences if subjects are sampled sufficiently early (Fiebig stages I to IV) and, in some cases, also at later time points (Fiebig stages V to VI) but only if the infection was initiated by a single virus.
A third study objective was to determine if insights into selection pressures on virus replication could be inferred from SGA-derived sequences from single time points distant from the transmission event. We show two examples of this. In Fig. , the results of Highlighter analysis of 24
env sequences from a subject at Fiebig stage V (ZM180M) are shown, illustrating a heavy concentration of nucleotide substitutions in the region of
env that overlaps the second exon of
rev. The actual nucleotide substitutions are shown in the middle panel. Each of the 24 sequences was found to contain one or more of seven different mutations when compared to the consensus sequence. Because of the large number of different mutations, it was possible to infer the consensus sequence in this region and across the entire
env gene. Moreover, all of the nucleotide substitutions concentrated within this 9-codon stretch of the Rev open reading frame were nonsynonymous (Fig. , right panel). Statistical analysis ruled out the possibility that this cluster of mutations arose by chance, and the observation most likely reflects selection for sequences with amino acid differences. Although viably frozen lymphocytes were not available for cytotoxic T-lymphocyte studies, this subject's HLA profile was typed as A*2901, A*3002, B*1510, B*4201, Cw*0304, Cw*17(01-03). The Rev sequence under selection pressure is LAEPV
PLPLPPIERLNIGD, with the variable region underlined. There are several HLA-B42 and HLA-C motifs that overlap this region of interest, where potential second-position and C-terminal anchor motifs are indicated as follows: XPXXXXXXXXL (B*4201); XPXXXXXXL (B*4201); and XAXXXXXXL (Cw*1701, Cw*1702, and Cw*0304). The potential B*4201 epitopes are embedded directly in the region that is the focus of the mutations, while the potential HLA-C epitope is slightly offset. People who carry Cw*03 tend to have a reaction to the peptide that spans this region more often than people without Cw*03, suggesting that a Cw*03 epitope is present and recognized in many subtype C infections (B. T. Korber, unpublished data). A similar pattern of mutations in a Rev 9-mer was recently identified in a subtype B-infected subject, and in this individual, HLA-restricted cytotoxic T-lymphocyte reactivity was confirmed by enzyme-linked immunospot assay and gamma interferon induction (G. M. Shaw and P. Borrow, unpublished data). In the sample from subject ZM206, obtained in stage VI, there was equally strong evidence of selection, again within a 9-amino-acid fragment but this time within the variable loop 1 (V1) region of Env (Fig. ). Remarkably, 33 out of 35 sequences had 1 or more of 16 different point mutations within this region, while the other 2 sequences had deletions. This again allowed for the identification of a consensus sequence that likely corresponds to the transmitted or early founder sequence. Again, these changes meant that every sequence sampled encoded a different amino acid sequence when compared to the consensus, and again, the likelihood of such a concentration of mutations occurring by chance was estimated to be extremely low. The HLA profile of this subject was A*0202, A*2301; B*1510, B*180101; Cw*0501, Cw*1601, and the region of extreme selection is GSS
KANDNNVNITSD. There are no obvious anchor motifs for the relevant HLAs in this sequence, although KANDNNVNI could fit an A*0202 binding pocket (P. Goulder, personal communication). Alternatively, the observed cluster of V1 mutations could be the result of neutralizing-antibody escape (
34). Taken together, these results indicate that molecular patterns of virus adaptation can be inferred even in samples obtained several months after transmission from subjects for whom earlier specimens are not available for comparison.
The SGA-direct sequencing approach is ideally suited to the evaluation of genetic linkages, as described by Palmer et al. for the analysis of drug resistance mutations in the
pro-pol genes (
27). We sought to determine if SGA-direct sequencing might reveal
env gene recombination in subjects acutely infected by more than one virus and then to compare recombination frequencies between SGA and bulk amplification methods. Figure illustrates multiple examples of viral recombination in the two subjects at Fiebig stage VI, ZM229M and ZM215F. Interestingly, in subject ZM215F, the recombination involved not only two principal transmitted virus lineages but also additional sequences not otherwise represented in the sequence set. Exhaustive phylogenetic analyses indicated that subjects ZM215F and ZM229M had each been infected by four or more viruses (Table ). Thus, in these two heterogeneous infections, viral diversification was accelerated by extensive recombination.
Although viral recombination assessed by SGA methods can be complicated and nearly indecipherable in multiply-infected individuals at later time points, this problem is magnified if bulk PCR methods are used. In Fig. , we show results for a subject at Fiebig stage II who was infected by two variants. The analysis of a total of 44 SGA-derived
env sequences reveals no evidence of recombination, but 8 of 34 bulk PCR-derived sequences are mosaic, each exhibiting different breakpoint patterns. Artifactual
Taq-mediated template switching was also demonstrated in an example of a cross-contaminated plasma specimen from two subjects (ZM246F and ZM246M) who were infected by unrelated viruses (Fig. ). In the contaminated specimen, the SGA method clearly distinguished a single ZM246F lineage from two ZM246M lineages with no recombinant viruses among them. Conversely, the bulk method generated two mosaic sequences in vitro that did not exist in vivo. From the results of our analyses and those of other reports (
27,
43-
45,
54), we conclude that SGA methods do not generate in vitro recombinants, whereas bulk methods commonly do. Bulk amplification-cloning-sequencing strategies are also susceptible to
Taq-induced nucleotide misincorporation, template resampling, and cloning bias. These limitations may not be problematic for certain applications. However, if a goal is to obtain sequences of appreciable length that correspond to HIV-1 genomes that exist in vivo, then SGA-direct sequencing approaches have distinct advantages.
The results of the present study, together with those of Palmer et al. (
27) and B. F. Keele (unpublished), illustrate new scientific avenues for deciphering HIV-1 transmission and patterns of early virus diversification. Based on the data reported here, it is likely that these new approaches will help to clarify the genetic and biological complexity of viruses transmitted by different routes and under various clinical circumstances, all factors that may be important in the design and assessment of candidate vaccines, antiretroviral drugs, or microbicides. In samples from clinical trial participants, it may be possible to use SGA-based methods to generate not only transmitted and evolved
env genes but all HIV-1 genes of interest. Recently, we have shown for seven clade B- or C-infected subjects that complete (9 kb) HIV-1 genomes corresponding to the transmitted or early founder virus can be identified by SGA-direct sequencing methods (
39). Such approaches may be useful in mapping linked mutations conferring escape from cellular and humoral immune responses in naïve or vaccinated individuals.