Protease and RT subtypes of completely sequenced HIV-1 genomes
We analyzed complete HIV-1 genome sequences from 117 persons. On the basis of previously published analyses, 77 of these sequences belonged to 1 subtype throughout their entire genome, and 40 sequences were recombinants. Of the 77 sequences of 1 subtype, 10 were subtype A, 29 were subtype B, 15 were subtype C, 6 were subtype D, 6 were subtype F (4 subsubtype F1 and 2 subsubtype F2), 4 were subtype G, 3 were subtype H, 2 were subtype J, and 2 were subtype K isolates. The 40 recombinant isolates included 29 that contained regions belonging to 2 subtypes and 11 with regions belonging to ≥3 subtypes.
To determine whether the protease and RT gene sequences clustered into groups similar to those of the complete genomes, we created phylogenetic trees by using the complete protease gene and first 300 codons (the 5′ polymerase coding region) of the RT gene. shows neighbor-joining trees constructed by using the HKY85+ Γ nucleotide substitution model; nearly identical trees were obtained by use of maximum parsimony and maximum likelihood methods. The complete genome subtypes of the sequences are indicated, and bootstrap values are shown for the ancestral nodes of each subtype. Highly mosaic variants and incidental viral variants that did not fit into the revised classification were not included in the bootstrap analysis.
Figure 1 Neighbor-joining trees constructed from reverse-transcriptase and protease sequences from 117 completely sequenced human immunodeficiency virus type 1 genomes. Trees were constructed with the HKY85 nucleotide-substitution model, with a Γ distribution (more ...)
On the basis of the clustering pattern in the neighbor-joining trees, we assigned protease and RT subtypes to 103 of the 117 published sequences (). For the nonrecombinant viruses, the subtypes assigned to the protease and RT were, in all cases, the same as the subtype of the complete genome. For the 2-subtype recombinants, the subtype assigned to the protease and RT were, in all cases, 1 of the 2 subtypes present in the complete genome. For 14 of 26 2-subtype recombinants (11 CRF01_AE, 2 AC, and 1 AD), both the protease and RT had the same subtype as gag. For 6 of 26 2-subtype recombinants (6 CRF02_AG), the protease and RT were mosaic but closest to G in the protease and in the first 300 codons of RT. For 4 of 26 2-subtype recombinants, the protease had the same subtype as gag and the RT had the same subtype as env. In 1 recombinant, the RT had the same subtype as gag and the protease had the same subtype as env. In another recombinant, both the protease and RT had the same subtype as env.
Distribution of isolates based on neighbor-joining phylogenetic analysis
The branching patterns of the protease and RT trees were similar, but the bootstrap values were lower in the protease tree. In both trees, there was a starlike branching structure, with the highest bootstrap values at the ancestral nodes for each subtype and at the node for the B-D ancestor. In the RT tree, the bootstrap values at the nodes of each of the recognized subtypes were as follows: A (99%), B (99%), C (100%), D (94%), F1 (90%), F2 (99%), G (70%), H (100%), J (100%), and K (100%). In the protease, bootstrap values at the ancestral nodes were, in all but 1 case, lower than those in the RT tree: A (48%), B (42%), C (62%), D (22%), G (85%), H (97%), J (100%), and K (74%). In the protease, sequences of F1 and F2 isolates did not form a clade; their most recent common ancestor was shared with isolates belonging to subtypes B, D, and J. The higher bootstrap values in the RT tree were probably due to the greater number of phylogenetically informative sites in the RT (364) than in the protease (103). There was also a higher proportion of informative sites in the RT (364 [40.4%] of 900) than in the protease (103 [34.7%] of 297; P = .03).
Each CRF formed clades; however, only CRF01_AE and CRF02_AG sequences formed lineages that were independent of other subtypes in both trees. In contrast, CRF03_AB protease sequences were within the subtype A clade, and CRF03_AB RT sequences were within the subtype B clade. CRF04_cpx RT sequences clustered with an incidental variant. CRF05_DF protease sequences were within the subtype D clade. CRF06_cpx protease and RT sequences clustered with incidental variants.
To determine the extent to which differences between and within subtypes were due to amino acid substitutions or to silent nucleotide substitutions, we measured the synonymous and nonsynonymous nucleotide distances between the sequenced isolates. About 21% of the RT nucleotides and 23% of the protease nucleotides had the potential for synonymous substitutions. In the RT, the mean intersubtype ratio of synonymous and nonsynonymous substitutions (DS:DN) was 7.1, and the mean intrasubtype DS:DN was 11.0. In the protease, the mean intersubtype DS:DN was 5.1; the mean intrasubtype DS:DN was 6.8.
Amino acid patterns in nonsubtype B isolates
Although most variation between subtypes was caused by synonymous nucleotide substitutions, there also were significant differences in the amino acid patterns between isolates belonging to different subtypes. To further characterize this variation, we searched GenBank for additional published sequences of non-B protease and RT group M isolates from untreated persons. We found 236 protease and 139 RT isolates that were not included in the original set of 117 completely sequenced isolates, for a total of 353 protease and 256 RT sequences.
The 353 protease sequences included 93 subtype A, 48 subtype C, 35 subtype D, 31 subtype F1/F2, 11 subtype G, 3 subtype H, 6 subtype J, 2 subtype K, 29 CRF01_AE, 86 CRF02_AG, and 9 sequences that belonged to CRF03-CRF06 or were incidental variants. Among these sequences, there were 14 positions at which 1 consensus sequence differed from the consensus B sequence (positions 12-15, 19, 20, 35, 36, 41, 57, 69, 82, 89, and 93) [19
]. Each of these 14 nonconsensus positions, however, was at least somewhat polymorphic among subtype B isolates (ranging from 1% at positions 82 and 89 to 21% at position 35).
illustrates the protease amino acid variation in those subtypes for which the most sequences were available. In this figure, the amino acid variation within each subtype is compared with the variation within 991 subtype B isolates from untreated persons. Among positions associated with drug resistance, variants at positions 10, 20, 36, and 82 were more common in non-B isolates; whereas variants at positions 63, 77, and 93 were more common in subtype B isolates. Variants at other positions associated with drug resistance were uncommon. There were 3 variants at position 46 (2 subtype A and 1 subtype F) and 1 at position 30 (subtype C).
Figure 2 Comparison of non-subtype B protease sequences with subtype B consensus sequence. White circles, percentage of subtype B sequences (n = 991) that differ from consensus B sequence; gray circles, percentage of non-B sequences that differ from subtype B (more ...)
Variation from subtype B at positions 20 (47% vs. 2%, P < .001) and 36 (94% vs. 14%, P < .001) was more common in each non-B subtype than in subtype B isolates. A variant in the substrate cleft, V82I, was found in 8 (73%) of 11 subtype G sequences, compared with 8 (1%) of 991 subtype B sequences (P < .001). Variants at position 10 occurred in 14% of non-B isolates and in 10% of subtype B isolates (P = .04) but not in the consensus sequence of any of the subtypes.
Variants at position 63 (L has long been considered to be the consensus, even though P is now reported more frequently) were more common in subtype B (68%) than in non-B isolates (33%; P < .001). V77I occurred in 24% of subtype B isolates and in 5% of non-B isolates (P < .001). I93L was more common in subtype C than in subtype B isolates (94% vs. 24%, P < .001) but was more common in subtype B than in the remaining non-B isolates.
The 256 RT sequences included 27 subtype A, 38 subtype C, 21 subtype D, 33 subtype F1/F2, 16 subtype G, 3 subtype H, 6 subtype J, 2 subtype K, 29 CRF01_AE, 72 CRF02_AG, and 9 RT that belonged to CRF03-CRF06 or were incidental variants. Among these sequences, at 26 positions ≥1 non-B consensus sequence differed from the consensus B sequence (positions 11, 35, 36, 39, 40, 48, 49, 60, 122, 135, 162, 173, 174, 177, 178, 200, 207, 211, 245, 248, 272, 277, 286, and 291-293). Each of the nonconsensus positions was at least somewhat polymorphic in subtype B isolates (from 1% at positions 36, 40, and 48 to >40% at positions 122 and 211). None of the 26 nonconsensus positions has been associated with RT inhibitor drug resistance.
HIV-1 protease and RT subtypes of isolates from northern California
Between July 1997 and June 2000, the SUH Diagnostic Virology Laboratory sequenced HIV-1 protease and RT genes from 2246 patients. For 801 patients, treatment histories were known. Of these 801, 556 had received RT inhibitors and 403 had received protease inhibitors. To assess how well these sequences clustered with sequences of known subtype, neighbor-joining trees were constructed by using the SUH sequences and the 103 previously published noncomplex sequences. In the RT, sequences from 2233 (99.42%) of 2246 persons clustered with subtype B, 9 (0.41%) with subtype C, 3 with subtype A (0.13%), and 1 with subtype D (0.04%). The protease sequences of isolates from subjects with non-B RT sequences were of the same subtype as the RT sequence.
We also determined the drug resistance-adjusted nucleotide distance of the 103 published sequences and the 2246 northern California sequences by comparing them with a set of reference isolates and disregarding positions associated with drug resistance [1
]. In all cases, the subtype of the most similar reference sequence was the same as the subtype with which the sequence clustered in the neighbor-joining tree. For both the protease and RT, the mean adjusted nucleotide distance between the northern California sequences and the reference sequence of the same subtype was between 3.9% and 6.6%—with the highest intrasubtype distances occurring within subtype A. The mean adjusted nucleotide distance between the northern California sequences and the reference sequences of different subtypes (subtypes A-D) was 6.5%-10.9%; the lowest intersubtype distances occurred between subtypes B and D.