A molecular epidemiological survey of human immunodeficiency virus type 1 (HIV-1) conducted in Cyprus in 1994 revealed a surprising degree of genetic variation among locally circulating strains (
14). Kostrikis and colleagues screened 25 HIV-1-infected individuals attending an outpatient clinic in Nicosia by performing phylogenetic analyses of C2-V3
env gene sequences amplified from their uncultured peripheral blood mononuclear cells (PBMCs). This identified representatives of several different HIV-1 group M clades, including subtype A, C, and F strains, that are not commonly found in European populations. Moreover, both members of a heterosexual couple with a history of intravenous drug use and documented travel outside of Cyprus were found to be infected with similar viruses that could not be assigned to any of the previously defined HIV-1 subtypes. These viruses formed an independent lineage roughly equidistant from all other group M subtypes, and so it was proposed to classify them as members of a new clade, termed subtype I (
14).
At about the same time as this initial description of subtype I, it was realized that numerous HIV-1 strains are mosaics of sequences from more than one clade (
18,
19). Subsequent confirmation of the widespread occurrence of such hybrid viruses, often with evidence of multiple recombination crossovers along the genome (
3,
4,
6,
9,
10,
17,
23), indicated that classification and definition of new subtypes should be based on complete genomic sequences (
9,
16,
17). This is particularly important for viruses originating from geographic regions where multiple subtypes cocirculate since these have a high probability of being recombinant. To characterize subtype I in greater detail, we thus cloned a full-length provirus from a short-term-cultured, primary isolate established from one of the two individuals (HO32) infected with this subtype (
14). Using primers corresponding to the tRNA primer binding site (5′-TCTCTacgcgtGGCGCCCGAACAGGGAC-3′, lowercase letters indicate an
MluI site) and the polyadenylation signal in the 3′ long terminal repeat (LTR) (5′-ACCAGacgcgtACAACAGACGGGCACACACTACTT-3′), we used long-range PCR to amplify nearly full-length genomic fragments (
9,
10,
21) that contained all coding and regulatory regions except for 102 bp of 5′ unique LTR sequences (U5). Amplification products were subcloned into a plasmid vector and mapped by restriction enzyme digestion. One clone, termed 94CY032.3, was selected for further analysis. A 694-bp fragment spanning the remainder of the LTR was amplified separately using a seminested approach (
10).
The complete sequence of 94CY032.3 was determined by the primer-walking approach. Examination of potential coding regions revealed the expected reading frames for
gag,
pol,
vif,
vpr,
tat,
rev,
vpu,
env, and
nef (data not shown). None of the genes contained major deletions, insertions, or rearrangements. However, both
env and
vif genes contained single in-frame stop codons. There was also a single-base-pair insertion at position 5199 which caused a frameshift and altered six amino acid residues at the C terminus of the Vpr protein. All other protein domains of known function as well as major regulatory sequences, including the primer binding site, the packaging signal, and major splice sites, appeared to be intact. Similarly, the number, position, and consensus sequences of promoter and enhancer elements in the 94CY032.3 LTR were indistinguishable from those of most other HIV-1 strains, except for the presence of an unusual TATA sequence (TAAAA), thus far only found in subtype E (A/E) viruses from Thailand and the Central African Republic (
4,
10).
To compare 94CY032.3 to previously reported subtype I sequences, we constructed a phylogenetic tree from C2-V3 sequences, including representatives of all 10 known group M subtypes (Fig. ). As expected, 94CY032.3 clustered most closely with CYHO321 and CYHO322, sequences amplified from uncultured PBMC DNA of the same individual (HO32) from whom the 94CY032 isolate was derived. 94CY032.3 also clustered very closely with CYHO311, a sequence derived from the sexual partner of HO32 (
14), strongly suggesting that the two infections were epidemiologically linked. Finally, as observed in the past (
14), all subtype I sequences clustered independently, forming a distinct lineage roughly equidistant from all other subtypes, including subtype J (
15). These findings thus confirmed the authenticity of the 94CY032.3 clone and validated it as a representative of subtype I in the C2-V3 region of the viral envelope.
To characterize the remainder of the 94CY032.3 genome, we next performed pairwise sequence comparisons with recently reported nonmosaic reference sequences for subtypes A through H (
9,
16) as well as selected intersubtype recombinants (
17). We have used this approach in the past to screen newly derived sequences for regions of unusual sequence similarity or dissimilarity that might indicate recombination (
9,
10). Briefly, 94CY032.3 was added to a multiple genome alignment (available upon request) which included a total of 28 sequences from the database (
13), representing subtypes A (U455 and 92UG037.1), B (LAI, RF, OYI, MN, and SF2), C (C2220 and 92BR025.8), D (NDK, Z2Z3, ELI, 84ZR085.1, and 94UG114.1), F (93BR020.1), and H (90CF056.1) as well as A/C (ZAM184 and 92RW009.6), A/G (92NG083.2, 92NG003.1, Z321, and IBNG), A/D (MAL), A/E (93TH253.3, CM240, and 90CF402.1), and B/F (93BR029.4) recombinants. Simian immunodeficiency virus SIVcpzGAB was included in the alignment to provide an outgroup. All sites with a gap in any of the sequences were removed from the alignment to ensure that all comparisons were made across the same sites. The percent nucleotide sequence diversity between 94CY032.3 and the other viruses was then calculated for sequence pairs by moving a window of 400 bp in steps of 10 bp along the genome. The resulting distance profiles for the various pairwise comparisons were very similar (data not shown), suggesting that 94CY032.3 was roughly equidistant from all other subtypes in most regions of its genome. However, careful inspection of the graphs revealed several small areas of disproportionate sequence similarity involving sequences from subtypes A and G. For example, at the 5′ end of
gag and
vif and at the 3′ and 5′ ends of
env, diversity plots indicated a relative greater similarity of 94CY032.3 to 92UG037.1 and U455. Similarly, at the 3′ end of
gag and the 3′ end of
pol, a relative greater similarity of 94CY032.3 to 92NG083.3 was noticed. Together, these results suggested that 94CY032.3 contained subtype A- and G-like segments, in addition to regions that appeared to be equidistant from the other subtypes. Note that in the absence of a nonmosaic, full-length subtype G genome (
9,
16), we used 94NG083.3 as a subtype G reference, although it contains a small A segment in the
vif/vpr region.
Relative differences in the extent of sequence similarity, as determined by diversity plots (
9,
10) or other methods of distance measurement (
24), are not always an indicator of recombination, but can reflect variations in the evolutionary rates of the lineages compared. To determine whether 94CY032.3 was truly mosaic, we thus performed an exploratory tree analysis, looking for significantly discordant phylogenetic positions for different parts of its genome (Fig. ). The multiple genome alignment described above was used, but only three representatives of subtypes B and D were included, and all known recombinants other than the near-full-length subtype G viruses 92NG083.3 and 92NG003.1 were excluded. We constructed unrooted trees for overlapping fragments of 400 bp, moved in 10-bp increments along the alignment. Inspection of the resulting topologies revealed that the branching order of 94CY032.3 changed a total of 10 times, with all of the discordant positions supported by significant bootstrap values. 94CY032.3 alternated between subtype A (Fig. , panels 201-600, 4241-4640, 5071-5470, and 6821-7220) and subtype G (Fig. , panels 1101-1500, 3841-4240, and 5471-5870), as well as an independent position (Fig. , panels 1751-2150, 4641-5040, 5901-6300, and 7901-8300). The 5901-6300 segment corresponds approximately to the C2-V3 region that has served as the basis for the definition of subtype I. It is thus most parsimonious to assume that all four non-subtype A, non-subtype G segments within 94CY032.3 share a common origin and represent subtype I. These analyses indicate that 94CY032.3 is a mosaic of sequences belonging to three different group M subtypes.
To map the boundaries of the putative A, G, and I segments, we performed bootstrap plot analyses as previously described (
9,
10,
22), plotting the magnitude of the bootstrap values that supported the clustering of 94CY032.3 with 92UG037.1 (subtype A), as well as that of 94CY032.3 with 92NG083.2 (subtype G). The results of these analyses allowed us to map the approximate location and boundaries of the various subtype A and G segments along the 94CY032.3 genome (Fig. ). Bearing in mind the window size of 400 nucleotides and considering only peaks of significant bootstrap values (>80%), we identified two A/G crossovers around positions 1200 and 5600 and one G/A crossover around position 4100. The bootstrap plots also outlined regions with no peaks or peaks below 80%, which coincided with segments that clustered independently in the exploratory tree analysis and thus were likely in subtype I. Delineating the boundaries of these regions suggested five additional breakpoint positions: G/I at 1500, I/G at 3800, G/I at 6000, I/A at 6900, and A/I at 7200. Because full-length, nonmosaic reference sequences for the parental lineages (G and I) were not available, most of the breakpoints could not be mapped with certainty. However, the A/G breakpoints at positions 1200 and 5600 were confirmed by informative site analysis (data not shown). The recombinant nature of 92NG083.2 prohibited reliable breakpoint analysis between positions 4200 and 4800 (
9,
16) (highlighted in Fig. ).
To map the recombination breakpoints in this remaining region, we made use of four recently reported, partial but nonmosaic subtype G sequences from Mali which spanned the
vif/vpr region and thus bridged the subtype A gap of 92NG083.2 (
2). Figure A illustrates a set of distance plots that compare 94CY032.3 to one of these newly derived G sequences (95ML045) as well as representatives of subtype A (U455), B (MN), and D (ELI), respectively. Consistent with the results from the exploratory tree analysis (Fig. ), 94CY032.3 was disproportionately more closely related to U455 in the 5′ and 3′ thirds of this fragment, suggesting the presence of subtype A-like segments. However, in the middle of the fragment, 94CY032.3 was clearly equidistant from U455 and the other subtypes, suggesting an independent position. Thus, noting the points at which the “A” distance increased and decreased relative to the other distances allowed us to map the two remaining breakpoints, one near position 4650 and the other near position 5000. Trees constructed from sequences surrounding these two breakpoints (Fig. B) confirmed that 94CY032.3 switched position from subtype A (panel 4255-4650) to subtype I (4651-5000) and back to subtype A (5001-5300). Note that the new subtype G sequences only cover the region between positions 4255 and 5300.
Figure summarizes the results from all phylogenetic analyses, depicting a schematic representation of the mosaic genome structure of 94CY032.3. Segments of different subtype origin are color coded, and there are a total of 10 recombination breakpoints between the 5′ end of gag and the 3′ end of nef. LTR sequences were not separately analyzed for mosaicism, but the discordant subtype assignments of the gag and nef regions necessitate at least one more breakpoint within either the viral LTR or the gag leader sequence. Given this extent of mosaic complexity, 94CY032.3 must be the result of multiple recombination events, either in the same or different individuals.
Having identified several fragments of subtype I in 94CY032.3, we next determined whether there was any evidence for its presence in other full-length recombinants from the database. Two known mosaics, MAL (
1,
19) and Z321 (
5), were of particular interest, because previous analyses had indicated that these viruses contain regions of uncertain subtype assignment (
17,
19). For example, MAL has long been known to represent a mosaic of subtypes A and D (color coded as red and blue, respectively, in Fig. A) but also contains a sizable
pol fragment that has defied previous subtype classification (white in Fig. A) (
17,
19). Similarly, Z321 is a known mosaic of subtypes A and G (red and green, respectively, in Fig. A) (
5), but a recent reanalysis of its recombination breakpoints also identified regions that could not be assigned to any known subtype (
17). To determine whether any of these regions represented subtype I, we again performed distance plot analysis, looking for dips in the diversity profiles of 94CY032.3, with MAL and Z321 as an indication of relatively greater sequence similarity. Indeed, two such dips were identified, one in the
pol region of MAL and another in the
vif/vpr region of Z321, both of which coincided with previously unclassified segments of their genomes (data not shown). Subsequent phylogenetic tree analysis confirmed that these regions were indeed of subtype I origin, since MAL and Z321 clustered with very high bootstrap values with the subtype I domains of 94CY032.3 (Fig. B and C). However, subtype I did not account for all of the unclassifiable regions in MAL and Z321 (
17), and it thus remains unclear whether these represent still other, as yet unidentified, subtypes or regions of multiple breakpoints that cannot be mapped by using current methods.
In summary, we show here that a strain of HIV-1, proposed in 1995 as a prototypic subtype I isolate (
14), in fact represents a complex mosaic comprising subtypes A, G, and I, respectively. We also show that two of the oldest known isolates from Africa, MAL isolated in 1984 (
1) and Z321 isolated in 1976 (
11,
25), contain short segments of sequence closely related to the subtype I domains of 94CY032.3. These findings support the following conclusions. (i) Although initially detected in Cyprus, subtype I must have existed in Africa as early as 1976; it is unknown whether full-length, nonmosaic representatives of subtype I still exist but have not yet been sampled or whether this subtype is represented only by fragments in present-day recombinants. (ii) The ancestry of 94CY032.3 has likely involved multiple recombination events; it remains unclear whether these occurred in Africa and/or in Cyprus, where a number of different subtypes have also been documented (
14). (iii) Subtype I, along with subtypes A and G, must have diverged substantially earlier than the 1970s in order to be detectable as a distinct segment in the Z321 genome; this is consistent with the recent molecular characterization of a virus from 1959 which in phylogenetic analyses appears to have postdated the group M radiation (
27). (iv) Finally, the finding of subtype I in several different recombinants, including one from an intravenous drug user (
14), suggests that this subtype may be more widespread than previously thought, at least in the form of mosaic genome fragments. It will be interesting to screen additional viruses from drug user populations and their contacts in Cyprus and Greece to determine the current prevalence and geographic distribution of subtype I-containing viruses.
Nucleotide sequence accession numbers. The complete sequence of 94CY032.3 has been deposited at GenBank under accession no. AF049337 and AF049338.