|Home | About | Journals | Submit | Contact Us | Français|
Known human immunodeficiency virus (HIV) transmission histories are invaluable models for investigating the evolutionary and transmission dynamics of the virus and to assess the accuracy of phylogenetic reconstructions. Here we have characterized an HIV-1 transmission chain consisting of nine infected patients, almost all of whom were treated with antiviral drugs at later stages of infection. Partial pol and env gp41 regions of the HIV genome were directly sequenced from plasma viral RNA for at least one sample from each patient. Phylogenetic analyses in pol using likelihood methods inferred an evolutionary history not fully compatible with the known transmission history. This could be attributed to parallel evolution of drug resistance mutations resulting in the incorrect clustering of multidrug-resistant virus. On the other hand, a fully compatible phylogenetic tree was reconstructed from the env sequences. We were able to identify and quantify the molecular footprint of drug-selective pressure in pol using maximum likelihood inference under different codon substitution models. An increased fixation rate of mutations in the HIV population of the multidrug-resistant patient was demonstrated using molecular clock modeling. We show that molecular evolutionary analyses, guided by a known transmission history, can reveal the presence of confounding factors like natural selection and caution should be taken when accurate descriptions of HIV evolution are required.
The evolution of human immunodeficiency virus (HIV) within and across hosts is known to be remarkably different (11, 31). Within hosts, the viral population is subjected to natural selection as a result of a continuous effort to evade the immune response. This process is frequently reflected in temporal phylogenetic structures showing the continual appearance and extinction of strains through time (11, 34). Across hosts, positive selective pressure does not seem to have an important impact. Instead, HIV genetic diversity is predominately shaped by spatial and temporal factors in the demographic history (11, 31). Understanding how intrahost evolution translates into HIV evolution at the population level is a key factor in determining how drug resistance and immune escape mutations might spread. Treating HIV-1-infected patients with highly active antiretroviral therapy (HAART) has led to a remarkable reduction in HIV-related morbidity and mortality (28). However, no antiretroviral is resistance proof, and when HIV replication is not fully suppressed, drug resistance inevitably appears. In countries where antiretrovirals are widely used, growing concern exists about the transmission of resistant viruses. Many efforts are currently being undertaken to monitor and control its spread. Accurately tracking HIV transmission in the population can provide useful data to investigate this issue. Only recently, an HIV transmission chain provided the proof of principle for transmission of drug resistance (38).
Well-characterized HIV transmission chains are much appreciated in a phylogenetic context. Assuming that the viral phylogeny should be consistent with the “true” transmission history, transmission chains allow assessment of the accuracy and reliability of phylogenetic reconstructions (19). Unlike simulation studies, which cannot fully capture the biological complexity of HIV evolution, transmission chains are informative on the performance of methods for real data. One of the most carefully studied examples involves a Swedish HIV-1 transmission chain consisting of nine individuals from whom 13 samples were obtained (19). Using sequences sampled in the gag and env genes, several tree reconstruction methods were tested for their ability to reconstruct the true transmission history. Except for one mother-to-child transmission event, the viral tree, reconstructed using maximum likelihood methods and realistic nucleotide substitution models, agreed with the known transmission history (19, 21). In addition, molecular clock analysis revealed that genetic divergence correlated well with the isolation dates of the samples and that significant genetic divergence, referred to as ancestral divergence, existed between the donor and recipient lineage at the time of transmission (17).
Although the characterization of a transmission chain of this scope—in terms of patients involved, separation times between samples, and sequence data obtained—is unique in the epidemic history of HIV, it has been taken as a strong argument for the accuracy of phylogenetic reconstructions (10, 19, 20). It remains, however, uncertain how robust evolutionary analyses are with respect to biological complications, like recombination and natural selection. Unfortunately, known transmission chains allowing us to address these issues are rare. Here, we present an HIV transmission cluster consisting of nine infected patients for whom the time and direction of each virus transmission were determined by in-depth patient interviews. Reconstructed phylogenies based on pol and env gp41 gene sequences were evaluated for their compatibility with the known transmission history. A particular clustering in the pol tree, topologically incongruent with the transmission history, is attributed to drug-selective pressure in a multidrug-resistant patient. Natural selection in the HIV transmission chain, and in the multidrug-resistant patient in particular, was investigated using codon substitution models; the impact on the evolutionary rate is demonstrated using molecular clock modeling.
The epidemiological relationships between nine HIV-1-infected patients attending the University Hospitals Leuven were established through in-depth interviews by physicians experienced in HIV care. A time interval for each transmission event was determined by the following contact tracing criteria (if available): (i) the patient reporting a high-risk contact, (ii) the patient's most recent negative HIV test, and (iii) a history of an acute viral syndrome (which indicates an infection in the range of several days up to 10 weeks in the past). Each patient provided written informed consent, and at least one blood sample was obtained between 1990 and 2002. Epidemiologically unrelated control sequences, having the same subtype as the strains constituting the transmission chain, were obtained from a local database (local controls) and from GenBank using BLAST (2). Control sequences with drug resistance mutations were retrieved from the HIV Drug Resistance Database (http://hivdb.stanford.edu/).
Plasma was isolated from the blood sample, and HIV RNA was extracted using a QIAamp Viral RNA Mini kit (Westburg, Leusden, The Netherlands). cDNA synthesis and PCR amplification of the pol gene region were performed using an in-house protocol (40). env gp41 cDNA synthesis and PCR amplification were performed as previously described (41). Direct sequencing of the purified nested PCR products was performed using the ABI PRISM BigDye Terminator v3.1 Ready Reaction Cycle Sequencing kit and analyzed on the ABI3100 genetic analyzer (Applera, Nieuwerkerk a/d Ijssel, The Netherlands). Sequence fragments of 1,069 bp for pol and 951 bp for env were assembled and analyzed using Sequence Analysis version 3.7 and SeqScape version 2.0 (Applera, Nieuwerkerk a/d Ijssel, The Netherlands).
Sequences were aligned using CLUSTAL X (39) and manually edited according to their codon-reading frame in Se-Al (http://evolve.zoo.ox.ac.uk). Regions that could not be unambiguously aligned in the env gp41 gene were deleted from the alignment. For a representation of the pol alignment, see the supplemental material. Hypermutation and recombination were investigated using Hypermut and Simplot v2.5, respectively (22, 33). Appropriate nucleotide substitution models were determined with Modeltest v3.06 (29). Maximum likelihood phylogenetic trees were reconstructed in PAUP* (v4b10) using three different heuristic branch-swapping algorithms (37). Bootstrapping was performed using the stepwise addition algorithm for 1,000 replicates. Maximum a posteriori trees were inferred using MrBayes (v3.0) (13). Synonymous and nonsynonymous trees were reconstructed using the neighbor-joining method based on distance matrices estimated in Syn-Scan, which uses a model that includes allelic mixtures (8). Different tree topologies were compared using the Kishino-Hasegawa test and the Shimodaira-Hasegawa test (16, 36).
Molecular adaptation at individual sites (26, 44), along specific lineages (45), and at individual sites along specific lineages (43) was investigated using codon substitution models as implemented in PAML (v3.13) (42). To test for diversifying selection at individual sites, different models were compared that allow for heterogeneous nonsynonymous/synonymous substitution rate ratios (ω = dN/dS) among sites (models M0, M1, M2, M3, M7, and M8; see reference 44). Likelihood ratio testing (LRT) was used to test whether allowing for sites with ω > 1 significantly improves the fit of the model to data (M1-M2, M0-M3, and M7-M8, where M2, M3, M7 can accommodate positively selected sites). To investigate different selective pressure along specific lineages, we tested a model that allows only a single ω for all branches in the tree (M0) against a two-ratio model that allows an additional ω for specific branches in the tree. Positively selected sites along the lineages of interest were identified using branch site models (43). Branch site model A is an extension of the neutral model (M1) because it allows for sites being positively selected along a prespecified lineage while belonging to the class with ω0 = 0 or ω1 = 1 in the background. In branch site model B, ω0 and ω1 are estimated as free parameters, and thus it is an extension of the discrete model (M3 with two discrete classes of sites).
Molecular clock analysis was performed using maximum likelihood methods implemented in PAML (v3.13b) (42). Evolutionary rates were estimated under the assumption of a constant rate of evolution for sequences that were serially sampled over time (single rate dated tip model [SRDT]) (30). The molecular clock was tested by comparing the SRDT model against an unconstrained different rates (DR) model using LRT (30).
The sequences described here have been submitted to the GenBank database and assigned accession numbers AF338984, AF338990, AF338992, AF338997, AF339013, AF339017, and AY749169 to AY749196 for the pol sequences and AY749197 to AY749208 for the env gp41 sequences.
In this study, we identified a heterosexual HIV-1 transmission chain consisting of nine individuals. The epidemiological information obtained by patient interviews and the clinical data, the time of sampling and the treatment history of the patients are summarized in Fig. Fig.1.1. The direction of transmission and a relatively narrow time interval were determined for all transmission events, with the exception of patients A and B. Although there was clearly a transmission event between both patients, they reported each other as the original donor and a time interval could not be defined. Since several viral phylogenies might have resulted from this transmission history, we did not attempt to reconstruct a single known phylogenetic tree. Instead, the scheme of the “true” transmission history was considered as a pathway along which the viral population is assumed to have evolved. We have further used the term “compatible” if the viral phylogeny could have been generated under the known transmission history. Similar to the Swedish transmission chain (19), this transmission history spans more than a decade of HIV-1 evolution; however, the transmission chain reported here is situated almost exactly one decade later than the Swedish transmission chain. Reflecting the progress of HIV research and treatment during this decade, almost all patients of our transmission chain have received antiretroviral therapy. We chose to obtain sequence data for the relatively conserved pol gene region, which is used to test for resistance against commonly available drugs, and the more variable env gp41 gene region, which is anticipated to be used for testing resistance against fusion inhibitors (41).
In a first analysis, we tested whether the sequences obtained from the transmission chain patients were more closely related to each other than to unrelated control sequences. This represents the general hypothesis test for molecular investigations in forensic settings (5, 18, 25, 27). Phylogenetic analysis of 16 pol sequences from the nine transmission chain patients, as well as a set of unrelated controls extracted from a local database and GenBank, confirmed with statistical significance that the transmission chain sequences constituted a monophyletic cluster in the subtype C phylogeny (Fig. (Fig.2).2). A similar analysis of the reverse transcriptase (RT) gene, including additional control sequences with matched drug resistance mutations, still identified the transmission chain as a monophyletic cluster (see the supplemental material). To test if the evolutionary history of the virus was compatible with the known transmission history, molecular phylogenies were reconstructed based on pol and env gp41 sequences. Maximum likelihood reconstructions based on both gene regions are depicted in Fig. Fig.3.3. Both maximum likelihood and Bayesian inference resulted in the exact same topologies. In Fig. Fig.3,3, the most likely host transmission history is superimposed onto the viral tree, indicating that the evolutionary history in env gp41 was perfectly compatible with the known transmission history, while the pol tree revealed a particular incompatibility. According to the known transmission history, the sequences for patients A, F, and G were expected to be monophyletic, as correctly inferred in the env gp41 tree. However, the pol phylogeny did not suggest patient A as the donor for F. As indicated in Fig. Fig.3,3, this inconsistency could be resolved by a single branch swapping of (A00pop, A96pop) or (F02pop, G02pop) in either direction. It is interesting to note that the ML bootstrap support for the nodes relevant to the incompatible tree was weak compared to the posterior probabilities, suggesting that the more conservative bootstraps might be less misleading in this case. Phylogenetic reconstruction based on the concatenated pol and env gp41 alignment resulted in the same topology as inferred for the env gp41 gene only. For consistency, we have further referred to this topology as the env gp41 tree. Using the Kishino-Hasegawa test and the Shimodaira-Hasegawa test, we compared the env gp41 tree topology, congruent with the known transmission history, with the pol tree (Table (Table1).1). Although the pol reconstruction yielded a better likelihood for the pol alignment, the env gp41 reconstruction could not be significantly rejected by any test independent of the evolutionary model used. For the env gp41 data, however, the pol tree, not fully compatible with the known transmission history, gave a significantly worse fit than the env gp41 tree.
To investigate whether the unexpected clustering for the pol gene could have resulted from drug-selective pressure, we identified the drug resistance mutations for all samples based on the mutation list available from the International AIDS Society (Table (Table2)2) (14). Except for some natural polymorphisms, there was no evidence that resistance mutations were transmitted in this transmission chain. The table reveals that the virus in patient A had a significant number of resistance mutations, and at the second time of sampling, this patient was classified as multidrug resistant under ongoing exposure to therapy. Since such amino acid-altering mutations might have caused an incompatible phylogeny in pol, we reconstructed trees for pol based on synonymous (silent) and nonsynonymous (amino acid altering) distances separately (Fig. (Fig.4).4). While the synonymous tree was now fully compatible with the known transmission history, the nonsynonymous tree showed again that the sequences from patient A did not cluster with F and G. Instead, the patient A virus clustered with the patient I virus, which had developed resistance mutations that were also all present in the patient A virus (Table (Table2).2). Therefore, we could argue that drug-selective pressure had resulted in a pattern of parallel evolution. After exclusion of the codon positions in the RT at which resistance mutations were identified, a pol phylogeny fully compatible with the known transmission history could also be inferred (data not shown). For these data, which contained only 10 codon sites less than the original pol alignment, the tree incongruence test results were inverted (Table (Table1).1). The env gp41 topology resulted now in a higher likelihood than the original pol topology, but the latter was again not significantly rejected. Comparison of the pol and env gp41 phylogenies also indicated that different tree topologies could be compatible with the same transmission history. For example, the patient H strain clustered with the patient I strain in pol while the former clustered with a patient B isolate in env gp41; however, both scenarios were compatible with the known transmission history. In contrast to the pol tree, the tree topology reconstructed for env gp41 based on nonsynonymous distances was compatible with the known transmission history (data not shown).
Since we have identified a topological incompatibility in the pol reconstruction and attributed it to drug-selective pressure, we could test whether this has left a footprint of positive selection in the viral lineages of interest. The nonsynonymous/synonymous substitution rate ratio (ω = dN/dS) provides a qualitative measure of natural selection at the protein level, with ω = 1, ω < 1, and ω > 1, indicating neutral evolution, purifying selection, and positive selection, respectively. Since an averaging approach failed to detect positive selection in a similar case of parallel evolution under drug-selective pressure (4), we used more sensitive codon substitution models to identify positive selection at individual sites (26, 44), along specific lineages (45), and at individual sites along specific lineages (43). As genealogy, we used the tree reconstructed in the env gp41 gene (or pol plus env gp41 concatenated alignment), which was consistent with the transmission history (Fig. (Fig.3).3). The results of the maximum likelihood inference under these models are summarized in Table Table3.3. The two-ratio model that allowed for a different ω for the patient A branches (1) compared to all other branches (0) fitted significantly better than the one-ratio model (M0) (P = 0.0001). Although this provided evidence for a different selective pressure in the patient A lineages, the ω estimate for these lineages (1 = 1.6230) was not significantly higher than 1 (P = 0.3078).
The results of the site-specific models indicated that the selective pressure on the protein varied greatly among amino acid sites (Table (Table3).3). All models allowing for positively selected sites (M2, M3, and M8) provided a significantly better fit to the data than their neutral counterparts (M1, M0, and M7, respectively). Interestingly, the positively selected sites identified by the empirical Bayes' criterion included several drug resistance-associated mutations, mostly observed in more than one patient.
We further tested for sites under positive selection along the lineage of interest using branch site models. Parameter estimates under model A suggested that 65% of sites were highly conserved across all lineages, with ω0 = 0, and 18% of sites were nearly neutral with ω1 = 1, whereas the additional 17% of sites were under strong positive selection along the patient A branches, with 2 = 4.74. In comparison with the simpler neutral model (M1), model A showed a statistically significant improvement (P = 0.0045). Parameter estimates under model B suggested that 4.7% of sites were under positive selection in all lineages, with 1 = 4.34, whereas 18% of sites were only under positive selection in patient A, with 2 = 2.91. The LRT indicated that the branch site model B was significantly better than the site-specific model M3 (K = 2) at the 0.05 confidence level (P = 0.025), suggesting positive selection in the patient A lineages, in addition to positively selected sites in all lineages. Interestingly, the positively selected sites identified for the background included several positions at which drug resistance mutations were identified in more than one patient (RT, 41, 184, 210, and 215) while the positively selected sites identified for the foreground lineage included additional positions at which drug resistance mutations were only found in patient A (Pro, 10, 33, 48, 54, and 82; RT, 67 and 219). A similar analysis of the env gp41 gene region also identified sites under positive selection, but differential selective pressure in the specified lineages was not observed (data not shown).
Our analyses indicated that several mutations have been fixed under drug-selective pressure in patient A, which could have resulted in a faster evolutionary rate along this lineage. Exploratory linear regression analysis appeared to confirm this effect in the terminal branch of taxon A00pop (data not shown). To test this more formally, we applied molecular clock modeling to the heterochronous sequence data. The results of the maximum likelihood inference are shown in Table Table4.4. The SRDT model, which constrains the tips of the tree to be proportional to the sampling dates, resulted in a nucleotide substitution rate of 0.00121 substitutions/site/year. The single-rate model, which makes no accommodation for the temporal sampling of the isolates (30), was significantly rejected by the LRT in favor of the SRDT model (P < 0.0001). This comparison suggested that incorporating isolation dates into a single-rate model significantly improved the likelihood. However, the SRDT model is rejected in favor of the more general different rates (DR) model (P < 0.0001), indicating that the assumption of a global molecular clock was violated. A local clock dated tips model, relaxing the molecular clock along the branch leading to the taxon A00pop (Fig. (Fig.3),3), fitted the data significantly better than the SRDT model (P < 0.0001). The rate estimated for the branch leading to the multidrug-resistant isolate was significantly higher than the background lineages (0.00616 substitutions/site/year; Table Table4).4). It should be noted that the local clock model was still significantly rejected in favor of the DR model (P = 0.0014). Also for the env gp41 gene, a global molecular clock was rejected (DR versus SRDT, P = 0.0007), but no lineage effect in patient A was observed (local clock dated tips versus SRDT, P = 0.99).
In this study, we extensively documented a known HIV-1 transmission history almost a decade later than the Swedish transmission chain (19). Although the number of people carrying HIV-1 is estimated to be around 40 million (www.unaids.org), the long time span between the identification of both transmission chains of comparable size reflects the rarity of identifying such data. Sampling genetic data from known transmission histories of fast-evolving pathogens provides the opportunity to investigate the accuracy of phylogenetic reconstructions and the rate and mode at which genetic variation is accumulated (17, 19, 21). Here, we have sequenced the pol and env gp41 regions of the HIV-1 genome from plasma RNA in at least one sample of nine infected patients. Phylogenetic reconstructions using likelihood methods revealed that, in contrast to the env gp41 tree, the pol tree was incompatible with the known transmission history. The difference between transmission history and reconstructed topology in pol could be attributed to strong drug-selective pressure resulting in a pattern of parallel evolution. Analysis of the pol data without the positions at which known drug resistance mutations were identified resulted in a fully compatible tree.
Although the selective pressure in patient A, due to various drug regimens and eventually leading to multidrug resistance, might be an extreme case of selection in molecular evolution, it had a remarkable impact on our inference. Natural selection, even if confined to a few positions in the sampled gene fragment, can provide sufficient counterweight to the remaining phylogenetic information resulting in incorrect clustering. This observation demands caution when the pol gene is used for testing transmission hypotheses in forensic investigations (e.g., references 23 and 25). The presence of resistance mutations does not necessarily invalidate analyses of the pol gene for forensic purposes because the hypothesis test of epidemiological relatedness still provided convincing results (Fig. (Fig.1).1). This was also confirmed in the RT gene by using additional drug-resistant control sequences (see the supplemental material). However, it is highly recommended to assess the impact of sites with resistance-associated mutations by removing them from the alignment in a parallel analysis (12). As accessibility to antiretroviral treatment will increase, the footprint of drug-selective pressure might become a common feature in pol sequences. Reconstructions based on the env gp41 gene region, known to be subject to host immune-selection pressure as reflected in a higher overall dN/dS, did not seem to be severely influenced by parallel substitutions. On the contrary, the env gp41 data had the power to reject the incompatible pol evolutionary history. This showed that obtaining sequence data for multiple HIV genome regions is not a redundant recommendation in forensic investigations (18). It is interesting to note that patient A is now successfully being treated with the fusion inhibitor T20, which targets the HIV-1 gp41 transmembrane glycoprotein (15). Therefore, future sampling of this patient might indeed also reveal an effect of drug-selective pressure in the env gp41 gene (32).
A common convention for detecting the action of positive selection is the nonsynonymous/synonymous substitution rate ratio, a ratio greater than 1 indicating that nonsynonymous mutations offer fitness advantages and have a higher fixation rate than synonymous mutations. Unfortunately, an average ratio usually has little power to detect positive selection (e.g., references 1, 7, and 35). For the evolution of HIV drug resistance in particular, Crandall et al. (4) have shown that this ratio is a poor indicator of natural selection. This is not surprising, since for the conventional drugs, resistance mutations are fixed in the most functionally conserved protein of the HIV genome. Using codon substitution models, we were able to explicitly test differential selective pressure in patient A. Although this approach confirmed a significantly higher dN/dS along these lineages compared to the background, we still could not demonstrate any positive selection in the patient A lineages with statistical significance. Only the most complex model for detecting molecular adaptation at individual sites along specific lineages (branch site model B) revealed positively selected sites in patient A, in addition to positively selected sites throughout the complete genealogy. Interestingly, the latter included several positions with resistance mutations identified in more than one patient, while the former included several positions with resistance mutations exclusively seen in patient A. An exception to this was position 44, for which a drug resistance mutation was observed in both patient A and patient B but was identified as a positively selected site only in patient A. However, this mutation is only present as an allelic mixture in the patient B population sequence and therefore was not fully fixed in the population at that time of sampling. It remains to be elucidated whether the remaining sites identified to be under positive selection are resistance associated, selected by the host immune system or simply false positives. For example, it is interesting to note that mutations at positions 20, 203, and 218 in the RT, identified as positively selected in the foreground lineage, were significantly associated with nucleoside RT inhibitor therapy in a recent comprehensive statistical analysis (9).
Finally, we have demonstrated that the increased fixation of resistance mutations in patient A resulted in a significantly higher evolutionary rate along a terminal branch in patient A. Again, this could be explicitly tested by maximum likelihood modeling. However, the local molecular clock model with dated tips was still significantly rejected in favor of the different rates model, which does not assume a molecular clock. A molecular clock in the env gp41 gene was also rejected. This was in contrast with the Swedish transmission chain that supported the existence of a molecular clock (17). The difference might have resulted from a different testing approach or from the impact of antiretroviral treatment. On the one hand, it has been shown that effective treatment can result in a slowing down or even cessation of viral evolution (6). On the other hand, suboptimal therapy can lead to drug resistance, which can result in increased mutation rates, in turn increasing the likelihood of further resistance under ongoing therapy (24). The latter might have been the evolutionary pathway in patient A.
In conclusion, we were able to uncover the molecular footprint of drug-selective pressure in a case where the HIV transmission history was known. If the aim would have been to estimate the transmission history, which is much more common in HIV phylogenetics, we would have drawn incorrect conclusions on the basis of the pol gene region. Therefore, we suggest that caution should be taken when accurate reconstructions of HIV evolution are required.
This work was supported by the Flemish Fonds voor Wetenschappelijk Onderzoek (FWO G.0288.01). P.L. was supported by the Flemish Institute for the Promotion and Innovation through Science and Technology in Flanders (IWT Vlaanderen). A.R. was supported by the Royal Society. K.V.L. and Y.S. were supported by the Belgian Ministry of Social Affairs through a fund within the Health Insurance System.
We thank Y. Schrooten and B. Maes for expert laboratory assistance.
†Supplemental material for this article may be found at http://jvi.asm.org.