|Home | About | Journals | Submit | Contact Us | Français|
Human immunodeficiency virus type 1 (HIV-1) sequences that pre-date the recognition of AIDS are critical to defining the time of origin and the timescale of virus evolution1,2. A viral sequence from 1959 (ZR59) is the oldest known HIV-1 infection1. Other historically documented sequences, important calibration points to convert evolutionary distance into time, are lacking, however; ZR59 is the only one sampled prior to 1976. Here we report the amplification and characterization of viral sequences from a Bouin's-fixed paraffin-embedded lymph node biopsy specimen obtained in 1960 from an adult female in Léopoldville, Belgian Congo (now Kinshasa, Democratic Republic of the Congo [DRC]), and we use it to conduct the first comparative evolutionary genetic study of early pre-AIDS epidemic HIV-1 group M viruses. Phylogenetic analyses position this viral sequence (DRC60) closest to the ancestral node of subtype A (excluding A2). Relaxed molecular clock analyses incorporating DRC60 and ZR59 date the M group's most recent common ancestor near the beginning of the 20th century. The sizeable genetic distance between DRC60 and ZR59 directly demonstrates that diversification of HIV-1 in west-Central Africa occurred long before the recognized AIDS pandemic. The recovery of viral gene sequences from decades-old paraffin-embedded tissues opens the door to a detailed paleovirological investigation of the evolutionary history of HIV-1 that is not accessible by other methods.
We screened twenty-seven tissue blocks (8 lymph node, 9 liver, and 10 placenta) obtained from Kinshasa between 1958 and 1960 by RT-PCR; one lymph node biopsy specimen contained HIV-1 RNA. Viral nucleic acids were extracted from this specimen using protocols optimized for the recovery of nucleic acids from ancient or degraded samples3,4. After reverse transcription, 12 out of the 14 short HIV-1 cDNA fragments attempted (Fig. 1A) were amplified by PCR using a panel of conserved primer pairs from different regions of the viral genome (Table S1). Each PCR-product DNA was cloned and sequenced. Sequences were reproducible after repeated extractions and not the result of PCR contamination (see Fig. 1A and Table S1 for fragment designations). The results were confirmed independently in two laboratories (Figs. 1B and S1), with the second laboratory successfully identifying the positive 1960 specimen in a blinded assay. The short fragments of the 1960 sample were found to be subtype A and not to be a mosaic of contemporary sequences (see supplementary information for a detailed discussion of the authenticity of the 1960 sequences). Consensus nucleotide sequences from these short HIV-1 fragments were concatenated for study. The analyses included reference sequences from the Los Alamos National Laboratory HIV sequence database and sequences recovered as part of this study from three paraffin-embedded tissue specimens collected from AIDS patients in Belgium and Canada between 1981 and 1997.
HIV-1 sequences were analyzed in MrBayes v3.1.25 using an unconstrained (no molecular clock enforced) Bayesian Markov chain Monte Carlo (BMCMC) method. The phylogenetic analyses confirmed that the DRC60 consensus sequences from the two laboratories were derived from a single patient (uncorrected pairwise distance 1.4%). The sequences were positioned close to the ancestral node of the subtype A lineage (excluding sub-subtype A2), forming a monophyletic clade with three modern sequences from the DRC (Figs. 1B and S1). Assuming a similar rate of evolution along all branches on a tree, the divergence between two sequences reflects the time elapsed since their shared ancestor. As predicted, the DRC60 sequences had a shorter branch length to the A/A1 ancestral node than the contemporary subtype A viruses sampled from the same geographic region (P = 1.0).
We validated the time of origin of the 1960 sequence by comparisons of the predicted date to the documented date. With the DRC60 date treated as an unknown, we calculated an evolutionary rate based on the distribution of branch lengths on the unconstrained phylogenetic trees sampled by MrBayes. To limit the effects of evolutionary rate differences between clades and uncertainties in rooting the HIV-1 M group phylogeny, we focused on the subtype A/A1 subtree (Fig. S1) and analyzed root-to-tip branch lengths relative to the sampling year. The mean estimates for the year of origin of the DRC60 consensus sequences from the University of Arizona and Northwestern University laboratories were 1959 (95% highest probability distribution 1902-84) and 1959 (95% HPD 1915-85), respectively, corroborating the authenticity of the DRC60 sequences and the existence of clock-like signal in our data set (see below). Despite initial indications that recombination might seriously confound phylogenetic dating estimates6, subsequent work has suggested that recombination is not likely to systematically bias HIV-1 dates in one direction or the other, though it is expected to increase variance7. The close match between the predicted and actual dates of both ZR592 and DRC60 provides support for this view and an unambiguous indication that HIV-1 evolves in a fairly reliably clock-like fashion.
The uncorrected pairwise distance between DCR60 and ZR59 in their overlapping env region was 11.7% (Fig. 1C). This genetic distance is greater than 99.2% of within-subtype comparisons (within-subtype difference range 0.01-0.15; between-subtype difference range 0.05-0.18). Since each subtype represents several decades of independent evolution in the human population2,8, the extensive divergence between DRC60 and ZR59 indicates that the HIV-1 M group founder virus began to diversify in the human population (and that HIV-1 likely entered Kinshasa) decades before 1960.
We applied a relaxed clock BMCMC coalescent framework as implemented in BEAST v1.4.79 to estimate the time to the most recent common ancestor (TMRCA) of the HIV-1 M group. This approach robustly incorporates phylogenetic uncertainty and accounts for the possibility of variable substitution rates among lineages and differences in the demographic history of the virus, sampling phylogenies and parameter estimates in proportion to their posterior probability10. As with other studies of HIV-111, comparisons of the marginal likelihoods of strict versus relaxed clock models (both of which are implemented in BEAST) indicated overwhelming support for relaxed clocks (data available upon request). Hence, the use of strict clock models with these data would be inappropriate and would likely yield misleadingly small error estimates with regard to both timing and substitution rates.
Using substitution rates calibrated with sequences sampled at different time points, we obtained a posterior distribution of rooted tree topologies with branch lengths in unit time (Figs. 2 and S2). The median estimated substitution rate for the concatenated subregions of the gag-pol-env genes was 2.47 × 10−3 substitutions/site/year (95% HPD 1.90-2.95 × 10−3). The inclusion of the 1959 and 1960 sequences appeared to improve estimation of the TMRCA of the M group (Table 1), limiting the influence of the coalescent tree prior on the posterior TMRCA distributions compared with the data set that excluded these earliest cases of HIV-1. With DRC60 and ZR59 included, the different demographic/coalescent models gave highly consistent results, with tighter and more similar date ranges compared to the analyses that excluded them and 95% HPDs that extend no later than 1933. The best-fit model incorporated a constant population size demographic model (TMRCA 1921, 95% HPD 1908-1933). The model with a general, non-parametric prior (the Bayesian skyline plot tree prior)12,13 that indicated a more complex (and biologically plausible) demographic history (Fig. S3) had a statistically indistinguishable degree of support (TMRCA 1908, 95% HPD 1884-1924). Moreover, the population expansion demographic model9, which was a slightly worse fit to the data compared to the constant population and Bayesian skyline plot models, could not be rejected given the Bayes factors comparison of models (Table 1). The inability to strongly reject the model with a constant population size prior is counterintuitive since it is clear that the HIV-1 population size has increased dramatically. We speculate that this finding might be due to the simplest model providing a good fit to a relatively short, information-poor alignment, in comparison with more parameterized models.
Acid-containing fixatives such as Bouin's solution can cause base modifications of nucleic acids, leading to the generation of erroneous bases in sequences derived from such samples3. However, the replication of all sequences from independent PCRs and the uncorrected pairwise distance between the consensus sequences from the two laboratories (1.4%) suggest that few of the mutations on the DRC60 lineage are damaged-induced. Moreover, our relaxed clock methods are likely to be fairly robust to the presence of such mutations in one lineage9. Nevertheless, additional old sequence data would be helpful for resolving what impact, if any, this possible source of error had on the slightly earlier dates we calculated compared to previous estimates that did not include early calibration points2,8,14,15. Interestingly, the best-fit model for the data set that excluded ZR59 and DRC60 (Table 1) gave a TMRCA estimate, 1933 (1919-1945), very similar to that of Korber et al.2. This suggests that the inclusion of the old sequences, rather than the vagaries associated with a much shorter alignment than that analyzed by Korber et al., might explain the discrepancy. It is also worth noting in this context that one earlier study, using sequences from the DRC only16, produced dating and demography estimates very similar to ours. Overall, there is broad agreement between all these studies in spite of differences in data and methods.
Our estimation of divergence times, with an evolutionary timescale spanning several decades, together with the extensive genetic distance between DRC60 and ZR59, indicate that these viruses evolved from a common ancestor circulating in the African population near the beginning of the 20th century; TMRCA dates later than the 1930s are strongly rejected by our statistical analyses. The topology of the HIV-1 group M phylogeny provides further support for this conclusion. Unlike ZR59, which is basal to subtype D1, DRC60 branches off from the ancestral node of subtype A/A1 (Figs. 2, S1, and S2). Thus, phylogenetically distinct subtypes (and/or their progenitors) were clearly already present in the DRC by this early time point (Fig. 2). Notably, DRC60 and ZR59 cluster with other strains from the same geographical region, and basal to other members of their respective subtypes, a pattern consistent with the hypothesis that the subtypes spread through lineage founder effects worldwide, while a more diverse array of forms remained at the site of origin in Africa17,18.
The reservoir of the ancestral virus still exists among wild chimpanzee communities in the same area on the African continent19. Humans acquired a common ancestor of the HIV-1 M group by cross-species transmission under natural circumstances20, most likely predation21. The Bayesian skyline plot (Fig. S2), which tracks effective population size through time, suggests that HIV-1 group M experienced an extensive period of relatively slow growth in the first half of the 20th century. A similar pattern has been inferred using sequences sampled only in the DRC16. This pattern, and the short duration between the first presence of urban agglomerations in this area and the timing of the most recent common ancestor of HIV-1 group M (Fig. 3), suggests that the rise of cities may have facilitated the initial establishment and the early spread of HIV-1. Hence, the founding and growth of colonial administrative and trading centers like Kinshasa22 may have enabled the region to become the epicenter of the HIV/AIDS pandemic23.
The archival banks of Bouin's-fixed paraffin-embedded tissue specimens accumulated by many hospitals in west-Central Africa provide a vast source of clinical material for viral genetic analysis. As with the 1918 Spanish Influenza pandemic virus24,25, a deep perspective on the evolutionary history of HIV-1 using sequences resurrected from the earliest cases in Africa could yield important insights into the pathogenesis, virulence, and evolution of pandemic AIDS viruses.
Each individual block carried an original paper identification number permanently embedded in the paraffin. Laboratory books listed the corresponding identification numbers sequentially, and included the patient's age, sex, department of hospitalization, tissue type, and date of sampling. Block identification number, sampling date, and tissue type were transcribed onto an Excel spreadsheet, and the blocks were indexed, transferred into plastic boxes, and photographed.
The results of the quantitative RT-PCR assay indicated that the integrity of the RNA preserved in these 27 samples ranged from moderate to undetectable, a range typical of Bouin's-fixed specimens3. The human RNA found in the 1960 lymph node biopsy sample that was found to be HIV-1 RNA positive was of relatively good quality. The Ct values (quantitative PCR data available from the authors upon request) were as low or lower (better) than more recent (1980-90) paraffin-embedded tissues that have yielded short HIV-1 RNA amplicons3.
Three formalin-fixed paraffin-embedded necropsy specimens were obtained from (i) a Canadian patient who died in 1997 (CAN97), (ii) a Congolese woman who died in Belgium in 1981 and was retrospectively identified as an AIDS patient (BE81), and (iii) a Congolese man who died in Belgium in 1985 (BE85). The latter two cases were presumably infected in Zaire (now DRC). The phylogenetic reconstruction shows that their viral sequences are most closely related to modern sequences from the Democratic Republic of the Congo, while the Canadian specimen yielded subtype B sequence, as predicted (Figs. 1, 22, S1, and S2).
Between 5 and 10 microtome sections, 5-10 μm in thickness, or an approximately equivalent amount of tissue shaved from each block with a disposable scalpel blade, were used for each digestion/extraction, as described3. Rigorous attention was given to preventing cross-contamination between samples by cleaning the outer surface of each block with a bleach solution, using fresh microtome/scalpel blades for each sectioning of each block, discarding the first few (exposed-surface) sections, and by performing the work in a room physically isolated from any human or HIV-1 PCR-product DNA. A 48-hour digestion period (24h at 65°C, 24h at 75°C) was used. Post extraction nucleic acids were eluted into 100 μL elution buffer AE and stored frozen at minus 80°C until required for analyses.
Reverse transcription was performed simultaneously for (i) the gag, pol, and human B2M RNA fragments, (ii) env fragments 3-10, and (iii) env fragments 11-14 (Table S1), with SuperScript III used according to the manufacturer's instructions. The protocol was as described3 except that alternating 50°C and 55°C incubation periods of 30 minutes were used for a total of 6 hours.
The cDNA was PCR amplified in 25μL reactions, using 0.1μL Platinum Taq HiFi enzyme (Invitrogen), 250μM DNTP mix, 2mM MgSO4, 1x PCR buffer, 0.4μM per primer, and 2μL cDNA for the gag and pol reactions, or 1μL for the env ones, with annealing temperatures of 60°C (gag [55 cycles]) or 55°C (pol [50 cycles]; env [60 cycles]). Full details are available from the authors upon request.
After amplification, the PCR-product DNA was visualized by agarose gel electrophoresis then purified using Zymoclean DNA Clean and Concentrator – 25 spin tubes (Zymo Research Corp, Orange, CA). PCR-product DNA was inserted into vector pCR 2.1-TOPO using the TOPO TA Cloning Kit (Invitrogen). The University of Arizona Genomic Analysis and Technology Core Facility resolved the DNA sequence of the vector inserts on an Applied Biosystem 3730xl DNA Analyzer using ABI Big Dye 3.1 chemistry (Applied Biosystems, Foster City, CA). Nearly identical protocols were followed for the independent replication of the DRC60 results at Northwestern University.
We downloaded the 2006 full-length HIV-1 sequence alignment from the Los Alamos National Laboratories HIV sequence database26. We retained only non-recombinant HIV-1 group M A-K subtype sequences (excluding G) and removed sequences suspected a priori of unusual evolutionary dynamics (such as those associated with the IV drug user epidemic in Eastern Europe and those with nef-deletions, both of which exhibit abnormally slow evolutionary rates). We also reduced the size of the subtype B and C clades, which are heavily over-sampled relative to the others, by keeping only the first 5 sequences from any year/country then randomly removing sequences until the sample size was similar to other subtypes. This procedure left a total of 156 sequences. We then manually aligned the consensus sequence from the 12 regions amplified from DRC60, plus the 4 regions available for ZR59, to the full-length sequences. These short regions (Fig. 1A and Table S1) were then concatenated into an alignment 994 nucleotides in length (Table 1). The four env fragments from DRC60 that overlapped with available data from ZR59 were concatenated into an alignment 163 nucleotides in length. Matching alignments with DRC60 and ZR59 removed were also constructed. All the alignments are available from the authors upon request.
We used a general time-reversible nucleotide substitution model with gamma-distributed rate heterogeneity among sites and performed four independent runs of 20 million steps, sampling every 2000 steps. Examination of the MCMC samples with Tracer v1.49 indicated convergence and adequate mixing of the Markov chain with estimated sample sizes in the thousands. We discarded the first 2 million steps from each run as burn-in, and combined the resulting MCMC samples for subsequent estimation of posteriors. The 50% majority rule consensus tree (Fig. S1) is shown rooted on the branch identified by the rooted-tree method in BEAST v1.4.7, described below; however, the group M rooting was not relevant to any dating analysis. We also estimated phylogenies using the same data set under neighbor-joining and maximum likelihood methods and the same substitution model. The DRC60 sequences fell in the same topological position as with the BMCMC methods, with short root-to-tip genetic distances, consistent with the MrBayes results (Fig. S1). All data and trees available from the authors upon request.
We used the posterior tree sample to test the hypothesis that the terminal nodes of the DRC60 sequences were closer to the inferred A/A1 ancestral node by calculating the proportion of sampled trees where the A/A1 node-to-tip distances were smaller for these sequences than for the three modern sequences from the DRC in the same clade (Fig. 1B).
To predict the date of sampling based on the phylogenetic properties of the DRC60 sequences we also plotted the branch lengths (A/A1 node to tips) against the time of sampling for all A/A1 sequences excluding DRC60 and calculated the best fit for the linear regression of genetic divergence against the year of sampling of the viruses27. We calculated the mean and 95% HPD of the predicted sampling date of each DRC60 consensus sequence based on its node-to-tip distance and the inferred regression line calculated for each of 100 trees sampled by MrBayes.
We used the Bayesian methods described by Drummond et al.9,10, which allow for the co-estimation of phylogeny and divergence times under a “relaxed” molecular clock model, as implemented in BEAST v1.4.79. All analyses were performed under an uncorrelated lognormal relaxed molecular clock model, using a general time-reversible nucleotide substitution model with heterogeneity among sites modeled with a gamma distribution. We investigated each demographic model (constant population, exponential growth, expansion growth, logistic growth) as well as a Bayesian skyline plot coalescent tree prior13, a general, non-parametric prior that enforces no particular demographic history. We employed a piecewise-linear skyline model with 10 groups. We then compared the marginal likelihoods for each model using Bayes factors estimated in Tracer v 1.4 as described12,15. Bayes factors represent the ratio of the marginal likelihoods of the models being compared. A large ratio can indicate that one model is a significantly better fit to the data than another. We assessed the strength of the evidence that the best-fit model was superior to the others as described15.
For each analysis, two independent runs of 50 million steps were performed. Examination of the MCMC samples with Tracer v1.4 indicated convergence and adequate mixing of the Markov chains, with estimated sample sizes in the 100s or 1000s. After inspection with Tracer, we discarded an appropriate number of steps from each run as burn-in, and combined the resulting MCMC tree samples for subsequent estimation of posteriors. We summarized the MCMC samples using the maximum clade credibility (MCC) topology found with TreeAnnotator v1.4.79 with branch length depicted in years (median of those branches that were present in at least 50% of the sampled trees) (Fig. 2). The Bayesian skyline plot was reconstructed using the posterior tree sample and Tracer v1.4.
We thank Joel Wertheim and Mike Sanderson for computational assistance and Larry Jewel for providing the Canadian control specimen. The NIH/NIAID and the David and Lucile Packard Foundation funded the research.
Author Contributions M.W., D.E.T., S.M.W., and M.T.P.G designed the study. M.G., T.H., K.K. and M.T.P.G. performed digestion/extraction, PCR, qPCR, cloning, and sequencing experiments. M.G. and M.B. designed primers. D.E.T., J-J.M., E.V.M., J-M.M.K and R.M.K. organized and provided samples. M.W. analyzed the data, performed the phylogenetic analyses, and wrote the paper. S.M.W. contributed to the analyses and writing. All authors discussed the results and commented on the manuscript.
The authors declare no competing financial interests.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
The sequences reported in this study have been deposited in GenBank under accession numbers EU580739-EU580854 and EU589211-EU589236.