Human immunodeficiency virus type 1 (HIV-1) is pandemic, but its contemporary global transmission network has not been characterized. A better understanding of the properties and dynamics of this network is essential for surveillance, prevention, and eventual eradication of HIV. Here, we apply a simple and computationally efficient network-based approach to all publicly available HIV polymerase sequences in the global database, revealing a contemporary picture of the spread of HIV-1 within and between countries. This approach automatically recovered well-characterized transmission clusters and extended other clusters thought to be contained within a single country across international borders. In addition, previously undescribed transmission clusters were discovered. Together, these clusters represent all known modes of HIV transmission. The extent of international linkage revealed by our comprehensive approach demonstrates the need to consider the global diversity of HIV, even when describing local epidemics. Finally, the speed of this method allows for near-real-time surveillance of the pandemic's progression.
human immunodeficiency virus; transmission network; molecular epidemiology
Since its identification in 1983, HIV-1 has been the focus of a research effort unprecedented in scope and difficulty, whose ultimate goals — a cure and a vaccine – remain elusive. One of the fundamental challenges in accomplishing these goals is the tremendous genetic variability of the virus, with some genes differing at as many as 40% of nucleotide positions among circulating strains. Because of this, the genetic bases of many viral phenotypes, most notably the susceptibility to neutralization by a particular antibody, are difficult to identify computationally. Drawing upon open-source general-purpose machine learning algorithms and libraries, we have developed a software package IDEPI (IDentify EPItopes) for learning genotype-to-phenotype predictive models from sequences with known phenotypes. IDEPI can apply learned models to classify sequences of unknown phenotypes, and also identify specific sequence features which contribute to a particular phenotype. We demonstrate that IDEPI achieves performance similar to or better than that of previously published approaches on four well-studied problems: finding the epitopes of broadly neutralizing antibodies (bNab), determining coreceptor tropism of the virus, identifying compartment-specific genetic signatures of the virus, and deducing drug-resistance associated mutations. The cross-platform Python source code (released under the GPL 3.0 license), documentation, issue tracking, and a pre-configured virtual machine for IDEPI can be found at https://github.com/veg/idepi.
HIV-1 dual infection (DI) and CXCR4 (X4) coreceptor usage are associated with accelerated disease progression but frequency and dynamics of coreceptor usage during DI is unknown. Ultradeep sequencing was used to interrogate for DI and infer coreceptor usage in longitudinal blood samples of 102 subjects. At baseline, X4 usage was high (23 subjects harbored X4 variants) and was not associated with infection duration or DI. Coreceptor usage changed over time in 12 of 47 participants, and X4 usage emerged in 4 of 41 monoinfections vs 2 of 5 superinfections (P = .12), suggesting a weak statistical trend toward occurrence of superinfection and acquiring X4 usage.
HIV-1 dual infection; HIV-1 coinfection; HIV-1 superinfection; coreceptor tropism; coreceptor usage; ultradeep pyrosequencing; next-generation sequencing; genotypic tropism prediction; genotypic coreceptor usage prediction
To reconstruct the local HIV-1 transmission network from 1996 to 2011 and use network data to evaluate and guide efforts to interrupt transmission.
HIV-1 pol sequence data were analyzed to infer the local transmission network.
We analyzed HIV-1 pol sequence data to infer a partial local transmission network among 478 recently HIV-1 infected persons and 170 of their sexual and social contacts in San Diego, California. A transmission network score (TNS) was developed to estimate the risk of HIV transmission from a newly diagnosed individual to a new partner and target prevention interventions.
HIV-1 pol sequences from 339 individuals (52.3%) were highly similar to sequences from at least one other participant (i.e., clustered). A high TNS (top 25%) was significantly correlated with baseline risk behaviors (number of unique sexual partners and insertive unprotected anal intercourse (p = 0.014 and p = 0.0455, respectively) and predicted risk of transmission (p<0.0001). Retrospective analysis of antiretroviral therapy (ART) use, and simulations of ART targeted to individuals with the highest TNS, showed significantly reduced network level HIV transmission (p<0.05).
Sequence data from an HIV-1 screening program focused on recently infected persons and their social and sexual contacts enabled the characterization of a highly connected transmission network. The network-based risk score (TNS) was highly correlated with transmission risk behaviors and outcomes, and can be used identify and target effective prevention interventions, like ART, to those at a greater risk for HIV-1 transmission.
Investigating the incidence and prevalence of HIV-1 superinfection is challenging due to the complex dynamics of two infecting strains. The superinfecting strain can replace the initial strain, be transiently expressed, or persist along with the initial strain in distinct or in recombined forms. Various selective pressures influence these alternative scenarios in different HIV-1 coding regions. We hypothesized that the potency of the neutralizing antibody (NAb) response to autologous viruses would modulate viral dynamics in env following superinfection in a limited set of superinfection cases. HIV-1 env pyrosequencing data were generated from blood plasma collected from 7 individuals with evidence of superinfection. Viral variants within each patient were screened for recombination, and viral dynamics were evaluated using nucleotide diversity. NAb responses to autologous viruses were evaluated before and after superinfection. In 4 individuals, the superinfecting strain replaced the original strain. In 2 individuals, both initial and superinfecting strains continued to cocirculate. In the final individual, the surviving lineage was the product of interstrain recombination. NAb responses to autologous viruses that were detected within the first 2 years of HIV-1 infection were weak or absent for 6 of the 7 recently infected individuals at the time of and shortly following superinfection. These 6 individuals had detectable on-going viral replication of distinct superinfecting virus in the env coding region. In the remaining case, there was an early and strong autologous NAb response, which was associated with extensive recombination in env between initial and superinfecting strains. This extensive recombination made superinfection more difficult to identify and may explain why the detection of superinfection has typically been associated with low autologous NAb titers.
Model-based analyses of natural selection often categorize sites into a relatively small number of site classes. Forcing each site to belong to one of these classes places unrealistic constraints on the distribution of selection parameters, which can result in misleading inference due to model misspecification. We present an approximate hierarchical Bayesian method using a Markov chain Monte Carlo (MCMC) routine that ensures robustness against model misspecification by averaging over a large number of predefined site classes. This leaves the distribution of selection parameters essentially unconstrained, and also allows sites experiencing positive and purifying selection to be identified orders of magnitude faster than by existing methods. We demonstrate that popular random effects likelihood methods can produce misleading results when sites assigned to the same site class experience different levels of positive or purifying selection—an unavoidable scenario when using a small number of site classes. Our Fast Unconstrained Bayesian AppRoximation (FUBAR) is unaffected by this problem, while achieving higher power than existing unconstrained (fixed effects likelihood) methods. The speed advantage of FUBAR allows us to analyze larger data sets than other methods: We illustrate this on a large influenza hemagglutinin data set (3,142 sequences). FUBAR is available as a batch file within the latest HyPhy distribution (http://www.hyphy.org), as well as on the Datamonkey web server (http://www.datamonkey.org/).
evolutionary model; coding sequence evolution; approximate Bayesian inference; parallel algorithms
Evolutionary models that make use of site-specific parameters have recently been criticized on the grounds that parameter estimates obtained under such models can be unreliable and lack theoretical guarantees of convergence. We present a simulation study providing empirical evidence that a simple version of the models in question does exhibit sensible convergence behavior and that additional taxa, despite not being independent of each other, lead to improved parameter estimates. Although it would be desirable to have theoretical guarantees of this, we argue that such guarantees would not be sufficient to justify the use of these models in practice. Instead, we emphasize the importance of taking the variance of parameter estimates into account rather than blindly trusting point estimates – this is standardly done by using the models to construct statistical hypothesis tests, which are then validated empirically via simulation studies.
We present a case of sexual transmission of HIV-1 predicted to have CXCR4-tropism during male-to-male sexual exposure. Phylogenetic analyses exclude cell-free virus in the seminal plasma of the source partner and possibly point to the seminal cells as the origin of the transmission event.
Coronaviruses are found in a diverse array of bat and bird species, which are believed to act as natural hosts. Molecular clock dating analyses of coronaviruses suggest that the most recent common ancestor of these viruses existed around 10,000 years ago. This relatively young age is in sharp contrast to the ancient evolutionary history of their putative natural hosts, which began diversifying tens of millions of years ago. Here, we attempted to resolve this discrepancy by applying more realistic evolutionary models that have previously revealed the ancient evolutionary history of other RNA viruses. By explicitly modeling variation in the strength of natural selection over time and thereby improving the modeling of substitution saturation, we found that the time to the most recent ancestor common for all coronaviruses is likely far greater (millions of years) than the previously inferred range.
Motivation: Statistical methods for comparing relative rates of synonymous and non-synonymous substitutions maintain a central role in detecting positive selection. To identify selection, researchers often estimate the ratio of these relative rates () at individual alignment sites. Fitting a codon substitution model that captures heterogeneity in across sites provides a reliable way to perform such estimation, but it remains computationally prohibitive for massive datasets. By using crude estimates of the numbers of synonymous and non-synonymous substitutions at each site, counting approaches scale well to large datasets, but they fail to account for ancestral state reconstruction uncertainty and to provide site-specific estimates.
Results: We propose a hybrid solution that borrows the computational strength of counting methods, but augments these methods with empirical Bayes modeling to produce a relatively fast and reliable method capable of estimating site-specific values in large datasets. Importantly, our hybrid approach, set in a Bayesian framework, integrates over the posterior distribution of phylogenies and ancestral reconstructions to quantify uncertainty about site-specific estimates. Simulations demonstrate that this method competes well with more-principled statistical procedures and, in some cases, even outperforms them. We illustrate the utility of our method using human immunodeficiency virus, feline panleukopenia and canine parvovirus evolution examples.
Availability: Renaissance counting is implemented in the development branch of BEAST, freely available at http://code.google.com/p/beast-mcmc/. The method will be made available in the next public release of the package, including support to set up analyses in BEAUti.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
Standard methods used to estimate HIV-1 population diversity are often resource intensive (e.g., single genome amplification, clonal amplification and pyrosequencing) and not well suited for large study cohorts. Additional approaches are needed to address the relationships between intraindividual HIV-1 genetic diversity and disease. With a small cohort of individuals, we validated three methods for measuring diversity: Shannon entropy and average pairwise distance (APD) using single genome sequences, and counts of mixed bases (i.e. ambiguous nucleotides) from population-based sequences. In a large cohort, we then used the mixed base approach to determine associations between measure HIV-1 diversity and HIV associated disease. Normalized counts of mixed bases correlated with Shannon Entropy at both the nucleotide (rho=0.72, p=0.002) and amino acid level (rho=0.59, p=0.015), and APD (rho=0.75, p=0.001). Among participants who underwent neuropsychological and clinical assessments (n=187), increased HIV-1 population diversity was associated with both a diagnosis of AIDS and neuropsychological impairment.
HIV; AIDS; genetic diversity; neuropsychological impairment; viral population dynamics
The genital tract of individuals infected with HIV-1 is an anatomic compartment that supports local HIV-1 and CMV replication. This study investigated the association of seminal CMV replication with changes in HIV-1 clonal expansion, evolution and phylogenetic compartmentalization between blood and semen. Fourteen paired blood and semen samples were analyzed from four untreated subjects. Clonal sequences (n=607) were generated from extracted HIV-1 RNA (env C2-V3 region), and HIV-1 and CMV levels were measured in the seminal plasma by real-time PCR. Sequence alignments were evaluated for: (i) viral compartmentalization between semen and blood samples using Slatkin-Maddison and FST methods, (ii) different nucleotide substitution rates in semen and blood, and (iii) association between proportions of clonal HIV-1 sequences in each compartment and seminal CMV levels. Half of the semen samples had detectable CMV DNA, with at least one CMV positive sample for each patient. Seminal CMV DNA levels correlated positively with seminal HIV-1 RNA levels (Spearman p=0.05). A trend towards an association between compartmentalization of HIV-1 sequences sampled from blood and semen and presence of seminal CMV was observed (Cochran Q test p=0.12). Evolutionary rates between semen and blood HIV-1 populations did not differ significantly, and there was no significant association between seminal CMV DNA levels and the frequency of non-unique clonal HIV-1 sequences in the semen. In conclusion, the effects of CMV replication on HIV-1 viral and immunologic dynamics within the male genital tract are not significant enough to perturb evolution or disrupt compartmentalization in the genital tract.
HIV-1; Cytomegalovirus; compartmentalization; evolution; semen
To investigate the susceptibilities to and consequences of HIV-1 dual infection (DI).
We compared clinical, virologic, and immunologic factors between participants who were dually infected with HIV-1 subtype B, and monoinfected (MI) controls who were matched by ongoing HIV risk factor.
The viral load and CD4 progressions of dually and singly infected participant groups were compared with linear mixed-effects models, and individual dynamics before and after superinfection were assessed with a structural change test (Chow test). Recombination breakpoint analysis (GARD), HLA frequency analysis, and cytotoxic T-lymphocyte (CTL) epitope mapping were also performed (HIV LANL Database).
The viral loads of DI participants increased more over 3 years of follow-up than the viral loads of MI controls, while CD4 progressions of the two groups did not differ. Viral escape from CTL responses following superinfection was observed in two participants whose superinfecting strain completely replaced the initial strain. This pattern was not seen among participants whose superinfecting virus persisted in a recombinant form with the initial virus or was only detected transiently. Several HLA types were overrepresented in DI participants as compared to MI controls.
These results identify potential factors for DI susceptibility and further define its clinical consequences.
HIV-1 dual infection; viral load; CD4 count; HLA; CTL
Viral suppressors of RNAi (VSRs) are proteins that actively inhibit the antiviral RNA interference (RNAi) immune response, providing an immune evasion route for viruses. It has been hypothesized that VSRs are engaged in a molecular ‘arms race’ with RNAi pathway genes. Two lines of evidence support this. First, VSRs from plant viruses display high sequence diversity, and are frequently gained and lost over evolutionary time scales. Second, Drosophila antiviral RNAi genes show high rates of adaptive evolution. Here, we investigate whether VSRs diversify faster than other genes and, if so, whether this is a result of positive selection, as might be expected in an arms race. By analysis of 12 plant RNA viruses, we show that the relative rate of protein evolution is higher for VSRs than for other genes, but that this is not attributable to pervasive positive selection. We argue that, because evolutionary time scales are extremely different for viruses and eukaryotes, it is improbable that viral adaptation (as measured by the ratio of non-synonymous to synonymous change) will be dominated by one-to-one coevolution with eukaryotes. Instead, for plant virus VSRs, we find strong evidence of episodic selection—diversifying selection that acts on a subset of lineages—which might be attributable to frequent shifts between different host genotypes or species.
molecular evolution; positive selection; evolutionary arms race; RNA interference; viral suppressor of RNAi; RNA silencing suppressors
To subvert host defenses, some microbial pathogens produce proteins that interact with conserved motifs in variable regions of B-cell antigen receptor shared by large sets of lymphocytes, which define the properties of a superantigen. As the clonal composition of the lymphocyte pool is a major determinant of immune responsiveness, this study was undertaken to examine the in vivo effect on the host immune system of exposure to a B-cell superantigen, protein L (PpL), a product of the common commensal bacterial species, Finegoldia magna, which is one of the most common pathogenic species amongst Gram-positive anaerobic cocci. Libraries of variable kappa (Vκ) light chain transcripts were generated from the spleens of control and PpL-exposed mice, and the expressed Vκ rearrangements were characterized by high-throughput sequencing. A total of 120,855 sequencing reads could be assigned to a germline Vκ gene, with all 20 known Vκ subgroups represented. In control mice, we found a recurrent and consistent hierarchy of Vκ gene usage, as well as patterns of preferential Vκ-Jκ pairing. PpL exposure induced significant targeted global shifts in repertoire with reduction of Vκ that contain the superantigen binding motif in all exposed mice, with significant targeted reductions in the expression of clonotypes encoded by 14 specific Vκ genes with the predicted PpL binding motif. These rigorous surveys document the capacity of a microbial protein to modulate the composition of the expressed lymphocyte repertoire, which also has broad potential implications for host-microbiome and host-pathogen relationships.
High-throughput sequencing; BCR repertoire; Protein L; Immunoglobulin kappa light chain; 454 sequencing
Specific sequence changes of human immunodeficiency virus type 1 (HIV-1) in the presence of specific HLA molecules may alter the composition and processing of viral peptides, leading to immune escape. Persistence of these mutations after transmission may leave the genetic fingerprint of the transmitter's HLA profile. Here, we evaluated the associations between HLA profiles and the phylogenetic relationships of HIV sequences sampled from a cohort of recently infected individuals in San Diego, California.
We identified transmission clusters within the study cohort, using phylogenetic analysis of sampled HIV pol genotypes at a genetic distance of <1.5%. We then evaluated the association of specific HLA alleles, HLA homozygosity, HLA concordance, race and ethnicity, and mutational patterns within the clustering and nonclustering groups.
From 350 cohort participants, we identified 162 clustering individuals and 188 nonclustering individuals. We identified trends for enrichment of particular alleles within individual clusters and evidence of viral escape within those clusters. We also found that discordance of HLA alleles was significantly associated with clustering individuals.
Some transmission clusters demonstrate HLA enrichment, and viruses in these HLA-associated clusters often show evidence of escape to enriched alleles. Interestingly, HLA discordance was associated with clustering in our predominantly MSM population.
Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
molecular evolution; statistical inference; phylogenetics; evolutionary tree; statistical bias; variance
Rate heterogeneity among lineages is a common feature of molecular evolution, and it has long impeded our ability to accurately estimate the age of evolutionary divergence events. The development of relaxed molecular clocks, which model variable substitution rates among lineages, was intended to rectify this problem. Major subtypes of pandemic HIV-1 group M are thought to exemplify closely related lineages with different substitution rates. Here, we report that inferring the time of most recent common ancestor of all these subtypes in a single phylogeny under a single (relaxed) molecular clock produces significantly different dates for many of the subtypes than does analysis of each subtype on its own. We explore various methods to ameliorate this problem. We conclude that current molecular dating methods are inadequate for dealing with this type of substitution rate variation in HIV-1. Through simulation, we show that heterotachy causes root ages to be overestimated.
molecular clock; rate variation; HIV-1
Statistical methods for molecular dating of viral origins have been used extensively to infer the time of most common recent ancestor for many rapidly evolving pathogens. However, there are a number of cases, in which epidemiological, historical, or genomic evidence suggests much older viral origins than those obtained via molecular dating. We demonstrate how pervasive purifying selection can mask the ancient origins of recently sampled pathogens, in part due to the inability of nucleotide-based substitution models to properly account for complex patterns of spatial and temporal variability in selective pressures. We use codon-based substitution models to infer the length of branches in viral phylogenies; these models produce estimates that are often considerably longer than those obtained with traditional nucleotide-based substitution models. Correcting the apparent underestimation of branch lengths suggests substantially older origins for measles, Ebola, and avian influenza viruses. This work helps to reconcile some of the inconsistencies between molecular dating and other types of evidence concerning the age of viral lineages.
measles virus; rinderpest virus; Ebola virus; avian influenza virus; molecular clock; substitution rate; codon model; purifying selection
Standard genotypic antiretroviral resistance testing, performed by bulk sequencing, does not readily detect variants that comprise <20% of the circulating HIV-1 RNA population. Nevertheless, it is valuable in selecting an antiretroviral regimen after antiretroviral failure. In patients with poor adherence, resistant variants may not reach this threshold. Therefore, deep sequencing would be potentially valuable for detecting minority resistant variants. We compared bulk sequencing and deep sequencing to detect HIV-1 drug resistance at the time of a second-line protease inhibitor (PI)-based antiretroviral regimen failure. Eligibility criteria were virologic failure (HIV-1 RNA load of >500 copies/ml) of a first-line nonnucleoside reverse transcriptase inhibitor-based regimen, with at least the M184V mutation (lamivudine resistance), and second-line failure of a lopinavir/ritonavir (LPV/r)-based regimen. An amplicon-sequencing approach on the Roche 454 system was used. Six patients with viral loads of >90,000 copies/ml and one patient with a viral load of 520 copies/ml were included. Mutations not detectable by bulk sequencing during first- and second-line failure were detected by deep sequencing during second-line failure. Low-frequency variants (>0.5% of the sequence population) harboring major protease inhibitor resistance mutations were found in 5 of 7 patients despite poor adherence to the LPV/r-based regimen. In patients with intermittent adherence to a boosted PI regimen, deep sequencing may detect minority PI-resistant variants, which likely represent early events in resistance selection. In patients with poor or intermittent adherence, there may be low evolutionary impetus for such variants to reach fixation, explaining the low prevalence of PI resistance.
Adaptive evolution frequently occurs in episodic bursts, localized to a few sites in a gene, and to a small number of lineages in a phylogenetic tree. A popular class of “branch-site” evolutionary models provides a statistical framework to search for evidence of such episodic selection. For computational tractability, current branch-site models unrealistically assume that all branches in the tree can be partitioned a priori into two rigid classes—“foreground” branches that are allowed to undergo diversifying selective bursts and “background” branches that are negatively selected or neutral. We demonstrate that this assumption leads to unacceptably high rates of false positives or false negatives when the evolutionary process along background branches strongly deviates from modeling assumptions. To address this problem, we extend Felsenstein's pruning algorithm to allow efficient likelihood computations for models in which variation over branches (and not just sites) is described in the random effects likelihood framework. This enables us to model the process at every branch-site combination as a mixture of three Markov substitution models—our model treats the selective class of every branch at a particular site as an unobserved state that is chosen independently of that at any other branch. When benchmarked on a previously published set of simulated sequences, our method consistently matched or outperformed existing branch-site tests in terms of power and error rates. Using three empirical data sets, previously analyzed for episodic selection, we discuss how modeling assumptions can influence inference in practical situations.
episodic selection; random effects model; evolutionary model; branch-site model
The imprint of natural selection on protein coding genes is often difficult to identify because selection is frequently transient or episodic, i.e. it affects only a subset of lineages. Existing computational techniques, which are designed to identify sites subject to pervasive selection, may fail to recognize sites where selection is episodic: a large proportion of positively selected sites. We present a mixed effects model of evolution (MEME) that is capable of identifying instances of both episodic and pervasive positive selection at the level of an individual site. Using empirical and simulated data, we demonstrate the superior performance of MEME over older models under a broad range of scenarios. We find that episodic selection is widespread and conclude that the number of sites experiencing positive selection may have been vastly underestimated.
Identifying regions of protein coding genes that have undergone adaptive evolution is important to answering many questions in evolutionary biology and genetics. In order to tease out genetic evidence for natural selection, genes from a diverse array of taxa must be analyzed, only a subset of which may have undergone adaptive evolution; the same gene region may be under stabilizing or relaxed selection in lineages leading to other taxa. Most current computational methods designed to detect the imprint of natural selection at a site in a protein coding gene assume the strength and direction of natural selection is constant across all lineages. Here, we present a method to detect adaptive evolution, even when the selective forces are not constant across taxa. Using a variety of well-characterized genes, we find evidence suggesting that natural selection is generally episodic and that modeling it as such reveals that many more sites are subject to episodic positive selection than previously appreciated.
During the late 1980s and early 1990s, an estimated 10,000 Romanian children were infected with HIV-1 subtype F nosocomially through contaminated needles and blood transfusions. However, the geographic source and origins of this epidemic remain unclear.
Here we used phylogenetic inference and “relaxed” molecular clock dating analysis to further characterize the Romanian HIV-1 subtype F epidemic.
These analyses revealed a major lineage of Romanian HIV sequences consisting nearly entirely of virus sampled from adolescents and children and a distinct cluster that included a much higher ratio of adult sequences. Divergence time estimates inferred the time of most recent common ancestor of subtype F1 sequences to be 1973 (1966–1980) and for all Angolan sequences to 1975 (1968–1980). The most common ancestor of the Romanian sequences was dated to 1978 (1972–1983) with pediatric and adolescent sequences interspersed throughout the lineage. The phylogenetic structure of the entire subtype F epidemic suggests that multiple introductions of subtype F into Romania occurred either from the Angolan epidemic or from more distant ancestors. Since the historical records note that the Romanian pediatric epidemic did not begin until the late 1980s, the inferred time of most recent common ancestor of the Romanian lineage of 1978 suggests that there were multiple introductions of subtype F occurred into the pediatric population from HIV already circulating in Romania.
Analysis of the subtype F HIV-1 epidemic in an historical context allows for a deeper appreciation of how the HIV pandemic has been influenced by socio-political events.
Phylogeography; Romania; Subtype F; Socio-political; HIV
The evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. While methods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDS and EDEPS) which allow the a priori specification of lineages expected to have undergone directional selection. The models infer the sites and target residues that were likely subject to directional selection, using either codon or protein sequences. Compared to its null model of episodic diversifying selection, MEDS provides a superior fit to most sites known to be involved in drug resistance, and neither one test for episodic diversifying selection nor another for constant directional selection are able to detect as many true positives as MEDS and EDEPS while maintaining acceptable levels of false positives. This suggests that episodic directional selection is a better description of the process driving the evolution of drug resistance.
When exposed to treatment, HIV-1 and other rapidly evolving viruses have the capacity to acquire drug resistance mutations (DRAMs), which limit the efficacy of antivirals. There are a number of experimentally well characterized HIV-1 DRAMs, but many mutations whose roles are not fully understood have also been reported. In this manuscript we construct evolutionary models that identify the locations and targets of mutations conferring resistance to antiretrovirals from viral sequences sampled from treated and untreated individuals. While the evolution of drug resistance is a classic example of natural selection, existing analyses fail to detect the majority of DRAMs. We show that, in order to identify resistance mutations from sequence data, it is necessary to recognize that in this case natural selection is both episodic (it only operates when the virus is exposed to the drugs) and directional (only mutations to a particular amino-acid confer resistance while allowing the virus to continue replicating). The new class of models that allow for the episodic and directional nature of adaptive evolution performs very well at recovering known DRAMs, can be useful at identifying unknown resistance-associated mutations, and is generally applicable to a variety of biological scenarios where similar selective forces are at play.