A critical question in biology is the identification of functionally important amino acid sites in proteins. Because functionally important sites are under stronger purifying selection, site-specific substitution rates tend to be lower than usual at these sites. A large number of phylogenetic models have been developed to estimate site-specific substitution rates in proteins and the extraordinarily low substitution rates have been used as evidence of function. Most of the existing tools, e.g. Rate4Site, assume that site-specific substitution rates are independent across sites. However, site-specific substitution rates may be strongly correlated in the protein tertiary structure, since functionally important sites tend to be clustered together to form functional patches. We have developed a new model, GP4Rate, which incorporates the Gaussian process model with the standard phylogenetic model to identify slowly evolved regions in protein tertiary structures. GP4Rate uses the Gaussian process to define a nonparametric prior distribution of site-specific substitution rates, which naturally captures the spatial correlation of substitution rates. Simulations suggest that GP4Rate can potentially estimate site-specific substitution rates with a much higher accuracy than Rate4Site and tends to report slowly evolved regions rather than individual sites. In addition, GP4Rate can estimate the strength of the spatial correlation of substitution rates from the data. By applying GP4Rate to a set of mammalian B7-1 genes, we found a highly conserved region which coincides with experimental evidence. GP4Rate may be a useful tool for the in silico prediction of functionally important regions in the proteins with known structures.
To understand how a protein functions, a critical step is to know which regions in its protein tertiary structure may be functionally important. Functionally important protein regions are typically more conserved than other regions because mutations in these regions are more likely to be deleterious. A number of phylogenetic models have been developed to identify conserved sites or regions in proteins by comparing protein sequences from multiple species. However, most of these methods treat amino acid sites independently and do not consider the spatial clustering of conserved sites in the protein tertiary structure. Therefore, their power of identifying functional protein regions is limited. We develop a new statistical model, GP4Rate, which combines the information from the protein sequences and the protein tertiary structure to infer conserved regions. We demonstrate that GP4Rate outperforms Rate4Site, the most widely used phylogenetic software for inferring functional amino acid sites, via simulations with a case study of B7-1 genes. GP4Rate is a potentially useful tool for guiding mutagenesis experiments or providing insights on the relationship between protein structures and functions.
Identifying the source of transmission using pathogen genetic data is complicated by numerous biological, immunological, and behavioral factors. A large source of error arises when there is incomplete or sparse sampling of cases. Unsampled cases may act as either a common source of infection or as an intermediary in a transmission chain for hosts infected with genetically similar pathogens. It is difficult to quantify the probability of common source or intermediate transmission events, which has made it difficult to develop statistical tests to either confirm or deny putative transmission pairs with genetic data. We present a method to incorporate additional information about an infectious disease epidemic, such as incidence and prevalence of infection over time, to inform estimates of the probability that one sampled host is the direct source of infection of another host in a pathogen gene genealogy. These methods enable forensic applications, such as source-case attribution, for infectious disease epidemics with incomplete sampling, which is usually the case for high-morbidity community-acquired pathogens like HIV, Influenza and Dengue virus. These methods also enable epidemiological applications such as the identification of factors that increase the risk of transmission. We demonstrate these methods in the context of the HIV epidemic in Detroit, Michigan, and we evaluate the suitability of current sequence databases for forensic and epidemiological investigations. We find that currently available sequences collected for drug resistance testing of HIV are unlikely to be useful in most forensic investigations, but are useful for identifying transmission risk factors.
Molecular data from pathogens may be useful for identifying the source of infection and identifying pairs of individuals such that one host transmitted to the other. Inference of who acquired infection from whom is confounded by incomplete sampling, and given genetic data only, it is not possible to infer the direction of transmission in a transmission pair. Given additional information about an infectious disease epidemic, such as incidence of infection over time, and the proportion of hosts sampled, it is possible to correct for biases stemming from incomplete sampling of the infected host population. It may even be possible to infer the direction of transmission within a transmission pair if additional clinical, behavioral, and demographic covariates of the infected hosts are available. We consider the problem of identifying the source of infection using HIV sequence data collected for clinical purposes. We find that it is rarely possible to infer transmission pairs with high credibility, but such data may nevertheless be useful for epidemiological investigations and identifying risk factors for transmission.
We present a case of sexual transmission of HIV-1 predicted to have CXCR4-tropism during male-to-male sexual exposure. Phylogenetic analyses exclude cell-free virus in the seminal plasma of the source partner and possibly point to the seminal cells as the origin of the transmission event.
Coronaviruses are found in a diverse array of bat and bird species, which are believed to act as natural hosts. Molecular clock dating analyses of coronaviruses suggest that the most recent common ancestor of these viruses existed around 10,000 years ago. This relatively young age is in sharp contrast to the ancient evolutionary history of their putative natural hosts, which began diversifying tens of millions of years ago. Here, we attempted to resolve this discrepancy by applying more realistic evolutionary models that have previously revealed the ancient evolutionary history of other RNA viruses. By explicitly modeling variation in the strength of natural selection over time and thereby improving the modeling of substitution saturation, we found that the time to the most recent ancestor common for all coronaviruses is likely far greater (millions of years) than the previously inferred range.
Motivation: Statistical methods for comparing relative rates of synonymous and non-synonymous substitutions maintain a central role in detecting positive selection. To identify selection, researchers often estimate the ratio of these relative rates () at individual alignment sites. Fitting a codon substitution model that captures heterogeneity in across sites provides a reliable way to perform such estimation, but it remains computationally prohibitive for massive datasets. By using crude estimates of the numbers of synonymous and non-synonymous substitutions at each site, counting approaches scale well to large datasets, but they fail to account for ancestral state reconstruction uncertainty and to provide site-specific estimates.
Results: We propose a hybrid solution that borrows the computational strength of counting methods, but augments these methods with empirical Bayes modeling to produce a relatively fast and reliable method capable of estimating site-specific values in large datasets. Importantly, our hybrid approach, set in a Bayesian framework, integrates over the posterior distribution of phylogenies and ancestral reconstructions to quantify uncertainty about site-specific estimates. Simulations demonstrate that this method competes well with more-principled statistical procedures and, in some cases, even outperforms them. We illustrate the utility of our method using human immunodeficiency virus, feline panleukopenia and canine parvovirus evolution examples.
Availability: Renaissance counting is implemented in the development branch of BEAST, freely available at http://code.google.com/p/beast-mcmc/. The method will be made available in the next public release of the package, including support to set up analyses in BEAUti.
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics online.
The genital tract of individuals infected with HIV-1 is an anatomic compartment that supports local HIV-1 and CMV replication. This study investigated the association of seminal CMV replication with changes in HIV-1 clonal expansion, evolution and phylogenetic compartmentalization between blood and semen. Fourteen paired blood and semen samples were analyzed from four untreated subjects. Clonal sequences (n=607) were generated from extracted HIV-1 RNA (env C2-V3 region), and HIV-1 and CMV levels were measured in the seminal plasma by real-time PCR. Sequence alignments were evaluated for: (i) viral compartmentalization between semen and blood samples using Slatkin-Maddison and FST methods, (ii) different nucleotide substitution rates in semen and blood, and (iii) association between proportions of clonal HIV-1 sequences in each compartment and seminal CMV levels. Half of the semen samples had detectable CMV DNA, with at least one CMV positive sample for each patient. Seminal CMV DNA levels correlated positively with seminal HIV-1 RNA levels (Spearman p=0.05). A trend towards an association between compartmentalization of HIV-1 sequences sampled from blood and semen and presence of seminal CMV was observed (Cochran Q test p=0.12). Evolutionary rates between semen and blood HIV-1 populations did not differ significantly, and there was no significant association between seminal CMV DNA levels and the frequency of non-unique clonal HIV-1 sequences in the semen. In conclusion, the effects of CMV replication on HIV-1 viral and immunologic dynamics within the male genital tract are not significant enough to perturb evolution or disrupt compartmentalization in the genital tract.
HIV-1; Cytomegalovirus; compartmentalization; evolution; semen
To investigate the susceptibilities to and consequences of HIV-1 dual infection (DI).
We compared clinical, virologic, and immunologic factors between participants who were dually infected with HIV-1 subtype B, and monoinfected (MI) controls who were matched by ongoing HIV risk factor.
The viral load and CD4 progressions of dually and singly infected participant groups were compared with linear mixed-effects models, and individual dynamics before and after superinfection were assessed with a structural change test (Chow test). Recombination breakpoint analysis (GARD), HLA frequency analysis, and cytotoxic T-lymphocyte (CTL) epitope mapping were also performed (HIV LANL Database).
The viral loads of DI participants increased more over 3 years of follow-up than the viral loads of MI controls, while CD4 progressions of the two groups did not differ. Viral escape from CTL responses following superinfection was observed in two participants whose superinfecting strain completely replaced the initial strain. This pattern was not seen among participants whose superinfecting virus persisted in a recombinant form with the initial virus or was only detected transiently. Several HLA types were overrepresented in DI participants as compared to MI controls.
These results identify potential factors for DI susceptibility and further define its clinical consequences.
HIV-1 dual infection; viral load; CD4 count; HLA; CTL
Despite environmental, social and ecological dependencies, emergence of zoonotic viruses in human populations is clearly also affected by genetic factors which determine cross-species transmission potential. RNA viruses pose an interesting case study given their mutation rates are orders of magnitude higher than any other pathogen – as reflected by the recent emergence of SARS and Influenza for example. Here, we show how feature selection techniques can be used to reliably classify viral sequences by host species, and to identify the crucial minority of host-specific sites in pathogen genomic data. The variability in alleles at those sites can be translated into prediction probabilities that a particular pathogen isolate is adapted to a given host. We illustrate the power of these methods by: 1) identifying the sites explaining SARS coronavirus differences between human, bat and palm civet samples; 2) showing how cross species jumps of rabies virus among bat populations can be readily identified; and 3) de novo identification of likely functional influenza host discriminant markers.
Moving away from genome scan methods used for human GWAS (ultimately inappropriate for the short highly polymorphic genomes of RNA viruses), our work shows the power and potential of multi-class machine learning algorithms in inferring the functional genetic changes associated with phenotypic change (e.g. crossing a species barrier). We show that even distantly related viruses within a viral family share highly conserved genetic signatures of host specificity; reinforce how fitness landscapes of host adaptation are shaped by host phylogeny; and highlight the evolutionary trajectories of RNA viruses in rapid expansion and under great evolutionary pressure. We do so by (for each dataset) unveiling a set of phenotype characteristic mutations which are shown to be functionally relevant, thus providing new insights into phenotypic relationships between RNA viruses. These methods also provide a solid statistical framework with which the degree of host adaptation can be inferred, thus serving as a valuable tool for studying host transition events with particular relevance for emerging infectious diseases. These methods can then serve as rigorous tools of emergence potential assessment, specifically in scenarios where rapid host classification of newly emerging viruses can be more important than identifying putative functional sites.
Phylogenetic trees are used to analyze and visualize evolution. However, trees can be imperfect datatypes when summarizing multiple trees. This is especially problematic when accommodating for biological phenomena such as horizontal gene transfer, incomplete lineage sorting, and hybridization, as well as topological conflict between datasets. Additionally, researchers may want to combine information from sets of trees that have partially overlapping taxon sets. To address the problem of analyzing sets of trees with conflicting relationships and partially overlapping taxon sets, we introduce methods for aligning, synthesizing and analyzing rooted phylogenetic trees within a graph, called a tree alignment graph (TAG). The TAG can be queried and analyzed to explore uncertainty and conflict. It can also be synthesized to construct trees, presenting an alternative to supertrees approaches. We demonstrate these methods with two empirical datasets. In order to explore uncertainty, we constructed a TAG of the bootstrap trees from the Angiosperm Tree of Life project. Analysis of the resulting graph demonstrates that areas of the dataset that are unresolved in majority-rule consensus tree analyses can be understood in more detail within the context of a graph structure, using measures incorporating node degree and adjacency support. As an exercise in synthesis (i.e., summarization of a TAG constructed from the alignment trees), we also construct a TAG consisting of the taxonomy and source trees from a recent comprehensive bird study. We synthesized this graph into a tree that can be reconstructed in a repeatable fashion and where the underlying source information can be updated. The methods presented here are tractable for large scale analyses and serve as a basis for an alternative to consensus tree and supertree methods. Furthermore, the exploration of these graphs can expose structures and patterns within the dataset that are otherwise difficult to observe.
Phylogenetic trees are the most common datatype by which we examine evolutionary patterns. However, biological and practical considerations require the exploration of other models. Here, we address a problem concerning the representation of conflicting and partially overlapping datasets in phylogenetics. We examine the problem of aligning many source trees from independent phylogenetic analyses into a structure that can be analyzed and synthesized but retain all of the original structure and source information. We present methods to map trees into a common graph structure using a graph database. This allows the information in the trees to be stored and synthesized in several ways. Specifically, we demonstrate how these graphs can be used to construct enormous trees as an alternative to labor-intensive grafting exercise and other methods that make the synthetic tree difficult to update. We also show how examination of the relationships in the graph allows patterns to emerge concerning support and information that are difficult to discern with existing methods. Because these methods scale well into the millions of nodes, these techniques should lead to the construction and maintenance of even larger phylogenies and new techniques for analyzing graphs that maintain the structure of the underlying trees.
Viral suppressors of RNAi (VSRs) are proteins that actively inhibit the antiviral RNA interference (RNAi) immune response, providing an immune evasion route for viruses. It has been hypothesized that VSRs are engaged in a molecular ‘arms race’ with RNAi pathway genes. Two lines of evidence support this. First, VSRs from plant viruses display high sequence diversity, and are frequently gained and lost over evolutionary time scales. Second, Drosophila antiviral RNAi genes show high rates of adaptive evolution. Here, we investigate whether VSRs diversify faster than other genes and, if so, whether this is a result of positive selection, as might be expected in an arms race. By analysis of 12 plant RNA viruses, we show that the relative rate of protein evolution is higher for VSRs than for other genes, but that this is not attributable to pervasive positive selection. We argue that, because evolutionary time scales are extremely different for viruses and eukaryotes, it is improbable that viral adaptation (as measured by the ratio of non-synonymous to synonymous change) will be dominated by one-to-one coevolution with eukaryotes. Instead, for plant virus VSRs, we find strong evidence of episodic selection—diversifying selection that acts on a subset of lineages—which might be attributable to frequent shifts between different host genotypes or species.
molecular evolution; positive selection; evolutionary arms race; RNA interference; viral suppressor of RNAi; RNA silencing suppressors
To subvert host defenses, some microbial pathogens produce proteins that interact with conserved motifs in variable regions of B-cell antigen receptor shared by large sets of lymphocytes, which define the properties of a superantigen. As the clonal composition of the lymphocyte pool is a major determinant of immune responsiveness, this study was undertaken to examine the in vivo effect on the host immune system of exposure to a B-cell superantigen, protein L (PpL), a product of the common commensal bacterial species, Finegoldia magna, which is one of the most common pathogenic species amongst Gram-positive anaerobic cocci. Libraries of variable kappa (Vκ) light chain transcripts were generated from the spleens of control and PpL-exposed mice, and the expressed Vκ rearrangements were characterized by high-throughput sequencing. A total of 120,855 sequencing reads could be assigned to a germline Vκ gene, with all 20 known Vκ subgroups represented. In control mice, we found a recurrent and consistent hierarchy of Vκ gene usage, as well as patterns of preferential Vκ-Jκ pairing. PpL exposure induced significant targeted global shifts in repertoire with reduction of Vκ that contain the superantigen binding motif in all exposed mice, with significant targeted reductions in the expression of clonotypes encoded by 14 specific Vκ genes with the predicted PpL binding motif. These rigorous surveys document the capacity of a microbial protein to modulate the composition of the expressed lymphocyte repertoire, which also has broad potential implications for host-microbiome and host-pathogen relationships.
High-throughput sequencing; BCR repertoire; Protein L; Immunoglobulin kappa light chain; 454 sequencing
Specific sequence changes of human immunodeficiency virus type 1 (HIV-1) in the presence of specific HLA molecules may alter the composition and processing of viral peptides, leading to immune escape. Persistence of these mutations after transmission may leave the genetic fingerprint of the transmitter's HLA profile. Here, we evaluated the associations between HLA profiles and the phylogenetic relationships of HIV sequences sampled from a cohort of recently infected individuals in San Diego, California.
We identified transmission clusters within the study cohort, using phylogenetic analysis of sampled HIV pol genotypes at a genetic distance of <1.5%. We then evaluated the association of specific HLA alleles, HLA homozygosity, HLA concordance, race and ethnicity, and mutational patterns within the clustering and nonclustering groups.
From 350 cohort participants, we identified 162 clustering individuals and 188 nonclustering individuals. We identified trends for enrichment of particular alleles within individual clusters and evidence of viral escape within those clusters. We also found that discordance of HLA alleles was significantly associated with clustering individuals.
Some transmission clusters demonstrate HLA enrichment, and viruses in these HLA-associated clusters often show evidence of escape to enriched alleles. Interestingly, HLA discordance was associated with clustering in our predominantly MSM population.
Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
molecular evolution; statistical inference; phylogenetics; evolutionary tree; statistical bias; variance
Rate heterogeneity among lineages is a common feature of molecular evolution, and it has long impeded our ability to accurately estimate the age of evolutionary divergence events. The development of relaxed molecular clocks, which model variable substitution rates among lineages, was intended to rectify this problem. Major subtypes of pandemic HIV-1 group M are thought to exemplify closely related lineages with different substitution rates. Here, we report that inferring the time of most recent common ancestor of all these subtypes in a single phylogeny under a single (relaxed) molecular clock produces significantly different dates for many of the subtypes than does analysis of each subtype on its own. We explore various methods to ameliorate this problem. We conclude that current molecular dating methods are inadequate for dealing with this type of substitution rate variation in HIV-1. Through simulation, we show that heterotachy causes root ages to be overestimated.
molecular clock; rate variation; HIV-1
Statistical methods for molecular dating of viral origins have been used extensively to infer the time of most common recent ancestor for many rapidly evolving pathogens. However, there are a number of cases, in which epidemiological, historical, or genomic evidence suggests much older viral origins than those obtained via molecular dating. We demonstrate how pervasive purifying selection can mask the ancient origins of recently sampled pathogens, in part due to the inability of nucleotide-based substitution models to properly account for complex patterns of spatial and temporal variability in selective pressures. We use codon-based substitution models to infer the length of branches in viral phylogenies; these models produce estimates that are often considerably longer than those obtained with traditional nucleotide-based substitution models. Correcting the apparent underestimation of branch lengths suggests substantially older origins for measles, Ebola, and avian influenza viruses. This work helps to reconcile some of the inconsistencies between molecular dating and other types of evidence concerning the age of viral lineages.
measles virus; rinderpest virus; Ebola virus; avian influenza virus; molecular clock; substitution rate; codon model; purifying selection
Standard genotypic antiretroviral resistance testing, performed by bulk sequencing, does not readily detect variants that comprise <20% of the circulating HIV-1 RNA population. Nevertheless, it is valuable in selecting an antiretroviral regimen after antiretroviral failure. In patients with poor adherence, resistant variants may not reach this threshold. Therefore, deep sequencing would be potentially valuable for detecting minority resistant variants. We compared bulk sequencing and deep sequencing to detect HIV-1 drug resistance at the time of a second-line protease inhibitor (PI)-based antiretroviral regimen failure. Eligibility criteria were virologic failure (HIV-1 RNA load of >500 copies/ml) of a first-line nonnucleoside reverse transcriptase inhibitor-based regimen, with at least the M184V mutation (lamivudine resistance), and second-line failure of a lopinavir/ritonavir (LPV/r)-based regimen. An amplicon-sequencing approach on the Roche 454 system was used. Six patients with viral loads of >90,000 copies/ml and one patient with a viral load of 520 copies/ml were included. Mutations not detectable by bulk sequencing during first- and second-line failure were detected by deep sequencing during second-line failure. Low-frequency variants (>0.5% of the sequence population) harboring major protease inhibitor resistance mutations were found in 5 of 7 patients despite poor adherence to the LPV/r-based regimen. In patients with intermittent adherence to a boosted PI regimen, deep sequencing may detect minority PI-resistant variants, which likely represent early events in resistance selection. In patients with poor or intermittent adherence, there may be low evolutionary impetus for such variants to reach fixation, explaining the low prevalence of PI resistance.
Adaptive evolution frequently occurs in episodic bursts, localized to a few sites in a gene, and to a small number of lineages in a phylogenetic tree. A popular class of “branch-site” evolutionary models provides a statistical framework to search for evidence of such episodic selection. For computational tractability, current branch-site models unrealistically assume that all branches in the tree can be partitioned a priori into two rigid classes—“foreground” branches that are allowed to undergo diversifying selective bursts and “background” branches that are negatively selected or neutral. We demonstrate that this assumption leads to unacceptably high rates of false positives or false negatives when the evolutionary process along background branches strongly deviates from modeling assumptions. To address this problem, we extend Felsenstein's pruning algorithm to allow efficient likelihood computations for models in which variation over branches (and not just sites) is described in the random effects likelihood framework. This enables us to model the process at every branch-site combination as a mixture of three Markov substitution models—our model treats the selective class of every branch at a particular site as an unobserved state that is chosen independently of that at any other branch. When benchmarked on a previously published set of simulated sequences, our method consistently matched or outperformed existing branch-site tests in terms of power and error rates. Using three empirical data sets, previously analyzed for episodic selection, we discuss how modeling assumptions can influence inference in practical situations.
episodic selection; random effects model; evolutionary model; branch-site model
The imprint of natural selection on protein coding genes is often difficult to identify because selection is frequently transient or episodic, i.e. it affects only a subset of lineages. Existing computational techniques, which are designed to identify sites subject to pervasive selection, may fail to recognize sites where selection is episodic: a large proportion of positively selected sites. We present a mixed effects model of evolution (MEME) that is capable of identifying instances of both episodic and pervasive positive selection at the level of an individual site. Using empirical and simulated data, we demonstrate the superior performance of MEME over older models under a broad range of scenarios. We find that episodic selection is widespread and conclude that the number of sites experiencing positive selection may have been vastly underestimated.
Identifying regions of protein coding genes that have undergone adaptive evolution is important to answering many questions in evolutionary biology and genetics. In order to tease out genetic evidence for natural selection, genes from a diverse array of taxa must be analyzed, only a subset of which may have undergone adaptive evolution; the same gene region may be under stabilizing or relaxed selection in lineages leading to other taxa. Most current computational methods designed to detect the imprint of natural selection at a site in a protein coding gene assume the strength and direction of natural selection is constant across all lineages. Here, we present a method to detect adaptive evolution, even when the selective forces are not constant across taxa. Using a variety of well-characterized genes, we find evidence suggesting that natural selection is generally episodic and that modeling it as such reveals that many more sites are subject to episodic positive selection than previously appreciated.
During the late 1980s and early 1990s, an estimated 10,000 Romanian children were infected with HIV-1 subtype F nosocomially through contaminated needles and blood transfusions. However, the geographic source and origins of this epidemic remain unclear.
Here we used phylogenetic inference and “relaxed” molecular clock dating analysis to further characterize the Romanian HIV-1 subtype F epidemic.
These analyses revealed a major lineage of Romanian HIV sequences consisting nearly entirely of virus sampled from adolescents and children and a distinct cluster that included a much higher ratio of adult sequences. Divergence time estimates inferred the time of most recent common ancestor of subtype F1 sequences to be 1973 (1966–1980) and for all Angolan sequences to 1975 (1968–1980). The most common ancestor of the Romanian sequences was dated to 1978 (1972–1983) with pediatric and adolescent sequences interspersed throughout the lineage. The phylogenetic structure of the entire subtype F epidemic suggests that multiple introductions of subtype F into Romania occurred either from the Angolan epidemic or from more distant ancestors. Since the historical records note that the Romanian pediatric epidemic did not begin until the late 1980s, the inferred time of most recent common ancestor of the Romanian lineage of 1978 suggests that there were multiple introductions of subtype F occurred into the pediatric population from HIV already circulating in Romania.
Analysis of the subtype F HIV-1 epidemic in an historical context allows for a deeper appreciation of how the HIV pandemic has been influenced by socio-political events.
Phylogeography; Romania; Subtype F; Socio-political; HIV
Much molecular-evolution research is concerned with sequence analysis. Yet these sequences represent real, three-dimensional molecules with complex structure and function. Here I highlight a growing trend in the field to incorporate molecular structure and function into computational molecular-evolution work. I consider three focus areas: reconstruction and analysis of past evolutionary events, such as phylogenetic inference or methods to infer selection pressures; development of toy models and simulations to identify fundamental principles of molecular evolution; and atom-level, highly realistic computational modeling of molecular structure and function aimed at making predictions about possible future evolutionary events.
The evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. While methods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDS and EDEPS) which allow the a priori specification of lineages expected to have undergone directional selection. The models infer the sites and target residues that were likely subject to directional selection, using either codon or protein sequences. Compared to its null model of episodic diversifying selection, MEDS provides a superior fit to most sites known to be involved in drug resistance, and neither one test for episodic diversifying selection nor another for constant directional selection are able to detect as many true positives as MEDS and EDEPS while maintaining acceptable levels of false positives. This suggests that episodic directional selection is a better description of the process driving the evolution of drug resistance.
When exposed to treatment, HIV-1 and other rapidly evolving viruses have the capacity to acquire drug resistance mutations (DRAMs), which limit the efficacy of antivirals. There are a number of experimentally well characterized HIV-1 DRAMs, but many mutations whose roles are not fully understood have also been reported. In this manuscript we construct evolutionary models that identify the locations and targets of mutations conferring resistance to antiretrovirals from viral sequences sampled from treated and untreated individuals. While the evolution of drug resistance is a classic example of natural selection, existing analyses fail to detect the majority of DRAMs. We show that, in order to identify resistance mutations from sequence data, it is necessary to recognize that in this case natural selection is both episodic (it only operates when the virus is exposed to the drugs) and directional (only mutations to a particular amino-acid confer resistance while allowing the virus to continue replicating). The new class of models that allow for the episodic and directional nature of adaptive evolution performs very well at recovering known DRAMs, can be useful at identifying unknown resistance-associated mutations, and is generally applicable to a variety of biological scenarios where similar selective forces are at play.
Reports of a high frequency of the transmission of minority viral populations with drug-resistant mutations (DRM) are inconsistent with evidence that HIV-1 infections usually arise from mono- or oligoclonal transmission. We performed ultradeep sequencing (UDS) of partial HIV-1 gag, pol, and env genes from 32 recently infected individuals. We then evaluated overall and per-site diversity levels, selective pressure, sequence reproducibility, and presence of DRM and accessory mutations (AM). To differentiate biologically meaningful mutations from those caused by methodological errors, we obtained multinomial confidence intervals (CI) for the proportion of DRM at each site and fitted a binomial mixture model to determine background error rates for each sample. We then examined the association between detected minority DRM and the virologic failure of first-line antiretroviral therapy (ART). Similar to other studies, we observed increased detection of DRM at low frequencies (average, 0.56%; 95% CI, 0.43 to 0.69; expected UDS error, 0.21 ± 0.08% mutations/site). For 8 duplicate runs, there was variability in the proportions of minority DRM. There was no indication of increased diversity or selection at DRM sites compared to other sites and no association between minority DRM and AM. There was no correlation between detected minority DRM and clinical failure of first-line ART. It is unlikely that minority viral variants harboring DRM are transmitted and maintained in the recipient host. The majority of low-frequency DRM detected using UDS are likely errors inherent to UDS methodology or a consequence of error-prone HIV-1 replication.
Effective population screening of HIV and prevention of HIV transmission are only part of the global fight against AIDS. Community-level effects, for example those aimed at thwarting future transmission, are potential outcomes of treatment and may be important in stemming the epidemic. However, current clinical trial designs are incapable of detecting a reduction in future transmission due to treatment. We took advantage of the fact that HIV is an evolving pathogen whose transmission network can be reconstructed using genetic sequence information to address this shortcoming. Here, we use an HIV transmission network inferred from recently infected men who have sex with men (MSM) in San Diego, California. We developed and tested a network-based statistic for measuring treatment effects using simulated clinical trials on our inferred transmission network. We explored the statistical power of this network-based statistic against conventional efficacy measures and find that when future transmission is reduced, the potential for increased statistical power can be realized. Furthermore, our simulations demonstrate that the network statistic is able to detect community-level effects (e.g., reduction in onward transmission) of HIV treatment in a clinical trial setting. This study demonstrates the potential utility of a network-based statistical metric when investigating HIV treatment options as a method to reduce onward transmission in a clinical trial setting.
Summary: Datamonkey is a popular web-based suite of phylogenetic analysis tools for use in evolutionary biology. Since the original release in 2005, we have expanded the analysis options to include recently developed algorithmic methods for recombination detection, evolutionary fingerprinting of genes, codon model selection, co-evolution between sites, identification of sites, which rapidly escape host-immune pressure and HIV-1 subtype assignment. The traditional selection tools have also been augmented to include recent developments in the field. Here, we summarize the analyses options currently available on Datamonkey, and provide guidelines for their use in evolutionary biology.
Availability and documentation: http://www.datamonkey.org
The endangered Przewalski's horse is the closest relative of the domestic horse and is the only true wild horse species surviving today. The question of whether Przewalski's horse is the direct progenitor of domestic horse has been hotly debated. Studies of DNA diversity within Przewalski's horses have been sparse but are urgently needed to ensure their successful reintroduction to the wild. In an attempt to resolve the controversy surrounding the phylogenetic position and genetic diversity of Przewalski's horses, we used massively parallel sequencing technology to decipher the complete mitochondrial and partial nuclear genomes for all four surviving maternal lineages of Przewalski's horses. Unlike single-nucleotide polymorphism (SNP) typing usually affected by ascertainment bias, the present method is expected to be largely unbiased. Three mitochondrial haplotypes were discovered—two similar ones, haplotypes I/II, and one substantially divergent from the other two, haplotype III. Haplotypes I/II versus III did not cluster together on a phylogenetic tree, rejecting the monophyly of Przewalski's horse maternal lineages, and were estimated to split 0.117–0.186 Ma, significantly preceding horse domestication. In the phylogeny based on autosomal sequences, Przewalski's horses formed a monophyletic clade, separate from the Thoroughbred domestic horse lineage. Our results suggest that Przewalski's horses have ancient origins and are not the direct progenitors of domestic horses. The analysis of the vast amount of sequence data presented here suggests that Przewalski's and domestic horse lineages diverged at least 0.117 Ma but since then have retained ancestral genetic polymorphism and/or experienced gene flow.
wild horse; next-generation sequencing; mitochondrial DNA; nuclear DNA; phylogeny