Despite environmental, social and ecological dependencies, emergence of zoonotic viruses in human populations is clearly also affected by genetic factors which determine cross-species transmission potential. RNA viruses pose an interesting case study given their mutation rates are orders of magnitude higher than any other pathogen – as reflected by the recent emergence of SARS and Influenza for example. Here, we show how feature selection techniques can be used to reliably classify viral sequences by host species, and to identify the crucial minority of host-specific sites in pathogen genomic data. The variability in alleles at those sites can be translated into prediction probabilities that a particular pathogen isolate is adapted to a given host. We illustrate the power of these methods by: 1) identifying the sites explaining SARS coronavirus differences between human, bat and palm civet samples; 2) showing how cross species jumps of rabies virus among bat populations can be readily identified; and 3) de novo identification of likely functional influenza host discriminant markers.
Moving away from genome scan methods used for human GWAS (ultimately inappropriate for the short highly polymorphic genomes of RNA viruses), our work shows the power and potential of multi-class machine learning algorithms in inferring the functional genetic changes associated with phenotypic change (e.g. crossing a species barrier). We show that even distantly related viruses within a viral family share highly conserved genetic signatures of host specificity; reinforce how fitness landscapes of host adaptation are shaped by host phylogeny; and highlight the evolutionary trajectories of RNA viruses in rapid expansion and under great evolutionary pressure. We do so by (for each dataset) unveiling a set of phenotype characteristic mutations which are shown to be functionally relevant, thus providing new insights into phenotypic relationships between RNA viruses. These methods also provide a solid statistical framework with which the degree of host adaptation can be inferred, thus serving as a valuable tool for studying host transition events with particular relevance for emerging infectious diseases. These methods can then serve as rigorous tools of emergence potential assessment, specifically in scenarios where rapid host classification of newly emerging viruses can be more important than identifying putative functional sites.
Phylogenetic trees are used to analyze and visualize evolution. However, trees can be imperfect datatypes when summarizing multiple trees. This is especially problematic when accommodating for biological phenomena such as horizontal gene transfer, incomplete lineage sorting, and hybridization, as well as topological conflict between datasets. Additionally, researchers may want to combine information from sets of trees that have partially overlapping taxon sets. To address the problem of analyzing sets of trees with conflicting relationships and partially overlapping taxon sets, we introduce methods for aligning, synthesizing and analyzing rooted phylogenetic trees within a graph, called a tree alignment graph (TAG). The TAG can be queried and analyzed to explore uncertainty and conflict. It can also be synthesized to construct trees, presenting an alternative to supertrees approaches. We demonstrate these methods with two empirical datasets. In order to explore uncertainty, we constructed a TAG of the bootstrap trees from the Angiosperm Tree of Life project. Analysis of the resulting graph demonstrates that areas of the dataset that are unresolved in majority-rule consensus tree analyses can be understood in more detail within the context of a graph structure, using measures incorporating node degree and adjacency support. As an exercise in synthesis (i.e., summarization of a TAG constructed from the alignment trees), we also construct a TAG consisting of the taxonomy and source trees from a recent comprehensive bird study. We synthesized this graph into a tree that can be reconstructed in a repeatable fashion and where the underlying source information can be updated. The methods presented here are tractable for large scale analyses and serve as a basis for an alternative to consensus tree and supertree methods. Furthermore, the exploration of these graphs can expose structures and patterns within the dataset that are otherwise difficult to observe.
Phylogenetic trees are the most common datatype by which we examine evolutionary patterns. However, biological and practical considerations require the exploration of other models. Here, we address a problem concerning the representation of conflicting and partially overlapping datasets in phylogenetics. We examine the problem of aligning many source trees from independent phylogenetic analyses into a structure that can be analyzed and synthesized but retain all of the original structure and source information. We present methods to map trees into a common graph structure using a graph database. This allows the information in the trees to be stored and synthesized in several ways. Specifically, we demonstrate how these graphs can be used to construct enormous trees as an alternative to labor-intensive grafting exercise and other methods that make the synthetic tree difficult to update. We also show how examination of the relationships in the graph allows patterns to emerge concerning support and information that are difficult to discern with existing methods. Because these methods scale well into the millions of nodes, these techniques should lead to the construction and maintenance of even larger phylogenies and new techniques for analyzing graphs that maintain the structure of the underlying trees.
Viral suppressors of RNAi (VSRs) are proteins that actively inhibit the antiviral RNA interference (RNAi) immune response, providing an immune evasion route for viruses. It has been hypothesized that VSRs are engaged in a molecular ‘arms race’ with RNAi pathway genes. Two lines of evidence support this. First, VSRs from plant viruses display high sequence diversity, and are frequently gained and lost over evolutionary time scales. Second, Drosophila antiviral RNAi genes show high rates of adaptive evolution. Here, we investigate whether VSRs diversify faster than other genes and, if so, whether this is a result of positive selection, as might be expected in an arms race. By analysis of 12 plant RNA viruses, we show that the relative rate of protein evolution is higher for VSRs than for other genes, but that this is not attributable to pervasive positive selection. We argue that, because evolutionary time scales are extremely different for viruses and eukaryotes, it is improbable that viral adaptation (as measured by the ratio of non-synonymous to synonymous change) will be dominated by one-to-one coevolution with eukaryotes. Instead, for plant virus VSRs, we find strong evidence of episodic selection—diversifying selection that acts on a subset of lineages—which might be attributable to frequent shifts between different host genotypes or species.
molecular evolution; positive selection; evolutionary arms race; RNA interference; viral suppressor of RNAi; RNA silencing suppressors
To subvert host defenses, some microbial pathogens produce proteins that interact with conserved motifs in variable regions of B-cell antigen receptor shared by large sets of lymphocytes, which define the properties of a superantigen. As the clonal composition of the lymphocyte pool is a major determinant of immune responsiveness, this study was undertaken to examine the in vivo effect on the host immune system of exposure to a B-cell superantigen, protein L (PpL), a product of the common commensal bacterial species, Finegoldia magna, which is one of the most common pathogenic species amongst Gram-positive anaerobic cocci. Libraries of variable kappa (Vκ) light chain transcripts were generated from the spleens of control and PpL-exposed mice, and the expressed Vκ rearrangements were characterized by high-throughput sequencing. A total of 120,855 sequencing reads could be assigned to a germline Vκ gene, with all 20 known Vκ subgroups represented. In control mice, we found a recurrent and consistent hierarchy of Vκ gene usage, as well as patterns of preferential Vκ-Jκ pairing. PpL exposure induced significant targeted global shifts in repertoire with reduction of Vκ that contain the superantigen binding motif in all exposed mice, with significant targeted reductions in the expression of clonotypes encoded by 14 specific Vκ genes with the predicted PpL binding motif. These rigorous surveys document the capacity of a microbial protein to modulate the composition of the expressed lymphocyte repertoire, which also has broad potential implications for host-microbiome and host-pathogen relationships.
High-throughput sequencing; BCR repertoire; Protein L; Immunoglobulin kappa light chain; 454 sequencing
Specific sequence changes of human immunodeficiency virus type 1 (HIV-1) in the presence of specific HLA molecules may alter the composition and processing of viral peptides, leading to immune escape. Persistence of these mutations after transmission may leave the genetic fingerprint of the transmitter's HLA profile. Here, we evaluated the associations between HLA profiles and the phylogenetic relationships of HIV sequences sampled from a cohort of recently infected individuals in San Diego, California.
We identified transmission clusters within the study cohort, using phylogenetic analysis of sampled HIV pol genotypes at a genetic distance of <1.5%. We then evaluated the association of specific HLA alleles, HLA homozygosity, HLA concordance, race and ethnicity, and mutational patterns within the clustering and nonclustering groups.
From 350 cohort participants, we identified 162 clustering individuals and 188 nonclustering individuals. We identified trends for enrichment of particular alleles within individual clusters and evidence of viral escape within those clusters. We also found that discordance of HLA alleles was significantly associated with clustering individuals.
Some transmission clusters demonstrate HLA enrichment, and viruses in these HLA-associated clusters often show evidence of escape to enriched alleles. Interestingly, HLA discordance was associated with clustering in our predominantly MSM population.
Phylogenomics refers to the inference of historical relationships among species using genome-scale sequence data and to the use of phylogenetic analysis to infer protein function in multigene families. With rapidly decreasing sequencing costs, phylogenomics is becoming synonymous with evolutionary analysis of genome-scale and taxonomically densely sampled data sets. In phylogenetic inference applications, this translates into very large data sets that yield evolutionary and functional inferences with extremely small variances and high statistical confidence (P value). However, reports of highly significant P values are increasing even for contrasting phylogenetic hypotheses depending on the evolutionary model and inference method used, making it difficult to establish true relationships. We argue that the assessment of the robustness of results to biological factors, that may systematically mislead (bias) the outcomes of statistical estimation, will be a key to avoiding incorrect phylogenomic inferences. In fact, there is a need for increased emphasis on the magnitude of differences (effect sizes) in addition to the P values of the statistical test of the null hypothesis. On the other hand, the amount of sequence data available will likely always remain inadequate for some phylogenomic applications, for example, those involving episodic positive selection at individual codon positions and in specific lineages. Again, a focus on effect size and biological relevance, rather than the P value, may be warranted. Here, we present a theoretical overview and discuss practical aspects of the interplay between effect sizes, bias, and P values as it relates to the statistical inference of evolutionary truth in phylogenomics.
molecular evolution; statistical inference; phylogenetics; evolutionary tree; statistical bias; variance
Rate heterogeneity among lineages is a common feature of molecular evolution, and it has long impeded our ability to accurately estimate the age of evolutionary divergence events. The development of relaxed molecular clocks, which model variable substitution rates among lineages, was intended to rectify this problem. Major subtypes of pandemic HIV-1 group M are thought to exemplify closely related lineages with different substitution rates. Here, we report that inferring the time of most recent common ancestor of all these subtypes in a single phylogeny under a single (relaxed) molecular clock produces significantly different dates for many of the subtypes than does analysis of each subtype on its own. We explore various methods to ameliorate this problem. We conclude that current molecular dating methods are inadequate for dealing with this type of substitution rate variation in HIV-1. Through simulation, we show that heterotachy causes root ages to be overestimated.
molecular clock; rate variation; HIV-1
Statistical methods for molecular dating of viral origins have been used extensively to infer the time of most common recent ancestor for many rapidly evolving pathogens. However, there are a number of cases, in which epidemiological, historical, or genomic evidence suggests much older viral origins than those obtained via molecular dating. We demonstrate how pervasive purifying selection can mask the ancient origins of recently sampled pathogens, in part due to the inability of nucleotide-based substitution models to properly account for complex patterns of spatial and temporal variability in selective pressures. We use codon-based substitution models to infer the length of branches in viral phylogenies; these models produce estimates that are often considerably longer than those obtained with traditional nucleotide-based substitution models. Correcting the apparent underestimation of branch lengths suggests substantially older origins for measles, Ebola, and avian influenza viruses. This work helps to reconcile some of the inconsistencies between molecular dating and other types of evidence concerning the age of viral lineages.
measles virus; rinderpest virus; Ebola virus; avian influenza virus; molecular clock; substitution rate; codon model; purifying selection
Standard genotypic antiretroviral resistance testing, performed by bulk sequencing, does not readily detect variants that comprise <20% of the circulating HIV-1 RNA population. Nevertheless, it is valuable in selecting an antiretroviral regimen after antiretroviral failure. In patients with poor adherence, resistant variants may not reach this threshold. Therefore, deep sequencing would be potentially valuable for detecting minority resistant variants. We compared bulk sequencing and deep sequencing to detect HIV-1 drug resistance at the time of a second-line protease inhibitor (PI)-based antiretroviral regimen failure. Eligibility criteria were virologic failure (HIV-1 RNA load of >500 copies/ml) of a first-line nonnucleoside reverse transcriptase inhibitor-based regimen, with at least the M184V mutation (lamivudine resistance), and second-line failure of a lopinavir/ritonavir (LPV/r)-based regimen. An amplicon-sequencing approach on the Roche 454 system was used. Six patients with viral loads of >90,000 copies/ml and one patient with a viral load of 520 copies/ml were included. Mutations not detectable by bulk sequencing during first- and second-line failure were detected by deep sequencing during second-line failure. Low-frequency variants (>0.5% of the sequence population) harboring major protease inhibitor resistance mutations were found in 5 of 7 patients despite poor adherence to the LPV/r-based regimen. In patients with intermittent adherence to a boosted PI regimen, deep sequencing may detect minority PI-resistant variants, which likely represent early events in resistance selection. In patients with poor or intermittent adherence, there may be low evolutionary impetus for such variants to reach fixation, explaining the low prevalence of PI resistance.
Adaptive evolution frequently occurs in episodic bursts, localized to a few sites in a gene, and to a small number of lineages in a phylogenetic tree. A popular class of “branch-site” evolutionary models provides a statistical framework to search for evidence of such episodic selection. For computational tractability, current branch-site models unrealistically assume that all branches in the tree can be partitioned a priori into two rigid classes—“foreground” branches that are allowed to undergo diversifying selective bursts and “background” branches that are negatively selected or neutral. We demonstrate that this assumption leads to unacceptably high rates of false positives or false negatives when the evolutionary process along background branches strongly deviates from modeling assumptions. To address this problem, we extend Felsenstein's pruning algorithm to allow efficient likelihood computations for models in which variation over branches (and not just sites) is described in the random effects likelihood framework. This enables us to model the process at every branch-site combination as a mixture of three Markov substitution models—our model treats the selective class of every branch at a particular site as an unobserved state that is chosen independently of that at any other branch. When benchmarked on a previously published set of simulated sequences, our method consistently matched or outperformed existing branch-site tests in terms of power and error rates. Using three empirical data sets, previously analyzed for episodic selection, we discuss how modeling assumptions can influence inference in practical situations.
episodic selection; random effects model; evolutionary model; branch-site model
The imprint of natural selection on protein coding genes is often difficult to identify because selection is frequently transient or episodic, i.e. it affects only a subset of lineages. Existing computational techniques, which are designed to identify sites subject to pervasive selection, may fail to recognize sites where selection is episodic: a large proportion of positively selected sites. We present a mixed effects model of evolution (MEME) that is capable of identifying instances of both episodic and pervasive positive selection at the level of an individual site. Using empirical and simulated data, we demonstrate the superior performance of MEME over older models under a broad range of scenarios. We find that episodic selection is widespread and conclude that the number of sites experiencing positive selection may have been vastly underestimated.
Identifying regions of protein coding genes that have undergone adaptive evolution is important to answering many questions in evolutionary biology and genetics. In order to tease out genetic evidence for natural selection, genes from a diverse array of taxa must be analyzed, only a subset of which may have undergone adaptive evolution; the same gene region may be under stabilizing or relaxed selection in lineages leading to other taxa. Most current computational methods designed to detect the imprint of natural selection at a site in a protein coding gene assume the strength and direction of natural selection is constant across all lineages. Here, we present a method to detect adaptive evolution, even when the selective forces are not constant across taxa. Using a variety of well-characterized genes, we find evidence suggesting that natural selection is generally episodic and that modeling it as such reveals that many more sites are subject to episodic positive selection than previously appreciated.
During the late 1980s and early 1990s, an estimated 10,000 Romanian children were infected with HIV-1 subtype F nosocomially through contaminated needles and blood transfusions. However, the geographic source and origins of this epidemic remain unclear.
Here we used phylogenetic inference and “relaxed” molecular clock dating analysis to further characterize the Romanian HIV-1 subtype F epidemic.
These analyses revealed a major lineage of Romanian HIV sequences consisting nearly entirely of virus sampled from adolescents and children and a distinct cluster that included a much higher ratio of adult sequences. Divergence time estimates inferred the time of most recent common ancestor of subtype F1 sequences to be 1973 (1966–1980) and for all Angolan sequences to 1975 (1968–1980). The most common ancestor of the Romanian sequences was dated to 1978 (1972–1983) with pediatric and adolescent sequences interspersed throughout the lineage. The phylogenetic structure of the entire subtype F epidemic suggests that multiple introductions of subtype F into Romania occurred either from the Angolan epidemic or from more distant ancestors. Since the historical records note that the Romanian pediatric epidemic did not begin until the late 1980s, the inferred time of most recent common ancestor of the Romanian lineage of 1978 suggests that there were multiple introductions of subtype F occurred into the pediatric population from HIV already circulating in Romania.
Analysis of the subtype F HIV-1 epidemic in an historical context allows for a deeper appreciation of how the HIV pandemic has been influenced by socio-political events.
Phylogeography; Romania; Subtype F; Socio-political; HIV
Much molecular-evolution research is concerned with sequence analysis. Yet these sequences represent real, three-dimensional molecules with complex structure and function. Here I highlight a growing trend in the field to incorporate molecular structure and function into computational molecular-evolution work. I consider three focus areas: reconstruction and analysis of past evolutionary events, such as phylogenetic inference or methods to infer selection pressures; development of toy models and simulations to identify fundamental principles of molecular evolution; and atom-level, highly realistic computational modeling of molecular structure and function aimed at making predictions about possible future evolutionary events.
The evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. While methods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDS and EDEPS) which allow the a priori specification of lineages expected to have undergone directional selection. The models infer the sites and target residues that were likely subject to directional selection, using either codon or protein sequences. Compared to its null model of episodic diversifying selection, MEDS provides a superior fit to most sites known to be involved in drug resistance, and neither one test for episodic diversifying selection nor another for constant directional selection are able to detect as many true positives as MEDS and EDEPS while maintaining acceptable levels of false positives. This suggests that episodic directional selection is a better description of the process driving the evolution of drug resistance.
When exposed to treatment, HIV-1 and other rapidly evolving viruses have the capacity to acquire drug resistance mutations (DRAMs), which limit the efficacy of antivirals. There are a number of experimentally well characterized HIV-1 DRAMs, but many mutations whose roles are not fully understood have also been reported. In this manuscript we construct evolutionary models that identify the locations and targets of mutations conferring resistance to antiretrovirals from viral sequences sampled from treated and untreated individuals. While the evolution of drug resistance is a classic example of natural selection, existing analyses fail to detect the majority of DRAMs. We show that, in order to identify resistance mutations from sequence data, it is necessary to recognize that in this case natural selection is both episodic (it only operates when the virus is exposed to the drugs) and directional (only mutations to a particular amino-acid confer resistance while allowing the virus to continue replicating). The new class of models that allow for the episodic and directional nature of adaptive evolution performs very well at recovering known DRAMs, can be useful at identifying unknown resistance-associated mutations, and is generally applicable to a variety of biological scenarios where similar selective forces are at play.
Reports of a high frequency of the transmission of minority viral populations with drug-resistant mutations (DRM) are inconsistent with evidence that HIV-1 infections usually arise from mono- or oligoclonal transmission. We performed ultradeep sequencing (UDS) of partial HIV-1 gag, pol, and env genes from 32 recently infected individuals. We then evaluated overall and per-site diversity levels, selective pressure, sequence reproducibility, and presence of DRM and accessory mutations (AM). To differentiate biologically meaningful mutations from those caused by methodological errors, we obtained multinomial confidence intervals (CI) for the proportion of DRM at each site and fitted a binomial mixture model to determine background error rates for each sample. We then examined the association between detected minority DRM and the virologic failure of first-line antiretroviral therapy (ART). Similar to other studies, we observed increased detection of DRM at low frequencies (average, 0.56%; 95% CI, 0.43 to 0.69; expected UDS error, 0.21 ± 0.08% mutations/site). For 8 duplicate runs, there was variability in the proportions of minority DRM. There was no indication of increased diversity or selection at DRM sites compared to other sites and no association between minority DRM and AM. There was no correlation between detected minority DRM and clinical failure of first-line ART. It is unlikely that minority viral variants harboring DRM are transmitted and maintained in the recipient host. The majority of low-frequency DRM detected using UDS are likely errors inherent to UDS methodology or a consequence of error-prone HIV-1 replication.
Effective population screening of HIV and prevention of HIV transmission are only part of the global fight against AIDS. Community-level effects, for example those aimed at thwarting future transmission, are potential outcomes of treatment and may be important in stemming the epidemic. However, current clinical trial designs are incapable of detecting a reduction in future transmission due to treatment. We took advantage of the fact that HIV is an evolving pathogen whose transmission network can be reconstructed using genetic sequence information to address this shortcoming. Here, we use an HIV transmission network inferred from recently infected men who have sex with men (MSM) in San Diego, California. We developed and tested a network-based statistic for measuring treatment effects using simulated clinical trials on our inferred transmission network. We explored the statistical power of this network-based statistic against conventional efficacy measures and find that when future transmission is reduced, the potential for increased statistical power can be realized. Furthermore, our simulations demonstrate that the network statistic is able to detect community-level effects (e.g., reduction in onward transmission) of HIV treatment in a clinical trial setting. This study demonstrates the potential utility of a network-based statistical metric when investigating HIV treatment options as a method to reduce onward transmission in a clinical trial setting.
Summary: Datamonkey is a popular web-based suite of phylogenetic analysis tools for use in evolutionary biology. Since the original release in 2005, we have expanded the analysis options to include recently developed algorithmic methods for recombination detection, evolutionary fingerprinting of genes, codon model selection, co-evolution between sites, identification of sites, which rapidly escape host-immune pressure and HIV-1 subtype assignment. The traditional selection tools have also been augmented to include recent developments in the field. Here, we summarize the analyses options currently available on Datamonkey, and provide guidelines for their use in evolutionary biology.
Availability and documentation: http://www.datamonkey.org
The endangered Przewalski's horse is the closest relative of the domestic horse and is the only true wild horse species surviving today. The question of whether Przewalski's horse is the direct progenitor of domestic horse has been hotly debated. Studies of DNA diversity within Przewalski's horses have been sparse but are urgently needed to ensure their successful reintroduction to the wild. In an attempt to resolve the controversy surrounding the phylogenetic position and genetic diversity of Przewalski's horses, we used massively parallel sequencing technology to decipher the complete mitochondrial and partial nuclear genomes for all four surviving maternal lineages of Przewalski's horses. Unlike single-nucleotide polymorphism (SNP) typing usually affected by ascertainment bias, the present method is expected to be largely unbiased. Three mitochondrial haplotypes were discovered—two similar ones, haplotypes I/II, and one substantially divergent from the other two, haplotype III. Haplotypes I/II versus III did not cluster together on a phylogenetic tree, rejecting the monophyly of Przewalski's horse maternal lineages, and were estimated to split 0.117–0.186 Ma, significantly preceding horse domestication. In the phylogeny based on autosomal sequences, Przewalski's horses formed a monophyletic clade, separate from the Thoroughbred domestic horse lineage. Our results suggest that Przewalski's horses have ancient origins and are not the direct progenitors of domestic horses. The analysis of the vast amount of sequence data presented here suggests that Przewalski's and domestic horse lineages diverged at least 0.117 Ma but since then have retained ancestral genetic polymorphism and/or experienced gene flow.
wild horse; next-generation sequencing; mitochondrial DNA; nuclear DNA; phylogeny
Rapidly evolving viruses such as HIV-1 display extensive sequence variation in response to host-specific selection, while simultaneously maintaining functions that are critical to replication and infectivity. This apparent conflict between diversifying and purifying selection may be resolved by an abundance of epistatic interactions such that the same functional requirements can be met by highly divergent sequences. We investigate this hypothesis by conducting an extensive characterization of sequence variation in the HIV-1 nef gene that encodes a highly variable multifunctional protein. Population-based sequences were obtained from 686 patients enrolled in the HOMER cohort in British Columbia, Canada, from which the distribution of nonsynonymous substitutions in the phylogeny was reconstructed by maximum likelihood. We used a phylogenetic comparative method on these data to identify putative epistatic interactions between residues. Two interactions (Y120/Q125 and N157/S169) were chosen to further investigate within-host evolution using HIV-1 RNA extractions from plasma samples from eight patients. Clonal sequencing confirmed strong linkage between polymorphisms at these sites in every case. We used massively parallel pyrosequencing (MPP) to reconstruct within-host evolution in these patients. Experimental error associated with MPP was quantified by performing replicates at two different stages of the protocol, which were pooled prior to analysis to reduce this source of variation. Phylogenetic reconstruction from these data revealed correlated substitutions at Y120/Q125 or N157/S169 repeated across multiple lineages in every host, indicating convergent within-host evolution shaped by epistatic interactions.
coevolution; epistasis; HIV-1; next-generation sequencing; ancestral reconstruction; sequencing error
Over time, natural selection molds every gene into a unique mosaic of sites evolving rapidly or resisting change—an “evolutionary fingerprint” of the gene. Aspects of this evolutionary fingerprint, such as the site-specific ratio of nonsynonymous to synonymous substitution rates (dN/dS), are commonly used to identify genetic features of potential biological interest; however, no framework exists for comparing evolutionary fingerprints between genes. We hypothesize that protein-coding genes with similar protein structure and/or function tend to have similar evolutionary fingerprints and that comparing evolutionary fingerprints can be useful for discovering similarities between genes in a way that is analogous to, but independent of, discovery of similarity via sequence-based comparison tools such as Blast.
To test this hypothesis, we develop a novel model of coding sequence evolution that uses a general bivariate discrete parameterization of the evolutionary rates. We show that this approach provides a better fit to the data using a smaller number of parameters than existing models. Next, we use the model to represent evolutionary fingerprints as probability distributions and present a methodology for comparing these distributions in a way that is robust against variations in data set size and divergence. Finally, using sequences of three rapidly evolving RNA viruses (HIV-1, hepatitis C virus, and influenza A virus), we demonstrate that genes within the same functional group tend to have similar evolutionary fingerprints. Our framework provides a sound statistical foundation for efficient inference and comparison of evolutionary rate patterns in arbitrary collections of gene alignments, clustering homologous and nonhomologous genes, and investigation of biological and functional correlates of evolutionary rates.
adaptive evolution; codon models; evolutionary distance; machine classification
Although it is known that most HIV-1 infections worldwide result from exposure to virus in semen, it has not yet been established whether transmitted strains originate as RNA virions in seminal plasma or as integrated proviral DNA in infected seminal leukocytes. We present phylogenetic evidence that among six transmitting pairs of men who have sex with men, blood plasma virus in the recipient is consistently more closely related to the seminal plasma virus in the source. All sequences were subtype B, and the env C2V3 of transmitted variants tended to have higher mean isoelectric points, contain potential N-linked glycosylation sites, and favor CCR5 co-receptor usage. A statistically robust phylogenetically corrected analysis did not detect genetic signatures reliably associated with transmission, but further investigation of larger samples of transmitting pairs holds promise for determining which structural and genetic features of viral genomes are associated with transmission.
Although vaccines pose the best means of preventing influenza infection, strain selection and optimal implementation remain difficult due to antigenic drift and a lack of understanding global spread. Detecting viral movement by sequence analysis is complicated by skewed geographic and seasonal distributions in viral isolates. We propose a probabilistic method that accounts for sampling bias through spatiotemporal clustering and modeling regional and seasonal transmission as a binomial process. Analysis of H3N2 not only confirmed East-Southeast Asia as a source of new seasonal variants, but also increased the resolution of observed transmission to a country level. H1N1 data revealed similar viral spread from the tropics. Network analysis suggested China and Hong Kong as the origins of new seasonal H3N2 strains and the United States as a region where increased vaccination would maximally disrupt global spread of the virus. These techniques provide a promising methodology for the analysis of any seasonal virus, as well as for the continued surveillance of influenza.
As evidenced by several historic vaccine failures, the design and implementation of the influenza vaccine remains an imperfect science. The virus's rapid rate of evolution makes the selection of representative strains for vaccine composition a difficult process. From a global health viewpoint, how to optimally implement a limited stockpile of vaccines is another fundamental question that remains unanswered. An understanding of how influenza spreads around the world would greatly aid the design and implementation process, but regional and seasonal bias in collected virus samples hampers epidemiologic analysis. Here, we show that it is possible to counter this data bias through probabilistic modeling and represent the global viral spread as a network of seeding events between different regions of the world. On a local scale, our technique can output the most likely origins of a virus circulating in a given location. On a global scale, we can pinpoint regions of the world that would maximally disrupt viral transmission with an increase in vaccine implementation. We demonstrate our method on seasonal H3N2 and H1N1 and foresee similar application to other seasonal viruses, including swine-origin H1N1, once more seasonal data is collected.
Codon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of rate classes, where is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.
Evolution in protein-coding DNA sequences can be modeled at three levels: nucleotides, amino acids or codons that encode the amino acids. Codon models incorporate nucleotide and amino acid information, and allow the estimation of the rate at which amino acids are replaced () versus the rate at which they are preserved (). The ratio has been used in thousands of studies to detect molecular footprints of natural selection. A serious limitation of most codon models is the unrealistic assumption that all non-synonymous substitutions occur at the same rate. Indeed, amino acid models have consistently demonstrated that different residues are exchanged more or less frequently, depending on incompletely understood factors. We derive and validate a computational approach for inferring codon models which combine the power to investigate natural selection with data-driven amino acid substitution biases from alignments. The addition of amino acid properties can lead to more powerful and accurate methods for studying natural selection and the evolutionary history of protein-coding sequences. The pattern of amino acid substitutions specific to a given alignment can be used to compare and contrast the evolutionary properties of different genes, providing an evolutionary analog to protein family comparisons.
Noroviruses are the most common cause of viral gastroenteritis. An increase in the number of globally reported norovirus outbreaks was seen the past decade, especially for outbreaks caused by successive genogroup II genotype 4 (GII.4) variants. Whether this observed increase was due to an upswing in the number of infections, or to a surveillance artifact caused by heightened awareness and concomitant improved reporting, remained unclear. Therefore, we set out to study the population structure and changes thereof of GII.4 strains detected through systematic outbreak surveillance since the early 1990s. We collected 1383 partial polymerase and 194 full capsid GII.4 sequences. A Bayesian MCMC coalescent analysis revealed an increase in the number of GII.4 infections during the last decade. The GII.4 strains included in our analyses evolved at a rate of 4.3–9.0×10−3 mutations per site per year, and share a most recent common ancestor in the early 1980s. Determinants of adaptation in the capsid protein were studied using different maximum likelihood approaches to identify sites subject to diversifying or directional selection and sites that co-evolved. While a number of the computationally determined adaptively evolving sites were on the surface of the capsid and possible subject to immune selection, we also detected sites that were subject to constrained or compensatory evolution due to secondary RNA structures, relevant in virus-replication. We highlight codons that may prove useful in identifying emerging novel variants, and, using these, indicate that the novel 2008 variant is more likely to cause a future epidemic than the 2007 variant. While norovirus infections are generally mild and self-limiting, more severe outcomes of infection frequently occur in elderly and immunocompromized people, and no treatment is available. The observed pattern of continually emerging novel variants of GII.4, causing elevated numbers of infections, is therefore a cause for concern.
Noroviruses, known as the viruses that cause the ‘stomach flu’ or as the ‘cruise ship virus’, cause sporadic cases and large outbreaks of gastrointestinal illness in humans. An increase in norovirus outbreaks was reported globally around 2002. Doubts remained as to whether this increase was real, or caused by improved detection-techniques and increased awareness. This study was performed to address this ambiguity, and to determine the possible virological causes for such changes. Using a population genetic approach, we studied sequences of epidemic norovirus strains collected through time and we indeed demonstrated expanding epidemic dynamics. Global epidemics were caused by subsequent variants of norovirus, observed in 2002, 2004 and 2006 and at a smaller scale in 1996, whereas no evidence for such epidemic evolutionary patterns occurring previous to these peaks. Based on the sequences analyzed the strains of the genotype under study here were shown to have circulated at least since the early 1980s, and likely earlier. We showed that not only surface exposed sites on the outside of the virus shell were under selective pressure, involved in avoiding host immune responses, but also codons that are apparently conserved for the purpose of virus replication.