Emerging and re-emerging infectious diseases of zoonotic origin like highly pathogenic avian influenza pose a significant threat to human and animal health due to their elevated transmissibility. Identifying the drivers of such viruses is challenging and complicates the estimation of spatial diffusion because the variability of viral spread from locations could be caused by a complex array of unknown factors. Several techniques exist to help identify these drivers including bioinformatics, phylogeography, and spatial epidemiology but these methods are generally evaluated separately and do not consider the complementary nature of each other. Here we studied an approach that integrates these techniques and identifies the most important drivers of viral spread by focusing on H5N1 in Egypt because of its recent emergence as an epicenter for the disease. We used a Bayesian phylogeographic generalized linear model (GLM) to reconstruct spatiotemporal patterns of viral diffusion while simultaneously assessing the impact of factors contributing to transmission. We also calculated the cross-species transmission rates among hosts in order to identify the species driving transmission. Density of both human and avian species were supported contributors along with latitude, longitude, elevation, and several meteorological variables. Also supported was the presence of a genetic motif found near the hemagglutinin cleavage site. Various genetic, geographic, demographic, and environmental predictors each play a role in H1N1 diffusion. Further development and expansion of phylogeographic GLMs such as this will enable health agencies to identify variables that can curb virus diffusion and reduce morbidity and mortality.
Survival analysis endures as an old, yet active research field with applications that spread across many domains. Continuing improvements in data acquisition techniques pose constant challenges in applying existing survival analysis methods to these emerging data sets. In this paper, we present tools for fitting regularized Cox survival analysis models on high-dimensional, massive sample-size (HDMSS) data using a variant of the cyclic coordinate descent optimization technique tailored for the sparsity that HDMSS data often present. Experiments on two real data examples demonstrate that efficient analyses of HDMSS data using these tools result in improved predictive performance and calibration.
Big data; Cox proportional hazards; Regularized regression; Survival analysis
Birth-death processes (BDPs) are continuous-time Markov chains that track the number of “particles” in a system over time. While widely used in population biology, genetics and ecology, statistical inference of the instantaneous particle birth and death rates remains largely limited to restrictive linear BDPs in which per-particle birth and death rates are constant. Researchers often observe the number of particles at discrete times, necessitating data augmentation procedures such as expectation-maximization (EM) to find maximum likelihood estimates. For BDPs on finite state-spaces, there are powerful matrix methods for computing the conditional expectations needed for the E-step of the EM algorithm. For BDPs on infinite state-spaces, closed-form solutions for the E-step are available for some linear models, but most previous work has resorted to time-consuming simulation. Remarkably, we show that the E-step conditional expectations can be expressed as convolutions of computable transition probabilities for any general BDP with arbitrary rates. This important observation, along with a convenient continued fraction representation of the Laplace transforms of the transition probabilities, allows for novel and efficient computation of the conditional expectations for all BDPs, eliminating the need for truncation of the state-space or costly simulation. We use this insight to derive EM algorithms that yield maximum likelihood estimation for general BDPs characterized by various rate models, including generalized linear models. We show that our Laplace convolution technique outperforms competing methods when they are available and demonstrate a technique to accelerate EM algorithm convergence. We validate our approach using synthetic data and then apply our methods to cancer cell growth and estimation of mutation parameters in microsatellite evolution.
Birth-death process; EM algorithm; MM algorithm; maximum likelihood estimation; continuous-time Markov chain; microsatellite evolution
Reconstructing the biogeographic history of groups present in continuous arid landscapes is challenging due to the difficulties in defining discrete areas for analyses, and even more so when species largely overlap both in terms of geography and habitat preference. In this study, we use a novel approach to estimate ancestral areas for the small plant genus Centipeda. We apply continuous diffusion of geography by a relaxed random walk where each species is sampled from its extant distribution on an empirical distribution of time-calibrated species-trees. Using a distribution of previously published substitution rates of the internal transcribed spacer (ITS) for Asteraceae, we show how the evolution of Centipeda correlates with the temporal increase of aridity in the arid zone since the Pliocene. Geographic estimates of ancestral species show a consistent pattern of speciation of early lineages in the Lake Eyre region, with a division in more northerly and southerly groups since ∼840 ka. Summarizing the geographic slices of species-trees at the time of the latest speciation event (∼20 ka), indicates no presence of the genus in Australia west of the combined desert belt of the Nullabor Plain, the Great Victoria Desert, the Gibson Desert, and the Great Sandy Desert, or beyond the main continental shelf of Australia. The result indicates all western occurrences of the genus to be a result of recent dispersal rather than ancient vicariance. This study contributes to our understanding of the spatiotemporal processes shaping the flora of the arid zone, and offers a significant improvement in inference of ancestral areas for any organismal group distributed where it remains difficult to describe geography in terms of discrete areas.
Australia; BEAST; biogeography; Centipeda; continuous diffusion; Pliocene; species-tree
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety.
Clinical studies that use observational databases to evaluate the effects of medical products have become commonplace. Such studies begin by selecting a particular database, a decision that published papers invariably report but do not discuss. Studies of the same issue in different databases, however, can and do generate different results, sometimes with strikingly different clinical implications. In this paper, we systematically study heterogeneity among databases, holding other study methods constant, by exploring relative risk estimates for 53 drug-outcome pairs and 2 widely used study designs (cohort studies and self-controlled case series) across 10 observational databases. When holding the study design constant, our analysis shows that estimated relative risks range from a statistically significant decreased risk to a statistically significant increased risk in 11 of 53 (21%) of drug-outcome pairs that use a cohort design and 19 of 53 (36%) of drug-outcome pairs that use a self-controlled case series design. This exceeds the proportion of pairs that were consistent across databases in both direction and statistical significance, which was 9 of 53 (17%) for cohort studies and 5 of 53 (9%) for self-controlled case series. Our findings show that clinical studies that use observational databases can be sensitive to the choice of database. More attention is needed to consider how the choice of data source may be affecting results.
database; heterogeneity; methods; population characteristics; reproducibility of results; surveillance
There are two competing hypotheses for the origin of the Indo-European language family. The conventional view places the homeland in the Pontic steppes approximately 6kya. An alternative hypothesis claims the languages spread from Anatolia with the expansion of farming 8–9.5kya. Here we use Bayesian phylogeographic approaches together with basic vocabulary data from 103 ancient and contemporary Indo-European languages to explicitly model the expansion of the family and test between the homeland hypotheses. We find decisive support for an Anatolian over a steppe origin. Both the inferred timing and root location of the Indo-European language trees fit with an agricultural expansion from Anatolia beginning in the 9th millennium BP. These results highlight the critical role phylogeographic inference can play in resolving longstanding debates about human prehistory.
Despite decades of research, the roles of climate and humans in driving the dramatic extinctions of large-bodied mammals during the Late Quaternary remain contentious. We use ancient DNA, species distribution models and the human fossil record to elucidate how climate and humans shaped the demographic history of woolly rhinoceros, woolly mammoth, wild horse, reindeer, bison and musk ox. We show that climate has been a major driver of population change over the past 50,000 years. However, each species responds differently to the effects of climatic shifts, habitat redistribution and human encroachment. Although climate change alone can explain the extinction of some species, such as Eurasian musk ox and woolly rhinoceros, a combination of climatic and anthropogenic effects appears to be responsible for the extinction of others, including Eurasian steppe bison and wild horse. We find no genetic signature or any distinctive range dynamics distinguishing extinct from surviving species, underscoring the challenges associated with predicting future responses of extant mammals to climate and human-mediated habitat change.
The investigation of infectious disease outbreaks relies on the analysis of increasingly complex and diverse data, which offer new prospects for gaining insights into disease transmission processes and informing public health policies. However, the potential of such data can only be harnessed using a number of different, complementary approaches and tools, and a unified platform for the analysis of disease outbreaks is still lacking. In this paper, we present the new R package OutbreakTools, which aims to provide a basis for outbreak data management and analysis in R. OutbreakTools is developed by a community of epidemiologists, statisticians, modellers and bioinformaticians, and implements classes and methods for storing, handling and visualizing outbreak data. It includes real and simulated outbreak datasets. Together with a number of tools for infectious disease epidemiology recently made available in R, OutbreakTools contributes to the emergence of a new, free and open-source platform for the analysis of disease outbreaks.
Software; Free; Bioinformatics; Epidemiology; R; Epidemics; Public health; Infectious disease
Simulated nucleotide or amino acid sequences are frequently used to assess the performance of phylogenetic reconstruction methods. BEAST, a Bayesian statistical framework that focuses on reconstructing time-calibrated molecular evolutionary processes, supports a wide array of evolutionary models, but lacked matching machinery for simulation of character evolution along phylogenies.
We present a flexible Monte Carlo simulation tool, called πBUSS, that employs the BEAGLE high performance library for phylogenetic computations to rapidly generate large sequence alignments under complex evolutionary models. πBUSS sports a user-friendly graphical user interface (GUI) that allows combining a rich array of models across an arbitrary number of partitions. A command-line interface mirrors the options available through the GUI and facilitates scripting in large-scale simulation studies. πBUSS may serve as an easy-to-use, standard sequence simulation tool, but the available models and data types are particularly useful to assess the performance of complex BEAST inferences. The connection with BEAST is further strengthened through the use of a common extensible markup language (XML), allowing to specify also more advanced evolutionary models. To support simulation under the latter, as well as to support simulation and analysis in a single run, we also add the πBUSS core simulation routine to the list of BEAST XML parsers.
πBUSS offers a unique combination of flexibility and ease-of-use for sequence simulation under realistic evolutionary scenarios. Through different interfaces, πBUSS supports simulation studies ranging from modest endeavors for illustrative purposes to complex and large-scale assessments of evolutionary inference procedures. Applications are not restricted to the BEAST framework, or even time-measured evolutionary histories, and πBUSS can be connected to various other programs using standard input and output format.
Simulation; Monte Carlo; Phylogenetics; BEAST; BEAGLE; Evolution
The branching structure of biological evolution confers statistical dependencies on phenotypic trait values in related organisms. For this reason, comparative macroevolutionary studies usually begin with an inferred phylogeny that describes the evolutionary relationships of the organisms of interest. The probability of the observed trait data can be computed by assuming a model for trait evolution, such as Brownian motion, over the branches of this fixed tree. However, the phylogenetic tree itself contributes statistical uncertainty to estimates of rates of phenotypic evolution, and many comparative evolutionary biologists regard the tree as a nuisance parameter. In this article, we present a framework for analytically integrating over unknown phylogenetic trees in comparative evolutionary studies by assuming that the tree arises from a continuous-time Markov branching model called the Yule process. To do this, we derive a closed-form expression for the distribution of phylogenetic diversity (PD), which is the sum of branch lengths connecting the species in a clade. We then present a generalization of PD which is equivalent to the expected trait disparity in a set of taxa whose evolutionary relationships are generated by a Yule process and whose traits evolve by Brownian motion. We find expressions for the distribution of expected trait disparity under a Yule tree. Given one or more observations of trait disparity in a clade, we perform fast likelihood-based estimation of the Brownian variance for unresolved clades. Our method does not require simulation or a fixed phylogenetic tree. We conclude with a brief example illustrating Brownian rate estimation for 12 families in the mammalian order Carnivora, in which the phylogenetic tree for each family is unresolved. [Brownian motion; comparative method; Markov reward process; phylogenetic diversity; pure-birth process; quantitative trait evolution; trait disparity; Yule process.]
Dengue virus and its four serotypes (DENV-1 to DENV-4) infect 390 million people and are implicated in at least 25,000 deaths annually, with the largest disease burden in tropical and subtropical regions. We investigated the spatial dynamics of DENV-1, DENV-2 and DENV-3 in Brazil by applying a statistical framework to complete genome sequences. For all three serotypes, we estimated that the introduction of new lineages occurred within 7 to 10-year intervals. New lineages were most likely to be imported from the Caribbean region to the North and Northeast regions of Brazil, and then to disperse at a rate of approximately 0.5 km/day. Joint statistical analysis of evolutionary, epidemiological and ecological data indicates that aerial transportation of humans and/or vector mosquitoes, rather than Aedes aegypti infestation rates or geographical distances, determine dengue virus spread in Brazil.
Dengue virus serotypes are associated with millions of infections and thousands of deaths globally each year, primarily in tropical and subtropical regions. We investigated the spatial dynamics of DENV (serotypes 1–3) in Brazil by applying a statistical framework to complete genome sequences. Co-circulation of distinct genotypes, lineage extinction and replacement and multiple viral introduction events were found for all three serotypes. New lineages were typically introduced from the Caribbean into Northern Brazil and dispersed thereafter at a rate of ≈0.5 km/year. Our analysis indicates that aerial transportation is a more important determinant of viral dispersal than Aedes aegypti infestation rates or geographical distance.
We present a new open source, extensible and flexible software platform for Bayesian evolutionary analysis called BEAST 2. This software platform is a re-design of the popular BEAST 1 platform to correct structural deficiencies that became evident as the BEAST 1 software evolved. Key among those deficiencies was the lack of post-deployment extensibility. BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform. This package architecture is showcased with a number of recently published new models encompassing birth-death-sampling tree priors, phylodynamics and model averaging for substitution models and site partitioning. A second major improvement is the ability to read/write the entire state of the MCMC chain to/from disk allowing it to be easily shared between multiple instances of the BEAST software. This facilitates checkpointing and better support for multi-processor and high-end computing extensions. Finally, the functionality in new packages can be easily added to the user interface (BEAUti 2) by a simple XML template-based mechanism because BEAST 2 has been re-designed to provide greater integration between the analysis engine and the user interface so that, for example BEAST and BEAUti use exactly the same XML file format.
Bioinformatics and phylogeography models use viral sequence data to analyze spread of epidemics and pandemics. However, few of these models have included analytical methods for testing whether certain predictors such as population density, rates of disease migration, and climate are drivers of spatial spread. Understanding the specific factors that drive spatial diffusion of viruses is critical for targeting public health interventions and curbing spread. In this paper we describe the application and evaluation of a model that integrates demographic and environmental predictors with molecular sequence data. The approach parameterizes evolutionary spread of RNA viruses as a generalized linear model (GLM) within a Bayesian inference framework using Markov chain Monte Carlo (MCMC). We evaluate this approach by reconstructing the spread of H5N1 in Egypt while assessing the impact of individual predictors on evolutionary diffusion of the virus.
Transmission lies at the interface of human immunodeficiency virus type 1 (HIV-1) evolution within and among hosts and separates distinct selective pressures that impose differences in both the mode of diversification and the tempo of evolution. In the absence of comprehensive direct comparative analyses of the evolutionary processes at different biological scales, our understanding of how fast within-host HIV-1 evolutionary rates translate to lower rates at the between host level remains incomplete. Here, we address this by analyzing pol and env data from a large HIV-1 subtype C transmission chain for which both the timing and the direction is known for most transmission events. To this purpose, we develop a new transmission model in a Bayesian genealogical inference framework and demonstrate how to constrain the viral evolutionary history to be compatible with the transmission history while simultaneously inferring the within-host evolutionary and population dynamics. We show that accommodating a transmission bottleneck affords the best fit our data, but the sparse within-host HIV-1 sampling prevents accurate quantification of the concomitant loss in genetic diversity. We draw inference under the transmission model to estimate HIV-1 evolutionary rates among epidemiologically-related patients and demonstrate that they lie in between fast intra-host rates and lower rates among epidemiologically unrelated individuals infected with HIV subtype C. Using a new molecular clock approach, we quantify and find support for a lower evolutionary rate along branches that accommodate a transmission event or branches that represent the entire backbone of transmitted lineages in our transmission history. Finally, we recover the rate differences at the different biological scales for both synonymous and non-synonymous substitution rates, which is only compatible with the ‘store and retrieve’ hypothesis positing that viruses stored early in latently infected cells preferentially transmit or establish new infections upon reactivation.
Since its discovery three decades ago, the HIV epidemic has unfolded into one of the most devastating pandemics in human history. When HIV replication cannot be completely inhibited, the fast-evolving retrovirus continuously evades intra-host immune and drug selective pressure, but diversifies according to more neutral epidemiological dynamics at the interhost level. Limited evidence suggests that the virus may evolve faster in a single host than in a population of hosts, and various hypotheses have been put forward to explain this phenomenon. Here, we develop a new computational approach aimed at integrating host transmission information with pathogen genealogical reconstructions. We apply this approach to comprehensive sequence data sets sampled from a large HIV-1 subtype C transmission chain, and in addition to providing several insights into the reconstruction of HIV-1 transmissions histories and its associated population dynamics, we find that transmission decreases the HIV-1 evolutionary rate. The fact that we also identify this decline for substitutions that do not alter amino acid substitutions provides evidence against hypotheses that invoke selection forces. Instead, our findings support earlier reports that new infections start preferentially with less evolved variants, which may be stored in latently infected cells, and this may vary among different HIV-1 subtypes.
The factors that determine the origin and fate of cross-species transmission events remain unclear for the majority of human pathogens, despite being central for the development of predictive models and assessing the efficacy of prevention strategies. Here, we describe a flexible Bayesian statistical framework to reconstruct virus transmission between different host species based on viral gene sequences, while simultaneously testing and estimating the contribution of several potential predictors of cross-species transmission. Specifically, we use a generalized linear model extension of phylogenetic diffusion to perform Bayesian model averaging over candidate predictors. By further extending this model with branch partitioning, we allow for distinct host transition processes on external and internal branches, thus discriminating between recent cross-species transmissions, many of which are likely to result in dead-end infections, and host shifts that reflect successful onwards transmission in the new host species. Our approach corroborates genetic distance between hosts as a key determinant of both host shifts and cross-species transmissions of rabies virus in North American bats. Furthermore, our results indicate that geographical range overlap is a modest predictor for cross-species transmission, but not for host shifts. Although our evolutionary framework focused on the multi-host reservoir dynamics of bat rabies virus, it is applicable to other pathogens and to other discrete state transition processes.
Bayesian diffusion models; branch partitioning; cross-species transmission; rabies virus
Bayesian phylogeographic methods simultaneously integrate geographical and evolutionary modelling, and have demonstrated value in assessing spatial spread patterns of measurably evolving organisms. We improve on existing phylogeographic methods by combining information from multiple phylogeographic datasets in a hierarchical setting. Consider N exchangeable datasets or strata consisting of viral sequences and locations, each evolving along its own phylogenetic tree and according to a conditionally independent geographical process. At the hierarchical level, a random graph summarizes the overall dispersion process by informing which migration rates between sampling locations are likely to be relevant in the strata. This approach provides an efficient and improved framework for analysing inherently hierarchical datasets. We first examine the evolutionary history of multiple serotypes of dengue virus in the Americas to showcase our method. Additionally, we explore an application to intrahost HIV evolution across multiple patients.
Bayesian statistics; phylodynamics; phylogenetics; random graphs; HIV; dengue
Molecular phylogenetic and phylogeographic reconstructions generally assume time-homogeneous substitution processes. Motivated by computational convenience, this assumption sacrifices biological realism and offers little opportunity to uncover the temporal dynamics in evolutionary histories. Here, we propose an evolutionary approach that explicitly relaxes the time-homogeneity assumption by allowing the specification of different infinitesimal substitution rate matrices across different time intervals, called epochs, along the evolutionary history. We focus on an epoch model implementation in a Bayesian inference framework that offers great modeling flexibility in drawing inference about any discrete data type characterized as a continuous-time Markov chain, including phylogeographic traits. To alleviate the computational burden that the additional temporal heterogeneity imposes, we adopt a massively parallel approach that achieves both fine- and coarse-grain parallelization of the computations across branches that accommodate epoch transitions, making extensive use of graphics processing units. Through synthetic examples, we assess model performance in recovering evolutionary parameters from data generated according to different evolutionary scenarios that comprise different numbers of epochs for both nucleotide and codon substitution processes. We illustrate the usefulness of our inference framework in two different applications to empirical data sets: the selection dynamics on within-host HIV populations throughout infection and the seasonality of global influenza circulation. In both cases, our epoch model captures key features of temporal heterogeneity that remained difficult to test using ad hoc procedures. [Bayesian inference; BEAGLE; BEAST; Epoch Model; phylogeography; Phylogenetics.]
Effective population size is fundamental in population genetics and characterizes genetic diversity. To infer past population dynamics from molecular sequence data, coalescent-based models have been developed for Bayesian nonparametric estimation of effective population size over time. Among the most successful is a Gaussian Markov random field (GMRF) model for a single gene locus. Here, we present a generalization of the GMRF model that allows for the analysis of multilocus sequence data. Using simulated data, we demonstrate the improved performance of our method to recover true population trajectories and the time to the most recent common ancestor (TMRCA). We analyze a multilocus alignment of HIV-1 CRF02_AG gene sequences sampled from Cameroon. Our results are consistent with HIV prevalence data and uncover some aspects of the population history that go undetected in Bayesian parametric estimation. Finally, we recover an older and more reconcilable TMRCA for a classic ancient DNA data set.
coalescent; smoothing; effective population size; Gaussian Markov random fields
Information on global human movement patterns is central to spatial epidemiological models used to predict the behavior of influenza and other infectious diseases. Yet it remains difficult to test which modes of dispersal drive pathogen spread at various geographic scales using standard epidemiological data alone. Evolutionary analyses of pathogen genome sequences increasingly provide insights into the spatial dynamics of influenza viruses, but to date they have largely neglected the wealth of information on human mobility, mainly because no statistical framework exists within which viral gene sequences and empirical data on host movement can be combined. Here, we address this problem by applying a phylogeographic approach to elucidate the global spread of human influenza subtype H3N2 and assess its ability to predict the spatial spread of human influenza A viruses worldwide. Using a framework that estimates the migration history of human influenza while simultaneously testing and quantifying a range of potential predictive variables of spatial spread, we show that the global dynamics of influenza H3N2 are driven by air passenger flows, whereas at more local scales spread is also determined by processes that correlate with geographic distance. Our analyses further confirm a central role for mainland China and Southeast Asia in maintaining a source population for global influenza diversity. By comparing model output with the known pandemic expansion of H1N1 during 2009, we demonstrate that predictions of influenza spatial spread are most accurate when data on human mobility and viral evolution are integrated. In conclusion, the global dynamics of influenza viruses are best explained by combining human mobility data with the spatial information inherent in sampled viral genomes. The integrated approach introduced here offers great potential for epidemiological surveillance through phylogeographic reconstructions and for improving predictive models of disease control.
What explains the geographic dispersal of emerging pathogens? Reconstructions of evolutionary history from pathogen gene sequences offer qualitative descriptions of spatial spread, but current approaches are poorly equipped to formally test and quantify the contribution of different potential explanatory factors, such as human mobility and demography. Here, we use a novel phylogeographic method to evaluate multiple potential predictors of viral spread in human influenza dynamics. We identify air travel as the predominant driver of global influenza migration, whilst also revealing the contribution of other mobility processes at more local scales. We demonstrate the power of our inter-disciplinary approach by using it to predict the global pandemic expansion of H1N1 influenza in 2009. Our study highlights the importance of integrating evolutionary and ecological information when studying the dynamics of infectious disease.
Influenza viruses undergo continual antigenic evolution allowing mutant viruses to evade host immunity acquired to previous virus strains. Antigenic phenotype is often assessed through pairwise measurement of cross-reactivity between influenza strains using the hemagglutination inhibition (HI) assay. Here, we extend previous approaches to antigenic cartography, and simultaneously characterize antigenic and genetic evolution by modeling the diffusion of antigenic phenotype over a shared virus phylogeny. Using HI data from influenza lineages A/H3N2, A/H1N1, B/Victoria and B/Yamagata, we determine patterns of antigenic drift across viral lineages, showing that A/H3N2 evolves faster and in a more punctuated fashion than other influenza lineages. We also show that year-to-year antigenic drift appears to drive incidence patterns within each influenza lineage. This work makes possible substantial future advances in investigating the dynamics of influenza and other antigenically-variable pathogens by providing a model that intimately combines molecular and antigenic evolution.
Every year, seasonal influenza, commonly called flu, infects up to one in five people around the world, and causes up to half a million deaths. Even though the human immune system can detect and destroy the virus that causes influenza, people can catch flu many times throughout their lifetimes because the virus keeps evolving in an effort to avoid the immune system. This antigenic drift—so-called because the antigens displayed by the virus keep changing—also explains why influenza vaccines become less effective over time and need to be reformulated every year.
It is possible to determine which antigens are displayed by a new strain of the virus by observing how blood samples that respond to known strains respond to the new strain. This information about the “antigenic phenotype” of the virus can be plotted on an antigenic map in which strains with similar antigens cluster together. Gene sequencing has shown that there are four subtypes of the flu virus that commonly infect people; but the relationship between changes in antigenic phenotype and changes in gene sequences of the influenza virus is poorly understood.
Bedford et al. have now developed an approach to combine antigenic maps with genetic information about the four subtypes of the human flu virus. This revealed that the antigenic phenotype of H3N2—a subtype that is becoming increasingly common—evolved faster than the other three subtypes. Further, a correlation was observed between antigenic drift and the number of new influenza cases per year for each flu strain. This suggests that knowing which antigenic phenotypes are present at the start of flu season could help predict which strains of the virus will predominate later on.
The work of Bedford et al. provides a useful framework to study influenza, and could help to pinpoint which changes in viral genes cause the changes in antigens. This information could potentially speed up the development of new flu vaccines for each flu season.
influenza; evolution; antigenic cartography; phylogenetics; Bayesian inference; multidimensional scaling; viruses
Recent implementations of path sampling (PS) and stepping-stone sampling (SS) have been shown to outperform the harmonic mean estimator (HME) and a posterior simulation-based analog of Akaike’s information criterion through Markov chain Monte Carlo (AICM), in Bayesian model selection of demographic and molecular clock models. Almost simultaneously, a Bayesian model averaging approach was developed that avoids conditioning on a single model but averages over a set of relaxed clock models. This approach returns estimates of the posterior probability of each clock model through which one can estimate the Bayes factor in favor of the maximum a posteriori (MAP) clock model; however, this Bayes factor estimate may suffer when the posterior probability of the MAP model approaches 1. Here, we compare these two recent developments with the HME, stabilized/smoothed HME (sHME), and AICM, using both synthetic and empirical data. Our comparison shows reassuringly that MAP identification and its Bayes factor provide similar performance to PS and SS and that these approaches considerably outperform HME, sHME, and AICM in selecting the correct underlying clock model. We also illustrate the importance of using proper priors on a large set of empirical data sets.
model comparison; marginal likelihood; Bayes factors; path sampling; stepping-stone sampling; model averaging; molecular clock; Bayesian inference; phylogeny; BEAST
Multidrug-resistant (MDR) HIV-1 presents a challenge to the efficacy of antiretroviral therapy (ART). To examine mechanisms leading to MDR variants in infected individuals, we studied recombination between single viral genomes from the genital tract and plasma of a woman initiating ART. We determined HIV-1 RNA sequences and drug resistance profiles of 159 unique viral variants obtained before ART and semiannually for 4 years thereafter. Soon after initiating zidovudine, lamivudine, and nevirapine, resistant variants and intrapatient HIV-1 recombinants were detected in both compartments; the recombinants had inherited genetic material from both genital and plasma-derived viruses. Twenty-three unique recombinants were documented during 4 years of therapy, comprising ∼22% of variants. Most recombinant genomes displayed similar breakpoints and clustered phylogenetically, suggesting evolution from common ancestors. Longitudinal analysis demonstrated that MDR recombinants were common and persistent, demonstrating that recombination, in addition to point mutation, can contribute to the evolution of MDR HIV-1 in viremic individuals.
Motivation: Statistical methods for comparing relative rates of synonymous and non-synonymous substitutions maintain a central role in detecting positive selection. To identify selection, researchers often estimate the ratio of these relative rates () at individual alignment sites. Fitting a codon substitution model that captures heterogeneity in across sites provides a reliable way to perform such estimation, but it remains computationally prohibitive for massive datasets. By using crude estimates of the numbers of synonymous and non-synonymous substitutions at each site, counting approaches scale well to large datasets, but they fail to account for ancestral state reconstruction uncertainty and to provide site-specific estimates.
Results: We propose a hybrid solution that borrows the computational strength of counting methods, but augments these methods with empirical Bayes modeling to produce a relatively fast and reliable method capable of estimating site-specific values in large datasets. Importantly, our hybrid approach, set in a Bayesian framework, integrates over the posterior distribution of phylogenies and ancestral reconstructions to quantify uncertainty about site-specific estimates. Simulations demonstrate that this method competes well with more-principled statistical procedures and, in some cases, even outperforms them. We illustrate the utility of our method using human immunodeficiency virus, feline panleukopenia and canine parvovirus evolution examples.
Availability: Renaissance counting is implemented in the development branch of BEAST, freely available at http://code.google.com/p/beast-mcmc/. The method will be made available in the next public release of the package, including support to set up analyses in BEAUti.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.