Probabilistic inference of a phylogenetic tree from molecular sequence data is predicated on a substitution model describing the relative rates of change between character states along the tree for each site in the multiple sequence alignment. Commonly, one assumes that the substitution model is homogeneous across sites within large partitions of the alignment, assigns these partitions a priori, and then fixes their underlying substitution model to the best-fitting model from a hierarchy of named models. Here, we introduce an automatic model selection and model averaging approach within a Bayesian framework that simultaneously estimates the number of partitions, the assignment of sites to partitions, the substitution model for each partition, and the uncertainty in these selections. This new approach is implemented as an add-on to the BEAST 2 software platform. We find that this approach dramatically improves the fit of the nucleotide substitution model compared with existing approaches, and we show, using a number of example data sets, that as many as nine partitions are required to explain the heterogeneity in nucleotide substitution process across sites in a single gene analysis. In some instances, this improved modeling of the substitution process can have a measurable effect on downstream inference, including the estimated phylogeny, relative divergence times, and effective population size histories.
across-site rate variation; Dirichlet process mixture model; Bayesian model selection
Phylogeographic approaches help uncover the imprint that spatial epidemiological processes leave in the genomes of fast evolving viruses. Recent Bayesian inference methods that consider phylogenetic diffusion of discretely and continuously distributed traits offer a unique opportunity to explore genotypic and phenotypic evolution in greater detail. To provide a taste of the recent advances in viral diffusion approaches, we highlight key findings arising at the intra-host, local and global epidemiological scales. We also outline future areas of research and discuss how these may contribute to a quantitative understanding of the phylodynamics of RNA viruses.
Multiple origins indicate this serotype was introduced in several episodes.
Dengue virus serotype 4 (DENV-4) reemerged in Roraima State, Brazil, 28 years after it was last detected in the country in 1982. To study the origin and evolution of this reemergence, full-length sequences were obtained for 16 DENV-4 isolates from northern (Roraima, Amazonas, Pará States) and northeastern (Bahia State) Brazil during the 2010 and 2011 dengue virus seasons and for an isolate from the 1982 epidemic in Roraima. Spatiotemporal dynamics of DENV-4 introductions in Brazil were applied to envelope genes and full genomes by using Bayesian phylogeographic analyses. An introduction of genotype I into Brazil from Southeast Asia was confirmed, and full genome phylogeographic analyses revealed multiple introductions of DENV-4 genotype II in Brazil, providing evidence for >3 introductions of this genotype within the last decade: 2 from Venezuela to Roraima and 1 from Colombia to Amazonas. The phylogeographic analysis of full genome data has demonstrated the origins of DENV-4 throughout Brazil.
dengue virus; serotype 4; molecular epidemiology; phylogeography; Brazil; viruses; reemergence; genetic characterization; spatiotemporal patterns
A birth-death process is a continuous-time Markov chain that counts the number of particles in a system over time. In the general process with n current particles, a new particle is born with instantaneous rate λn and a particle dies with instantaneous rate μn. Currently no robust and efficient method exists to evaluate the finite-time transition probabilities in a general birth-death process with arbitrary birth and death rates. In this paper, we first revisit the theory of continued fractions to obtain expressions for the Laplace transforms of these transition probabilities and make explicit an important derivation connecting transition probabilities and continued fractions. We then develop an efficient algorithm for computing these probabilities that analyzes the error associated with approximations in the method. We demonstrate that this error-controlled method agrees with known solutions and outperforms previous approaches to computing these probabilities. Finally, we apply our novel method to several important problems in ecology, evolution, and genetics.
General birth-death process; Continuous-time Markov chain; Transition probabilities; Population genetics; Ecology; Evolution
Host species switches by bacterial pathogens leading to new endemic infections are important evolutionary events that are difficult to reconstruct over the long term. We investigated the host switching of Staphylococcus aureus over a long evolutionary timeframe by developing Bayesian phylogenetic methods to account for uncertainty about past host associations and using estimates of evolutionary rates from serially sampled whole-genome data. Results suggest multiple jumps back and forth between human and bovids with the first switch from humans to bovids taking place around 5500 BP, coinciding with the expansion of cattle domestication throughout the Old World. The first switch to poultry is estimated at around 275 BP, long after domestication but still preceding large-scale commercial farming. These results are consistent with a central role for anthropogenic change in the emergence of new endemic diseases.
Bayesian phylogenetics; molecular clocks; bacterial evolution; host switching
The interplay between C-C chemokine receptor type 5 (CCR5) host genetic background, disease progression, and intrahost HIV-1 evolutionary dynamics remains unclear because differences in viral evolution between hosts limit the ability to draw conclusions across hosts stratified into clinically relevant populations. Similar inference problems are proliferating across many measurably evolving pathogens for which intrahost sequence samples are readily available. To this end, we propose novel hierarchical phylogenetic models (HPMs) that incorporate fixed effects to test for differences in dynamics across host populations in a formal statistical framework employing stochastic search variable selection and model averaging. To clarify the role of CCR5 host genetic background and disease progression on viral evolutionary patterns, we obtain gp120 envelope sequences from clonal HIV-1 variants isolated at multiple time points in the course of infection from populations of HIV-1–infected individuals who only harbored CCR5-using HIV-1 variants at all time points. Presence or absence of a CCR5 wt/Δ32 genotype and progressive or long-term nonprogressive course of infection stratify the clinical populations in a two-way design. As compared with the standard approach of analyzing sequences from each patient independently, the HPM provides more efficient estimation of evolutionary parameters such as nucleotide substitution rates and dN/dS rate ratios, as shown by significant shrinkage of the estimator variance. The fixed effects also correct for nonindependence of data between populations and results in even further shrinkage of individual patient estimates. Model selection suggests an association between nucleotide substitution rate and disease progression, but a role for CCR5 genotype remains elusive. Given the absence of clear dN/dS differences between patient groups, delayed onset of AIDS symptoms appears to be solely associated with lower viral replication rates rather than with differences in selection on amino acid fixation.
CCR5; envelope; HIV-1; hierarchical phylogenetic models; disease progression; Bayesian inference
Staphylococcus aureus is a common cause of infections that has undergone rapid global spread over recent decades. Formal phylogeographic methods have not yet been applied to the molecular epidemiology of bacterial pathogens because the limited genetic diversity of data sets based on individual genes usually results in poor phylogenetic resolution. Here, we investigated a whole-genome single nucleotide polymorphism (SNP) data set of health care-associated Methicillin-resistant S. aureus sequence type 239 (HA-MRSA ST239) strains, which we analyzed using Markov spatial models that incorporate geographical sampling distributions. The reconstructed timescale indicated a temporal origin of this strain shortly after the introduction of Methicillin, followed by global pandemic spread. The estimate of the temporal origin was robust to the molecular clock, coalescent prior, full/intergenic/synonymous SNP inclusion, and correction for excluded invariant site patterns. Finally, phylogeographic analyses statistically supported the role of human movement in the global dissemination of HA-MRSA ST239, although it was unable to conclusively resolve the location of the root. This study demonstrates that bacterial genomes can indeed contain sufficient evolutionary information to elucidate the temporal and spatial dynamics of transmission. Future applications of this approach to other bacterial strains may provide valuable epidemiological insights that may justify the cost of genome-wide typing.
Bayesian inférence; phylogeography; phylogenetics; measurably evolving population
But Tuffley and Steel (1997) introduced a model called No Common Mechanism (NCM), in which characters may—but are not required to—vary their relative rates independently, both within and between branches. Because the independent variation is taken only as a possibility, not as a requirement, NCM would apply to almost any situation, and so may be accepted as realistic. This is useful because Tuffley and Steel also showed that maximum likelihood under NCM selects the same trees as does parsimony. With the realistic NCM in the background, then, most parsimonious trees have greatest power to explain available observations.
Computational evolutionary biology, statistical phylogenetics and coalescent-based population genetics are becoming increasingly central to the analysis and understanding of molecular sequence data. We present the Bayesian Evolutionary Analysis by Sampling Trees (BEAST) software package version 1.7, which implements a family of Markov chain Monte Carlo (MCMC) algorithms for Bayesian phylogenetic inference, divergence time dating, coalescent analysis, phylogeography and related molecular evolutionary analyses. This package includes an enhanced graphical user interface program called Bayesian Evolutionary Analysis Utility (BEAUti) that enables access to advanced models for molecular sequence and phenotypic trait evolution that were previously available to developers only. The package also provides new tools for visualizing and summarizing multispecies coalescent and phylogeographic analyses. BEAUti and BEAST 1.7 are open source under the GNU lesser general public license and available at http://beast-mcmc.googlecode.com and http://beast.bio.ed.ac.uk
Bayesian phylogenetics; evolution; phylogenetics; molecular evolution; coalescent theory
Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site dN/dS rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.
Bayes factor; Bayesian inference; MCMC; model averaging; model choice
Heterochronous data sets comprise molecular sequences sampled at different points in time. If the temporal range of the sampled sequences is large relative to the rate of mutation, the sampling times can directly calibrate evolutionary rates to calendar time. Here, we extend this calibration process to provide a full probabilistic method that utilizes temporal information in heterochronous data sets to estimate sampling times (leaf-ages) for sequenced for which this information unavailable. Our method is similar to relaxing the constraints of the molecular clock on specific lineages within a phylogenetic tree. Using a combination of synthetic and empirical data sets, we demonstrate that the method estimates leaf-ages reliably and accurately. Potential applications of our approach include incorporating samples of uncertain or radiocarbon-infinite age into ancient DNA analyses, evaluating the temporal signal in a particular sequence or data set, and exploring the reliability of sequence ages that are somehow contentious.
heterochronous sequences; ancient DNA; molecular clock; viral evolution; measurably evolving populations
Phylogeographic methods enable inference of the geographical history of genetic lineages. Recent examples successfully explore the patterns of human migration and the origins and spread of viral pandemics. Nevertheless, longstanding disagreement exists over the use and validity of certain phylogeographic inference methodologies. In this paper, we highlight three distinct frameworks for phylogeographic inference to give a taste of this disagreement. Each of the three approaches presents a different viewpoint on phylogeography, most fundamentally how we view the relationship between the inferred history of the sample and the history of the population the sample is embedded in. Satisfactory resolution of this relationship between history of the tree and history of the population remains a challenge for all but the most trivial models of phylogeographic processes. Intriguingly, we believe that some recent methods that entirely side-step inference about the history of the population will eventually help the field toward this goal.
Characterization of residual plasma virus during antiretroviral therapy (ART) is a high priority to improve understanding of HIV-1 pathogenesis and therapy. To understand the evolution of HIV-1 pol and env genes in viremic patients under selective pressure of ART, we performed longitudinal analyses of plasma-derived pol and env sequences from single HIV-1 genomes. We tested the hypotheses that drug resistance in pol was unrelated to changes in coreceptor usage (tropism), and that recombination played a role in evolution of viral strains. Recombinants were identified by using Bayesian and other computational methods. High-level genotypic resistance was seen in ~70% of X4 and R5 strains during ART. There was no significant association between resistance and tropism. Each patient displayed at least one recombinant encompassing env and representing a change in predicted tropism. These data suggest that, in addition to mutation, recombination can play a significant role in shaping HIV-1 evolution.
HIV-1 drug resistance; HIV-1 recombination; HIV-1 tropism
Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.
Bayesian phylogenetics; GPU; maximum likelihood; parallel computing
Summary: SPREAD is a user-friendly, cross-platform application to analyze and visualize Bayesian phylogeographic reconstructions incorporating spatial–temporal diffusion. The software maps phylogenies annotated with both discrete and continuous spatial information and can export high-dimensional posterior summaries to keyhole markup language (KML) for animation of the spatial diffusion through time in virtual globe software. In addition, SPREAD implements Bayes factor calculation to evaluate the support for hypotheses of historical diffusion among pairs of discrete locations based on Bayesian stochastic search variable selection estimates. SPREAD takes advantage of multicore architectures to process large joint posterior distributions of phylogenies and their spatial diffusion and produces visualizations as compelling and interpretable statistical summaries for the different spatial projections.
Availability: SPREAD is licensed under the GNU Lesser GPL and its source code is freely available as a GitHub repository: https://github.com/phylogeography/SPREAD
Double-stranded (ds) DNA viruses are often described as evolving through long-term codivergent associations with their hosts, a pattern that is expected to be associated with low rates of nucleotide substitution. However, the hypothesis of codivergence between dsDNA viruses and their hosts has rarely been rigorously tested, even though the vast majority of nucleotide substitution rate estimates for dsDNA viruses are based upon this assumption. It is therefore important to estimate the evolutionary rates of dsDNA viruses independent of the assumption of host-virus codivergence. Here, we explore the use of temporally structured sequence data within a Bayesian framework to estimate the evolutionary rates for seven human dsDNA viruses, including variola virus (VARV) (the causative agent of smallpox) and herpes simplex virus-1. Our analyses reveal that although the VARV genome is likely to evolve at a rate of approximately 1 × 10−5 substitutions/site/year and hence approaching that of many RNA viruses, the evolutionary rates of many other dsDNA viruses remain problematic to estimate. Synthetic data sets were constructed to inform our interpretation of the substitution rates estimated for these dsDNA viruses and the analysis of these demonstrated that given a sequence data set of appropriate length and sampling depth, it is possible to use time-structured analyses to estimate the substitution rates of many dsDNA viruses independently from the assumption of host-virus codivergence. Finally, the discovery that some dsDNA viruses may evolve at rates approaching those of RNA viruses has important implications for our understanding of the long-term evolutionary history and emergence potential of this major group of viruses.
double-stranded DNA viruses; nucleotide substitution rates; evolution; codivergence; variola virus
We propose a Bayesian multivariate model in which a single linear combination of the covariates predict multiple outcomes simultaneously. The single linear combination is a data-derived score along the lines of the Apache or Charlson index scores for critically ill patients, the Karnofsky or Eastern Cooperative Oncology Group score for cancer patients or Euro-score for cardiac patients that may be used to predict multiple outcomes. Outcomes may be discrete or continuous and we use a composition of generalized linear models for the marginal distribution for each outcome. We explain how to set the prior distribution and we use Markov chain Monte Carlo methods to calculate the posterior distribution. We propose two types of expanded models to diagnose whether each outcome indeed has predictor effects common with the other outcomes, and whether a particular predictor is commonly predictive for all outcomes. We determine a final model based on the diagnostic models. The method is applied to a study yielding multiple psychometric outcomes of mixed type measured in young people living with human immunodeficiency virus.
Bayesian Wald test; human immunodeficiency virus; index construction; multivariate regression; single index model
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board.
Block relaxation; EM and MM algorithms; multidimensional scaling; nonnegative matrix factorization; parallel computing; PET scanning
Research aimed at understanding the geographic context of evolutionary histories is burgeoning across biological disciplines. Recent endeavors attempt to interpret contemporaneous genetic variation in the light of increasingly detailed geographical and environmental observations. Such interest has promoted the development of phylogeographic inference techniques that explicitly aim to integrate such heterogeneous data. One promising development involves reconstructing phylogeographic history on a continuous landscape. Here, we present a Bayesian statistical approach to infer continuous phylogeographic diffusion using random walk models while simultaneously reconstructing the evolutionary history in time from molecular sequence data. Moreover, by accommodating branch-specific variation in dispersal rates, we relax the most restrictive assumption of the standard Brownian diffusion process and demonstrate increased statistical efficiency in spatial reconstructions of overdispersed random walks by analyzing both simulated and real viral genetic data. We further illustrate how drawing inference about summary statistics from a fully specified stochastic process over both sequence evolution and spatial movement reveals important characteristics of a rabies epidemic. Together with recent advances in discrete phylogeographic inference, the continuous model developments furnish a flexible statistical framework for biogeographical reconstructions that is easily expanded upon to accommodate various landscape genetic features.
phylogeography; Bayesian inference; random walk; Brownian diffusion; rabies; BEAST; phylodynamics
The emergence and rapid global spread of the swine-origin H1N1/09 pandemic influenza A virus in humans underscores the importance of swine populations as reservoirs for genetically diverse influenza viruses with the potential to infect humans. However, despite their significance for animal and human health, relatively little is known about the phylogeography of swine influenza viruses in the United States. This study utilizes an expansive data set of hemagglutinin (HA1) sequences (n = 1516) from swine influenza viruses collected in North America during the period 2003–2010. With these data we investigate the spatial dissemination of a novel influenza virus of the H1 subtype that was introduced into the North American swine population via two separate human-to-swine transmission events around 2003. Bayesian phylogeographic analysis reveals that the spatial dissemination of this influenza virus in the US swine population follows long-distance swine movements from the Southern US to the Midwest, a corn-rich commercial center that imports millions of swine annually. Hence, multiple genetically diverse influenza viruses are introduced and co-circulate in the Midwest, providing the opportunity for genomic reassortment. Overall, the Midwest serves primarily as an ecological sink for swine influenza in the US, with sources of virus genetic diversity instead located in the Southeast (mainly North Carolina) and South-central (mainly Oklahoma) regions. Understanding the importance of long-distance pig transportation in the evolution and spatial dissemination of the influenza virus in swine may inform future strategies for the surveillance and control of influenza, and perhaps other swine pathogens.
Since 1998, genetically and antigenically diverse influenza A viruses have circulated in North American swine due to continuous cross-species transmission and reassortment with avian and human influenza viruses, presenting a pandemic threat to humans. Millions of swine are transported year-round from the southern United States into the corn-rich Midwest, but the importance of these movements in the spatial dissemination and evolution of the influenza virus in swine is unknown. Using a large data set of influenza virus sequences collected in North American swine during 2003–2010, we investigated the spatial dynamics of two influenza viruses of the H1 subtype that were introduced into swine from humans around 2003. Employing recently developed Bayesian phylogeography methods, we find that the spread of this influenza virus follows the large-scale transport of swine from the South to the Midwest. Based on this pattern of viral migration, we suggest that the genetic diversity of swine influenza viruses in the Midwest is continually augmented by the importation of viruses from source populations located in the South. Understanding the importance of long-distance pig movements in the evolution and spatial dissemination of influenza virus in swine may inform future strategies for the surveillance and control of influenza, and perhaps other swine pathogens.
Trinidad, like many other American regions, experiences repeated epizootics of yellow fever virus (YFV). However, it is unclear whether these result from in situ evolution (enzootic maintenance) or regular reintroduction of YFV from the South American mainland. To discriminate between these hypotheses, we carried out a Bayesian phylogeographic analysis of over 100 prM/E gene sequences sampled from 8 South American countries. These included newly sequenced isolates from the recent 2008-2009 Trinidad epizootic and isolates derived from mainland countries within the last decade. The results indicate that the most recent common ancestor of the 2008-2009 epizootic existed in Trinidad 4.2 years prior to 2009 (95% highest probability density [HPD], 0.5 to 9.0 years). Our data also suggest a Trinidad origin for the progenitor of the 1995 Trinidad epizootic and support in situ evolution of YFV between the 1979 and 1988-1989 Trinidad epizootics. Using the same phylogeographic approach, we also inferred the historical spread of YFV in the Americas. The results suggest a Brazilian origin for YFV in the Americas and an overall dispersal rate of 182 km/year (95% HPD, 52 to 462 km/year), with Brazil as the major source population for surrounding countries. There is also strong statistical support for epidemiological links between four Brazilian regions and other countries. In contrast, while there were well-supported epidemiological links within Peru, the only statistically supported external link was a relatively weak link with neighboring Bolivia. Lastly, we performed a complete analysis of the genome of a newly sequenced Trinidad 2009 isolate, the first complete genome for a genotype I YFV isolate.
Evolutionary biologists have introduced numerous statistical approaches to explore nonvertical evolution, such as horizontal gene transfer, recombination, and genomic reassortment, through collections of Markov-dependent gene trees. These tree collections allow for inference of nonvertical evolution, but only indirectly, making findings difficult to interpret and models difficult to generalize. An alternative approach to explore nonvertical evolution relies on phylogenetic networks. These networks provide a framework to model nonvertical evolution but leave unanswered questions such as the statistical significance of specific nonvertical events. In this paper, we begin to correct the shortcomings of both approaches by introducing the “stochastic model for reassortment and transfer events” (SMARTIE) drawing upon ancestral recombination graphs (ARGs). ARGs are directed graphs that allow for formal probabilistic inference on vertical speciation events and nonvertical evolutionary events. We apply SMARTIE to phylogenetic data. Because of this, we can typically infer a single most probable ARG, avoiding coarse population dynamic summary statistics. In addition, a focus on phylogenetic data suggests novel probability distributions on ARGs. To make inference with our model, we develop a reversible jump Markov chain Monte Carlo sampler to approximate the posterior distribution of SMARTIE. Using the BEAST phylogenetic software as a foundation, the sampler employs a parallel computing approach that allows for inference on large-scale data sets. To demonstrate SMARTIE, we explore 2 separate phylogenetic applications, one involving pathogenic Leptospirochete and the other Saccharomyces.
Ancestral recombination graph; Bayesian; horizontal gene transfer; phylogenetic network; reassortment; species tree
Understanding the role of humans in the dispersal of predominately animal pathogens is essential for their control. We used newly developed Bayesian phylogeographic methods to unravel the dynamics and determinants of the spread of dog rabies virus (RABV) in North Africa. Each of the countries studied exhibited largely disconnected spatial dynamics with major geo-political boundaries acting as barriers to gene flow. Road distances proved to be better predictors of the movement of dog RABV than accessibility or raw geographical distance, with occasional long distance and rapid spread within each of these countries. Using simulations that bridge phylodynamics and spatial epidemiology, we demonstrate that the contemporary viral distribution extends beyond that expected for RABV transmission in African dog populations. These results are strongly supportive of human-mediated dispersal, and demonstrate how an integrated phylogeographic approach will turn viral genetic data into a powerful asset for characterizing, predicting, and potentially controlling the spatial spread of pathogens.
At least 15 million doses of anti-rabies post-exposure prophylaxis are administered annually worldwide, and an estimated 55,000 people die of rabies every year. Over 99% of these deaths occur in developing countries, predominantly in Asia and in Africa where rabies is endemic in domestic dogs. Despite the global health burden due to rabies, little is known about the patterns of the spread of dog rabies in these endemic regions. We used recently developed Bayesian analytical methods to unravel the dynamics and determinants of the spatial diffusion of dog rabies viruses in North Africa based on viral genetic data. Our analysis reveals a combination of restricted spread across administrative borders, the occasional long-distance movement of rabies viruses, and a strong fit between spatial spread of the virus and road distances between localities. Together, these data indicate that by transporting dogs, humans have played a key role in the dispersal of a major animal pathogen. Our studies therefore provide essential new information on the transmission dynamics of rabies in Africa, and in doing so will greatly assist in future intervention strategies.
Motivation: Statistical analyses of phylogenetic data culminate in uncertain estimates of underlying model parameters. Lack of additional data hinders the ability to reduce this uncertainty, as the original phylogenetic dataset is often complete, containing the entire gene or genome information available for the given set of taxa. Informative priors in a Bayesian analysis can reduce posterior uncertainty; however, publicly available phylogenetic software specifies vague priors for model parameters by default. We build objective and informative priors using hierarchical random effect models that combine additional datasets whose parameters are not of direct interest but are similar to the analysis of interest.
Results: We propose principled statistical methods that permit more precise parameter estimates in phylogenetic analyses by creating informative priors for parameters of interest. Using additional sequence datasets from our lab or public databases, we construct a fully Bayesian semiparametric hierarchical model to combine datasets. A dynamic iteratively reweighted Markov chain Monte Carlo algorithm conveniently recycles posterior samples from the individual analyses. We demonstrate the value of our approach by examining the insertion–deletion (indel) process in the enolase gene across the Tree of Life using the phylogenetic software BALI-PHY; we incorporate prior information about indels from 82 curated alignments downloaded from the BAliBASE database.
Supplementary information: Supplementary data are available at Bioinformatics online.