The annals of applied statistics  2015;9(2):969-991.
Understanding which phenotypic traits are consistently correlated throughout evolution is a highly pertinent problem in modern evolutionary biology. Here, we propose a multivariate phylogenetic latent liability model for assessing the correlation between multiple types of data, while simultaneously controlling for their unknown shared evolutionary history informed through molecular sequences. The latent formulation enables us to consider in a single model combinations of continuous traits, discrete binary traits, and discrete traits with multiple ordered and unordered states. Previous approaches have entertained a single data type generally along a fixed history, precluding estimation of correlation between traits and ignoring uncertainty in the history. We implement our model in a Bayesian phylogenetic framework, and discuss inference techniques for hypothesis testing. Finally, we showcase the method through applications to columbine flower morphology, antibiotic resistance in Salmonella, and epitope evolution in influenza.
PMCID: PMC4820077  PMID: 27053974
Bayesian phylogenetics; Threshold model; Evolution; Genotype-phenotype correlation
2.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers 
The vision of creating accessible, reliable clinical evidence by accessing the clinical experience of hundreds of millions of patients across the globe is a reality. The Observational Health Data Sciences and Informatics (OHDSI) has built on learnings from the Observational Medical Outcomes Partnership to turn methods research and insights into a suite of applications and exploration tools that move the field closer to the ultimate goal of generating evidence about all aspects of healthcare to serve the needs of patients, clinicians and all other decision-makers around the world.
PMCID: PMC4815923  PMID: 26262116
Health Services Research; Databases; Observation
3.  Positive Selection in CD8+ T-Cell Epitopes of Influenza Virus Nucleoprotein Revealed by a Comparative Analysis of Human and Swine Viral Lineages 
Journal of Virology  2015;89(22):11275-11283.
Numerous experimental studies have demonstrated that CD8+ T cells contribute to immunity against influenza by limiting viral replication. It is therefore surprising that rigorous statistical tests have failed to find evidence of positive selection in the epitopes targeted by CD8+ T cells. Here we use a novel computational approach to test for selection in CD8+ T-cell epitopes. We define all epitopes in the nucleoprotein (NP) and matrix protein (M1) with experimentally identified human CD8+ T-cell responses and then compare the evolution of these epitopes in parallel lineages of human and swine influenza viruses that have been diverging since roughly 1918. We find a significant enrichment of substitutions that alter human CD8+ T-cell epitopes in NP of human versus swine influenza virus, consistent with the idea that these epitopes are under positive selection. Furthermore, we show that epitope-altering substitutions in human influenza virus NP are enriched on the trunk versus the branches of the phylogenetic tree, indicating that viruses that acquire these mutations have a selective advantage. However, even in human influenza virus NP, sites in T-cell epitopes evolve more slowly than do nonepitope sites, presumably because these epitopes are under stronger inherent functional constraint. Overall, our work demonstrates that there is clear selection from CD8+ T cells in human influenza virus NP and illustrates how comparative analyses of viral lineages from different hosts can identify positive selection that is otherwise obscured by strong functional constraint.
IMPORTANCE There is a strong interest in correlates of anti-influenza immunity that are protective against diverse virus strains. CD8+ T cells provide such broad immunity, since they target conserved viral proteins. An important question is whether T-cell immunity is sufficiently strong to drive influenza virus evolution. Although many studies have shown that T cells limit viral replication in animal models and are associated with decreased symptoms in humans, no studies have proven with statistical significance that influenza virus evolves under positive selection to escape T cells. Here we use comparisons of human and swine influenza viruses to rigorously demonstrate that human influenza virus evolves under pressure to fix mutations in the nucleoprotein that promote escape from T cells. We further show that viruses with these mutations have a selective advantage since they are preferentially located on the “trunk” of the phylogenetic tree. Overall, our results show that CD8+ T cells targeting nucleoprotein play an important role in shaping influenza virus evolution.
PMCID: PMC4645657  PMID: 26311880
4.  Simultaneously estimating evolutionary history and repeated traits phylogenetic signal: applications to viral and host phenotypic evolution 
Phylogenetic signal quantifies the degree to which resemblance in continuously-valued traits reflects phylogenetic relatedness. Measures of phylogenetic signal are widely used in ecological and evolutionary research, and are recently gaining traction in viral evolutionary studies. Standard estimators of phylogenetic signal frequently condition on data summary statistics of the repeated trait observations and fixed phylogenetics trees, resulting in information loss and potential bias.
To incorporate the observation process and phylogenetic uncertainty in a model-based approach, we develop a novel Bayesian inference method to simultaneously estimate the evolutionary history and phylogenetic signal from molecular sequence data and repeated multivariate traits. Our approach builds upon a phylogenetic diffusion framework that model continuous trait evolution as a Brownian motion process and incorporates Pagel’s λ transformation parameter to estimate dependence among traits. We provide a computationally efficient inference implementation in the BEAST software package.
We evaluate the synthetic performance of the Bayesian estimator of phylogenetic signal against standard estimators, and demonstrate the use of our coherent framework to address several virus-host evolutionary questions, including virulence heritability for HIV, antigenic evolution in influenza and HIV, and Drosophila sensitivity to sigma virus infection. Finally, we discuss model extensions that will make useful contributions to our flexible framework for simultaneously studying sequence and trait evolution.
PMCID: PMC4358766  PMID: 25780554
comparative approach; Bayesian phylogenetics; virus evolution; adaptation; virulence
5.  Empirical calibrated radiocarbon sampler: a tool for incorporating radiocarbon-date and calibration error into Bayesian phylogenetic analyses of ancient DNA 
Molecular ecology resources  2014;15(1):81-86.
Studies of DNA from ancient samples provide a valuable opportunity to gain insight into past evolutionary and demographic processes. Bayesian phylogenetic methods can estimate evolutionary rates and timescales from ancient DNA sequences, with the ages of the samples acting as calibrations for the molecular clock. Sample ages are often estimated using radiocarbon dating, but the associated measurement error is rarely taken into account. In addition, the total uncertainty quantified by converting radiocarbon dates to calendar dates is typically ignored. Here we present a tool for incorporating both of these sources of uncertainty into Bayesian phylogenetic analyses of ancient DNA. This empirical calibrated radiocarbon sampler (ECRS) integrates the age uncertainty for each ancient sequence over the calibrated probability density function estimated for its radiocarbon date and associated error. We use the ECRS to analyse three ancient DNA data sets. Accounting for radiocarbon-dating and calibration error appeared to have little impact on estimates of evolutionary rates and related parameters for these data sets. However, analyses of other data sets, particularly those with few or only very old radiocarbon dates, might be more sensitive to using artificially precise sample ages and should benefit from use of the ECRS.
PMCID: PMC4270920  PMID: 24964386
age estimation error; ancient DNA; phylogenetic dating; BEAST
6.  Fitting Birth-Death Processes to Panel Data with Applications to Bacterial DNA Fingerprinting 
The annals of applied statistics  2013;7(4):2315-2335.
Continuous-time linear birth–death-immigration (BDI) processes are frequently used in ecology and epidemiology to model stochastic dynamics of the population of interest. In clinical settings, multiple birth–death processes can describe disease trajectories of individual patients, allowing for estimation of the effects of individual covariates on the birth and death rates of the process. Such estimation is usually accomplished by analyzing patient data collected at unevenly spaced time points, referred to as panel data in the biostatistics literature. Fitting linear BDI processes to panel data is a nontrivial optimization problem because birth and death rates can be functions of many parameters related to the covariates of interest. We propose a novel expectation–maximization (EM) algorithm for fitting linear BDI models with covariates to panel data. We derive a closed-form expression for the joint generating function of some of the BDI process statistics and use this generating function to reduce the E-step of the EM algorithm, as well as calculation of the Fisher information, to one-dimensional integration. This analytical technique yields a computationally efficient and robust optimization algorithm that we implemented in an open-source R package. We apply our method to DNA fingerprinting of Mycobacterium tuberculosis, the causative agent of tuberculosis, to study intrapatient time evolution of IS6110 copy number, a genetic marker frequently used during estimation of epidemiological clusters of Mycobacterium tuberculosis infections. Our analysis reveals previously undocumented differences in IS6110 birth–death rates among three major lineages of Mycobacterium tuberculosis, which has important implications for epidemiologists that use IS6110 for DNA fingerprinting of Mycobacterium tuberculosis.
PMCID: PMC4685745  PMID: 26702330
Missing data; EM algorithm; transposable element; IS6110; tuberculosis
7.  Reuse, Recycle, Reweigh: Combating Influenza through Efficient Sequential Bayesian Computation for Massive Data 
The annals of applied statistics  2010;4(4):1722-1748.
Massive datasets in the gigabyte and terabyte range combined with the availability of increasingly sophisticated statistical tools yield analyses at the boundary of what is computationally feasible. Compromising in the face of this computational burden by partitioning the dataset into more tractable sizes results in stratified analyses, removed from the context that justified the initial data collection. In a Bayesian framework, these stratified analyses generate intermediate realizations, often compared using point estimates that fail to account for the variability within and correlation between the distributions these realizations approximate. However, although the initial concession to stratify generally precludes the more sensible analysis using a single joint hierarchical model, we can circumvent this outcome and capitalize on the intermediate realizations by extending the dynamic iterative reweighting MCMC algorithm. In doing so, we reuse the available realizations by reweighting them with importance weights, recycling them into a now tractable joint hierarchical model. We apply this technique to intermediate realizations generated from stratified analyses of 687 influenza A genomes spanning 13 years allowing us to revisit hypotheses regarding the evolutionary history of influenza within a hierarchical statistical framework.
PMCID: PMC4679157  PMID: 26681992
Gibbs variable selection; hierarchical Bayesian model; importance sampling; influenza A; Markov chain Monte Carlo; massive data
8.  Wagner and Dollo: A Stochastic Duet by Composing Two Parsimonious Solos 
Systematic biology  2008;57(5):772-784.
New contributions toward generalizing evolutionary models expand greatly our ability to analyze complex evolutionary characters and advance phylogeny reconstruction. In this article, we extend the binary stochastic Dollo model to allow for multi-state characters. In doing so, we align previously incompatible Wagner and Dollo parsimony principles under a common probabilistic framework by embedding arbitrary continuous-time Markov chains into the binary stochastic Dollo model. This approach enables us to analyze character traits that exhibit both Dollo and Wagner characteristics throughout their evolutionary histories. Utilizing Bayesian inference, we apply our novel model to analyze intron conservation patterns and the evolution of alternatively spliced exons. The generalized framework we develop demonstrates potential in distinguishing between phylogenetic hypotheses and providing robust estimates of evolutionary rates. Moreover, for the two applications analyzed here, our framework is the first to provide an adequate stochastic process for the data. We discuss possible extensions to the framework from both theoretical and applied perspectives.
PMCID: PMC4677801  PMID: 18853363
Alternative splicing evolution; Bayesian phylogenetic inference; immigration-mutation-death process; intron conservation; stochastic Dollo
9.  Phylodynamics of the HIV-1 CRF02_AG clade in Cameroon 
Evolutionary analyses have revealed an origin of pandemic HIV-1 group M in the Congo River basin in the first part of the XXth century, but the patterns of historical viral spread in or around its epicentre remain largely unexplored. Here, we combine epidemiologic and molecular sequence data to investigate the spatiotemporal patterns of the CRF02_AG clade. By explicitly integrating prevalence counts and genetic population size estimates we date the epidemic emergence of CRF02_AG at 1973.1 (1972.1, 1975.3 95% CI). To infer their phylogeographic signature at a regional scale, we analyze pol and env time-stamped sequence data from 8 countries using a Bayesian phylogeographic approach based on a discrete asymmetric model. Our data confirms a spatial origin of this clade in the Democratic Republic of Congo (DRC) and suggests that viral dissemination to Cameroon occurred at an early stage of the evolutionary history of CRF02_AG. We find considerable support for epidemiological linkage between neighbour countries. Compilation of ethnographic data suggests that well-supported viral migration was related with chance exportation events rather than by sustained human migratory flows. Finally, using sequence data from 15 locations in Cameroon, we use relaxed random walk models to explore the spatiotemporal dynamics of CRF02_AG at a finer geographical detail. Phylogeographic dispersal in continuous space reveals that at least two distinct CRF02_AG lineages are circulating in overlapping regions that are evolving at different evolutionary and diffusion rates. Altogether, by combining molecular and epidemiological data, our results provide a time scale for CRF02_AG, place its spatial root within the putative root of group-M diversity and propose a scenario for the spatiotemporal patterns of a successful HIV-1 lineage both at a regional and country-scale.
PMCID: PMC4677783  PMID: 21565285
10.  Ancient Hybridization and an Irish Origin for the Modern Polar Bear Matriline 
Current biology : CB  2011;21(15):1251-1258.
Polar bears (Ursus maritimus) are among those species most susceptible to the rapidly changing arctic climate, and their survival is of global concern. Despite this, little is known about polar bear species history. Future conservation strategies would significantly benefit from an understanding of basic evolutionary information, such as the timing and conditions of their initial divergence from brown bears (U. arctos) or their response to previous environmental change.
We used a spatially explicit phylogeographic model to estimate the dynamics of 242 brown bear and polar bear matrilines sampled throughout the last 120,000 years and across their present and past geographic ranges. Our results show that the present distribution of these matrilines was shaped by a combination of regional stability and rapid, long-distance dispersal from ice-age refugia. In addition, hybridization between polar bears and brown bears may have occurred multiple times throughout the Late Pleistocene.
The reconstructed matrilineal history of brown and polar bears has two striking features. First, it is punctuated by dramatic and discrete climate-driven dispersal events. Second, opportunistic mating between these two species as their ranges overlapped has left a strong genetic imprint. In particular, a likely genetic exchange with extinct Irish brown bears forms the origin of the modern polar bear matriline. This suggests that interspecific hybridization not only may be more common than previously considered but may be a mechanism by which species deal with marginal habitats during periods of environmental deterioration.
PMCID: PMC4677796  PMID: 21737280
11.  Spatiotemporal dynamics of simian immunodeficiency virus brain infection in CD8+ lymphocyte-depleted rhesus macaques with neuroAIDS 
The Journal of General Virology  2014;95(Pt 12):2784-2795.
Despite the success of combined antiretroviral therapy in controlling viral replication in human immunodeficiency virus (HIV)-infected individuals, HIV-associated neurocognitive disorders, commonly referred to as neuroAIDS, remain a frequent and poorly understood complication. Infection of CD8+ lymphocyte-depleted rhesus macaques with the SIVmac251 viral swarm is a well-established rapid disease model of neuroAIDS that has provided critical insight into HIV-1-associated neurocognitive disorder onset and progression. However, no studies so far have characterized in depth the relationship between intra-host viral evolution and pathogenesis in this model. Simian immunodeficiency virus (SIV) env gp120 sequences were obtained from six infected animals. Sequences were sampled longitudinally from several lymphoid and non-lymphoid tissues, including individual lobes within the brain at necropsy, for four macaques; two animals were sacrificed at 21 days post-infection (p.i.) to evaluate early viral seeding of the brain. Bayesian phylodynamic and phylogeographic analyses of the sequence data were used to ascertain viral population dynamics and gene flow between peripheral and brain tissues, respectively. A steady increase in viral effective population size, with a peak occurring at ~50–80 days p.i., was observed across all longitudinally monitored macaques. Phylogeographic analysis indicated continual viral seeding of the brain from several peripheral tissues throughout infection, with the last migration event before terminal illness occurring in all macaques from cells within the bone marrow. The results strongly supported the role of infected bone marrow cells in HIV/SIV neuropathogenesis. In addition, our work demonstrated the applicability of Bayesian phylogeography to intra-host studies in order to assess the interplay between viral evolution and pathogenesis.
PMCID: PMC4233634  PMID: 25205684
The annals of applied statistics  2015;9(2):572-596.
Surveys often ask respondents to report non-negative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. This phenomenon is called heaping, and the error inherent in heaped self-reported numbers can bias estimation. Heaped data may be collected cross-sectionally or longitudinally and there may be covariates that complicate the inferential task. Heaping is a well-known issue in many survey settings, and inference for heaped data is an important statistical problem. We propose a novel reporting distribution whose underlying parameters are readily interpretable as rates of misremembering and rounding. The process accommodates a variety of heaping grids and allows for quasi-heaping to values nearly but not equal to heaping multiples. We present a Bayesian hierarchical model for longitudinal samples with covariates to infer both the unobserved true distribution of counts and the parameters that control the heaping process. Finally, we apply our methods to longitudinal self-reported counts of sex partners in a study of high-risk behavior in HIV-positive youth.
PMCID: PMC4617556  PMID: 26500711
Bayesian hierarchical model; Coarse data; Continuous-time Markov chain; Heaping; Mixture model; Rounding
13.  The early spread and epidemic ignition of HIV-1 in human populations 
Science (New York, N.Y.)  2014;346(6205):56-61.
Thirty years after the discovery of HIV-1, the early transmission, dissemination, and establishment of the virus in human populations remain unclear. Using statistical approaches applied to HIV-1 sequence data from central Africa, we show that from the 1920s Kinshasa (in what is now the Democratic Republic of Congo) was the focus of early transmission and the source of pre-1960 pandemic viruses elsewhere. Location and dating estimates were validated using the earliest HIV-1 archival sample, also from Kinshasa. The epidemic histories of HIV-1 group M and nonpandemic group O were similar until ~1960, after which group M underwent an epidemiological transition and outpaced regional population growth. Our results reconstruct the early dynamics of HIV-1 and emphasize the role of social changes and transport networks in the establishment of this virus in human populations.
PMCID: PMC4254776  PMID: 25278604
14.  Global migration of influenza A viruses in swine 
Nature communications  2015;6:6696.
The complex and unresolved evolutionary origins of the 2009 H1N1 influenza pandemic exposed major gaps in our knowledge of the global spatial ecology and evolution of influenza A viruses in swine (swIAVs). Here we undertake an expansive phylogenetic analysis of swIAV sequence data and demonstrate that the global live swine trade strongly predicts the spatial dissemination of swIAVs, with Europe and North America acting as sources of viruses in Asian countries. In contrast, China has the world’s largest swine population but is not a major exporter of live swine, and is not an important source of swIAVs in neighboring Asian countries or globally. A meta-population simulation model incorporating trade data predicts that the global ecology of swIAVs is more complex than previously thought, and the US and China’s large swine populations are unlikely to be representative of swIAV diversity in their respective geographic regions, requiring independent surveillance efforts throughout Latin America and Asia.
PMCID: PMC4380236  PMID: 25813399
15.  Modeling Protein Expression and Protein Signaling Pathways 
High-throughput functional proteomic technologies provide a way to quantify the expression of proteins of interest. Statistical inference centers on identifying the activation state of proteins and their patterns of molecular interaction formalized as dependence structure. Inference on dependence structure is particularly important when proteins are selected because they are part of a common molecular pathway. In that case, inference on dependence structure reveals properties of the underlying pathway. We propose a probability model that represents molecular interactions at the level of hidden binary latent variables that can be interpreted as indicators for active versus inactive states of the proteins. The proposed approach exploits available expert knowledge about the target pathway to define an informative prior on the hidden conditional dependence structure. An important feature of this prior is that it provides an instrument to explicitly anchor the model space to a set of interactions of interest, favoring a local search approach to model determination. We apply our model to reverse-phase protein array data from a study on acute myeloid leukemia. Our inference identifies relevant subpathways in relation to the unfolding of the biological process under study.
PMCID: PMC4523312  PMID: 26246646
AML; Graphical models; Mixture models; POE; RJ-MCMC; RPPA
16.  Combining Phylogeography and Spatial Epidemiology to Uncover Predictors of H5N1 Diffusion 
Archives of virology  2014;160(1):215-224.
Emerging and re-emerging infectious diseases of zoonotic origin like highly pathogenic avian influenza pose a significant threat to human and animal health due to their elevated transmissibility. Identifying the drivers of such viruses is challenging and complicates the estimation of spatial diffusion because the variability of viral spread from locations could be caused by a complex array of unknown factors. Several techniques exist to help identify these drivers including bioinformatics, phylogeography, and spatial epidemiology but these methods are generally evaluated separately and do not consider the complementary nature of each other. Here we studied an approach that integrates these techniques and identifies the most important drivers of viral spread by focusing on H5N1 in Egypt because of its recent emergence as an epicenter for the disease. We used a Bayesian phylogeographic generalized linear model (GLM) to reconstruct spatiotemporal patterns of viral diffusion while simultaneously assessing the impact of factors contributing to transmission. We also calculated the cross-species transmission rates among hosts in order to identify the species driving transmission. Density of both human and avian species were supported contributors along with latitude, longitude, elevation, and several meteorological variables. Also supported was the presence of a genetic motif found near the hemagglutinin cleavage site. Various genetic, geographic, demographic, and environmental predictors each play a role in H1N1 diffusion. Further development and expansion of phylogeographic GLMs such as this will enable health agencies to identify variables that can curb virus diffusion and reduce morbidity and mortality.
PMCID: PMC4398335  PMID: 25355432
17.  High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis 
Biostatistics (Oxford, England)  2013;15(2):207-221.
Survival analysis endures as an old, yet active research field with applications that spread across many domains. Continuing improvements in data acquisition techniques pose constant challenges in applying existing survival analysis methods to these emerging data sets. In this paper, we present tools for fitting regularized Cox survival analysis models on high-dimensional, massive sample-size (HDMSS) data using a variant of the cyclic coordinate descent optimization technique tailored for the sparsity that HDMSS data often present. Experiments on two real data examples demonstrate that efficient analyses of HDMSS data using these tools result in improved predictive performance and calibration.
PMCID: PMC3944969  PMID: 24096388
Big data; Cox proportional hazards; Regularized regression; Survival analysis
18.  Estimation for general birth-death processes 
Birth-death processes (BDPs) are continuous-time Markov chains that track the number of “particles” in a system over time. While widely used in population biology, genetics and ecology, statistical inference of the instantaneous particle birth and death rates remains largely limited to restrictive linear BDPs in which per-particle birth and death rates are constant. Researchers often observe the number of particles at discrete times, necessitating data augmentation procedures such as expectation-maximization (EM) to find maximum likelihood estimates. For BDPs on finite state-spaces, there are powerful matrix methods for computing the conditional expectations needed for the E-step of the EM algorithm. For BDPs on infinite state-spaces, closed-form solutions for the E-step are available for some linear models, but most previous work has resorted to time-consuming simulation. Remarkably, we show that the E-step conditional expectations can be expressed as convolutions of computable transition probabilities for any general BDP with arbitrary rates. This important observation, along with a convenient continued fraction representation of the Laplace transforms of the transition probabilities, allows for novel and efficient computation of the conditional expectations for all BDPs, eliminating the need for truncation of the state-space or costly simulation. We use this insight to derive EM algorithms that yield maximum likelihood estimation for general BDPs characterized by various rate models, including generalized linear models. We show that our Laplace convolution technique outperforms competing methods when they are available and demonstrate a technique to accelerate EM algorithm convergence. We validate our approach using synthetic data and then apply our methods to cancer cell growth and estimation of mutation parameters in microsatellite evolution.
PMCID: PMC4196218  PMID: 25328261
Birth-death process; EM algorithm; MM algorithm; maximum likelihood estimation; continuous-time Markov chain; microsatellite evolution
19.  Analysis of Viral Genetics for Estimating Diffusion of Influenza A H6N1 
H6N1 influenza A is an avian virus but in 2013 infected a human in Taiwan. We studied the phylogeography of avian origin H6N1 viruses in the Influenza Research Database and the Global Initiative on Sharing Avian Influenza Data EpiFlu Database in order to characterize their recent evolutionary spread. Our results suggest that the H6N1 virus that infected a human in Taiwan is derived from a diversity of avian strains of H6N1 that have circulated for at least seven years in this region. Understanding how geography impacts the evolution of avian influenza could allow disease control efforts to focus on areas that pose the greatest risk to humans. The serious human infection with a known avian influenza virus underscores the zoonotic potential of diverse avian strains of influenza, and the need for comprehensive influenza surveillance in animals and the value of public sequence databases including GISAID and the IRD.
PMCID: PMC4525270  PMID: 26306229
20.  On the Biogeography of Centipeda: A Species-Tree Diffusion Approach 
Systematic Biology  2014;63(2):178-191.
Reconstructing the biogeographic history of groups present in continuous arid landscapes is challenging due to the difficulties in defining discrete areas for analyses, and even more so when species largely overlap both in terms of geography and habitat preference. In this study, we use a novel approach to estimate ancestral areas for the small plant genus Centipeda. We apply continuous diffusion of geography by a relaxed random walk where each species is sampled from its extant distribution on an empirical distribution of time-calibrated species-trees. Using a distribution of previously published substitution rates of the internal transcribed spacer (ITS) for Asteraceae, we show how the evolution of Centipeda correlates with the temporal increase of aridity in the arid zone since the Pliocene. Geographic estimates of ancestral species show a consistent pattern of speciation of early lineages in the Lake Eyre region, with a division in more northerly and southerly groups since ∼840 ka. Summarizing the geographic slices of species-trees at the time of the latest speciation event (∼20 ka), indicates no presence of the genus in Australia west of the combined desert belt of the Nullabor Plain, the Great Victoria Desert, the Gibson Desert, and the Great Sandy Desert, or beyond the main continental shelf of Australia. The result indicates all western occurrences of the genus to be a result of recent dispersal rather than ancient vicariance. This study contributes to our understanding of the spatiotemporal processes shaping the flora of the arid zone, and offers a significant improvement in inference of ancestral areas for any organismal group distributed where it remains difficult to describe geography in terms of discrete areas.
PMCID: PMC3926304  PMID: 24335493
Australia; BEAST; biogeography; Centipeda; continuous diffusion; Pliocene; species-tree
22.  Massive parallelization of serial inference algorithms for a complex generalized linear model 
Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety.
PMCID: PMC4201181  PMID: 25328363
23.  Evaluating the Impact of Database Heterogeneity on Observational Study Results 
American Journal of Epidemiology  2013;178(4):645-651.
Clinical studies that use observational databases to evaluate the effects of medical products have become commonplace. Such studies begin by selecting a particular database, a decision that published papers invariably report but do not discuss. Studies of the same issue in different databases, however, can and do generate different results, sometimes with strikingly different clinical implications. In this paper, we systematically study heterogeneity among databases, holding other study methods constant, by exploring relative risk estimates for 53 drug-outcome pairs and 2 widely used study designs (cohort studies and self-controlled case series) across 10 observational databases. When holding the study design constant, our analysis shows that estimated relative risks range from a statistically significant decreased risk to a statistically significant increased risk in 11 of 53 (21%) of drug-outcome pairs that use a cohort design and 19 of 53 (36%) of drug-outcome pairs that use a self-controlled case series design. This exceeds the proportion of pairs that were consistent across databases in both direction and statistical significance, which was 9 of 53 (17%) for cohort studies and 5 of 53 (9%) for self-controlled case series. Our findings show that clinical studies that use observational databases can be sensitive to the choice of database. More attention is needed to consider how the choice of data source may be affecting results.
PMCID: PMC3736754  PMID: 23648805
database; heterogeneity; methods; population characteristics; reproducibility of results; surveillance
24.  Mapping the origins and expansion of the Indo-European language family 
Science (New York, N.Y.)  2012;337(6097):957-960.
There are two competing hypotheses for the origin of the Indo-European language family. The conventional view places the homeland in the Pontic steppes approximately 6kya. An alternative hypothesis claims the languages spread from Anatolia with the expansion of farming 8–9.5kya. Here we use Bayesian phylogeographic approaches together with basic vocabulary data from 103 ancient and contemporary Indo-European languages to explicitly model the expansion of the family and test between the homeland hypotheses. We find decisive support for an Anatolian over a steppe origin. Both the inferred timing and root location of the Indo-European language trees fit with an agricultural expansion from Anatolia beginning in the 9th millennium BP. These results highlight the critical role phylogeographic inference can play in resolving longstanding debates about human prehistory.
PMCID: PMC4112997  PMID: 22923579
25.  Species-specific responses of Late Quaternary megafauna to climate and humans 
Nature  2011;479(7373):359-364.
Despite decades of research, the roles of climate and humans in driving the dramatic extinctions of large-bodied mammals during the Late Quaternary remain contentious. We use ancient DNA, species distribution models and the human fossil record to elucidate how climate and humans shaped the demographic history of woolly rhinoceros, woolly mammoth, wild horse, reindeer, bison and musk ox. We show that climate has been a major driver of population change over the past 50,000 years. However, each species responds differently to the effects of climatic shifts, habitat redistribution and human encroachment. Although climate change alone can explain the extinction of some species, such as Eurasian musk ox and woolly rhinoceros, a combination of climatic and anthropogenic effects appears to be responsible for the extinction of others, including Eurasian steppe bison and wild horse. We find no genetic signature or any distinctive range dynamics distinguishing extinct from surviving species, underscoring the challenges associated with predicting future responses of extant mammals to climate and human-mediated habitat change.
PMCID: PMC4070744  PMID: 22048313

