|Home | About | Journals | Submit | Contact Us | Français|
The majority of human emerging infectious diseases (EIDs) are zoonotic, with viruses originating in wild mammals of particular concern (e.g. HIV, Ebola, SARS)1–3. Understanding patterns of viral diversity in wildlife and determinants of successful cross-species transmission, or spillover, are therefore key goals for pandemic surveillance programs4. However, few analytical tools exist to identify which host species likely harbor the next human virus, or which viruses can cross species boundaries5–7. Here we conduct the most comprehensive analysis yet of mammalian host-virus relationships and show that both the total number of viruses that infect a given species, and the proportion likely to be zoonotic are predictable. After controlling for research effort, the proportion of zoonotic viruses per species is predicted by phylogenetic relatedness to humans, host taxonomy, and human population within a species range – which may reflect human-wildlife contact. We demonstrate for the first time that bats harbor a significantly higher proportion of zoonotic viruses than all other mammalian orders. We identify the taxa and geographic regions with the largest estimated number of ‘missing viruses’ and ‘missing zoonoses’ and therefore of highest value for future surveillance. We then show that phylogenetic host breadth and other viral traits are significant predictors of zoonotic potential, providing a novel framework to assess if a newly discovered mammalian virus could infect people.
Viral zoonoses are a serious threat to public health and global security, and have caused the majority of recent pandemics in people4, yet our understanding of the factors driving viral diversity in mammals, viral host range, and cross-species transmission to humans remain nascent. Recent studies have described broad patterns of pathogen host range1,3 and various host or microbial factors that facilitate cross-species transmission5,7,8, or have focused on factors promoting pathogen and parasite sharing within specific mammalian taxonomic groups including primates9–11, bats12–14, and rodents12,15— but to date there has been no comprehensive, species-level analysis of viral sharing between humans and all mammals. Here we create then analyze a database of 2,805 mammal-virus associations, including 754 mammal species (14% of global mammal diversity) from 15 orders and 586 unique viral species (every recognized virus found in mammals16) from 28 viral families (Methods). We use these data to test hypotheses on the determinants of viral richness and viral sharing with humans. We fit three inter-related models to elucidate specific components of the process of zoonotic spillover (Extended Data Fig. 1). First, we identify factors that influence total viral richness (i.e. ‘baseline’ number of unique viral species found in a given host, including those which may have the potential to infect humans). Second, we identify and rank the ecological, phylogenetic, and life-history traits that make some species more likely hosts of zoonoses than others. Third, recognizing that not all mammalian viruses will have the biological capacity to infect humans, we identify and rank viral traits that increase their likelihood of being zoonotic.
In examining the raw data, we found that observed viral richness within mammals varies at a host order and viral family level, and is highest for Bunya-, Flavi-, and Arenaviruses in rodents; Flavi-, Bunya-, and Rhabdoviruses in bats; and Herpesviruses in non-human primates (Extended Data Fig. 2). Of 586 mammalian viruses in our dataset, 263 (44.8%) have been detected in humans, 75 of which are exclusively human and 188 (71.5% of human viruses) zoonotic – defined operationally here as viruses detected at least once in humans and at least once in another mammal species (Methods). The proportion of zoonotic viruses is higher for RNA (159/382, 41.6%) than DNA (29/205, 14.1%) viruses. The observed number of viruses per wild host species was comparable when averaged across orders, but bats, primates, and rodents had a higher proportion of observed zoonotic viruses compared to other groups of mammals (Fig. 1). Species in other orders (e.g. Cingulata, Pilosa, Didelphimorphia, Eulipotyphla) also shared a majority of their observed viruses with humans, but data were limited in these less diverse and poorly-studied orders. Several domestic ungulate species (orders Cetartiodactyla and Perissodactyla) are outliers for their number of observed viruses, but these species have a relatively low proportion of zoonotic viruses (Fig. 1; Supplementary Discussion).
Previous analyses show that zoonotic disease emergence events and human pathogen species richness are spatially correlated with mammal and bird diversity2,17. However, these studies weight all species equally. In reality, the risk of zoonotic viral transmission, or spillover, likely varies among host species due to differences in underlying viral richness, opportunity for contact with humans, propensity to exhibit clinical signs that exacerbate viral shedding18, other ecological, behavioral and life-history differences5,12,15, and phylogenetic distance from humans10. We hypothesize that the number of viruses a given mammal species shares with humans decreases with phylogenetic distance from humans and increases with opportunity for human contact. We used generalized additive models (GAMs) to identify and rank host-specific predictors (ecological, life history, taxonomic, and phylogenetic traits, and a control for research effort) of the number of total and zoonotic viruses in mammals (Methods; Supplementary Table 1).
The best-fit model for total viral richness per wild mammal species explained 49.2% of the total deviance, and included per species measures of disease-related research effort, phylogenetically-corrected body mass, geographic range, mammal sympatry, and taxonomy (order) (Fig. 2a–e). Not surprisingly, research effort had the strongest effect on the total number of viruses per host, explaining 31.9% of the total deviance for this model (Extended Data Table 1). The remaining 17.3% can be explained by biological factors, a value greater than or comparable to studies examining much narrower groups of mammal hosts10,12,15 (Supplementary Discussion). Mammal sympatry was the second most important predictor of total viral richness (Fig. 2d). Our model selection consistently identified mammal sympatry calculated at a ≥20% area overlap over other thresholds explored (Methods), providing insight into the minimum geographic overlap needed to facilitate viral sharing between hosts. Host geographic range was also significantly associated with increasing total viral richness, although the strength of this effect was low (Fig. 2c). Several mammalian orders, Chiroptera (bats), Rodentia (rodents), Primates, Cetartiodactyla (even-toed ungulates), and Perissodactyla (odd-toed ungulates) listed here in order of relative deviance explained, had a significantly greater viral richness than predicted (Fig. 2e). This finding highlights these taxa as important targets for global viral discovery in wildlife4, and suggests that traits not captured in our analysis (e.g. immunological function, social structure, and other life-history variables) may underlie their capacity to harbor a greater number of viral species. Our models to predict total viral richness were comparable when excluding virus-host associations detected by serology, i.e. using the ‘stringent data’, and were robust when validated with random cross-validation tests (Extended Data Table 1; Supplementary Table 2). However, we identified several regions that showed significant bias when cross validated by excluding mammals from zoogeographic areas, suggesting that there are location-specific factors that remain unexplained in our models (Methods; Supplementary Table 3).
Our best model to predict the number of zoonotic viruses per wild mammal species explained 82% of the deviance, and included phylogenetic distance from humans, a ratio of urban to rural human population across a species range, host order, whether or not a species is hunted, research effort, and total viral richness (Extended Data Table 1). A large fraction of the deviance explained is driven by the observed total viral richness per host – supporting the biological assumption that the number of viruses that infect humans scales positively with the size of the potential ‘zoonotic pool’19 in each reservoir host. Removing this contribution by including observed total viral richness per host as an offset, the model explains 33% of the total deviance in the proportion of viruses that are zoonotic (Methods), with 30% of total deviance explained by biological factors (Fig. 2f–i). Several mammalian orders had both a significant positive (bats) and negative (two ungulate orders) effect on the proportion of zoonotic viruses (Fig. 2i). A number of previous studies have proposed that bats are special among mammals in being reservoir hosts of a large number of recently emerging high profile zoonoses (e.g. SARS, Ebola virus, MERS)12,13,20. Our study, for the first time, tests this hypothesis in the context of all known mammalian viruses and hosts. While other mammalian orders have relatively high proportions of observed zoonoses and others have been poorly studied (Fig. 1a), our model results show that bats are host to a significantly higher proportion of zoonoses than other mammalian orders after controlling for reporting effort and the other predictor variables.
We found that the proportion of zoonotic viruses per species increases with host phylogenetic proximity to humans, and that this relationship is significant even when we removed ‘reverse zoonoses’ primarily associated with transmission from humans to primates (Methods). This is the first time this relationship has been demonstrated using data for all mammals and specifically as a determinant of zoonotic spillover, and is supported by previous taxon-specific studies that have examined host relatedness and parasite/pathogen sharing in primates9,10, bats14, and plants21. The proportion of zoonotic viruses shows some upward drift for mammals that are very phylogenetically distant from humans (Fig. 2g) that may represent an artifact of preferentially screening marsupials for human viruses. While primate species largely drive the phylogenetic effect, our best-fit model excluded the effect of the order Primates as a discrete variable (Fig. 2i) – suggesting that continuous variation in phylogenetic distance across primate species is more important, and is significant even when all mammals are included. This finding highlights the need to uncover the mechanism by which phylogeny affects spillover risk, e.g. evolutionarily related species sharing host cell receptors and viral binding affinities22,23 and specific viral mutations that may expand host range in related mammal species24.
We tested several measures to estimate human-wildlife contact at a global scale for the 721 wild mammals in our dataset, but only the ratio of urban to rural human population (all data model), the change in human population density, and the change in urban to rural population ratio from 1970–2005 across a species range (stringent data model) were included (Extended Data Table 1). The response curve for urban to rural population suggests that increasing urbanization raises the risk of zoonotic spillover (Fig. 2h), as does increasing human population density and the change in urban to rural population ratio over time. A single global metric of human-wildlife ecological contact did not emerge across models. However, the alternate inclusion of these related variables points to the importance of human-animal contact in defining per species spillover risk globally, and the need for controlled field experiments and human behavioral risk studies to uncover the mechanisms underlying this risk. Overall, the strength of the effect of phylogenetic distance was stronger than our proxies for animal-human contact in predicting proportion of zoonoses (30–44% stronger explanatory factor), but both remained significant after controlling for research effort (Extended Data Table 1).
The predominance of zoonoses of wildlife origin in emerging diseases has led to a series of programs to sample wildlife, discover novel viruses, and assess their zoonotic potential4,23,25–27. To inform their scale and scope we calculate the expected number of as-yet undiscovered viruses and zoonoses per host species using our best-fit GAMs and a scenario of increased research effort (Methods, Supplementary Table 4). We then project these ‘missing viruses’ and ‘missing zoonoses’ geographically (Fig. 3, Extended Data Figs. 3–8) to identify regions of the world where targeted, future surveillance to find new viruses and zoonoses will be most effective. In the process of translating our non-spatial, species-level predictions to geographic space, we identified several regions where our model predictions of the number of total and zoonotic viruses were systematically biased (hatched regions – Fig. 3 and Extended Data Figs. 3–8; Methods). Local factors contributing to this bias may include geographic variation in the detection probability of human and/or wildlife viruses, indicating areas where additional research and capacity strengthening for viral detection are most needed. Our model predictions were not systematically biased or clustered across host phylogeny (Extended Data Fig. 9).
Geographic hotspots of ‘missing zoonoses’ vary by host taxonomic order, with foci for carnivores and even-toed ungulates in Eastern and Southern Africa, bats in South and Central America and parts of Asia, primates in specific tropical regions in Central America, Africa, and Southeast Asia; and rodents in pockets of North and South America and Central Africa. Areas where ‘missing zoonoses’ predictions were systematically biased varied by taxonomic order, but included large parts of Africa for the all mammal dataset (Fig. 3a, Extended Data Figs. 3–8f). In contrast, the distribution of bias in predicting the ‘missing viruses’ for all mammals was limited to patches of northeastern Asia, Greenland, Peninsular Malaysia, and scattered grid cells in Western Asia and Patagonia (Extended Data Fig. 3c). We also identify geographic regions with large numbers of mammal species currently lacking any information regarding their viral diversity (Extended Data Figs. 3i–8i). In combination, these maps can be used for cost effective allocation of future resources for viral discovery programs, such as the recently proposed “Global Virome Project”27.
Finally, a significant challenge to preventing future disease emergence is estimating the zoonotic potential of a newly-discovered viral species or strain based on viral traits4–6,28. The best model for determining whether or not a known virus (n=586 mammalian viruses) has been observed as zoonotic explained 27.2% of total deviance and included maximum phylogenetic host breadth (PHB – a virus-specific trait that measures the phylogenetic range of known hosts, excluding humans), research effort, whether or not a virus replicates in the cytoplasm, is vector-borne, or is enveloped, and average genome length (Fig. 4). Using the ‘stringent’ dataset to define whether a virus is zoonotic resulted in a reduced model that excluded enveloped status and genome length (Extended Data Table 1). Our findings confirm a positive relationship between zoonotic potential and ability to replicate in the cytoplasm7, and that viruses with arthropod vectors may be able to infect a wider range of mammalian hosts5. Our phylogenetically-explicit measure of host breadth, PHB, can be used at various hierarchical taxonomic levels to quantify and rank viruses from specialist to generalist, and was the strongest predictor of zoonotic potential (12.4% of total deviance explained). This highlights the value of field programs to identify the natural host range of newly discovered pathogens in order to develop early proxies for their zoonotic potential4. Significant variation in PHB across viral families is suggestive of intrinsic differences in a virus’ ability to infect diverse hosts, and this relates to the proportion of observed zoonoses in each family (Fig. 4a).
We acknowledge several important caveats. First, our estimates of ‘missing viruses’ and ‘missing zoonoses’ per species are based on the current maximum observed research effort from the literature, and these estimates should be viewed as relative, not absolute. The true size of the undiscovered mammalian virome will likely increase with new genetic tools for unbiased viral discovery and in depth studies that repeatedly sample wildlife populations over time25. Second, our ecological and biological predictor variables only explain a portion of the total variation in viral richness per host and zoonotic potential based on viral traits, although this is greater than that reported in comparable order-specific studies10,12. Third, while we control for research effort we cannot account for viruses or host associations that have completely evaded human detection to date, nor those identified but not published. Additional resources to support better data-sharing and on-the-ground viral surveillance in the species and regions we identify would help validate predictive models to identify zoonotic viral hotspots, and streamline costly efforts to develop measures to prevent their future emergence.
The analyses reported herein have broad potential to assist in expediting viral discovery programs for public health. Our host-specific analyses and estimates of ‘missing zoonoses’ allow us to identify which species and regions should be preferentially targeted to characterize the global mammalian virome. Our viral trait framework then allows prioritization of newly-discovered wildlife viruses for detailed characterization (e.g. by sequencing receptor binding domains, and conducting in vitro and in vivo infection experiments23) to assess their potential to threaten human health.
To construct the mammal–virus association database we initially extracted all viruses listed as occurring in any mammal from the International Committee on Taxonomy of Viruses database (ICTVdb), and further individually went through each virus listed in the ICTV 8th edition master list and searched the literature for mammalian hosts. All viral species names were synonymized to ICTV 8th edition, which was the global authority on viral taxonomy at the start of our data collection in 201016. From 2010–15 the authors and a team of research assistants and interns at EcoHealth Alliance compiled mammal species associations for each of 586 unique viruses published in the literature between 1940–2015 initially by using the virus name and synonyms as the search keywords in the major online reference databases (Web of Science, PubMed, and Google scholar) in addition to searching in books, reviews, and literature cited in sources we had already obtained. To narrow the search for hosts for well-researched viruses, we additionally included the terms “host(s)”, “reservoir”, “wildlife”, “animals”, “surveillance”, and other relevant terms to find publications related to host range. Associations were cross-checked for completeness with the Global Mammal Parasite Database for primate, carnivore and ungulate viruses, version as of Nov 2006 (GMPD, www.mammalparasites.org)30 and other published reviews specific to bats and rodents12,31,32. We excluded all records without species-level host information, and those where we could not track down the primary references. Records of mammal-virus associations from experimental infection studies, zoological parks or captive breeding facilities, or cell culture discoveries were excluded. Host species were defined as domestic vs. wild following the list of domestic animal species from the Food and Agriculture Organization (FAO)33, and we removed the black rat (Rattus rattus) and domestic mouse (Mus musculus) from the domesticated list as these two species make up their own “peri-domestic” category. Host species were categorized as either occurring in human modified habitats or being hunted by humans – both estimates for human contact – according to the IUCN Red List species descriptions34.
To control for the fact that some detection methods are more reliable than others in identifying the pathogen of interest, we recorded the detection method used for each host-virus association and scored these as 0, 1, or 2 according to the reliability of detection method used. Viral isolation and PCR detection with sequence confirmation were scored as a 2 (= stringent data); and serological methods were scored as a 0 or 1, with viral or serum neutralization tests (=1), and enzyme-linked immuno assays (ELISA), antigen detection assays, and other serological assays scored as (=0). ‘Stringent data’ were analyzed separately to remove potential uncertainty due to cross-reactivity with related viruses. We exhaustively searched the literature to identify a ‘stringent’ detection for each mammal-virus pair, and only included the serological finding for that pair if no molecular or viral isolation studies were available. We partitioned data and conducted separate analyses for the entire dataset (0 +1 + 2 detection quality) and the ‘stringent’ data (score of 2) to reduce the noise from potential serological cross-reactivity. Full list of host-virus associations, detection methods, and associated references are provided in our data and code repository at http://doi.org/10.5281/zenodo.569079.
Our operational definition of a zoonotic virus includes any virus that was detected in humans and at least one other mammalian host in at least one primary publication, and does not imply directionality. Our complete dataset of mammalian viral associations demonstrates evidence of past or current viral infection which we believe is a reasonable proxy for measuring spillover, and our ‘stringent’ dataset specifically is more robust to exclude species that may have been exposed to a given virus versus those that show some evidence for replication within the host species. Our bi-directional definition of spillover follows a proposal by the WHO that defines a zoonosis as “any disease or infection that is naturally transmissible from vertebrate animals to humans and vice-versa” (http://www.who.int/zoonoses/en/) and excludes any human pathogens that recently evolved from nonhuman pathogens (e.g. HIV in primates), as per Woolhouse and Gowtage-Sequeria 20051.
In order to address influence of transmission from humans to wildlife in our models, we also ran our GAM model fitting and selection procedure (see below) on a subset of data that excluded any probable ‘reverse zoonotic’ viruses. We first searched our entire dataset and removed any clear instances of transmission from humans to primates, e.g. including records from zoological parks and wildlife rehabilitation centers (as previously noted). We then additionally removed several human viruses most commonly transmitted from humans back to non-human primates to create a subset of data without the most common ‘reverse zoonotic’ viruses (Adeno-associated virus-2; Human adenovirus D; Human herpesvirus 4; Human metapneumovirus; Human respiratory syncytial virus; Measles virus; Mumps virus)35,36. We present these additional analyses excluding ‘reverse zoonoses’ and associated code at http://doi.org/10.5281/zenodo.569079.
Total viral richness was calculated as the number of unique ICTV-recognized viruses found in a given host species, and zoonotic viral richness was defined as the number of unique ICTV-recognized viruses in a given host species that were also detected in humans in our database.
To assess research bias for both host and virus, we searched ISI Web of Knowledge, including Web of Science and Zoological Record, and PubMed for the number of research publications for a given host or pathogen. We recorded two values for the number of research papers for a host. The first was a simple search by scientific binomial in Zoological Abstracts where we recorded the number of papers published between 1940–2013 for each host species. We also recorded the number of disease-related publications for each species using the scientific binomial AND topic keyword: disease* OR virus* OR pathogen* OR parasit*. The * operator was used in our search criteria to capture all words that begin with each term, e.g. “parasit” (= “parasite”, “parasites”, and “parasitic”). These search criteria broadly included papers that examined disease or diseases, virus or viruses, pathogen or pathogens, parasite parasites, or parasitology, for each species. Only one measure of per host research effort was included at a time in model selection. As these metrics are highly correlated and the number of disease related citations per host outperformed the total number of publications per host in all but one model (all-data zoonoses), we decided to use disease-related publications as our per-species research effort measure for all models to improve interoperability. We also recorded the number of publications for each of 586 virus species using a keyword search by virus name in PubMed and Web of Science. Only one measure of per virus research effort was included at a time in model selection.
We used a phylogenetically-corrected measure of body mass (see details below under ‘Phylogenetic Signal’) as our main life history predictor variable, because it was the only one for which a nearly complete data set existed for the species in our dataset. We used the body mass recorded in the PanTHERIA database37 for 709 species. For 3 species, we used the second choice option, body mass recorded in the AnAge database38. For 11 species, we used the third choice option of the extrapolated body mass recorded in PanTHERIA, which is based on body length or forearm length, depending on species. For 36 species, we used the average body mass for members of the genus that had a recorded body mass. We explored other life-history variables related to longevity39, reproductive success, and basal-metabolic rate but these were ultimately excluded due to the high number of missing records.
We address the issue of non-independence of host species traits due to shared ancestry40 in our analyses by first quantifying the phylogenetic signal for each variable in our model using Blomberg’s K41. Blomberg’s K measures phylogenetic signal in a given trait by quantifying trait variance relative to an expectation under a Brownian motion null model of evolution using a phylogenetic tree with varying branch lengths. Blomberg’s K-values are scaled from 0 to infinity, with a value of 0 equal to no phylogenetic signal and values greater than 1 equal to strong phylogenetic signal for closely related species that share more similar trait values. While there is no clearly defined K-value cut-off in which to apply phylogenetic comparative methods, non-significant value of <1, or more conservatively <0.5, are typical for traits that are phylogenetically independent. The only host variables we examined with significant K-values >0.5 were host body mass, and our direct measure of phylogenetic distance to humans. While there are several tools available to control for phylogeny in multivariate analyses, e.g. using phylogenetic generalized least square models (e.g. PGLS)42, there is currently no modeling approach to control for phylogeny using GAMs. More importantly, a wholesale effort to control for phylogeny across all variables in our analysis was not appropriate here, as we are explicitly testing the relative importance of phylogenetic distance to humans vs. other host traits including measures of human-wildlife contact to predict the proportion of zoonotic viruses for a given host species. This left body mass as the only variable in our models, excluding our direct measures of phylogenetic distance, with a significant Blomberg K value that was greater than one. We controlled for the significant effect of shared evolutionary history using a phylogenetic eigenvector regression (PVR)43,44 on body mass. The PVR approach allowed us to remove phylogenetic signal for any phylogenetically non-independent variables and then include the corrected values back in our GAMs, while retaining predictor variables like phylogenetic distance to humans as unmodified. We calculated PVR for body mass using the R package ‘PVR’ and our custom-build maximum likelihood host phylogeny using cytochrome B sequences constrained to the order-level topology of the mammalian supertree29,45. Our new variable for body mass that controls for phylogenetic signal (PVRcytb_resid) removed most of the phylogenetic signal, with K= 3.5 unadjusted, and K<0.5 after PVR correction. Our new metric of body mass scales in the same way, with larger values equal to species with larger body mass. PVR body mass was included in our GAM model selection for the total viral richness and zoonotic virus models.
We used two different mammal phylogenetic trees in our analyses and used a model selection framework to determine which best explained our observed association with zoonotic viral richness. First the mammal supertree was pruned in R (package ape, function ‘drop.tips’) to include only synonymous species for the 753 species in our database29,46. We synonymized all host species names between the mammal supertree and the host associations in our database using the IUCN Red List34. If the species was listed as ‘cattle’ it was assumed to be Bos taurus, all other records were excluded if there was ambiguity as to the scientific name for the host species. Second, a maximum likelihood cytB tree was generated using the constraint of a multifurcating tree with taxa constrained to their respective orders and the order-level topology matching that of the mammal supertree6, as per this Newick tree file: (MONOTREMATA,((DIDELPHIMORPHIA,(DIPROTODONTIA,PERAMELEMORPHIA)),(PROBOSCIDEA,((PILOSA,CINGULATA),((((RODENTIA,LAGOMORPHA),(PRIMATES,SCANDENTIA)),((((CETARTIODACTYLA,PERISSODACTYLA),CARNIVORA),CHIROPTERA),EULIPOTYPHLA))))))). This generated a higher resolution species-level mammal tree using cytb data, with more reliable positioning of the higher-level taxonomic relationships than was obtained in exploratory phylogenetic analyses using cytb data alone. Genbank accession numbers and cytb sequence lengths for each species provided in in our data and code repository. Cytb gene fragments ranged from 143 to 1140bp, with >1000bp available for 558/665 (84%) of the taxa. Data derived from the cytochrome b tree constrained to the topology of the mammal supertree was selected as the best option in all best-fit GAMs.
Sequences were aligned using MUSCLE with default setting in Geneious R6, and checked visually for errors47. The best maximum likelihood tree with and without the constraint tree were generated using RAxML-HPC2 on XSEDE via the CIPRES Science Gateway server v3.148 using a GTR model with parsimony seed, 1000 bootstrap replicates, and the following, specific parameters (raxmlHPC-HYBRID -s infile -n result -x 12345 -g constraint.tre -N 1000 -c 25 -p 12345 -f a -m GTRCAT).
Matrices of pairwise patristic distances between all species, including Homo sapiens, were calculated from the two phylogenies using the ‘cophenetic’ function in the R package ape46. Phylogenetic trees (Newick format for pruned Supertree and cytb tree) and matrices of phylogenetic distance from humans are provided in the data and code repository.
We calculated mean, median, max, min, IQR, and standard deviation (represented as generic function F in equation 1) of phylogenetic host breadth (PHB) from all known mammalian hosts for each virus using the pairwise patristic distances (di,j) for each mammal-mammal association for all hosts of a given virus excluding humans, where i indexes each mammal in the database, as does j, and J represents the total mammals in the database. We aggregated these PHB values using mean, median, or maximum values at a viral species, genus and viral family level to generate higher-level taxonomic variables of host breadth per viral group. Our measure is similar to those developed by previous studies to understand parasite host specificity49–51, but here we create a generalizable variable to measure viral host breadth that can be aggregated at different viral taxonomic levels.
To make Extended Data Figure 9, taxon names and terminal branches of cytb tree constrained to supertree were color-coded using residual from the best-fit zoonotic virus GAM (predicted-observed zoonotic viral richness) for wildlife species, and plotted using the plot.phylo function in the R package ape46. Symbols (circles) at terminal taxa additionally added to better visualize residual value colors were added using willeerd.nodelabels function (dx.doi.org/10.5281/zenodo.10855). All marine mammals, domestic animals, and other taxa with missing data were coded as grey for missing data.
Viral richness heatmap (Extended Data Figure 2) was generated using the R package pheatmap, and the ‘complete’ hierarchical clustering algorithm to sort cells across rows and columns by similar values of viral richness. All boxplots, histograms and all other figures generated in R v.3.3.052. R code for primary figure generation is provided in the code repository.
We fit a set of generalized additive models (GAMs) that included all of our selected potential variables explaining the number of total viruses or number of zoonoses in hosts, as well as whether viruses were zoonotic (for conceptual framework and summary of each GAM see Extended Data Figure 1; for full variable list and data sources see Supplementary Table 1). Our use of GAMs, an incorporation of smooth spline predictor functions into the generalized linear model (GLM) framework, allowed us to examine the functional form of our predictor variables (e.g. Fig 2 and and4).4). Categorical and binary variables (e.g. host order, IUCN status of hunted or not, and certain viral traits) were fit as random effects of each variable level. We used automated term selection by double penalty smoothing53 to eliminate variables from the models. This method removes variables with little to no predictive power and has been shown to be comparable or superior to comparing alternate models with and without variables. We did use the model comparison method in for the case of domestic animals, where the sample size was not sufficient for fitting all variables. In this case dropping variables by double penalty smoothing still allowed pruning the model list to eliminate redundant models. Where there were competing variables measuring the same mechanistic effect, we fit alternate GAMs using only one of each of these variables (as specified in below and in the Extended Data Figure 1). These included phylogenetic variables, citation counts from alternate databases, and different measures of human population/host overlap. For example, to capture host phylogeny we used phylogenetic distance based on either the mammal supertree20 or a purpose-built CytB constrained by the topology of the mammal supertree, but never both in the same model. For human population variables, we looked at either variables measuring overlap of species range with human-occupied areas, or human population in those areas, as area- and population-based measures were highly co-linear. For citation variables, we looked at either all citations or the number of disease-related citations for each host species, not both, and similarly citations in either PubMed or Web of Knowledge. We used a binomial GAM to analyze the 586 mammalian viruses in our database and identify viral traits that may serve as predictors of zoonotic potential. Collinearity was not a major issue among variables included in the same model.
We inspected models within 2 AIC units of the model with the lowest AIC, and present the outputs of the best-fit and all other top models (<2 ΔAIC) in our data and code repository. In general, variable effects retained the same functional form and effect size across models within 2 ΔAIC – differences were limited to the adding or dropping of very weak, insignificant effects, or switching between highly correlated competing variables such as citation counts from different databases.
For our model of number of zoonoses per host, we used the total number of observed viruses per host as an offset, effectively fitting a model of expected number of zoonoses per host virus. We found this variable had a coefficient near to one when it was used as a linear predictor, indicating its appropriateness as an offset.
We repeated the model selection process for all models using the more ‘stringent’ set of data that used only virus identified in mammal hosts using viral isolation, PCR, or other methods of nucleic acid sequence confirmation, i.e. that excluded all associations detected via serology.
All models were fit using the MGCV package for R (version 1.8–12.). We used the model with the lowest AIC to predict the number of expected zoonotic viruses for each host species, using all the data from our database that had complete observations for the best model. Our top models consistently outperform the alternatives by wide margins, as measured by AIC. We used standard methods in the R package MGCV to calculate deviance explained, which is defined as (D_null – D_model)/D_null. In this formula, D_null is the deviance (−2*likelihood) of an intercept-only, (or, in the case of the zoonoses model, offset-only), model, while D_model is the deviance of our best-fit model.
Analyses were limited to terrestrial mammal species as defined by the IUCN Red List (marine mammals were excluded) and we ran separate analyses for wild and domestic animals. As domestic animals made up a much smaller data set (n=32 species) with a unique set of explanatory variables that differed from the wild species analyses, these models were fit separately. Domestic species results are also discussed separately (see Supplementary Discussion) as they are tangential to the primary findings.
We used k-fold cross-validation to evaluate goodness of fit for all models. The data was divided into 10 folds, selected randomly. For each fold, the model was re-fit based on the other 9 folds, and goodness of fit was assessed by conducting a nonparametric permutation test of comparing the predicted values vs. the real values for the kth fold, where a non-significant result indicates that predictions are unbiased. Poisson models goodness-of-fit may be compared via a parametric Chi-squared permutation test on deviance values, but this test is inappropriate in the case of models with low mean values, as is our case for some of our GAMs54. The k-fold cross-validation confirmed the robustness of our model predictions for wild mammals, code and outputs from these tests for each best-fit GAM are provided in Supplementary Table 2.
In addition to randomly-selected k-fold cross-validation, we evaluated the robustness of our models via a non-random geographic cross-validation, code and summary document provided in our code and data repository. In order to meaningfully organize species in our dataset by geographic areas, we used the 34 zoogeographic regions for terrestrial mammals recently redefined by Holt et al.55. Using QGIS56, a mammal-specific zoogeographical shapefile provided by Holt’s group at the University of Copenhagen (http://macroecology.ku.dk/resources/wallace) was intersected (using QGIS Vector > Geoprocessing Tools > Intersect) with a shapefile of IUCN’s host ranges for all mammals in our database. Areas of these intersections were then calculated using an equal-area projection (Mollweide), and each host was assigned to only the region that contained the greatest proportion of its range. We systematically removed all observations (species) from each given zoogeographical region, re-fit the model using all observations from outside the region, then performed a non-parametric permutation test comparing the predicted values to the observed values for that region. Non-significant results indicate that model predictions are unbiased. Significant results for a given zoogeographic region suggest that there are location-specific biases that remain unexplained. This systematic zoogeographic cross-validation supported the overall robustness of our model predictions for several models, i.e. all-data zoonoses, all-data total viral richness, and stringent-data total viral richness models. For these models, even though a majority of zoogeographic regions were unbiased, we still identified several zoogeographic regions that showed significant bias. Our zoogeographic cross-validation was equivocal for the stringent-data zoonoses model, with 8 regions that showed evidence of bias and 7 regions which showed no evidence of bias (Supplementary Table 3).
The presence of biased regions in our zoogeographic cross-validation suggested the possibility that there is a systematic bias associated with geography not captured by the predictor variables in our models. To further investigate this, we added zoogeographical region as a categorical random effect to each of our best-fit models. For three of our best-fit GAMs (all-data total viruses, stringent-data total viruses, and stringent-data zoonoses) the addition of zoogeographical region as a categorical random effect decreased the model AIC and increased the total deviance explained by 3–5%. The all-data zoonoses model, which was used to create the series of maps in the main manuscript, does not improve with the inclusion of zoogeographical region. However, the improved predictive power of models using region-specific terms is offset by the increase in degrees of freedom (i.e. if we included 31 zoogeographic regions as separate terms) and, more importantly, a decreased interpretability of our models – especially when compared to the geographical variables we used, such as host area or species range overlap with human modified habitat. We opted not to include these random effects in our final GAMs in favor of keeping only variables interpretable in the context of our host trait-specific framework. Instead, we indicate areas of geographic bias directly on our spatially mapped outputs. (See “Calculating and visualizing ‘missing viruses’ and ‘missing zoonoses’”, below.) Summaries of these models, along with changes in relative deviance explained for the other explanatory variables when zoogeographic region is added as a random effect, are provided in our code and data repository.
For all the wildlife hosts we used the geographic range information obtained from the IUCN spatial database version 2015.2. Wildlife host species shapefiles needed to replicate analysis are now hosted on our Amazon S3 storage (https://s3.amazonaws.com/hp3-shapefiles/Mammals_Terrestrial.zip)34. IUCN depict species’ range distributions as polygons based on the extent of occurrence (EOO), which is defined as the area contained within a minimum convex hull around species’ observations or records. This convex hull or polygon is further improved by including areas known to be suitable or by removing unsuitable or unoccupied areas based on expert knowledge. To accurately calculate the area in km2 of each host species we projected the polygons to an equal area projection (Mollweide).
We calculated various thresholds of mammal sympatry based on percentage of range overlap for each wild species in our database using IUCN shape files for all mammals globally. We define mammal sympatry as the number of mammalian species that overlap with the target species’ geographic range. We calculated mammal sympatry for each wild species in our database at six different thresholds based on the percentage area overlap with the target species geographic range, i.e. the number of other wild mammal species with any (>0%), ≥20%, ≥40%, ≥50%, ≥80%, or 100% range overlap. The six different thresholds for mammal sympatry were included as competing terms in our model selection for the total viral richness models.
We derived and tested several global measures to estimate the level of human contact with each wild species in our database. To estimate the area of host geographic range covered by crops, pastures, rural and urban areas – as measures of global human contact with a given wildlife species – each species polygon was intersected (overlapped) with spatial data representing those land cover types. Additionally, we calculated the total number of people within each host geographic range using data from HYDE database57, and also separately totaled the number of people in rural and urban populations. We obtained data on the distribution of cropland, pastures, rural and urban areas (i.e., grazing land) also from the HYDE database57 for the years 1970, 1980, 1990, 2000 and 2005 with a spatial resolution of 5' × 5', equivalent to 10km2 by 10km2. These datasets were created by combining information from satellite imagery and sub-national crop and pasture statistics57. In our GAMs, we used several transformations of these variables as competing proxies for human-wildlife contact: the log-transformed area of host range that overlapped each type of human-modified land cover, log-transformed human population in the host range, log-transformed human population density in the host range, and the log-ratio of urban and rural human populations in the host range. For each of these, we also included as a variable the change in value from 1970 to 2005. Human-wildlife contact variables that significantly covaried were excluded (set as competing terms) during the model selection process. The ratio of urban to rural human population was used to disentangle variables of human-wildlife contact that significantly covaried. For example, the total area of a species range that overlapped with urban and rural areas was highly correlated with the total geographic area variables we examined (e.g. total area, and area in crop, pasture, rural, and urban). The ratio of urban to rural population allowed us to separate these signals and best represent this proxy of per species human-wildlife contact. All spatial analyses were performed in R (3.3.2)52, using the following R libraries: raster58, rgdal59, and sp59.
We used each respective best-fit, all data GAM from the total viral richness and proportion zoonoses models to calculate the estimated number of viruses that would be observed if the research effort variable for each species was equal to that of the most-studied wild species in our database (Vulpes vulpes with 4,433 total publications and 1,477 disease-related publications). We used the prediction of the total virus richness GAM as the offset for the zoonoses GAM. We then calculated the ‘missing viruses’ and ‘missing zoonoses’ by subtracting the observed number of viruses and zoonoses from the predictions based on maximum research for each wild mammalian species.
We used geographic range maps from the IUCN spatial database (2015.2) to visualize the spatial distribution of observed host-virus associations, observed host-zoonoses association, these associations as predicted under maximum research, and the maximum predicted – observed viruses, or the ‘missing viruses’ and ‘missing zoonoses’ (e.g. Fig 3; Extended Data Figures 3–8; Supplementary Table 4). We also generated maps comparing species richness of all species in the IUCN database against those with viral associations in our database. For each species, the distribution range was converted to a grid system with cells 1/6 of a geographic degree (approximately 18 km × 18 km at the Equator line). Each grid cell was assigned a value of one to indicate presence. We repeated this process and assigned the observed and predicted-under-max-effort number of zoonotic viruses to their correspondent grid cells. Viral and host species richness maps, and both the ‘missing viruses’ and ‘missing zoonoses’ maps were calculated by overlying individual grids. Each richness map represents the sum of all values for a given grid cell. We repeated the process for all the host species in our database and created viral and species richness maps for the following orders: Carnivora, Cetartiodactyla, Chiroptera, Primates and Rodentia. These taxa were selected because they represent 681/736 (92.5%) of wild mammal species in our database.
In the process of translating our non-spatial, species-level predictions to geographic space (i.e. layered raster maps), we identified several geographic areas where our model predictions of the number of total and zoonotic viruses were systematically biased, i.e. p-value <0.05 (Supplementary Table 3). In order to visualize the geographic biases of our non-spatial model predictions in our maps (see above regarding zoogeographic cross-validation), we demarcate regions with significant bias with hatching. Hatched regions represent areas where model predictions of total or zoonotic viral richness deviate systematically for the collection of species in that grid cell. For each grid cell we calculated whether the bias exceeded that expected from a random sampling of hosts. This was accomplished by summing the residuals from 100,000 random draws of species in our dataset that was equal to the number of species present in that grid cell, then identifying grid cells where the observed bias was outside the middle 95% of the randomly drawn distribution. We calculated this for all mammals, and separately for each order across all grid cells. Areas with observed bias (outside of 95% of the randomly drawn distribution) are shown with hatched regions on each ‘missing virus’ and ‘missing zoonoses’ map.
Animal silhouettes added to Figs 1 and and33 and Extended Data Figs 1 and and22 to visually represent each mammalian order were downloaded from PhyloPic (www.phylopic.org). Chiroptera, Cingulata, Diprotodontia, Lagomorpha, Peramelemorphia, and Primates were available for use under the Public Domain Dedication license. Carnivora and Rodentia (by Rebecca Groom), Didelphimorphia, Pilosa, and Probscidea (by Sarah Werning), Eulipotyphyla (by Claus Rebler), Certartiodactyla and Perissodactyla (by Jan A. Venter, Herbert H. T. Prins, David A. Balfour & Rob Slotow and vectorized by T. Michael Keesey) are provided under the creative commons license (https://creativecommons.org/licenses/by/3.0/). We created the silhouette for Scandentia, and waive all rights.
|Total Viral Richness Model (all data, n=576 species)||49.2%|
|Disease-related publications (log)||1846.57||<0.001||5.55||64.8%|
|Mammal sympatry (>20% range overlap)||301.38||<0.001||5.16||10.1%|
|Phylogenetically-corrected body mass||216.42||0.009||3.82||1.9%|
|Geographic range (log)||18.93||0.025||3.58||1.6%|
|Total Viral Richness Model (stringent data, n=575 species)||35.8%|
|Disease-related publications (log)||923.02||<0.001||4.98||53.6%|
|Mammal sympatry (>20% range overlap)||44.96||<0.001||4.69||3.9%|
|Phylogenetically-corrected body mass||9.65||0.036||3.51||2.8%|
|Geographic range (log)||11.14||0.079||2.66||1.5%|
|Proportion Zoonoses Model (all data, n=584 species)||82.0% (number of zoonoses) 33.0% (proportion, w/offset)|
|Phylog. dist. from humans (log, cytb tree)||12.7||0.002||1.88||17.0%|
|Urban to rural human population ratio in species range (log)||10.01||0.002||1.25||13.0%|
|Disease-related publications (log)||5.81||0.017||1.2||7.7%|
|Hunted species, IUCN||0.75||0.167||0.36||1.3%|
|Proportion Zoonoses Model (stringent data, n=576 species)||23.6%|
|Phylog. dist. from humans (log, cytb tree)||56.13||<0.001||2.36||34.5%|
|Urban to rural human population ratio change, 1970–2005||16.88||0.002||4.05||19.6%|
|Change in human population density in range, 1970–2005||3.16||0.132||1.47||4.3%|
|Disease-related publications (log)||5.03||0.014||1.21||3.8%|
|Phylogenetically-corrected body mass||0.12||0.294||0.12||1.1%|
|Viral Traits Model (all data, n=464 viruses)||27.2%|
|Max phylogenetic host breadth w/out humans, (log, cytb tree)||44.91||<0.001||2.94||45.6%|
|Number of publications (log)||35.83||<0.001||3.28||37.4%|
|Average genome length (log)||0.12||0.266||0.09||0.9%|
|Viral Traits Model (stringent data, n=408 viruses)||21.1%|
|Number of publications (log)||29.51||<0.001||2.64||53.1%|
|Max phylogenetic host breadth w/out humans, (log, cytb tree)||15.75||<0.001||2.53||25.5%|
This research was supported by the United States Agency for International Development (USAID) Emerging Pandemic Threats PREDICT program; and NIH NIAID awards R01AI079231 and R01AI110964. The authors thank C.N. Basaraba, J. Baxter, L. Brierley, E.A. Hagan, J. Levinson, E.H. Loh, L. Mendiola, N. Wale, and A.R. Willoughby for assistance with data collection, and B.M. Bolker, A.R. Ives, K.E. Jones, C.K. Johnson, A.M. Kilpatrick, J.A.K. Mazet, and M.E.J. Woolhouse for comments.
Author contributionsK.J.O., T.L.B., and P.D. designed the study and supervised the collection of data. N.R., P.R.H. and K.J.O. designed the statistical approach, wrote the code, and generated figures. K.J.O. performed phylogenetic analyses. C.Z-T. performed spatial analyses. All authors were involved in writing the manuscript.
The authors declare no competing financial interests.
Data availability statement
All datasets (host traits, viral traits, full list of host-virus associations and associated references, phylogenetic trees, and phylogenetic distance matrices) along with the R code and R package dependencies needed to fully replicate and evaluate these analyses are provided at http://doi.org/10.5281/zenodo.569079. The top-level README.txt file in the directory details the file structure and metadata provided.