Accurate estimation of population size dynamics has important implications for public health and conservation biology (Pybus et al. 2003
; Shapiro et al. 2004
; Biek et al. 2006
). In this paper, we propose a statistically novel model to infer population size dynamics from molecular sequences. We build our modeling framework upon Kingman's coalescent process, a powerful tool in the population genetics arsenal for studying probabilistic properties of genealogies relating individuals randomly sampled from a population of interest (Kingman 1982
). Because genealogical shapes leave their imprints in the genomes of sampled individuals, the coalescent allows for the inference of population genetics parameters, including population size dynamics, directly from the observed genomic sequences (Hein et al. 2005
Many coalescent-based estimation algorithms rely on simple parametric forms to characterize the evolution of the population size dynamics over time. Advantageously, these deterministic functions contain a relatively small number of parameters to be estimated (Kuhner et al. 1998
; Drummond et al. 2002
). However, justifying strong parametric assumptions can be difficult and may require laborious and computationally expensive testing of many candidate functional forms to find an appropriate description of the population size trajectory. An extreme alternative to parametric population size models is the classical skyline plot estimation proposed by Pybus et al. (2000)
. This estimation procedure relies on a piecewise constant population dynamics model. Because the number of free parameters in this model is equal to the number of independently distributed observations, the classical skyline plot approach results in very noisy estimates.
To arrive at a middle ground between overly stringent parametric and noisy classical skyline plot approaches, 3 extensions to the classical skyline plot estimation have been recently proposed. Strimmer and Pybus (2001)
develop generalized classical skyline plot estimation. These authors employ a model selection approach, based on the Akaike Information Criterion correction (AICc
), to reduce the number of free parameters in the classical skyline plot. Drummond et al. (2005)
and Opgen-Rhein et al. (2005)
use multiple change-point (MCP) models to estimate the population size dynamics in a Bayesian framework. These methods approximate the effective population size trajectory with a step function, defined by estimable change-point locations and step heights. One of the main advantages of MCP models is the ease of incorporating them into a joint Bayesian estimation of genealogies and population genetics parameters as demonstrated by Drummond et al. (2005)
. Both proposed MCP models share the same weakness as they require fairly strong prior decisions. Drummond et al. (2005)
a priori fix the total number of change points, a critical parameter in their model that controls the smoothness of the population size trajectory. Opgen-Rhein et al. (2005)
bypass the problem of fixing the number of change points through reversible jump Markov chain Monte Carlo (MCMC) sampling (Green 1995
). However, these authors use an informative and very influential prior for the number of change points in their model. Therefore, in both MCP approaches choosing an appropriate level of smoothness of population size dynamics remains problematic.
We propose to smooth population size trajectories explicitly. We choose piecewise constant demographic model of Pybus et al. (2000)
as our point of departure. Our goal is to construct a smooth skyride through a rough classical skyline profile. Our construction is accomplished by imposing a Gaussian Markov random field (GMRF) smoothing prior on the parameters of the piecewise constant population size trajectory. We make our smoothing prior “time aware” to penalize effective population size changes between “small” consecutive (Ghedin et. al 2005) intercoalescent intervals more than changes between intervals of larger size. To achieve this desirable behavior of our smoothing prior, we equip each consecutive pair of intercoalescent intervals with an appropriate smoothing weight. We show through a simulation study that the time-aware prior is very effective in capturing important characteristics of “true” population size trajectories and is superior to a uniform, time-ignorant GMRF prior. The extension of Kingman's coalescent to serially sampled (heterochronous) data by Rodrigo and Felsenstein (1999)
opens the door for coalescent-based inference for measurably evolving populations (Drummond et al. 2003
). We show that the GMRF smoothing can be easily incorporated into analyses of both isochronous (contemporaneously sampled) and heterochronous data.
Through simulation, we compare performance of the Bayesian skyride with Opgen-Rhein's MCP (ORMCP) model (Opgen-Rhein et al. 2005
) and a Bayesian skyline plot, a MCP model implemented in the software package BEAST (Drummond et al. 2005
). In 3 simulation scenarios that we consider, we find that the Bayesian skyride performs as well or better than both MCP approaches. Although the small number of our simulations prevent us from a detailed comparison of the methods, we nevertheless can conclude that the Bayesian skyride is a competitive alternative to the MCP models and requires substantially weaker prior assumptions. We demonstrate the utility of the proposed method by applying it to 2 real data sets. We analyze isochronous sequences of hepatitis C virus (HCV) and demonstrate that the Bayesian skyride is able to recover all previously inferred characteristics of the Egyptian HCV population dynamics. We proceed with an investigation of intraseason population dynamics of human influenza virus. Our analysis of heterochronous influenza data from 3 seasons demonstrates that estimation of influenza population dynamics holds promise for predicting peak infection time within a flu season.