The coalescent with recombination describes the distribution of genealogies underlying samples of chromosomes from unrelated individuals in idealized natural populations (Hudson 1983
; Griffiths & Marjoram 1996
). Starting from the present and looking back in time, the ancestral lineages relating to the sampled chromosomes are traced until coalescence (where two ancestral lineages meet in a common ancestor) or recombination (where an ancestral lineage splits in two). The resulting ancestral recombination graph (ARG) has embedded within it the marginal genealogy (or phylogenetic tree describing the ancestry of the chromosomes) at any position along the sequence and, by mapping mutations on to the graph, describes patterns of genetic variation in the sampled chromosomes. Under models with constant population size and random mating, two parameters determine the distribution of variation: the population mutation rate θ
is the effective population size and μ
is the per generation mutation rate); and the population recombination rate ρ
is the per generation recombination rate).
Stochastic simulation under the model (backwards in time starting from the present) is computationally straightforward because at any point in time the rates of coalescence and recombination are simple functions of the ancestral lineages present (i.e. it has a Markovian structure; Hudson 1983
). In contrast, the alternative approach of simulating genealogies while moving along a sequence (Wiuf & Hein 1999
) has a complex non-Markovian structure in that the distribution of the next genealogy depends not just on the current genealogy, but also all previous ones. Both approaches, however, can make use of the separation of the genealogical and mutational processes under neutrality (Hudson 1990
). Consequently, the ARG may be generated first with mutations subsequently added to the marginal genealogies as a Poisson process.
Efficient inference under the coalescent with recombination is notoriously difficult (Stumpf & McVean 2003
). For example, moment estimation of ρ
can be achieved by comparing the sample variance in pairwise differences to the expectation under neutrality (using a point estimate of θ
; Hudson 1987
; Wakeley 1997
), but the estimator uses only a fraction of the available information about recombination and is both biased and has high variance (Wall 2000
In contrast, likelihood-based inference (which uses all possible information) is currently restricted because there exists no analytic or numerical expression for the likelihood function and the construction of efficient Monte Carlo methods for estimating the likelihood is technically challenging. Naively, the likelihood could be estimated by simulating ARGs from the coalescent distribution given ρ
, adding mutations to the ARGs from the distribution given by θ
and looking to see if the simulated data matched the data observed. By repeating many times under different values of θ
, maximum likelihood estimates of the statistic could be obtained. In practice, the naive approach is infeasible because the vast majority of ARGs contribute nothing to the likelihood. Consequently, sophisticated Monte Carlo methods such as importance sampling (IS; Fearnhead & Donnelly 2001
) and Markov Chain Monte Carlo (MCMC; Kuhner et al. 2000
; Nielsen 2000
) must be used (reviewed in Stumpf & McVean 2003
), which create bias towards the simulation of ARGs that make significant contributions to the likelihood.
To date, while Monte Carlo methods can be used to calculate likelihoods for very simple datasets, they are still impractical for most datasets currently being collected. Instead, three alternative approaches to coalescent-based inference have been explored. First, it is possible to calculate the likelihood of a summary of the data, rather than the data itself. For example, Wall (2000)
suggested estimating ρ
by calculating the likelihood of observing the number of haplotypes (H
) and the minimum number of recombination events (Rm
) as estimated by the method of Hudson & Kaplan (1985)
. Importantly, this likelihood may be calculated by naive simulation, potentially aided by regression techniques (Beaumont et al. 2002
). The second approach is to divide the complete data into smaller subsets (pairs of segregating sites; Hudson 2001
; McVean et al. 2002
or non-overlapping windows Fearnhead & Donnelly 2002
), the likelihood of which can be calculated using IS or even naive methods. Combining likelihood calculations across subsets can give accurate estimates (Wall 2000
), and can be used to estimate variation in the recombination rate (McVean et al. 2004
), but the resulting likelihoods do not have standard properties (e.g. be used to calculate support intervals).
The third approach is to simplify or approximate the coalescent model itself. Building on research into optimal IS proposal distributions, Stephens & Li (2003)
proposed a new statistical model for genetic data with recombination that generates patterns of genetic variation similar to the coalescent, but uses an approximation to the genealogical process. Importantly, the approximation means that likelihoods are easy to compute (referred to as product of approximate conditional, or PAC likelihoods), hence the approach generates a true likelihood.
The disadvantage of approximating the coalescent model is that the biological validity of the approximation may be poor. In the PAC approach, chromosomes are no longer exchangeable (i.e. the likelihood depends on the order in which chromosomes are analysed), and the estimated recombination parameter can only be related to that of the coalescent through an empirical bias correction. Even more importantly, the coalescent approximation does not correspond to any well defined genealogical process, so that no inferences can be made about the ancestral history of the sample (e.g. the marginal genealogy at a given position).
The potential advantages of developing tractable alternative models to the coalescent, combined with the disadvantages of the PAC model, stimulate the search for other possible approximations to the coalescent process. However, such a search should be motivated both by an appreciation of what makes the coalescent with recombination so difficult a model under which to perform inference, and how to assess the merit of alternative models for sequence variation data. Here, we focus on one aspect of the coalescent with recombination that makes inference difficult: the sequentially non-Markovian behaviour of the coalescent model. Our approach is to introduce a simplification of the standard coalescent process (called the sequentially Markov coalescent or SMC) that loses this aspect of model complexity, and to compare its properties with the full model. We show that the model differs only marginally from the standard model in terms of the predicted patterns of genetic variation and suggest that it may provide both a tractable and useful model for genealogy-based inference.