Our results indicate that the leaf-age model is powerful enough to recover temporal signal from sequence data provided that the calibrating information within other portions of the data set is sufficient. Analysis of simulated data sets demonstrates convincingly that, when all other sources of error are accounted for (i.e., the analysis is performed under the true model), the leaf-age estimator is an unbiased predictor of the true age of a sequence and has correct properties of statistical coverage. shows that bias is greater when the estimated leaf-age is close to its boundary (in this case, 0). Such behavior is expected because the variable being estimated (the sequence age) is a strictly nonnegative quantity and as the value approaches the boundary only ages that are older than the true age can be estimated. Implementing an estimator from a model that allows for negative leaf-ages may circumvent this but would be biologically meaningless when the youngest sample included in the data set was sampled, effectively, today (such a negative leaf-age would become the youngest sample, suggesting that all the other samples are older than ascribed).
The results of the empirical data analyses were mixed. Our leaf-dating method performed very well when applied to the DEN-2 data set, as the known dates of sampling fell within the 95% HPD intervals of our estimates 93% of the time. This is extremely encouraging, as it is almost certainly true that the evolutionary and coalescent models used in our analyses are a gross simplification of natural processes; yet, we were able to estimate leaf-dates reliably and accurately. In addition to demonstrating the effectiveness of our leaf-dating approach, our DEN-2 virus result also serve to validate the molecular clock methods used, for if terminal nodes throughout the tree are estimated with accuracy then it logically follows the internal nodes, including the tree root, are similarly dated correctly.
For the bison data set, the leaf-ages (as determined by radiocarbon dating) were estimated incorrectly approximately 15% of the time (, supplementary table S2
, Supplementary Material
online). There was no noticeable pattern in the age of the tips for which the leaf-dating method fails. However, in general, the same leaf-ages, which differ significantly from their priors in all cases, are incorrectly estimated under both coalescent models. This suggests two things: first, that the incorrect dates likely arise from some heterogeneity in the evolutionary process that is not accounted for in the models and second, that the leaf-date estimates depend more on the data than on the prior distributions assumed. The MCP process assumes that effective population sizes change according to an exponential Markovian process (Drummond et al. 2005
). In rare cases, with limited information about the root age that occurs when the unknown dated leaf attaches near the root, this nonstationary process can introduce excess variability (); the GMRF, on the other hand, is stationary (Minin et al. 2008
) and modestly outperforms the MCP in these situations.
FIG. 3. Examples of two different leave-one-out analyses for which the true age was not recovered within the 95% HPDs of the leaf-age estimates. In (A), the analysis identifies two similarly likely leaf-ages, whereas in B the analysis identifies a single, precise (more ...)
Although the vast majority of bison sequence ages were estimated correctly, it is unsatisfactory that some were not and important to consider the possible causes of such error. First, and most obviously, the “true” ages of the specimens may have been incorrectly reported or estimated. For example, the radiocarbon ages we used in our analysis may be incorrect or subject to error, in particular as the age of the specimen approaches the limits of this technique.
Second, the sequences themselves may be incorrect or be erroneous in some way. This is particularly problematic for ancient DNA data, where the degraded nature of the samples results in predictable patterns of DNA damage and base misincorporation by the polymerases used in polymerase chain reaction (Hofreiter, Jaenicke, et al. 2001
; Gilbert et al. 2003
; Binladen et al. 2006
). Previously, an analysis of the bison data set incorporating the postmortem damage (PMD) model showed that this particular aDNA data set contained very little damage (Rambaut et al. 2009
). The PMD model assumes that the probability that any given nucleotide remained undamaged decays exponentially with age; the oldest sequences in the bison data set demonstrated, on average, only 0.74 damaged sites, a level that produced no qualitative change in the demographic reconstructions. In its current implementation, the PMD model assumes a common decay rate for every site of every sequence (such that the probability of a site being undamaged decays exponentially with age). A useful extension of the PMD model may be to provide a probabilistic expectation that each individual sequence is damaged, thereby making it possible to identify problematic sequences for additional assessment. Although incorporating both PMD models and leaf-dating in a single analysis is possible, the common decay rate PMD model and leaf-dating are expected to be only weakly identifiable in the sampling density of the observed sequence data, serving primarily to increase variance on the leaf-age estimate without extracting much additional information from the data. Furthermore, the individual decay rate PMD model and a random leaf-age are not identifiable. Consequently, prior assumptions will dominate inference as to whether a sequence is damaged or incorrectly dated.
Third, some aspect of the evolutionary models used may be unrealistic and a possible source of error. We consider each model component in turn:
- (i) The phylogenetic model assumes that the sequences do not undergo recombination. Recombination is very unlikely to be present in our ancient mtDNA bison data set but may be a potential cause of error in the analysis of some viruses that do recombine readily, such as HIV-1 (in which case the method introduced here may be used to “detect” putatively recombinant sequences).
- (ii) The coalescent prior distribution may too constraining or unrepresentative of the tree shapes supported by the data. However, similar results were obtained when the bison data set was estimated under two different coalescent models, each of which is highly flexible, this is unlikely to be a significant cause of estimation error.
- (iii) The molecular clock model did not accurately model temporal sequence evolution. The molecular clock determines the estimated evolutionary timescale and is therefore, a priori, the most likely source of error in estimating leaf-dates. The strict molecular clock used in our analysis does not incorporate rate heterogeneity among lineages, which is common in many heterochronous data sets (Korsten et al. 2009; Magiorkinis et al. 2009). If this variation is ignored then leaves attached to terminal branches that evolve unusually rapidly or slowly will have their ages poorly estimated.
Note that the discussion above does not directly address the biological assumptions underlying model components, such as the coalescent and the molecular clock. This is deliberate, as we wish to highlight a common misconception. In population-level analyses, it is often assumed that it is necessary to assume neutral evolution in order to accurately estimate divergence times using a molecular clock. If divergence times are the primary parameters of interest, then the clock model used can be thought of as a statistical or phenomenological description of the relationship between genetic distance and time rather than an explicit model of a biological process. Viewed this way, all that is important is that the model “fits” the data well, statistically speaking. The same argument can be applied to the coalescent model, which, when being treated as a nuisance parameter, need only represent a suitable range of tree shapes and sizes. In such cases, the assumptions of the coalescent model (e.g., random sampling or panmixis) are not necessary conditions for accurate date estimation. As a result, estimation of leaf-dates is likely to be highly robust provided that the clock model used incorporates sufficient rate heterogeneity.
Although it is often not feasible to discriminate among the sources of error outlined above, it may be possible to identify problematic sequences by evaluating the output of the leaf-dating analysis. For example, a common problem among the small number of analyses that fail to recover the true leaf-age is that the sequences appear to be equally or nearly equally likely to fall in two locations in the genealogy, with each of these resulting in very different leaf-age estimates. This pattern is seen clearly by plotting the estimated marginal posterior distribution leaf-ages, which has an unusually high variance (). This result may be due to errors within the sequence itself, which could potentially be resolved by resequencing. Alternately, very precise leaf-age estimates not containing the true age may indicate a problem with the true age (). For ancient DNA data, recovering an additional radiocarbon date or confirming information about the stratigraphic context of the sample can be useful to rule out this potential source of error. For viral sequences, a reexamination of the documentation associated with the isolate may reveal an annotation or transcription error.
Although the estimated ages recovered by the leaf-dating method are often associated with wide credible intervals, our leaf-dating method provides a means to include in molecular clock analyses data for which little or no temporal information is known. Incorporating additional sequence data can improve the resolution of the phylogenetic, demographic, and geographic history of the sampled sequences and can extend significantly the temporal range of the analysis. Additionally, estimating leaf-ages can provide an independent means of assessing both the authenticity of heterochronous sequences and the ages to which the sequences have been ascribed, which is often a significant concern in ancient DNA research.
The possibility of treating leaf-ages as random variables also allows uncertainty to be modeled explicitly. This enables the incorporation of the uncertainty associated with layer dating, for example, in the form of a uniform prior across the age range of the source stratum. In addition, the error in isotopic dating can be reflected by choosing an appropriate prior distribution for the corresponding leaf-age (Ho and Phillips 2009
We hope that further uses for the leaf-dating method may be found. As one potential application, consider the forensic or archaeological examination of biological tissue from which rapidly evolving viral sequences (e.g., influenza) are recoverable—by estimating the date of such sequences using our method, it will be possible to posit a time of death.