Evolutionary divergence and inferred mutation rates are known to vary across the human genome1–3
, and it has long been speculated that this is a consequence of covariance with an epigenetic feature1,2
. In human cells, the time of DNA replication exhibits marked regional variability during an S-phase lasting approximately 10-hours4,5
. To parallel the conventional division of S-phase into four sequential temporal states (S1-S4), we used a hidden Markov model6
to perform unbiased four-state partitioning of continuous, high-resolution replication timing measurements across 1% of the human genome7
. We then determined human-chimpanzee nucleotide divergence rates and the density of SNPs8
at putatively neutrally evolving sites within each temporal state, excluding any bases within annotated exons, repetitive elements, CpG islands, 2kb-regions upstream and downstream of genes, intronic splice sites, and conserved non-coding sequences9
(Supplementary Table S1).
We observed a striking trend relating the rate of evolutionary divergence and the density of human SNPs to the progress of DNA replication (). Human-chimpanzee substitutions and human SNP density increase 22% and 53%, respectively, during the temporal course of replication, both of which are highly statistically significant (p
< 8.43 × 10−26
, Cochran-Armitage; ). To rule out potential confounding by the overall low genome-wide rate of human-chimpanzee divergence, we also analyzed human-macaque divergence, with similar results (p
< 2.7 × 10−54
; ). We confirmed the absence of bias due to a sampling or stratification effect across different genomic regions by testing (Cochran-Mantel-Haenszel) for three-way interactions, treating region assignment as controlling variable (p
< 7.2 × 10−12
< 0.00026 for human-chimpanzee divergence and human SNPs, respectively). Additionally, we repeated all analyses with an independent set of randomly ascertained SNPs10
, with nearly identical effect (p
< 9.69 × 10−22
Replication time-dependence of evolutionary divergence and human SNP density
Next we examined whether the observed correlation between mutation rate and replication time could be explained by variation in another genomic feature for which replication timing might be acting as a surrogate. Regional variation in G+C content2,3
and, independently, recombination rate2,3
have been invoked as potential causes of human mutation rate variation. We therefore obtained the distribution of G+C content, CpGs, recombination hotspots9
, and gene, exon, and conserved non-coding sequence9
densities in sliding non-overlapping 50kb windows (approximating the size of chromosomal domains linked to replicons) across each temporal replication state (Supplementary Fig. S1
). We binned each distribution into three classes (low, medium and high content), with an equal number of windows at each level and performed separate tests for three-way interactions using each factor as a controlling variable (total 12 tests). All were highly significant with p
-values not exceeding 3.0 × 10−12
(), as were repeated tests with the additional permutation re-sampling of temporal states (p
< 5.0 × 10−6
for divergence; p < 2.2 × 10−4
for SNPs; ).
Significance of replication time-dependence of evolutionary divergence and human
To address potential interplay between more than one variable, we developed multiple regression models of both divergence and diversity, confirming the independent effect of replication timing (Supplementary Table S2 and Supplementary Fig. S2
). These models suggest that replication time alone may explain 40–70% of the variability explained by the full model, and ~8% of overall variability in diversity and divergence. The observed correlation between rates of nucleotide change and replication timing is therefore highly unlikely to be caused by variation in G+C content or by a mutagenic effect of recombination. To rule out any hidden dependence on window size, we repeated all analyses conditioned on smaller (30kb) and larger (100kb) windows, with equivalent results (Supplementary Fig. S3
The effects of replication timing on evolutionary divergence and SNP density are highly similar when all other genomic features are controlled. These findings are compatible with a process that impacts mutation rate, which should affect both diversity and divergence in a stable fashion over evolutionary time. Furthermore, the findings persist across the spectrum of selected sites, from ancestral repeats and 4-fold degenerate sites to conserved non-coding sequences and non-degenerate coding sites (Supplementary Fig. S4
), and across the human and chimpanzee lineages following the split from macaque (Supplementary Fig. S5
We next considered whether the relationship with mutation rate might be due to a consequence of transcription such as transcription-coupled repair11
. To rule this out, we examined introns and intergenic regions separately, and found no significant difference in any parameter (data not shown).
Finally, we examined the possibility that the mutational effect might be restricted to the subset of the genome we analyzed. To test this, we examined a lower-resolution genome-wide data set comprising early- and late-replicating regions mapped in lymphoblastoid cells5
. These data also evince a mutational effect analogous with that reported above (Supplementary Fig. S6
), confirming the generality of our observations.
What molecular mechanism might underlie a monotonic increase in mutation rate during S-phase? One possibility is that late stages of DNA replication are associated with the slowing or stalling of replication forks due to exhaustion of the dNTP pool or difficulty in negotiating heterochromatinized templates, with consequent accumulation of single-stranded DNA (ssDNA) regions12
. ssDNA is more susceptible to endogenous and environmental damage, and can potentiate mutagenesis directly31
or via triggering of intra-S-phase checkpoints that set in motion low-fidelity polymerases. Another possibility is that the mismatch repair system might erode during S-phase, or that lesions in late replicating regions simply lack adequate time to undergo effective repair.
To differentiate these scenarios, we examined mutations at CpG dinucleotides, which arise overwhelmingly from spontaneous deamination of methylcytosine into thymine, a process which escapes DNA mismatch repair. Surprisingly, we found that both evolutionary divergence and human nucleotide diversity at CpG sites () correlate with replication timing, closely paralleling other types of sites (). The parallelism between CpG and non-CpG sites cannot be explained by alterations in the dNTP pool, nor by reduced polymerase fidelity, nor by defective mismatch repair. In addition, we found all classes of evolutionary transitions and transversions to display strong replication timing-dependence with a characteristically similar trend (Supplementary Fig. S7
). This indicates that the effect is not due to biases in the genesis of specific mutational events nor to their handling by the repair machinery.
Our results therefore suggest that a simple consequence of the process of DNA replication – accumulation of single-stranded DNA within later replicating regions – may provide the most parsimonious explanation. Because ssDNA is highly susceptible to endogenous DNA damage, including alkylation, oxidation and deamination13
, accumulation of ssDNA in late-replicating regions would be expected to increase mutation rate across all classes of substitutions, consistent with our observations.
In conclusion, we find a clear and striking relationship between the time at which human genomic DNA sequences replicate and their corresponding mutation rates. Our results affirm longstanding speculation concerning the existence of such a relationship, and they explain limited prior observations of increased SNP density near later replicating genes14
. In order for mutations to be propagated, they must arise in the germ line. Our results were obtained using replication timing measurements from somatic cells, suggesting that the somatic replication program largely parallels the temporal landscape of replication in germ cells, which have evaded study owing to their scarcity. Because the replication timing of tissue-specific genes is expected to vary between cell types, it is reasonable to expect that there will be discrepancies between our calculations and those that might be made from germ cells were data available. The correlation reported herein should therefore be regarded as a lower limit estimate of actual dependence of mutation rate on replication timing.
Interestingly, exons preferentially reside in early replicating regions (Supplementary Fig. S1
) and, consequently, in regions with reduced mutation rate. This observation may have either a mechanistic or a selection-based explanation. We found that replication timing is the dominant factor responsible for the reduced nucleotide diversity around exons. It is further observable that a significant number of human genes controlling developmental fate, differentiation, and cell proliferation are exceptions and undergo replication late in S-phase in most adult cell types15
, and that late replication timing is associated with repression of cell fate-modifying genes15
. This suggests that increased mutation rate affecting late replicating regions of the human genome may reflect a significant evolutionary cost for sequestering specific gene subsets within a repressed nuclear compartment15