|Home | About | Journals | Submit | Contact Us | Français|
Genomic integrity requires faithful chromosome duplication. Origins of replication, the genomic sites where DNA replication initiate, are scattered throughout the genome Their mapping at a genomic scale in multicellular organisms has been challenging. Here we have profiled origins in Arabidopsis by high-throughput sequencing of newly-synthesized DNA and identified ~1500 putative origins genome-wide. This was supported by ChIP-chip experiments to identify ORC1 and CDC6 binding sites. Origin activity was validated independently by measuring the abundance of nascent DNA strands. The midpoints of most Arabidopsis origin regions are preferentially located within the 5’ half of genes, slightly enriched in G+C, histone H2A.Z, H3K4me2/3 and H4K5ac, and depleted of H3K4me1 and H3K9me2. Our data establish the basis for understanding the epigenetic specification of DNA replication origins in Arabidopsis and have implications for other eukaryotes.
Faithful duplication of the genetic material is crucial to maintain genomic integrity. DNA replication in eukaryotic cells initiates at multiple sites, known as replication origins, which are scattered throughout the genome 1–3. The number of origins ranges from hundreds to thousands depending on the cell type and/or the physiological state 3. One of the key steps for understanding replication origin function is whether and how they are specified in the genome. In S. cerevisiae, a strict sequence-dependent specification occurs whereby the origin recognition complex (ORC) recognizes an 11bp sequence to define the site of each active replication origin 4,5. This mechanism appears to be a rather unique situation because a consensus sequence has not been found in other organisms. For example, in S. pombe, although origins are associated with A+T-rich stretches they are not specified by a known consensus DNA sequence 6,7.
The identification of the molecular nature of replication origins in multicellular organisms has been elusive and only a handful of them have been analyzed 2,8–10. The large genome size of multicellular eukaryotes, their different developmental strategies and the existence of a diversity of proliferating cell populations have led to increased difficulty in determining origin specification, function and spatio-temporal regulation at a genomic scale 3. Local epigenetic modifications can further affect origin selection and usage, e.g. replication timing 3,11–13. Although attempts to obtain genome-wide maps of replication origins in mammalian cells have been reported 14–17, the molecular features defining replication origins in higher eukaryotes and, in particular, their links to epigenetic modifications still remain largely unknown.
In this study, we have identified replication origins, analyzed their organization, and defined their epigenetic signatures at a high-resolution genome-wide scale in the plant Arabidopsis thaliana. Its rather compact genome (~125Mb, ~28,000 protein coding genes), fully sequenced and annotated, and the relatively small amount of repetitive sequences (~17%), largely confined to the pericentromeric areas 18, make Arabidopsis an excellent system to study origins. Furthermore, the comparison of replication origin features in organisms with very different developmental and growth strategies could shed light into the basic principles governing origin specification and function in eukaryotes. In addition, genome-wide maps of epigenetic marks such as DNA methylation and several histone modifications have already been reported 19–21. The use of massive sequencing of short-pulse BrdU-labeled DNA led us to identify ~1500 putative replication origins across the Arabidopsis genome. ORC1 and CDC6 binding regions, which, importantly, are enriched in BrdU-labeled regions, were also identified by chromatin immunoprecipitation and microarray experiments (ChIP-chip). Furthermore, origin activity was validated independently by measurement of nascent DNA strand abundance. Our studies reinforce the idea that some origin features are shared with animal cells whereas others are unique to plants 22,23. The Arabidopsis “originome” reported here provides the basis of identifying the key features of eukaryotic replication origins and delineate their possible regulatory mechanisms.
Functional origins mark the sites where the synthesis of nascent DNA strands occurs. Thus, our strategy was to sequence purified DNA labeled in vivo with a pulse of BrdU and confirm these data with the mapping of pre-RC binding (Supplementary Fig. 1). To obtain sufficient amounts of BrdU-labeled DNA, we used Arabidopsis cultures that contain a substantial amount of proliferating cells. We synchronized cells in G0 using sucrose deprivation and labeled them with BrdU a few hours after release from the block when cells are just entering the S-phase 24,25 (Fig. 2 and Supplementary Fig. 1). DNA was extracted, fractionated by CsCl gradient centrifugation and the BrdU-labeled material was purified and used to generate genomic libraries for sequencing using the Solexa (Illumina) technology. We obtained a total of ~4 million high quality reads that uniquely mapped to the Arabidopsis genome. Likewise, a sample of unlabeled DNA was processed as a control (see Methods). This BrdU-seq method rendered a comprehensive list of genomic locations with a significant enrichment in BrdU-labeled DNA strands (Fig. 1a). To define origin regions using the BrdU-labeled DNA sequencing data we merged BrdU positive regions separated <10kb, as described in Methods (see also Supplementary Fig. 3). An alignment of DNA sequences of ±100bp around the midpoint of BrdU-labeled regions did not render any consensus sequence. To corroborate the analysis of BrdU-labeled regions and deal with possible experimental variations, we carried out an independent assay of cell synchronization, BrdU-labeling and CsCl purification followed by massively parallel sequencing. Significantly, 78.2% (p<1.0e-6) of the BrdU-labeled regions overlapped with the regions defined in the previous experiment, supporting the reproducibility of the two independent experiments.
To identify pre-RC binding sites, in the absence of specific antibodies, we used plants expressing constitutively tagged versions of two pre-RC components, ORC1 (ref. 26) and CDC6 (ref. 27). ORC1- and CDC6-bound DNA fragments were purified by chromatin immunoprecipitation (ChIP) (Supplementary Fig. 1) and hybridized to whole-genome Arabidopsis tiling arrays to identify their genome-wide binding sites (Fig. 1a). ORC1 binding was spread over numerous sites (Supplementary Fig. 4) whereas CDC6 binding sites were less abundant (Supplementary Fig. 5). First, we determined the fraction of the BrdU-labeled regions that contained bound pre-RC components. We found that ~76.7% and 17.0% of BrdU-labeled regions overlapped with ORC1 and CDC6 regions, respectively (midpoint of BrdU region ±2.1 kb, p<0.001; see colocalization range in Fig. 1b). More importantly, the midpoints of these regions significantly colocalized with both ORC1 and CDC6 binding sites within ±2kb regions (Fig. 1b). Therefore, the 1543 regions rendered by our approach were considered bona fide replication origins (Supplementary Table 1). They appear uniformly distributed across the genome, although it is possible to identify clusters of more closely spaced origins in some genomic locations (Supplementary Fig. 6). The number of origins varies for different chromosomes but they roughly correlate with chromosome size (Fig. 1c). The distribution of distances between origin region midpoints gave a median of 51.1 kb, with a mean of 77.2 kb (Fig. 1d).
The BrdU-labeled regions identified in our study and the marked colocalization with ORC1- and CDC6-binding sites strongly support the notion that they represent active DNA replication origins. To assess origin activity directly we measured the relative abundance of nascent DNA strands of various putative origin regions relative to adjacent regions in a sample of short DNA molecules purified by sucrose gradient centrifugation and containing a RNA primer at their 5’ end 28,29. Thus, origin activity was determined by real-time PCR methods using primer pairs spanning 5–16kb around putative origin regions. In all cases analyzed, we could demonstrate a high enrichment of origin sequences in the short nascent DNA strand sample (Fig. 2a-c). Importantly, one of the BrdU-labeled regions included in this analysis was one showing a relatively low CDC6 signal in the ChIP-chip experiment (Fig. 2a). In spite of this, it showed a high abundance of nascent DNA strands measured by qPCR, demonstrating the activity of this region as a functional origin as well as the robustness of our approach. A control region, lacking BrdU-labeled DNA sequences, did not show any appreciable enrichment (Fig. 2d). These data together led us to conclude that the set of origins identified here provides a solid starting point to define their molecular landscape.
To test whether origins are randomly distributed along the genome or show a preferential location we estimated origin location relative to various genomic elements. We found that 77.7% and 10.2% of origins colocalized with gene units and transposons, respectively. These percentages are significantly different from the proportion of the Arabidopsis genome represented by these elements (Fig. 3a). Next, we analyzed origin density across genes and their 5’ and 3’ upstream regions. We observed that most origins were identified within gene bodies (Fig. 3b), but preferentially towards their 5’ ends (Supplementary Fig. 7). Origin localization to the bodies of genes did not correlate with gene expression levels (Fig. 3b), according to expression data obtained from cell suspensions at the same synchornization time used for BrdU labeling 30. However, highly expressed genes, compared to lowly expressed genes, tended to have more origins in regions immediately upstream (Wilcoxon ranksum test, p<0.005) or downstream (Wilcoxon ranksum test, P<0.01) of genes (Fig. 3b).
The body of highly expressed genes in Arabidopsis is enriched in CG methylation whereas the three types of C methylation (CG, CHG and CHH, where H is A, T or C) are highly enriched in the repeat-rich pericentromeric regions of the Arabidopsis genome 19,31. Interestingly, we found a slight decrease in CG methylation levels around origin midpoints compared to regions flanking them (Fig. 4a). Furthermore, we observed that regions ±0.1kb around the origin midpoints showed higher G+C contents (44.5%), compared to the whole Arabidopsis genome (Fig. 4b). It is known that the histone variant H2A.Z is preferentially deposited near the 5’ end of target genes and anticorrelates with CG methylation 32. We found a strong correlation between the presence of H2A.Z within ±1kb and the origin midpoints (Fig. 4c).
To further determine features defining Arabidopsis replication origins we next sought to profile the landscape of epigenetic histone marks that appear to associate with replication origins. Arabidopsis epigenomics data are already available for dimethylation of histone H3 at lysine 9 (H3K9me2) and for the three methylated forms of H3K4 20,21. We found that most origins tend to be depleted of H3K4me1 (Fig. 5a) but are highly enriched in H3K4me2 and H3K4me3 (Fig. 5b-c). In fact, we observed that H3K4me3 and/or H3K4me2, with or without H3K4me1, appears to be a signature of ~80% of origins associated with genes (Fig. 5e). This is consistent with the preferential localization of origins in 5’ gene body regions observed here and the anticorrelation of these marks and CG methylation 21. Furthermore, H3K9me2 is highly depleted in most of the origins identified in our study (Fig. 5d).
A correlation exists between histone hyperacetylation and origin activation in Xenopus 11 and Drosophila cells 33–35. Consistent with this, immunofluorescence data obtained in several plant species indicate that increases in histone acetylation occurs during S-phase 22,23. Recently, ChIP experiments have revealed that H4K5 and H4K12 (also H4K8 to a lesser extent) but not H4K16 need to be acetylated by the HBO1 histone acetylase at origins in human cells to overcome geminin inhibition and facilitate MCM loading 36. Thus, we profiled H4K5ac over the genome by ChIP-chip and found an enrichment of this mark at the origin midpoint (Fig. 5f).
Initiation of DNA replication in eukaryotes depends on the assembly of pre-replication complexes (pre-RC) in G1 of the cell cycle at certain chromosomal locations and its further activation to initiate DNA replication in S-phase. Both steps must be tightly coordinated to ensure that the genome is duplicated once per cell cycle 2. We have found that ORC1 binding sites tend to form clusters, a situation similar to Drosophila cells 37 but highly different from that of S. cerevisiae5. The presence of ORC1 binding sites across the genome may represent not only broad initiation zones with several potential initiation sites but also reflect the function of ORC1 in other processes, e.g. heterochromatin silencing 38, transcriptional control 26,39 or chromatid cohesion 40. In any case, detection of CDC6 in BrdU regions is highly valuable taking into account the release of CDC6 from the pre-RC once an origin fired 41.
The distribution of distances between origin region midpoints rendered values that fall within the range estimated for other eukaryotes 42 and roughly match estimations of replicon size in Arabidopsis 13,43. It is possible that a fraction of the putative origin regions identified here correspond to elongating forks rather than to initiation events. However, our direct measurements of origin activity by abundance of RNA primer-containing nascent strands support the idea that the “originome” reported here is a bona fide list of putative Arabidopsis DNA replication origins. Future analysis should address this point individually. The abundance of origin sequences and the width of the peak of amplified fragments varied for different origins analyzed, suggesting differences in the efficiency of origin usage or in the usage of initiation sequences within an origin region 10,42.
Interestingly, the location of most Arabidopsis origins is different from other systems in which a large proportion of highly efficient origins are associated with gene promoters or transcriptional start sites 16,17,37. We have found that the ±0.1kb region around Arabidopsis DNA replication origins possesses a higher than average G+C content and a slight decrease in CG methylation. Consistent with this observation, early-mid replicons in Arabidopsis chromosome 4 have been also found to be depleted of CG methylation 13. One possibility is that in Arabidopsis the relatively high G+C content at origins favors a particular nucleosome organization in these regions. This is reinforced by the colocalization of origins with histone H2A.Z, which affects nucleosome stability 44, and could facilitate pre-RC assembly and/or origin firing. Together, our data show that whereas CG methylation within gene bodies is relevant for gene expression in Arabidopsis 19, it does not seem to be a requirement for origins. Metazoan origins highly correlate with unmethylated CpG islands located at the promoter of active genes or in the proximity to transcriptional start sites 6,42. While CpG islands are not present in the Arabidopsis genome, our results revealed a conserved trend of having relatively lower CG methylation at origins and show a high correlation between origin activity, a local high G+C content and presence of H2A.Z.
Posttranslational histone modifications can also affect origin specification and function. Most Arabidopsis origins tend to be enriched in H3K4me2 and H3K4me3, as well as in H4K5ac, similar to human origins 17,36. Whether all human origins have the same H4ac pattern, as a consequence of HBO1 activity to overcome geminin inhibition 36, and whether all Arabidopsis origins require H4ac for activation remain open questions for the future. However, the H4 acetylation pattern is of particular relevance due to the presence in Arabidopsis of (i) an HBO1-related acetyltransferase 45, (ii) increased tetraH4ac residues around Arabidopsis ORC1-binding sites 26 and (iii) a CDT1-interacting protein, GEM, structurally unrelated to metazoan geminin 46,47. Acetylation in other histone residues may be also relevant for origin function, as suggested by the presence of H3K56ac in early replicons of chromosome 4 (ref. 13).
How replication origins are specified in large eukaryotic genomes has been a long-standing question. The association of early-firing origins with transcribed genomic regions has been reported 48,49. Origins that have been studied in the 0.4–1% of mammalian genomes show a preferential association with active promoters that contain CpG islands 15–17. We have found that origins located in the upstream regions of genes are preferentially associated with highly expressed genes. However, the differences in the genomic distribution of CG methylation pattern in Arabidopsis may contribute to the use of different mechanisms to specify origins. In fact, a higher proportion of origins in Arabidopsis are located in the 5’ half of gene bodies compared to mammalian cells.
Our work has defined a landscape of epigenetic marks associated with a genome-wide set of replication origins in Arabidopsis. The midpoints of most origin regions preferentially colocalize with a significantly higher than average G+C content, but lower CG methylation level, and are enriched in histone H2A.Z, H3K4me2/3 and acetylated H4K5, and depleted in H3K4me1 and H3K9me2. Elucidating how epigenetic mechanisms and gene expression coordinate with DNA replication is of primary importance for understanding these processes in a genomic and developmental context. The Arabidopsis “originome” reported here provides the foundation for future studies to identify the mechanisms of origin specification as well as the regulation and function of DNA replication origins in different eukaryotes.
Methods and any associated references are available in the online version of the paper at http://www.nature.com/nsmb/.
We thank E. Martinez-Salas, J.A. Tercero and E. Caro for comments and discussions, and to Sara Diaz-Triviño and P. Hernandez for initial efforts in origin mapping, and to M. Gomez and J. Sequeira-Mendes for advice with the purification and analysis of nascent DNA strands. The technical help of V. Mora-Gil is deeply acknowledged. M.P.S. and C.C. are recipients of JAE-Doc contracts from CSIC. S.F. is a Howard Hughes Medical Institute Fellow of the Life Sciences Research Foundation. Research has been supported by grants BFU2006-5662, BFU2009-9783 and CSD2007-00057-B (Ministry of Science and Education) and P2006/GEN0191 (Comunidad de Madrid) to C.G, by an institutional grant from Fundación Ramón Areces to CBM, by grant GM60398 (National Institutes of Health) to S.E.J and by grants BIO2004-02502, BIO2007-66935, GEN2003-20218-C02-02 and CSD2007-00057-B (Ministry of Science and Innovation) and GR/SAL/0674/2004 (Comunidad de Madrid) to R.S. S.E.J. is an investigator of the Howard Hughes Medical Institute.
Accession codes. The NCBI GEO accession numbers for the datasets generated in this work are GSE21928 (for ORC1 and CDC6 ChIP-chip) and GSE21828 (for BrdU-seq and H4K5ac ChIP-chip).
Supplementary information. Supplementary information is available at the NSMB website.
AUTHORS CONTRIBUTIONSC.C., M.P.S., Y.Y., S.F., A.B. and I.L.-V. performed experiments. H.S., J.C.O., C.C., M.P.S., X.Z. and R.S. analyzed data, C.G. and S.E.J. prepared the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.