|Home | About | Journals | Submit | Contact Us | Français|
A major yet unresolved quest in decoding the human genome is the identification of the regulatory sequences that control the spatial and temporal expression of genes. Distant-acting transcriptional enhancers are particularly challenging to uncover because they are scattered among the vast non-coding portion of the genome. Evolutionary sequence constraint can facilitate the discovery of enhancers, but fails to predict when and where they are active in vivo. Here we present the results of chromatin immunoprecipitation with the enhancer-associated protein p300 followed by massively parallel sequencing, and map several thousand in vivo binding sites of p300 in mouse embryonic forebrain, midbrain and limb tissue. We tested 86 of these sequences in a transgenic mouse assay, which in nearly all cases demonstrated reproducible enhancer activity in the tissues that were predicted by p300 binding. Our results indicate that in vivo mapping of p300 binding is a highly accurate means for identifying enhancers and their associated activities, and suggest that such data sets will be useful to study the role of tissue-specific enhancers in human biology and disease on a genome-wide scale.
The initial sequencing of the human genome1,2, complemented by effective computational and experimental strategies for mammalian gene discovery3,4, has resulted in a virtually complete list of protein-coding sequences. In contrast, the genomic location and function of regulatory elements that orchestrate gene expression in the developing and adult body remain more obscure, hindering studies of their contribution to developmental processes and human disease. Evolutionary constraint of non-coding sequences can predict the location of enhancers in the genome5-12, but does not reveal when and where these enhancers are active in vivo. Furthermore, it has been suggested that a substantial proportion of regulatory elements is not sufficiently conserved to be detectable by comparative genomic methods13-16.
Chromatin immunoprecipitation coupled to massively parallel sequencing (ChIP-seq) has been shown to enable genome-wide mapping of protein binding and epigenetic marks17-22. The ChIP-seq approach is dependent on the cross-linking of proteins to specific DNA elements, followed by antibody enrichment of the protein-DNA complexes, and high-throughput sequencing of the recovered DNA fragments. In principle, ChIP-seq using an antibody specific for an enhancer-binding protein could provide a conservation-independent approach for the identification of candidate enhancer sequences.
The acetyltransferase and transcriptional coactivator p300 is a near-ubiquitously expressed component of enhancer-associated protein assemblies and is critically required for embryonic development23-27. In homogeneous cell preparations, p300 has been shown to be associated with enhancers28,29, but these in vitro studies provided access only to subsets of enhancers that are active in a given cell type under culture conditions, providing limited insight into their in vivo function. In the present study, we have determined the genome-wide occupancy of p300 in forebrain, midbrain and limb tissue isolated directly from developing mouse embryos. Using a transgenic mouse reporter assay, we show that p300 binding in these embryonic tissues predicts with high accuracy not only where enhancers are located in the genome, but also in what tissues they are active in vivo. Depending on the tissue type, the success rate of predicting forebrain, midbrain and limb enhancers was between 5- and 16-fold increased compared to previous studies in which such enhancers were discovered by comparative genomics10,11.
To generate genome-wide maps of p300 binding in vivo, we micro-dissected forebrain, midbrain and limb tissue from more than 150 embryonic day 11.5 (E11.5) mouse embryos and performed ChIP directly from these tissue samples using a p300 antibody (Fig. 1). Immunoprecipitated DNA fragments were analysed using massively parallel sequencing and the resulting 36 base pair (bp) sequence reads were aligned to the reference mouse genome17,30.
After appropriate quality filtering, between 2.4 and 3.6 million aligned reads obtained from each of the tissue samples were used to identify regions of the genome with significant enrichment in p300-associated DNA sequences, hereafter referred to as ‘peaks’ owing to their appearance in genome-wide density plots17 (Supplementary Table 1). Using an estimated false-discovery rate (FDR) threshold of <0.01, we identified 2,543, 561 and 2,105 peaks from forebrain, midbrain and limb respectively (Supplementary Tables 2-4). Most peaks were located at least 10 kb from transcript start sites (Supplementary Fig. 1). The smaller number of peaks from midbrain is probably due to variability in the efficiency of enrichment by ChIP (Supplementary Fig. 2). Re-sampling of subsets of data suggests that the major p300-binding sites from these three tissues have been discovered, whereas with increased sequencing coverage it is anticipated that further binding sites can be identified that are occupied only in smaller subsets of cells within each tissue (Supplementary Fig. 3). Although most genomic regions with in vivo p300 binding were identified by peaks in a single tissue, there were 386 regions at which peaks were observed in two tissues, and 21 regions at which p300 peaks were observed in all three tissues (Supplementary Fig. 4).
To test directly whether p300 binding in developing mouse tissues is indicative of enhancer activity in vivo, we selected 86 regions with a p300 peak in at least one of the tissues for analysis in transgenic mice, comprising a total of 122 individual predictions of enhancer activity in specific tissues (Supplementary Table 5). These elements were selected blind to the identity of genes near which they are located, showed a wide range of evolutionary conservation with other vertebrate species (see Methods) and approximately reflect the genome-wide distribution properties of p300 peaks among intronic and intergenic regions, as well as their distances relative to known genes (Supplementary Fig. 1).
We cloned the human genomic sequences orthologous to these enhancer candidate regions into an enhancer reporter vector and generated transgenic mice as previously described10,31. For each of the 86 candidate enhancers, several independent transgenic embryos (average of n = 8) were assessed for reproducible reporter gene expression. A pattern was considered reproducible if the same anatomical structure was stained in three or more embryos. In almost all cases, this minimum threshold was exceeded and reproducible reporter staining in forebrain, midbrain or limb was present on average in more than 80% of the embryos obtained per construct (Supplementary Table 5).
First we determined whether p300 binding was predictive of reproducible in vivo enhancer activity regardless of their tissue specificity. Considering peaks from each of the three p300 data sets separately, 55 out of 63 (87%) forebrain predictions, 30 out of 34 (88%) midbrain predictions and 22 out of 25 (88%) limb predictions were active enhancers in vivo at E11.5 as defined by reproducible LacZ staining (Fig. 2, grey and coloured bars). Overall, 87% (75 out of 86) of the tested elements were reproducible enhancers at E11.5. This compares with a success rate for predicting enhancers of 47% (246 out of 528) from our previous studies in which elements were identified on the basis of their extreme evolutionary conservation and tested using the same transgenic mouse assay10,11. Thus, the rate of false-positive predictions using p300 ChIP-seq was more than fourfold lower than that with extreme evolutionary conservation (13% compared to 53% previously; P = 4.2 × 10−10, Fisher's exact test).
We next determined the accuracy with which p300 binding predicts the tissue in which enhancer activity will occur. Of the 63 tested elements that overlapped a forebrain p300 peak, 49 (78%) were found to have reproducible enhancer activity in the developing forebrain (Fig. 2, blue). Similarly, 28 out of 34 (82%) tested elements identified by midbrain p300 enrichment (Fig. 2, red), and 20 out of 25 (80%) tested elements identified by limb p300 enrichment (Fig. 2, green), were confirmed to be active in the predicted tissue. The 86 tested elements included 32 sequences that were identified by p300 binding in more than one tissue. Of these, 27 out of 32 (84%) were active in at least one of the predicted tissues, and 22 sequences (69%) perfectly recapitulated the predicted expression patterns (Supplementary Table 6).
To assess the degree of enrichment of enhancer activities in predicted tissues, we compared the relative frequency of enhancers for each of the three tissues examined here with a background set of 528 previously tested sequences predicted to be developmental enhancers on the basis of extreme sequence constraint that were not associated with a prior tissue specificity prediction10,11. For example, whereas forebrain enhancers account for only 16% (86 out of 528) of the tested elements identified through comparative approaches, 78% (49 out of 63) of elements predicted by forebrain p300 peaks were found to be active enhancers in the forebrain (Fig. 2). Forebrain predictions are therefore fivefold enriched in forebrain enhancers compared with enhancers identified through comparative approaches (P < 1 × 10−22). Similarly, we observed a sixfold enrichment of midbrain enhancers (P < 1 × 10−11) and a 16-fold enrichment of limb enhancers (P < 1 × 10−18) at midbrain and limb p300 peaks, respectively. Representative examples of enhancers identified by ChIP-seq are shown in Fig. 3 and detailed annotations and reproducibility across transgenic mice for all elements tested in this study can be found at http://enhancer.lbl.gov (ref. 32). Taken together, these results indicate that p300 peaks are a highly accurate predictor of in vivo enhancers and their spatial activity patterns.
Previous studies have indicated a positive correlation between enhancer activity during development and non-coding sequence conservation6,8-11,33, but it has also been suggested that not all regulatory elements in vertebrate genomes are under detectable evolutionary constraint13-16. To test whether p300 binding in E11.5 tissues is generally associated with evolutionarily constrained non-coding sequences, we determined if ChIP-seq reads are overall enriched at previously identified extremely conserved non-coding sequences9,11. We observed strong enrichment of p300 ChIP-seq reads at these conserved sequences, but not at random sites or exons (Fig. 4 and Supplementary Table 7). Vice versa, between 86% and 91% of the p300 peaks overlap sequences that are under evolutionary constraint in vertebrates34, compared to less than 30% of size-matched random regions (P < 1 × 10−172, Fisher's exact test; Supplementary Fig. 5). Using a more stringent constraint threshold score, we observed that between 10% and 21% of peaks are highly constrained, compared to 1% of random regions (P < 1 × 10−82). These results indicate that most p300 peaks in the investigated tissues are under evolutionary constraint and support a global enrichment of p300 in highly conserved non-coding regions of the genome previously correlated with developmental enhancers.
To examine the correlation of p300-enriched regions in embryonic tissues with the transcriptional regulation of neighbouring genes, we compared the genomic distribution of p300 peaks in E11.5 forebrain with gene expression data from this tissue. Using high-density microarrays, we identified a set of 885 genes that are overexpressed in fore-brain at E11.5 compared to whole embryos (Supplementary Table 8). When we compared the genomic position of these forebrain genes to the genome-wide distribution of 2,453 forebrain-derived p300 peaks, we observed that the intervals 90 kb up- and downstream of their promoters are 2.4-fold enriched overall in p300-binding sites (P < 0.05, Fig. 5a). In total, 14% of all forebrain p300 peaks are located within 101 kb from a promoter of a forebrain-overexpressed gene. The most pronounced enrichment (4.8-fold, P < 0.01) was observed within 10 kb up- and downstream of promoters of forebrain-specifically expressed genes. In contrast, forebrain peaks are not enriched near genes over-expressed in other parts of the body (Fig. 5b, Supplementary Table 9). Near genes that were overexpressed fivefold or more in the fore-brain, even higher enrichment of forebrain peaks was observed (11-fold enrichment within 10 kb from promoters, data not shown). We found similar enrichment of limb-derived p300 peaks near limb-overexpressed genes (Supplementary Fig. 6 and Supplementary Tables 10 and 11). These observations are consistent with the sequences bound by p300 in the forebrain or limb of day 11.5 embryos being enhancers that drive the expression of adjacent genes in these tissues at this time point.
In the present study, we have determined the genome-wide distribution of the transcriptional coactivator protein p300 (ref. 23) using ChIP-seq17 directly from developing mouse tissues. Notably, enrichment of p300 in different mouse tissues correctly predicted the spatial enhancer activities of human non-coding sequences in 80% of cases tested in a transgenic mouse assay, whereas absence of p300 enrichment correlated in 93% of cases with absence of enhancer activity in the respective tissue (Supplementary Table 5). The few elements that did not drive reporter gene expression in the tissue predicted by p300 ChIP-seq may represent cases in which the function of regulatory elements has diverged between the mouse sequences identified by ChIP-seq and the human orthologous regions tested in the transgenic mouse assay. In support of this hypothesis, we observed several cases in which the non-coding p300-bound region from mouse, but not the orthologous human sequence, had reproducible enhancer activities as predicted by p300 ChIP-seq from mouse tissues (data not shown). Taken together, the present approach provides a markedly improved specificity for locating enhancers in the human genome compared to conservation-based methods10,11 and also predicts their in vivo activity patterns with higher accuracy than motif-based computational methods available at present (for example, refs 35, 36).
Most p300-binding regions identified in developing mouse tissues are under detectable evolutionary constraint. They typically overlap conserved non-coding sequences whose length (median of 113 bp) far exceeds that of an individual transcription factor binding site, suggesting the presence of larger functional modules. In cell culture-based chromatin studies, a sizeable fraction of non-coding regions in the human genome was found to be functional yet not constrained13,14. This apparent discrepancy might be due to differences in evolutionary constraint between enhancers active in developing tissues compared to those in individual cell types, but highlights the intrinsic challenge of inferring in vivo functionality from studies in cell culture.
A generalized picture of the epigenetic marks and proteins associated with different types of functional non-coding elements has started to emerge from genome-wide chromatin studies13,18,28,37-41. We can now begin to use these signatures to unravel gene regulation on a genomic scale in the context of living organisms. The highly specific approach for identification of developmental enhancers and their activity patterns presented here represents a step in this direction. Complementary in-vivo-derived genomic data sets may be produced in the future, covering additional embryonic stages, anatomical regions and subregions, and perhaps considering extra molecular markers28,42-45. Focused experiments informed by such insights will expedite studies of the genome-wide activity dynamics of enhancers in developmental, physiological and pathological processes.
Embryonic forebrain, midbrain and limb tissue was isolated from mouse embryos at E11.5. Cross-linking, chromatin isolation, sonication and immuno-precipitation using an anti-p300 antibody were performed as previously described40,46. ChIP DNA was further sheared by sonication, end-repaired, ligated to sequencing adapters and amplified by emulsion PCR as previously described47. Gel-purified amplified ChIP DNA between 300 and 500 bp was sequenced on the Illumina Genome Analyzer II platform to generate 36-bp reads.
Sequence reads were aligned to the mouse reference genome (mm9) using BLAT48. Uniquely aligned reads were extended to 300 bp in the 3′ direction and used to determine the read coverage at individual nucleotides at 25-bp intervals throughout the mouse genome. p300-enriched regions (peaks) with an estimated FDR of ≤ 0.01 were identified by comparison with a random distribution of the same number of reads. Candidate peaks mapping to repetitive regions were removed as probable artefacts.
Candidate regions for transgenic testing were selected based on ChIP-seq results and cover a wide spectrum of conservation. Enhancer candidate regions were amplified by PCR from human genomic DNA and cloned into an Hsp68-promoter-LacZ reporter vector as previously described6,31. Transgenic mouse embryos were generated and evaluated for reproducible LacZ activity at E11.5 as previously described6.
Total RNA from E11.5 whole embryos and forebrain tissue was hybridized to GeneChip Mouse Genome 430 2.0 arrays (Affymetrix) and analysed according to the manufacturer's recommendations. Forebrain- and whole-embryo-enriched genes were identified as having at least 2.5-fold greater expression in one data set compared with the other, and a minimum signal intensity of 100. Limb-enriched genes were identified by comparison with publicly available wild-type E11.5 proximal hindlimb gene expression data (Gene Expression Omnibus (GEO) series GSE10516, samples GSM264689, GSM264690 and GSM264691)49.
Embryonic forebrain, midbrain and limb tissue was isolated from timed-pregnant CD-1 strain mouse embryos at E11.5 by microdissection in cold PBS along the anatomical boundaries indicated in Fig. 1. Tissue samples were cross-linked (1% formaldehyde, 10 μM NaCl, 100 mM EDTA, 50 μM EGTA, 5 mM HEPES, pH 8.0) for 15 min at room temperature. Cross-linking was terminated by the addition of 125 mM glycine and cells were dissociated in a glass douncer. Chromatin isolation, sonication and immunoprecipitation were performed as previously described40,46.In brief, 1 mg of sonicated chromatin (OD260) was incubated with 10 μg of antibody (rabbit polyclonal anti-p300 (C-20), Santa Cruz Biotechnology) coupled to IgG magnetic beads (Dynal Biotech) overnight at 4 °C. The magnetic beads were washed eight times with RIPA buffer (50 mM HEPES, pH 8.0, 1 mM EDTA, 1% NP-40, 0.7% DOC and 0.5 M LiCl, supplemented with complete protease inhibitors from Roche Applied Science), and washed once with TE buffer (10 mM Tris, pH 8.0, 1 mM EDTA). After washing, bound DNA was eluted at 65 °C in elution buffer (10 mM Tris, pH 8.0, 1 mM EDTA and 1% SDS) for 10 min and incubated at 65 °C overnight to reverse cross-links. After the reversal of cross-linking, immunoprecipitated DNA was treated sequentially with proteinase K and RNase A, and desalted using the QIAquick PCR purification kit (Qiagen).
ChIP DNA was quantified by Qubit assay HS kit. Approximately 0.1 ng of each ChIP DNA sample was sheared using Sonicator XL2020 (Misonix) with a microplate horn for 10 min at 55% power output and 90% amplitude. Sheared ChIP DNA extract was end-repaired using the End-It DNA End-Repair Kit (Epicentre). Illumina adapters (56 bp and 34 bp) were ligated using T4 DNA ligase (5 U μl−1, Fermentas) and recovered using a MinElute Reaction Cleanup Kit (Qiagen). Linker ligated ChIP DNA was amplified by emulsion PCR for 40 cycles as previously described47. Amplified ChIP DNA between 300 and 500 bp was gel purified on 2% agarose and sequenced on the Illumina Genome Analyzer II according to the manufacturer's instructions except that emulsion PCR-amplified DNA containing the GA2 sequencing adaptor was applied directly to the cluster station for bridge amplification. The resulting flow-cell was sequenced for 36 cycles to generate 36-bp reads.
Unfiltered 36-bp Illumina sequence reads were aligned to the mouse reference genome (NCBI build 37, mm9) using BLAT48 with optional parameters (minScore = 20, minIdentity = 80, stepSize = 5). BLAT was performed in parallel on a sge-cluster. For each read, the two highest-scoring alignments were compared and reads were rejected as repetitive unless the score of the best alignment was at least two greater than that of the second best alignment. The remaining reads were further filtered to reject those with a BLAT alignment score <21, with >1 bp insertion or deletion, or with >2 unaligned bases at the start of the read. Finally, reads with identical start sites in the mouse genome were considered likely to be duplicate sequences arising as an artefact of sample amplification or sequencing, and were counted only once. The remaining reads were classed as uniquely aligned to the mouse genome.
Uniquely aligned reads were extended to 300 bp in the 3′ direction to account for the average length of size-selected p300 ChIP fragments used for sequencing. These extended read coordinates were used to determine the read coverage at individual nucleotides at 25 bp intervals throughout the mouse genome. This data was used to produce coverage plots for visualization in the UCSC genome browser.
To identify p300-enriched regions (peaks), we compared the observed frequency of coverage depths with those expected from a random distribution of the same number of reads generated computationally as described previously17. In brief, the probability of observing a peak with a coverage depth of at least H reads is given by a sum of Poisson probabilities as:
in which λ is the average genome-wide coverage of extended reads given by: (read length × number of aligned reads)/alignable genome length. To estimate the alignable genome length, one million randomly selected 36-base-polymers from the mouse genome were realigned to the mouse genome using the same alignment and filtering scheme as for reads. A total of 77.3% of 36-base-polymers were uniquely mapped back to the mouse genome, resulting in an alignable genome length of 2.107 Gb.
For each sample, we determined the read coverage depth at which the observed frequency of sites with that coverage exceeded the expected frequency by a factor of 100 (FDR ≤ 0.01). Candidate peaks were identified as sites in which the coverage exceeded this threshold, and peak boundaries were extended to the nearest flanking positions at which read coverage fell below two reads. All consecutive regions of enrichment separated by regions of continuous coverage greater than two reads were merged into a single peak. Candidate peaks mapping to chr_random contigs, centromeric regions, telomeric regions, segmental duplications, satellite repeats, ribosomal RNA repeats or regions of >70% repeat sequence, and those coinciding with enriched regions in the control sample (input DNA) were removed as probable artefacts due to misalignment of heterochromatic sequences that are not at present represented in the mouse reference genome sequence. The remaining peaks represent high-confidence p300-enriched regions and putative enhancers with activity in specific tissues.
Annotation of p300 ChIP-seq read data sets with respect to nearby genes (UCSC known genes50), internal exons (mouse RefSeq51 exons >30 kb from the ends of transcripts) and conserved non-coding sequences (top 50,000 constrained non-coding human-mouse-rat conserved elements identified using GUMBY with R-ratio parameter R = 50; refs 9, 11) was performed using Galaxy52 and custom Perl scripts. Annotation of p300-enriched regions with respect to UCSC known genes and vertebrate phastCons elements34 was performed using custom Perl scripts.
Candidate regions for transgenic testing were selected based on ChIP-seq results. Peaks for which human orthologous regions could not be unambiguously established and those without detectable conservation in opossum53 were excluded from transgenic testing. Thus, the tested peaks cover a wide spectrum of conservation, but are overall more constrained than all peaks identified genome-wide (median score of 457 for all peaks versus 626 for tested peaks). Enhancer candidate regions (average size of 2.4 kb) were amplified by PCR from human genomic DNA (Clontech) and cloned into an Hsp68-promoter-LacZ reporter vector upstream of an Hsp68-promoter coupled to a LacZ reporter gene as previously described6,31. Candidate sequences were not cloned in any particular orientation, effectively resulting in randomized insert orientation among the test constructs. Genomic coordinates of amplified regions are reported in Supplementary Table 5. Transgenic mouse embryos were generated by pronuclear injection and F0 embryos were collected at E11.5 and stained for LacZ activity as previously described6. Only patterns that were observed in at least three different embryos resulting from independent transgenic integration events of the same construct were considered reproducible (see Supplementary Table 5). To account for minor variation in separating forebrain from midbrain during tissue dissection, forebrain and midbrain p300 peaks were also considered correct predictions if the reproducible in vivo pattern was located in the forebrain/midbrain boundary region, whereas absence of a p300 peak was only considered a false-negative prediction if the reproducible in vivo pattern clearly extended beyond the boundary region.
Tissue was isolated from timed-pregnant CD-1 strain mouse embryos at E11.5. Forebrains were further subdivided into basal telencephalon (subpallium), dorsal telencephalon (pallium), and diencephalon, which were processed separately in subsequent steps. For comparison, whole embryos (littermates) were collected. All samples were collected, processed and hybridized in duplicate. Total RNA was extracted using Trizol reagent (Invitrogen). Synthesis of complementary RNA, hybridization to GeneChip Mouse Genome 430 2.0 arrays (Affymetrix) and analysis of hybridization results was performed according to the manufacturer's recommendations. For each sample, the average expression value from duplicates was used for downstream analyses. Forebrain-enriched genes were defined as those with expression at least 2.5-fold greater expression in at least one of the three forebrain regions compared with the whole embryo, and with a minimum signal intensity of 100. Whole embryo-enriched genes are defined as those with at least 2.5-fold greater expression in the whole embryo than in each of the three forebrain regions, and a minimum signal intensity of 100. Distances between p300 peaks and the 5′ end of Affymetrix consensus complementary DNA sequences from mouse MOE430 (A and B) aligned to the mouse reference genome (mm9) were used to determine the closest forebrain-enriched and whole embryo-enriched genes (Supplementary Tables 8 and 9). The same procedure was used to analyse the correlation of limb p300 peaks with limb gene expression, except that limb expressed genes were identified by comparison of publicly available wild-type E11.5 proximal hind-limb gene expression data (GEO series GSE10516, samples GSM264689, GSM264690 and GSM264691)49, with the whole embryo gene expression data generated in the present study (Supplementary Tables 10 and 11).
All animal work was performed in accordance with protocols reviewed and approved by the Lawrence Berkeley National Laboratory Animal Welfare and Research Committee.
We wish to thank R. Hosseini and S. Phouanenavong for technical support, and J. Rubenstein, J. Long, J. Choi and Y. Zhu for help with microarray experiments. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program and by the University of California, Lawrence Berkeley National Laboratory under contract no. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract no. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract no. DE-AC02-06NA25396. L.A.P. and E.M.R. were supported by the Berkeley-PGA, under the Programs for Genomic Applications, funded by National Heart, Lung, & Blood Institute, and L.A.P. by the National Human Genome Research Institute. A.V. was supported by an American Heart Association postdoctoral fellowship. B.R. was supported by grants from the National Human Genome Research Institute and the Ludwig Institute for Cancer Research.
Full Methods and any associated references are available in the online version of the paper at www.nature.com/nature.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
Reprints and permissions information is available at www.nature.com/reprints. All ChIP-seq data sets described in this study have been deposited at the National Center for Biotechnology Information (NCBI) in the Gene Expression Omnibus (GEO) database under accession number GSE13845.