|Home | About | Journals | Submit | Contact Us | Français|
An organism’s genome sequence serves as a blueprint for the proteins and regulatory RNAs essential for cellular function. The genome also harbors cis-acting non-coding sequences that control gene expression and are essential to coordinate regulatory programs during embryonic development. However, the genome sequence is largely identical between cell types within a multi-cellular organism indicating that factors such as DNA accessibility and chromatin structure play a crucial role in governing cell-specific gene expression. Recent studies have identified particular chromatin modifications that define functionally distinct cis regulatory elements. Among these are forms of histone 3 that are mono- or tri-methylated at lysine 4 (H3K4me1 or H3K4me3, respectively), which bind preferentially to promoter and enhancer elements in the mammalian genome. In this work, we investigated whether these modified histones could similarly identify cis regulatory elements within the zebrafish genome. By applying chromatin immunoprecipitation followed by deep sequencing, we find that H3K4me1 and H3K4me3 are enriched at transcriptional start sites in the genome of the developing zebrafish embryo and that this association correlates with gene expression. We further find that these modifications associate with distal non-coding conserved elements, including known active enhancers. Finally, we demonstrate that it is possible to utilize H3K4me1 and H3K4me3 binding profiles in combination with available expression data to computationally identify relevant cis regulatory sequences flanking syn-expressed genes in the developing embryo. Taken together, our results indicate that H3K4me1 and H3K4me3 generally mark cis regulatory elements within the zebrafish genome and indicate that further characterization of the zebrafish using this approach will prove valuable in defining transcriptional networks in this model system.
Embryonic development requires coordination of numerous molecular signals that dictate cell migration and organization, as well as the differentiation and specification of distinct cell lineages. During this process, transcription factors are dynamically expressed in the developing embryo and modulate expression of genes necessary for key developmental processes (Levine and Davidson, 2005; Smith et al., 2007). These transcription factors typically act by binding to cognate sequences within cis regulatory elements that lie adjacent to or within target genes to promote or repress their expression. Through combinatorial and differential binding by multiple transcription factors, cis-regulatory elements for a particular gene help give rise to complex patterns of gene activation and silencing during development (Bonn and Furlong, 2008; Levine and Davidson, 2005). However, a confounding observation in the context of this regulatory mechanism is that every cell in the developing embryo possesses a nearly identical genomic sequence, yet they express different genes. This suggests an important role for chromatin structure in regulating DNA accessibility during the control of gene expression.
Early studies into the control of gene expression revealed that chromatin in the vicinity of promoter or enhancer elements was sensitive to digestion by DNase I (Liu et al., 1988; Wu, 1980). While these initial findings were limited to a few loci, recent application of deep sequencing approaches has allowed genome-wide identification of DNase I hypersensitive sites in the vertebrate genome (Crawford et al., 2006). These studies have demonstrated a consistent correlation between DNase I hypersensitivity and several classes of functionally defined cis regulatory elements, including transcriptional start sites, insulators, and enhancer elements (Boyle et al., 2008; Crawford et al., 2006). Along with DNase hypersensitivity, a number of regulatory proteins are broadly associated with cis elements and play important roles in determining their function. For example, the CCCTC-binding factor (also known as CTCF) is a zinc finger DNA-binding protein that functions as a transcriptional repressor and binds at insulator elements throughout the genomes of multiple organisms (Bell et al., 1999; Burcin et al., 1997). By contrast, the histone acetyltransferase p300, a transcriptional co-activator (Ogryzko et al., 1996), associates with active enhancers in a tissue-specific manner (Heintzman et al., 2009). Interestingly, cis elements that bind to CTCF or p300 are also hypersensitive to DNA treatment.
A number of post-translationally modified forms of histone protein subunits similarly associate with distinct cis regulatory elements. For example, the trimethylated form of histone 3 at lysine 4 (H3K4me3) is associated with genes that are actively transcribed (Santos-Rosa et al., 2002) (Bernstein et al., 2002). Recent large-scale and genome-wide profiling studies of H3K4me3 in the human and mouse genomes further indicate that this modification is preferentially associated with the promoters of active genes (Barski et al., 2007; Bernstein et al., 2005) and, to a lesser degree, at distal cis elements, such as enhancers (Barski et al., 2007). Histone 3 that is monomethylated at lysine 4 (H3K4me1) is similarly associated with active enhancer elements as well as transcriptional start sites, though to a lesser degree than H3K4me3 (Barski et al., 2007; Heintzman et al., 2007). H3K4me1 is also found at insulator elements concordant with binding by CTCF (Akkers et al., 2009; Barski et al., 2007; Heintzman et al., 2007). While H3K4me3, and to some degree H3K4me1, are generally associated with gene activation, other histone marks are indicative of gene repression. Most notably H3K4me27 is consistently associated with silent genes (Barski et al., 2007; Boyer et al., 2006; Lee et al., 2006) and is often found together with H3K4me3 at silent genes that are poised for activation (Azuara et al., 2006; Bernstein et al., 2006).
While genome-wide efforts to characterize histone modifications have been applied to embryonic stem cell lines, most studies have been confined to cell culture models thereby limiting their relevance to embryonic development. In cases when these approaches have been applied to vertebrate embryos, the resulting studies have provided new insights into the transcriptional regulatory networks required for development of a particular cell type or tissue. For example, ChIP-Seq analysis of p300 binding sites in tissues microdissected from mouse embryos at E11.5 revealed hundreds of enhancer elements capable of driving tissue specific gene expression (Visel et al., 2009). These results further demonstrated that p300 binding alone is a reliable mark to predict tissue specificity of cis elements in the context of the developing embryo. More recently, ChIP-on-chip has been applied to early zebrafish embryos to investigate the binding of modified hi-stones within the genome. Initial studies confirmed that H3K4me3 binding was enriched at transcriptional start sites similar to findings in mammalian cells (Wardle et al., 2006). More recently, a similar approach was used to demonstrate that both H3K4me3 and H3K27me3 marks are deposited at developmentally regulated genes independently of transcriptional activation immediately prior to the onset of zygotic transcription in zebrafish (Vastenhouw et al., 2010). Importantly, these findings highlight the benefit of applying genome-wide analysis of chromatin modification and transcriptional regulatory protein binding to model organisms in vivo.
The application of these approaches to zebrafish is aided by the ability to obtain large numbers of embryos, greatly enabling genome-wide analysis at multiple embryonic stages. In parallel, the ability to perform straightforward genetic manipulation in this system would make it amenable to genome-wide profiling to assess the transcriptional inputs of different developmental pathways. In this study, we have applied ChIP-Seq on zebrafish embryos to identify genomic elements that associate with H3K4me1 and H3K4me3. As in mammalian cells, we find that zebrafish promoters are enriched for H3K4me3, as well as H3K4me1, and that this association directly correlates with levels of gene expression. We further find that H3K4me3 binding sites can aid in annotating the transcriptional start site of predicted genes and can identify alternative promoter elements. We also find that active enhancer elements display preferential association with H3K4me1 and this mark can be predictive for active enhancers. Finally, we demonstrate that cis elements flanking H3K4me1 and H3K4me3 binding sites can be analyzed computationally to identify biologically relevant transcription factor binding sequences. Taken together, these studies demonstrate that genome-wide profiling of chromatin modifications in the zebrafish will be a valuable approach to identify and dissect transcriptional regulatory networks important for development. Additionally, the datasets generated in this work provide a valuable resource for the further annotation and identification of cis regulatory elements within the zebrafish genome.
Fish lines were raised and maintained by Animal Medicine at UMass Medical School as previously described (Westerfield, 1993). Wild type (WT) lines were TL and CF. Studies were performed in accordance with UMass Medical School IACUC guidelines.
For each ChIP we dissociated 200 wild type embryos staged at 24 h post fertilization (hpf) into single cell suspensions as previously (Covassin et al., 2006). Cells were cross-linked with a 1% formaldehyde/PBS solution for 10 min at room temperature. After adding 2 M glycine, the samples were centrifuged and pellets resuspended in SDS lysis buffer (50 mM Tris–HCl, pH 8.1, 10 mM EDTA, 1% SDS). The samples were sonicated using a Microsonicator (Cole and Palmer Instruments) with a 6 mm micro tip for a minimum of five minutes at 30% intensity during which they were kept in ice water. Shearing was monitored by gel electrophoresis to ensure a size distribution between 200 and 1000 bp. As necessary, additional sonication was applied in 1 min intervals to achieve the desired size range. Sonicates were centrifuged and supernatants were removed and diluted 1:10 in IP buffer (16.7 mM Tris–HCl PH 8.1, 1.2 mM EDTA, 1.1% Triton-X-100, 0.01% SDS, 167 mM NaCl). One percent of this solution was set aside as an input control for each corresponding immunoprecipitation. Samples were pre-cleared with ChIP-grade protein A agarose beads (Sigma) in IP buffer containing 0.4 mg/mL sheared salmon sperm (Invitrogen) and 1 mg/mL bovine serum albumin (BSA) for 30 min at 4 °C. Samples were then split into two tubes and antibody added to only one, while the remaining half served as a negative control. For immunoprecipitation, we used 8 μg of anti-H3K4me1 (ab8895, Abcam) or anti-H3K4me3 antibody (ab8580, Abcam). All samples were incubated overnight at 4 °C. Both antibody-treated and untreated samples were incubated with protein A agarose beads for one hour at 4 °C and subsequently washed with ChIP wash buffers A (20 mM Tris–HCl pH 8.1, 2 mM EDTA, 1% Triton-X-100, 0.1% SDS, 150 mM NaCl), B (20 mM Tris–HCl pH 8.1, 2 mM EDTA, 1% Triton-X-100, 0.1% SDS, 500 mM NaCl), C (10 mM Tris–HCl pH 8.1, 1 mM EDTA, 1% NP-40, 1% Sodium deoxycholate, 0.25 M LiCl), and twice in TE solution (10 mM Tris–HCl, 1 mM EDTA). All samples, including input, were brought to 200 μL with elution buffer (0.1 M NaHCO3 and 1% SDS), vortexed, and incubated for 15 min at room temperature. Supernatants were then cleared of agarose beads by two rounds of centrifugation. 20 μL of 5 M NaCl was added to all samples (including input) followed by incubation at 65 °C overnight. Immuno-precipitated DNAs were purified using a PCR purification kit (Qiagen), eluted with 50 μL EB, and quantified using a Nanodrop 2000 (Thermo Scientific). All ChIPs were performed in duplicate. PCR amplification was performed on all samples using primers specific for regions upstream of the predicted promoters of gapdh, ef1α, and fii1b (see Supplementary Table 9 for all primer sequences used in this study). Primers were designed using Primer3Plus (http://www.bioinformatics.nl/cgi-bin/primer3plus/primer3plus.cgi) and synthesized by Invitrogen. For PCR reactions, we used Platinum Taq polymerase (Invitrogen) and analyzed samples by electrophoresis on TAE gels made with 2.5% low-res ultra agarose (Bio-Rad).
80 ng of ChIP-isolated DNA from both antibody and input samples was polished using an END-IT DNA Repair Kit (Epicentre) and purified using Qiaquick columns (Qiagen). A single adenosine was added to the 3′ ends using 50 units of Klenow Exo-minus (Epicentre) in the presence of dATP followed by purification using Qiaquick columns. At this point, the samples were concentrated using a speed vacuum and 0.3 μM Illumina adapters were ligated to the fragments using a Fast Link ligation kit (Epicentre). Adaptor-ligated fragments were purified using a Gel Extraction Kit (Qiagen) and PCR-amplified using Illumina’s Genome Analyzer primers (primers 1.1 and 2.1, final concentration of 2.5 μM). PCR products were purified by gel electrophoresis and a Qiagen Gel Extraction Kit. Single end 36 or 72 bp reads from adaptor-ligated libraries were obtained by deep sequencing on Illumina Genome Analyzers operated by the UMass Center for AIDS Research Deep Sequencing Core.
We used the Firecrest and Bustard applications for base calling and to generate sequence tag files. We mapped the sequence tags to the zebrafish genome (Zv7) using Bowtie to generate genomic coordinates for all tags. Only tags with unique genomic coordinates and less than 2 bp of mismatch were included in subsequent analyses. We identified approximately 26, 17, and 22 million non-redundant uniquely mapped reads for H3K4me1, H3K4me3, and input samples, respectively, which were subsequently used for peak finding. To generate “signal” tracks, we determined the density of sequence tags for each sample by extending uniquely mapped sequence to 250 bp in length, followed by coverage calculation for each position in a 250 bp moving window. To minimize file size of these annotation tracks, only genomic positions with at least 10× sequence tag coverage were used to generate wiggle files (.wig, Supplementary Files 1–3). Regions of H3K4me1 or H3K4me3 enrichment over input (referred to hereafter as “hotspots”) and “peaks” of binding within these enriched regions were determined using the Model-based Analysis of ChIP-Seq algorithm (MACS, version 1.4; (Zhang et al., 2008) with the default settings and the following modified parameters: -f BED -w –single-wig –gsize 1700000000 –tsize 36 –bw 150). The genomic coordinates of significantly enriched regions (p-value<1×10−5) were formatted into Excel files (Supplementary Tables 1 and 2). Genomic locations of H3K4me1 or H3K4me3 “hotspots” and “peaks” were compiled into bed- and wig-formatted files, respectively (Supplementary Files 4–7). We established a mirror site of the UCSC Genome browser using sequences and annotations from Zv7 (danRer5). Wiggle- and bed-formatted files were uploaded to this mirror site for visualization using standard browsers.
Concordance was determined using the ChIPpeakAnno package from Bioconductor (Zhu et al., 2010). Coordinates for both ENSEMBL and RefSeq gene sets were obtained from tables at the UCSC Genome Browser (Zv7). Only uniquely mapped RefSeq genes were used in our analysis. The transcriptional start site (TSS) was defined as the starting genomic coordinate for a gene on the positive strand, or the end coordinate for a gene on the negative strand. A sequence fragment extending 2000 bp upstream and 1000 bp downstream of the TSS coordinate (or the converse for genes on the negative strand) was then used to determine concordance to H3K4me1 and/or H3K4me3 peaks based on their position overlap calculated by annotatePeakInBatch in the ChIPpeakAnno package. Results from both ENSEMBL and RefSeq TSSs were separately compiled into Excel files (Supplementary Tables 3 and 4). Density plots and histograms were generated in R to depict the distribution of peaks relative to the nearest TSS or conserved non-coding element.
To assess gene expression levels in zebrafish embryos at 24 h post fertilization (hpf), we used a publicly available dataset consisting of 4 microarray replicates performed on Affymetrix Zebrafish Genome arrays (Gene Expression Omnibus accession numbers: GSM466299, GSM466300, GSM466301, and GSM466302). We determined Absent and Present calls for each probe using mas5calls in the affy package from Bioconductor and probesets were divided into 5 categories based on expression level. To match probesets to either RefSeq genes or ENSEMBL genes, we mapped the coordinates of Affymetrix probes to the respective exon annotations for either gene set. We eliminated any probes that mapped to two separate genes, or probes mapping to the same gene that differed in level by more than one expression category. The resulting table was cross-referenced and integrated into Supplementary Tables 3 and 4. To identify genes that were expressed in blood vessels at 24 hpf, we downloaded the expression pattern datasets from the Zebrafish Information Network. From these we extracted a list of genes using the anatomical terms: “dorsal aorta” or “vasculature” and the stage term: “Prim-5”. Only entries corresponding to mapped RefSeq genes in Zv7 were taken for further analysis (Supplementary Table 5).
Conserved elements have not been annotated on Zv7. Therefore, to identify conserved non-coding elements (CNEs) in Zv7, we first downloaded the Most Conserved Element (MCE) table from Zv8 (danRer6), restricting our analysis to chromosomes 3, 6, 13, 14, 19, 20, and 23. We eliminated all MCEs that overlapped with predicted ENSEMBL genes, zebrafish mRNAs, and expressed sequence tags (ESTs). We also eliminated all elements with a score less than 400. To obtain the corresponding coordinates for the resulting CNEs in Zv7 (Supplementary Table 6), we applied the Lift-Over function in Galaxy (http://main.g2.bx.psu.edu/; (Blankenberg et al., 2010; Goecks et al., 2010)) and used these for concordance analysis as above, except that the CNE was not extended and the window of overlap was 1000 nucleotides up or downstream. Zv7 CNE location was compiled into a BED file for visualization on the UCSC Genome Browser (Supplementary File 8). To obtain Zv7 coordinates for CNEs in the zebrafish hox4 paralogs and fgf8a, we extracted genomic sequence from Zv6 based on coordinates for functionally characterized elements that have been previously described (Komisarczuk et al., 2009; Punnamoottil et al., 2010). Zv7 coordinates for these same sequences (Supplementary Table 7) were obtained through BLAT alignment and were used to generate BED files for visualization on the UCSC Genome Browser (Supplementary File 9).
Indicated genomic regions (Supplementary Table 8) were PCR amplified from wild type genomic DNA for dll4, her6, hey2, and notch3 using the primers in Supplementary Table 9. Cis elements from the dll4 and hey2 genes were amplified using primers containing attB4 and attB1 sites followed by BP recombination with pDONR-P4-P1R to generate 5′ entry clones (p5E). Cis elements from her6 and notch3 were amplified using primers containing EcoRV sites. Following PCR amplification, these elements were digested with EcoRV and cloned into the p5Ebasprom plasmid that had been digested with SnaBI. p5Ebasprom was constructed by PCR amplification of an E1A basal promoter element from pUASnBgal (unpublished) with an upstream SnaBI site followed by BP cloning into pDONR-P4-P1R. To generate reporter constructs, dll4 and hey2 p5E clones were used in individual multi-site LR reactions with pME-basEGFP (also known as pENTR-basEGFP) and pTolR4-R2pA. For her6 and notch3 elements, LR reactions were performed with respective p5E plasmids, along with pME-EGFP2 (also known as pENTR-EGFP2) and pTolR4-R2pA. The ENTR-basEGFP, -EGFP2, and pTolR4-R2pA plasmids have been described elsewhere (Villefranc et al., 2007).
To determine activity of putative enhancer elements, we co-injected reporter constructs along with 25 pg of mRNA encoding the Tol2 Transposase into wild type zebrafish embryos at the 1-cell stage. Embryos were checked for green fluorescence at 24 hpf and fixed in 4% paraformaldehyde. Immunostaining to detect GFP expression, as well as subsequent imaging, was performed as previously (Villefranc et al., 2007).
To identify overrepresented transcription factor binding sites we applied the CLOVER algorithm (Frith et al., 2004). We identified cis elements (CNEs, H3K4me1 peaks, and H3K4me3 peaks) within sequences 40 kb up and downstream of TSSs for genes of interest. For H3K4me1 and H3K4me3 peaks, we extracted 750 bp up- and downstream of each peak and compiled these sequences, along with CNE sequences, into a single FASTA file for all genes within the syn-expression group (Supplementary Files 10 and 11). In all cases, we used repeat-masked sequences from Zv7. In cases where sequence fragments overlapped, a single contiguous entry was generated to prevent duplication of sequences that would lead to spurious over-representation during CLOVER analysis. The Notch syn-expressed genes were her4.2, her6, hes5, notch2, notch3, nrarpa, and nrarpb (Supplementary File 10). The Fgf syn-expressed genes were dusp6, erm, fgf3, fgf8, fgfr1, pea3, spry1, and spry4 (Supplementary File 11). These FASTA files were separately used as input files for CLOVER (default parameters, except -t 0.005 -u 7). As a comparative background sequence we used the sequence of zebrafish chromosome 15 and we searched using weighted transcription factor binding matrices from the JASPAR database (core collection, 2005(Sandelin et al., 2004); Supplementary File 12) or for Drosophila Suppressor of hairless ((Bailey and Posakony, 1995); Supplementary File 13).
To broadly identify H3K4me1 and H3K4me3 binding sites throughout the zebrafish genome, we performed deep sequencing of genomic DNA fragments isolated following ChIP (referred to as ChIP-Seq) using antibodies against these modifications. We initially assessed the ChIP-Seq datasets by observing the distribution of H3K4me1 and H3K4me3 binding at several different loci using a mirrored version of the UCSC genome browser. For example, on chromosome 14 a functional vascular endothelial growth factor receptor-2 ortholog, kdrl, spans approximately 67 kb and is located downstream of the ligand of numb-protein X 2b (lnx2b), cysteine-rich hydrophobic domain 1 (chic1), and caudal domain homeobox 4 (cdx4) genes (Fig. 1A). Each of these genes displays distinct spatial and temporal gene expression patterns in the zebrafish embryo. kdrl expression is restricted to endothelial cells at 24 hpf (Liao et al., 1997), while cdx4 is expressed in lateral mesoderm and neural plate at earlier time points and is restricted to the spinal cord and endoderm by 24 hpf (Davidson et al., 2003). Similarly, lnx2b is broadly expressed in multiple tissues early during development and becomes restricted to pronephric cells, the dorsal aorta, and the forebrain by 24 hpf (Ro and Dawid, 2009). By contrast, chic1 is ubiquitously expressed at 24 hpf in zebrafish embryos (www.zfin.org; ZDB-FIG-050630-13617). For all of these genes, we observed enriched regions of H3K4me3 binding with peaks near their putative transcriptional start site (TSS) as predicted by ENSEMBL gene or RefSeq coordinates (Fig. 1A). Previous studies have shown that concordant H3K4me3 and H3K4me1 binding at the TSS correlates with active gene transcription (Barski et al., 2007; Bernstein et al., 2002, 2005; Santos-Rosa et al., 2002). Accordingly, we similarly observed enrichment of H3K4me1 binding with peaks at the TSSs of all four of these expressed genes (Fig. 1A). Additionally, we observed that H3K4me1 binding spanned the length of these genes with several intragenic peaks of enrichment evident (Fig. 1A), consistent with the finding that H3K4me1 is associated with distal cis regulatory elements (Heintzman et al., 2007).
Our preliminary analysis at these selected loci also revealed potential discrepancies between ENSEMBL gene predictions and RefSeq genes annotations for defining the putative TSS. For example, RefSeq annotation of lnx2b indicates an upstream non-coding exon that is experimentally supported by available expressed sequence tags (ESTs) yet is missing from the ENSEMBL prediction at the same locus (Fig. 1B). We observed enrichment of both H3K4me1 and H3K4me3 binding at the RefSeq-annotated TSS, consistent with a transcriptionally active promoter at this position. Interestingly, we also observed H3K4me1 binding at and immediately upstream of the second lnx2b exon and several mapped ESTs that initiate at these coordinates suggest that this gene may exhibit alternative promoter usage. These observations indicated that our datasets might be useful to identify alternative promoters. To determine if this was the case, we investigated H3K4me1 and H3K4me3 binding at the tal1 gene, which is known to utilize two distinct promoters in zebrafish (Qian et al., 2007). Consistent with published findings, we identified ESTs whose 5′ end initiated at two distinct sites (Fig. 1C). Furthermore, both TSSs were enriched for both H3K4me3 and H3K4me1 binding, indicative of active transcription from both sites. Taken together, these observations suggest that integration of H3K4me1 and H3K4me3 concordance data with ENSEMBL gene prediction algorithms may aid in the identification of 5′ exons and TSSs, as well as identify instances of alternative promoter usage in the zebrafish genome.
Our initial observations at chromosome 14 indicated tight association between H3K4me1 or H3K4me3 and the TSS, similar to observations in mammalian cells (Barski et al., 2007; Bernstein et al., 2002, 2005; Santos-Rosa et al., 2002). To investigate if these patterns were consistent in our dataset throughout the zebrafish genome, we plotted the distribution of all H3K4me1 and H3K4me3 peaks in relation to all predicted TSSs. Since our preliminary observations revealed some instances where a TSS was not appropriately annotated at ENSEMBL-predicted gene coordinates when compared to RefSeq sequences (see Fig. 1B), we separately performed this analysis on both ENSEMBL-predicted TSSs and uniquely mapped RefSeq TSSs. In both cases, we found that the distribution of H3K4me3 binding was tightly associated with TSSs (Figs. 2A, D) throughout the genome. Closer inspection of a 10 kb flanking window around the TSS of all ENSEMBL or RefSeq genes indicated that most H3K4me3 sites are within several hundred base pairs of the TSS and display a bimodal distribution flanking the TSS itself (Figs. 2B, E). These findings are similar to those in mammalian genomes where H3K4me3 binding occurs immediately upstream and downstream of the TSS, while being depleted from the actual TSS itself (Barski et al., 2007). We observed a similar association to the TSS for H3K4me1, although this mark is also more widely distributed (Figs. 2C, F) consistent with its known association with distal enhancer and insulator elements in mammalian genomes (Barski et al., 2007; Heintzman et al., 2007, 2009).
The distribution of H3K4me1 and H3K4me3 binding suggested that there would be concordance of these two marks at or near TSSs. To investigate if this was the case, we determined the degree of concordant H3K4me1 and H3K4me3 binding for all TSSs throughout the zebrafish genome. Using the MACS algorithm, we identified approximately 41,000 H3K4me1 and 29,000 H3K4me3 sites within the zebrafish genome that displayed significant enrichment when compared to the input sample at this depth of sequencing (Supplementary Tables 1 and 2). From all ENSEMBL-predicted genes in Zv7, we identified approximately 30% (8028 genes) marked by both H3K4me1 and H3K4me3 at predicted TSSs, while about 10% (2654 genes) were marked with H3K4me3 only, and 830 ENSEMBL genes were marked by H3K4me1 alone (Supplementary Table 3). Parallel analysis revealed that a much higher proportion of RefSeq-annotated TSSs than ENSEMBL TSSs (55% versus 31%, respectively) were marked by both H3K4me1 and H3K4me3 (Supplementary Table 4). A likely reason for the discrepancy between ENSEMBL-predicted genes and RefSeq annotated genes is that RefSeq sequences are manually curated and therefore are largely full-length transcripts based on expressed sequences and other experimental evidence (Pruitt et al., 2009). Therefore, it is likely that a high proportion of RefSeq TSSs are correctly annotated. While ENSEMBL incorporates multiple sources, including expressed sequence tags, in some cases gene predictions are made based solely on homology, leading to a prevalence of incomplete gene annotations. This is consistent with recent comparisons of RefSeq gene collections with ENSEMBL gene annotations demonstrating a significant proportion of full-length clones remain to be fully annotated (Flicek et al., 2010). Nonetheless, for both ENSEMBL and RefSeq genes, our results are consistent with findings in other vertebrate genomes and demonstrate that H3K4me3, and to a lesser degree, H3K4me1 preferentially associate with transcriptional start sites in the zebrafish genome.
In mammalian cells, H3K4me3 binding correlates with the presence of active RNA polymerase II and, along with H3K4me1, is more prevalent at the TSSs of genes known to be transcriptionally active or poised to be expressed (Bernstein et al., 2006; Mikkelsen et al., 2007). To determine if the same pattern held true for our dataset from zebrafish embryos at 24 hpf, we compared H3K4me1 and H3K4me3 binding at TSSs with corresponding transcript levels at this same time point. As above, we separately used both ENSEMBL-predicted and RefSeq-annotated genes for our analyses. Using a publicly available microarray dataset from wild type embryos at 24 hpf, we binned probesets based on their expression levels into 5 categories, 1 (lowest or no expression) to 5 (highest expression) and then cross-referenced these against H3K4me1 and H3K4me3 concordance tables. For ENSEMBL-predicted genes, we identified 4833 respective Affymetrix probesets that unambiguously mapped to an inclusive exon (Supplementary Table 3). Of these, the majority of genes were detectable (approximately 90% with expression level >1) at 24 hpf. Furthermore, we observed a direct correlation between H3K4me3 occupancy at the TSS and expression levels (Fig. 3A). For example, approximately 70% of TSSs at the most highly expressed ENSEMBL genes (both categories 4 and 5) were bound to H3K4me3 and more than half of these were also marked by H3K4me1 (Fig. 3A, Supplementary Table 3), indicative of active gene transcription. By contrast, less than 40% of genes with the lowest or no expression were marked by H3K4me3 and nearly two-thirds of genes expressed at this level failed to display enrichment of either H3K4me1 or H3K4me3 at their TSS (Fig. 3A). We observed similar results when comparing expression levels of RefSeq genes to TSS occupancy. In this case, we identified 3749 probesets that mapped to exons of a unique RefSeq gene and the majority of these were expressed at detectable levels (Supplementary Table 4). Similar to ENSEMBL-annotated genes, we identified a direct correlation between expression level and occupancy of H3K4me3 at the TSS of RefSeq genes (Fig. 3B). In this case, more than 80% of the highest expressed RefSeq genes (both category 4 or 5) were marked by H3K4me3 and nearly two-thirds displayed a bivalent H3K4me1/me3 mark. Less than 10% of the most highly expressed genes were not marked by either modification. By contrast, nearly half of the RefSeq genes with the lowest or no expression were not marked at all and just 45% were marked by H3K4me3. These results suggest a direct correlation between binding of H3K4me3 or concordant binding of both H3K4me1 and H3K4me3 with levels of gene expression in the developing zebrafish embryo and are consistent with recent studies showing a similar correlation in a zebrafish cell line (Lindeman et al., 2010).
While our microarray analysis was helpful to simultaneously compare expression levels of a large number of genes in whole embryos, it did not provide any information on tissue-specific expression. Furthermore, low levels of transcript and the inability to detect H3K4me1 or me3 could be due to gene expression that is highly restricted to only a few cells in the entire embryo. Therefore, we assessed H3K4me1 and H3K4me3 occupancy at a selected category of cell type-specific genes. For this purpose, we utilized a previously established and publicly accessible database of expression patterns that were obtained by whole mount in situ hybridization using antisense riboprobes on zebrafish embryos at selected time points (Kudoh et al., 2001; Tsang et al., 2002); (Thisse et al., 2004). From this dataset, we identified genes that displayed expression in blood vessels at the same time point used for our ChIP-Seq analysis. We chose blood vessels as a target tissue since the endothelial cells comprising this tissue only represents about 5% of all cells in an embryo at 24 hpf (Covassin et al., 2006 and data not shown). We identified 78 transcripts that corresponded to RefSeq sequences in our dataset and were reported to display expression in blood vessels (Supplementary Table 5). Of these, nearly two-thirds (45 out of 78) displayed a bivalent H3K4me1/me3 mark at the TSS of the corresponding RefSeq gene, comparable to the highly expressed genes identified by microarray analysis (Fig. 3B). While a number of these genes are also expressed in other cell types and would be more abundant within the whole embryo, we identified numerous genes that are restricted to endothelial cells (e.g. dusp5, fii1b, and plxnd1) and were marked by H3K4me1 and H3K4me3 at their TSS (Supplementary Table 5).
Our findings indicate that it is possible to identify transcriptionally active promoters in highly restricted genes within the zebrafish embryo at 24 hpf using this ChIP-Seq approach. Therefore, similar genome-wide profiling in the many different mutant lines available in zebrafish may provide valuable insight into transcriptional regulatory pathways required for development. While we did observe a consistent correlation with H3K4me1/me3 binding and gene expression, there are still numerous expressed genes for which we failed to identify either mark at their TSS. There could be several reasons for this finding. For example, both datasets used here to assess gene expression rely on transcript level in a given sample. In some cases, it is possible that active transcription of the gene has ceased by 24 hpf, yet transcript is still present. In this case, the TSS of such a gene may be less likely to be marked by H3K4me1 and H3K4me3. The inability to detect H3K4me1 or H3K4me3 binding may also reflect a high background signal from input, or simply a low density of sequence tags from the ChIP. In either case, this would lead to a low signal-to-noise ratio hindering peak detection. It is likely that deeper sequencing efforts would remedy this problem.
Along with promoter elements located immediately adjacent to the TSS of genes, distal enhancer elements are important for spatial and temporal control of gene expression. In many cases, these active cis regulatory elements can be identified on the basis of distinct signatures, including binding by general transcriptional activator proteins (e.g. p300) and hypersensitivity to DNase I digestion (Pennacchio et al., 2006; Visel et al., 2008). Several studies have also demonstrated enrichment of H3K4me1 and/or H3K4me3 binding at active tissue specific enhancer elements, with H3K4me1 often being the more prevalent mark (Barski et al., 2007; Heintzman et al., 2007; Robertson et al., 2008). Indeed, H3K4me1 occupancy is among the signature marks for cell-type specific enhancer elements in the human genome (Heintzman et al., 2007). Therefore, we investigated if putative distal cis regulatory elements within the zebrafish genome displayed enrichment of H3K4me1 binding. For this purpose, we first relied on previous observations that highly conserved non-coding elements (CNEs) often mark active cis regulatory sequences, such as enhancers (Pennacchio et al., 2006; Visel et al., 2008; Woolfe et al., 2005). We determined the distribution of H3K4me1 or H3K4me3 binding sites in relation to the closest distal CNE for a selected set of 7629 elements within the zebrafish genome. Similar to our findings for TSSs, we observed clustering of both H3K4me1 and H3K4me3 binding within and flanking CNEs (Figs. 4A, B). However, we identified only 910 CNEs that were marked by H3K4me1 only, and 287 marked by H3K4me3 only, while 503 CNEs were found to be associated with both marks (Supplementary Table 6). Thus, only approximately twenty percent of annotated CNEs appear to be marked by H3K4me1 in our dataset. There are several possible reasons for this finding. For example, there may still be a number of coding sequences and non-coding transcripts that have not been annotated but would be represented by conserved elements that would not have been filtered out during our analysis. In addition, it may be that highly restricted enhancer elements are only active and marked by H3K4me1 in a very small population of cells. In this case, the H3K4me1 signal would fall below our level of detection at this depth of sequencing. However, our results below suggest that many restricted enhancer elements are detected as peaks or enriched regions of H3K4me1 in our dataset. Thus, it is possible that only a small subset of these CNEs represent enhancer elements, while the remaining sequences are likely be other types of cis regulatory responsible for controlling gene expression, or may instead be structural elements.
Our results indicate that only a small fraction of CNEs are associated with H3K4me1 binding. To determine if H3K4me1 occupancy at these CNEs may correlate with enhancer activity, we determined H3K4me1 status at functionally verified CNEs within the paralagous hox4 loci and the fgf8a gene in the zebrafish genome (Komisarczuk et al., 2009; Punnamoottil et al., 2010). Previous studies used CNEs at these loci to generate transgenic zebrafish lines providing a reliable dataset of functional information concerning their in vivo activity. We observe good coverage of all of these loci in our ChIP-Seq datasets, except for the hoxc4a locus, which is duplicated in the Zv7 genome assembly thereby excluding sequence tags in this region from our initial analysis pipeline. When considering the remaining hox4 paralogs (hoxa4, hoxb4, and hoxd4), Punnamoottil et al. (2010) tested a total of 31 elements, of which 13 drove tissue-specific gene expression similar to the corresponding endogenous loci, 14 drove non-specific expression and 4 failed to exhibit activity. When considering activity of the CNE, approximately 70% of active elements, regardless of tissue specificity, are associated with H3K4me1, while only one out of the four inactive elements is enriched for H3K4me1 binding (Supplementary Table 7). However, the correlation between H3K4me1 binding and CNE activity on a gene-by-gene basis is not uniform. For example, all of the 14 hoxb4a CNEs are associated with H3K4me1, as well as H3K4me3 regardless of their expression level or pattern (Fig. 4C; Supplementary Table 7). By contrast, only one out of the four active CNEs at the hoxa4a locus is associated with H3K4me1 (Supplementary Table 7). Similarly, only half of the CNEs flanking the fgf8a gene, regardless of whether they drove expression or not, are found within H3K4me1-enriched regions (Fig. 4D; Supplementary Table 7). Despite these gene-by-gene discrepancies, overall the association of CNEs that display enhancer activity in vivo with H3K4me1 binding is much higher (60 to 70%) than the proportion of all CNEs associated with this mark (approximately 18%). This observation is consistent with previous findings in mammalian cell lines (Heintzman et al., 2007) and suggests that H3K4me1 can be modestly predictive for enhancer activity in the zebrafish genome.
To assess the predictive power of our datasets, we tested the ability of H3K4me1 associated elements from several different loci to drive expression in zebrafish embryos. For this purpose, we cloned a total of 6 elements flanking H3K4me1 peaks in the dll4, hey2, her6, and notch3 genes (Supplementary Table 8; Figs. 5A–C; data not shown). To test their ability to drive tissue-specific gene expression, we generated reporter constructs by placing each fragment upstream of a basal promoter and an enhanced green fluorescent protein (EGFP) trans-gene. Following injection into 1-cell stage zebrafish embryos, we observed EGFP expression by immunostaining at 24 h post fertilization. In four cases, we observed highly restricted expression of the EGFP transgene. For example, an H3K4me1 positive fragment upstream of the notch3 gene was capable of driving expression in the floor plate, hypochord and notochord (Fig. 5A). This pattern is consistent with the known expression pattern of notch3 in axial mesoderm and its known role in patterning midline structures (Latimer and Appel, 2006; Latimer et al., 2002; Westin and Lardelli, 1997). Likewise, an element downstream of her6 drove strong expression in mesenchymal cells surrounding the pharygenal arches (Fig. 5A), consistent with endogenous her6 expression in this region (ZDB-FIG-050630-5240, zfin.org Expression Database). Finally, we identified elements in both dll4 (Fig. 5C) and hey2 (Supplementary Table 8; data not shown) that appeared to drive expression in cells within the developing olfactory placode. Consistent with this observation, we find that endogenous dll4 transcript is likewise expressed in a similar pattern (data not shown), as is hey2 (ZDB-IMAGE-060130, zfin.org).
Taken together with our observations above concerning CNEs, these findings suggest that H3K4me1 is generally associated with active enhancer elements in the zebrafish genome. While we do not observe a perfect correlation between H3K4me1 occupancy and enhancer activity, binding is much more prevalent in active CNEs versus all CNEs surveyed in our study. Most importantly, four out of the six elements that we tested were capable of driving tissue specific gene expression in a manner consistent with their endogenous loci. As mentioned above, H3K4me1 is among the several signature marks associated with enhancer elements in the human genome. However, this mark alone is not sufficient to discern enhancers from other cis regulatory elements, such as insulators and repressors. Thus, future work incorporating genome-wide analysis of DNase hypersensitivity, as well as p300 occupancy, together with our datasets, will likely provide a more definitive predictive signature for identifying enhancer elements in vivo in the zebrafish genome.
Recent work has shown that transcription factor binding at enhancer elements often occurs within several hundred base pairs of H3K4me1-or H3K4me3-enriched elements (Robertson et al., 2008). Therefore, the identification of binding sites for H3K4me1 and H3K4me3 across the zebrafish genome could facilitate computational approaches to identify active cis regulatory sequences required for control of gene expression. To determine if this is the case, we used our datasets to identify common regulatory sequences in cis elements flanking syn-expressed genes. Syn-expressed genes are expressed in the same spatial or temporal pattern, are commonly regulated by a particular signaling pathway, and usually function in a common biological process (Furthauer et al., 2002; Niehrs and Meinhardt, 2002; Tsang et al., 2002). Not surprisingly, syn-expressed genes often share conserved cis regulatory elements that are bound by common transcription factors, which activate their expression (for examples see (Flames and Hobert, 2009; Wenick and Hobert, 2004)).
To identify putative regulatory sequences flanking syn-expressed genes, we first assembled a list of genes within the Notch signaling pathway, which facilitates cell–cell signaling and cell fate decisions in multiple cell types (Artavanis-Tsakonas et al., 1999). This list includes genes that encode components of the Notch signaling pathway, are highly dependent on Notch activity for their expression and display largely overlapping expression patterns within the zebrafish embryo (Table 1). These genes included those encoding the Notch receptors themselves (e.g. notch1b, notch2 and notch3; (Westin and Lardelli, 1997)) as well as downstream targets, such as hairy-related transcription factors (e.g. her4.2, her6, and hes5; (Pasini et al., 2004; Takke et al., 1999)) and notch regulated ankyrin repeat protein a and b genes (nrarpa and nrarpb; (Topczewska et al., 2003)). To identify over-represented cis regulatory sequences associated with these genes, we applied the CLOVER (Cis-element over-representation) algorithm (Frith et al., 2004), which identifies known transcription factor binding sites that are over-represented in a target sequence set as compared to a background sequence. As a target sequence set, we compiled all cis elements 20 kb up- or downstream of the TSS for each syn-expressed gene. For this purpose, we defined “cis elements” to include 750 bp up and downstream of any H3K4me1 and/or me3 peaks, as well as CNEs, within the 40 kb interval. As a background sequence, we used the entire sequence of zebrafish chromosome 15 from Zv7 and we utilized the JASPAR core collection as a source of transcription factor binding site matrices. Analysis of cis elements from Notch syn-expressed genes using CLOVER identified several transcription factor binding sites that were significantly over-represented compared to their frequency of occurrence on chromosome 15 (Table 2). Interestingly, the most enriched binding site was for the Suppressor of hairless (also referred to as Rbpsuh) DNA-binding protein. Rbpsuh interacts directly with the intracellular domain of activated Notch receptors (Notch ICD) leading to formation of a transcriptional activator complex that drives expression of Notch target genes (Fortini and Artavanis-Tsakonas, 1994; Kao et al., 1998). The over-abundance of Rbpsuh binding sites in cis elements flanking these syn-expressed genes is consistent with their common dependence on Notch signaling for expression (Pasini et al., 2004; Takke et al., 1999; Topczewska et al., 2003; Westin and Lardelli, 1997). Furthermore, one of the Rbpsuh sites falls within the H3K4me1-positive element from notch3 that we found was able to drive tissue-specific expression in our reporter assays (see Fig. 5A).
To confirm that the enrichment of Rbpsuh and other sites in cis elements flanking Notch syn-expressed genes was particular to this gene set, we generated a separate list of distinct syn-expressed genes for analysis using CLOVER. In this case, we chose genes within the Fibroblast growth factor (Fgf) syn-expression group, which includes components of the Fgf signaling pathway. Fgf syn-expressed genes are dependent on Fgf and are co-expressed in the midbrain hindbrain boundary, apical-ectodermal ridge in the developing fin bud, and within the somites during zebrafish development (Furthauer et al., 2002; Kudoh et al., 2001; Niehrs and Meinhardt, 2002; Tsang et al., 2002). Our Fgf gene set included 8 syn-expressed genes (Table 2) comprising a total of 135 cis elements, based on the same criteria described above for the Notch syn-expression group. As above, we searched these sequences using CLOVER against the Jaspar core transcription factor database with zebrafish chromosome 15 as the background sequence. In contrast to Notch syn-expressed genes, the CLOVER algorithm failed to identify an Rbpsuh binding site as an over-represented sequence motif flanking Fgf-regulated genes. Biased analysis of the same gene set using a weighted matrix for Drosophila Rbpsuh without using a background sequence revealed that Rbpsuh sites were indeed present in elements in Fgf syn-expressed genes, but at a much lower frequency than found in the Notch syn-expression group (Table 1). These results indicate that computational analysis of cis elements marked by H3K4me1 or H3K4me3 flanking syn-expressed genes can reveal biologically relevant regulatory sequences. Similar analyses of other syn-expressed gene groups will likely provide valuable insight into control of gene expression during embryonic development. Furthermore, similar application of unbiased computational approaches (e.g. the Multiple EM for Motif Elicitation [MEME] algorithm; (Bailey et al., 2006)) for the de novo identification of over-represented sites in syn-expressed genes will prove useful to identify novel factors that control gene expression during embryonic development.
In this study we have applied ChIP-Seq to determine the genome-wide binding profile of H3K4me1 and H3K4me3 in zebrafish embryos at 24 hpf. Our results demonstrate that these marks can be found at TSSs and active enhancer elements within the zebrafish genome, similar to their localization in mammalian embryos. We further show that it is possible to apply computational approaches to our ChIP-Seq datasets, combined with available expression data, to identify biologically relevant regulatory sequence flanking syn-expressed genes. Given the enormous wealth of publicly available expression pattern data available for the zebrafish, it is likely that similar approaches can be further applied to reveal regulatory elements required for the development of different cell types, or associated with particular pathways. Similarly, profiling chromatin modifications at different embryonic stages and of mutant zebrafish lines affecting different developmental processes will allow better identification of cell type-specific transcriptional networks. It will also be important to perform similar analyses for other general regulatory factors that mark functional cis elements (e.g. p300) to generate developmental signatures of these regulatory regions. Indeed, this comprehensive approach relying on multiple chromatin-associated proteins has proven useful to identify developmentally regulated enhancers in humans (Rada-Iglesias et al., 2011). Taken together, the increased application of the zebrafish model using these approaches will provide major insights into the transcriptional regulatory processes that control vertebrate development.
We thank John Polli for excellent fish care and maintenance and Ellie Kittler and Maria Zapp at the UMass Deep Sequencing core for their sequencing and troubleshooting efforts. We are grateful to Charles Sagerstrom and Zhiping Weng for reviewing the manuscript. In addition, we appreciate the input of Dan Hart, Claude Gazin, and Seong-Kyu Choe concerning chromatin immunoprecipitation and deep sequencing approaches. A. W. A. was a Research Fellow supported by the Sarnoff Cardiovascular Research Foundation. Studies in this project were supported by a grant from the National Heart, Lung, and Blood Institute (R01HL093467) to N.D.L.
The deep sequencing data generated in this publication including raw sequence, quality files, mapped sequences and identified peaks have been deposited in NCBI’s Gene Expression Omnibus (GEO) and are accessible through GEO.
Series accession number GSE20600.
Supplementary materials related to this article can be found online at doi:10.1016/j.ydbio.2011.03.007.