Overall data analysis
For the genome-wide analysis of FOXA1 and FOXA3 binding sites and regions of H3K4me3 in HepG2 cells, ChIPs and detection by a high throughput sequencer was performed. In order to get a detailed view of the regions of H3K4me3, we decided to treat the chromatin with micrococcal nuclease (MNase). MNase recognizes the naked DNA, which is not tightly wrapped around the nucleosomes, and digests it. This, in combination with the ChIP, will lead to nucleosome-sized DNA (147 bp) that can be sequenced by high throughput sequencers, resulting in a fine mapping of the H3K4me3 pattern in the genome. After alignment of the raw reads and calculation of overlap signals, we compared the results between the different libraries prepared for sequencing and detected a good correlation (Figure S1 in Additional data file 1). Thereafter, the aligned reads were merged, ordered on genomic positions, and extended by the average fragment size (Table S1 in Additional data file 1). We also sequenced a fraction of the input material, generated in the ChIPs, to use as a negative control for detection of regions where repeats may cause false positive overlap signals. Then we identified peaks with significant ChIP-enrichment by considering both the ChIP- and input signals.
We detected 8,175 peaks for FOXA1 and 4,598 peaks for FOXA3 in the human genome in the HepG2 cells (Table ; files with information on peak positions for upload in the UCSC genome browser are available as Additional data files 2, 3, and 4). Out of these, only 465 (5.7%) and 562 (12.2%), respectively, were located within 1 kb of a TSS (Figure S2 in Additional data file 1), emphasizing the importance of true genome-wide studies for these factors. A majority of the putative binding sites were, as expected, located in intragenic and intergenic regions. Genes with a FOXA binding within 1 kb of their TSS demonstrated significantly higher expression than all genes (Figure S3 in Additional data file 1). A search with the de novo
motif finding program BCRANK [21
] resulted in different motifs with variations of TGTTTAC as the top three for FOXA1 and top eight for FOXA3 (Figure ).
Number of regions and overlaps with putative FOXA binding and H3K4me3
Figure 1 Results of de novo motif search. FOXA1 and FOXA3 data were analyzed using BCRANK as described in the Materials and methods. To the right of each motif is the assigned BCRANK score, which gives an indication of the quality of the motif. (a) Top ten predicted (more ...)
As mentioned, members in the FOXA family regulate common pathways. This was supported by our Gene Ontology (GO) analysis, where some categories were recurrent for FOXA1 and FOXA3 (Figure S4A, B in Additional data file 1). Here, we consider a gene to be regulated by FOXA1 or FOXA3 when it contains a binding site within 1 kb of the TSS. Therefore, we analyzed the data for possible co-binding sites for these two factors. As presented in Table , more than 3,000 peaks were found in both data sets.
In a genome-wide study of FOXA1 binding sites in the human breast adenocarcinoma cell line (MCF-7), 12,904 regions have been found at a 1% false discovery rate [14
]. Of these, 2,093 (16%) overlap with the putative binding sites found in HepG2 cells in our study. In a similar way, 2,178 (27%) of our regions were reciprocally found in the MCF-7-data. This indicates that around 2,000 FOXA1 binding sites are common between the HepG2 and MCF-7 cell lines, while 6,000 binding sites are unique to HepG2.
We found 41,780 H3K4me3 regions in the HepG2 genome (Table ). This would approximately correspond to 160,000 nucleosomes with trimethylation of lysine 4 on histone H3. This number is calculated by multiplying the 41,780 regions by 764, which is the average peak length (Table S1 in Additional data file 1), and then dividing the product by 200, the assumed average distance in base-pairs between the start of two nucleosomes in these regions. Of the H3K4me3 regions, 42% are within 1 kb and an additional 15% within 5 kb of the TSS of a known gene, and 4.2% within 1 kb of a 3'-end (Figure S2 in Additional data file 1). Furthermore, 11% of these regions are intragenic, leaving 28% of the H3K4me3 not in the vicinity of a known gene.
Distinct H3K4me3 at bidirectional and other promoter structures
Next, we aimed to discover patterns of H3K4me3 that could be indicative of different types of promoters. Therefore, we extracted the H3K4me3 signals around the TSSs of about 24,000 genes for which the expression measurements in HepG2 are available. We then performed k-means clustering of the H3K4me3 signals to partition the genes into seven clusters, each with its individual H3K4me3 signature (Figure ). Nearly all clusters seem to differ in the level of expression of the downstream genes from the other clusters (Figure ; Table S2 in Additional data file 1). Furthermore, comparison of the expression levels in each cluster to the expression of all 24,000 genes using a two-tailed t-test showed that all but cluster V have significantly higher expression than the average (P < 0.0001). Instead, cluster V has significantly lower expression than the average (P < 0.0001). Genes with the highest expression in HepG2 (cluster I with 596 genes; Table S3) tend to be more enriched for H3K4me3 than any other cluster. Opposed to this is cluster V (12,776 genes), which contains the lowest expressed genes with no or very low enrichment for H3K4me3. The common feature of the six clusters with high H3K4me3 levels is that a nucleosome with this modification was centered at approximately 125 bp downstream of the TSS. Furthermore, these six clusters also contained genes from different GO categories than those of cluster V, which had GO categories overrepresented for genes involved in development (Table S4 in Additional data file 1).
Figure 2 H3K4me3 signals around the transcriptions start sites of 23,849 genes. (a) Enrichment of H3K4me3 in a window surrounding the TSSs. The genes were grouped into seven clusters (I to VII) by their H3K4me3 patterns as described in the Materials and methods (more ...)
Considering clusters I, II, and III with high enrichments for H3K4me3 upstream of the TSS, we suspected the existence of bidirectional transcription in these regions. Therefore, the clusters were compared with the data for CAGE tags [22
] in HepG2. CAGE (cap-analysis of gene expression) is a measurement of the expression of the TSSs of a gene. Consistent with our expectation, over 30% of the genes in each of these three clusters were in the vicinity of CAGE tags on the other strand compared to the TSS, that is, they were part of a bidirectional promoter (Table S3 in Additional data file 1). For cluster II, with a high and broad peak upstream of the TSS, this fraction exceeded 60%. Another significant finding was that 11% of the genes in cluster I, which had the highest expression, also had a FOXA3 binding site within 1 kb of their TSS (Table S3 in Additional data file 1).
Previous studies have suggested that bidirectional promoters occur in CpG-rich sequences [22
]. Thus, we examined the frequency of different sequence elements at the TSSs of the genes in the seven clusters (Table S5 in Additional data file 1). Promoters for all clusters - except cluster V, which had the least number of bidirectional promoters - were highly enriched for CpG-rich sequences. As expected, cluster V contained a higher number of TATA- and CAAT-boxes.
Thus, by unsupervised clustering of enrichment signals around the TSSs, we detected different H3K4me3 signatures depending on the structure of the promoter, sequence elements present in the promoter, and the level of expression of the downstream gene. A similar type of analysis was also performed for H3K4me3 at the 3'-end of genes (Figure S5 in Additional data file 1). Some of the clusters with higher signals at the 3'-ends were associated with high expression of the gene, suggesting a reciprocal H3K4me3 signal at the beginning and the end of some genes. These clusters also have higher frequency of CAGE tags at the 3'-ends (Table S10 in Additional data file 1). For further comments, see the supplementary results in Additional data file 1.
FOXA interactions detected by co-immunoprecipitation and ChIP-reChIP
We have previously examined the genome-wide location of FOXA2 binding in HepG2 cells, where we found 7,253 binding sites for this factor [25
]. Comparison of the FOXA2 data with that for FOXA1 and FOXA3 revealed 2,304 regions in common for all three factors. Here, a common binding is reported when the distance between the peak centers is less than 1 kb. Furthermore, when the genomic localization of different combinations of these factors was examined, we found around 100 regions of common binding for each pair (Figure ). While 12 of 121 (10%) FOXA1-2 regions were within 5 kb of a TSS of a known gene (Figure ), 49 of 96 (51%) FOXA2-3 regions were within the same distance (Figure ). For FOXA1-3, 14 of 102 (14%) regions are within 5 kb, although there are no common binding sites for this pair within the first kilobase of a TSS (Figure ). The corresponding number for all three factors together is 22% (505 of 2,304; Figure ).
Figure 3 Genomic localization of common binding regions for FOXA1, FOXA2, and FOXA3. (a) FOXA1-2, (b) FOXA2-3, (c) FOXA1-3, and (d) FOXA1-2-3. Each region was mapped to all UCSC gene coordinates and sequentially matched to the categories 500 bp from TSS, 500 bp (more ...)
Based on these data, we assumed that FOXA1, FOXA2, and FOXA3 interact with each other in vivo. Therefore, we employed co-immunoprecipitation (Co-IP) to examine the existence of these complexes in HepG2. For this, we immunoprecipitated the three endogenous factors and immunoblotted with the same antibodies, testing all six possible combinations. We found that FOXA2 interacts with FOXA1 and the data suggest an interaction between FOXA2 and FOXA3 as well (Figure ). We could not detect any direct protein-protein interaction between FOXA1 and FOXA3.
Figure 4 Co-immunoprecipitation and ChIP-reChIP of FOXAs reveals interaction and co-binding among FOXAs. (a) Immunoprecipitations were performed with indicated antibodies on nuclear extracts of HepG2 cells and the immunocomplexes were detected with FOXA1, FOXA2, (more ...)
The lack of evidence for a direct FOXA1 and FOXA3 interaction could be for technical reasons with regard to the Co-IP protocol. Therefore, to detect and verify possible co-bindings and to further understand whether these are due to binding of different FOXA molecules at the same site in different cells or due to co-binding in the same cell, we employed the ChIP-reChIP method in combination with semiquantitative PCR. With this method, crosslinked protein-DNA complexes are immunoprecipitated first with the antibody for one protein in the complex, followed by immunoprecipitation with the antibody for the second protein. We immunoprecipitated the chromatin from HepG2 cells with any of the three FOXA antibodies (FOXA1, FOXA2, and FOXA3) and reimmunoprecipitated the material with another of the three antibodies. The sequence of the pairs was then reversed in independent replicates in order to verify the results from the first round. The resulting DNA was then analyzed by PCR with primers amplifying a region containing enriched peaks for both factors in the complex. As a negative control, we used primers for regions containing a binding site for only one of the factors in the pair and primers for a region with no binding site for any of the factors. Theoretically, if two of the factors co-bind in a region, that sequence should be enriched in the ChIPed DNA, while sequences with a single binding should not be enriched as they are selected against by the serial immunoprecipitation. As demonstrated in Figure , we could find that each of the factors FOXA1, FOXA2, and FOXA3 bind in close vicinity of any of the other two FOXAs on the same DNA molecule in the same cell.
With these results, we demonstrate regions of pair-wise binding for FOXA1, FOXA2, and FOXA3, where these factors co-bind in close proximity and, as indicated by the Co-IP data, some of these factors may even interact at the site of binding.
Correlation of FOXA binding and H3K4me3
FOXA TFs are known to be involved in opening of compacted chromatin. Accordingly, we examined the H3K4me3 footprint pattern in the regions with FOXA1-2-3, FOXA1-2, FOXA2-3, and FOXA1-3 binding. Regions with FOXA1-2 and FOXA1-3 binding seem to have a lower enrichment for H3K4me3 than FOXA2-3 regions (Figure ). This was expected, as only 32% of FOXA1 binding sites had a region with H3K4me3 within 1 kb, compared to 44% for FOXA3 (Table ). The more interesting finding is the pattern of histone trimethylation in regions with FOXA1-2-3 binding, where a double peak surrounds the peak of TF binding (Figure ).
Figure 5 Enrichment signals in regions of pair-wise co-binding for FOXA1, FOXA2, and FOXA3. For each FOXA co-binding site, enrichment signals for FOXA1 (red), FOXA2 (orange), FOXA3 (blue), H3K4me3 (black), HNF4α (olive green), and GABP (turquoise) are (more ...)
We looked further into this double peak by k-means clustering of the signal for H3K4me3 in four different clusters (Figure ). Two of these clusters, clusters I and II, revealed patterns that resembled those at the TSS (Figure ), with each of the curves on either side of the FOXA1-2-3 binding. Due to the observed pattern, we decided to look for TSSs within a 5 kb distance from the combined FOXA1-2-3 binding site. Of the 2,304 regions with triple binding, 505 contained a known TSS within this distance (Figure ), with a similar number of TSSs on the two strands (Table S6 in Additional data file 1). A majority of regions in clusters I and II were within 5 kb of a TSS and these clusters showed the highest levels of H3K4me3. The H3K4me3 peaks in these clusters are located at opposite sides of the FOXA1-2-3 binding, and this would suggest that the H3K4me3 signals are biased towards the direction of transcription (Figure ; Table S6 in Additional data file 1).
Figure 6 H3K4me3 signals around 2,303 FOXA1-2-3 regions. (a) Enrichment of H3K4me3 in a window surrounding the center of FOXA1-2-3 regions. The regions were grouped into four clusters (I to IV) by their H3K4me3 patterns. The enrichment scale is from high (yellow) (more ...)
In the next step, we correlated these clusters with CAGE tags from HepG2 within the same distance as above. This comparison revealed a higher percentage of TSSs near the combined FOXA binding sites (Table S7 in Additional data file 1). When considering CAGE-tags within 1 kb, the difference in directionality for clusters I and II became more evident, with more CAGE tags in the plus direction for cluster I and in the minus direction for cluster II. Furthermore, by creating separate footprints of H3K4me3 around the FOXA1-2-3 regions with or without a TSS within 5 kb, we observed that both groups exhibit a double peak, each peak with its centre at a distance of approximately 200 bp from the binding site (Figure ). The H3K4me3 pattern around FOXA1-2-3 binding sites, as presented in our study, correlates well with the hypothesis that FOXAs position nucleosomes at their binding site [7
], which is best supported by FOXA1-2-3 regions with a TSS within 5 kb (Figure ).
Correlation of FOXA binding with other transcription factors
As mentioned previously, FOXAs are involved in auto- and feed-forward regulation of FOXA
genes and other TF genes in the liver. Therefore, we examined the binding pattern at the FOXA
genes and compared this with our data on upstream stimulatory factor (USF)1 and USF2 [26
], and HNF4α and GABP (GA binding protein; NRF2) [25
]. While FOXA1
had binding sites for all three factors, FOXA3
does not seem to be regulated by any of the FOXAs (Figure S6 in Additional data file 1). Moreover, FOXA1 and FOXA3 both seem to co-bind with the other factors at a similar rate (Table ). When we examined the co-binding of GABP with FOXA1-2, FOXA2-3, and FOXA1-3, we found that only the second complex had co-binding with it (Table and Figure ). In our ChIP-seq study of GABP, we found that 85% of its putative binding sites were located at TSSs.
Overlap between putative FOXA binding sites and the binding of other factors
Another interesting observation was that FOXA1-2 and FOXA1-3 regions were more related to HNF4α binding than FOXA2-3 (Table and Figure ). In addition, FOXA1-2-3 binding is highly correlated with HNF4α binding in HepG2 cells (Table and Figure ).
Allele-specific DNA-protein interactions
Monoallelic expression of genes can be due to imprinting, allelic exclusion or sex chromosome dosage compensation. SNPs in combination with the ChIP-seq could prove to be a powerful method for detection of allele-specific binding that could lead to monoallelic or preferential expression from one allele in the studied genome. With a high enough number of sequence reads at a locus with a heterozygous SNP, one can detect whether the majority of reads are from one allele or the other. If TF binding or active histone marks are predominantly found on only one of the alleles, one can suspect preferential binding to that particular allele of the gene. Previously, we have interrogated the genome of HepG2 cells for SNPs by the Infinium assay and Human-1M array (Illumina) in 1,000,000 positions (data not shown). Among these, 220,000 were heterozygous SNPs (Additional data file 5), which we screened in the ChIP-seq data for allele-specific binding. After taking multiple testing into account as described in the Materials and methods section, we found three examples for FOXA1, two for FOXA3, and six for H3K4me3 (Table S8 in Additional data file 1).
A detailed view of the most significant SNP for FOXA1, rs7248104, located in an intronic region of the insulin receptor precursor gene (INSR), revealed some interesting results (Figure ). rs7248104 is a heterozygous (C/T) SNP in HepG2 located in a DNA sequence that exactly matches the top motif found for FOXA1 (Figure ). The motif predicts binding of FOXA1 to the T-allele, but not the C-allele, which was reflected in the ChIP-seq data as all 15 reads that cover the SNP contain the T-allele (Figure ). This could indicate rs7248104 as a functional SNP, due to its effect on FOXA1 binding to the DNA, although experimental data are required to confirm this.
Figure 7 Preferential binding of FOXA1 at a heterozygous SNP. SNP rs7248104 is located in a FOXA1 binding sequence and FOXA1 is preferentially bound to one allele. At the top is the FOXA1 motif, predicted by the BCRANK method, followed by the sequence found in (more ...)
Combining ChIP-seq and SNP association data
Several genome-wide association studies have identified SNPs associated with various traits. Combining such data with our genome-wide DNA-protein interaction maps could offer a possibility to find functional SNPs. Here, we compared our data for FOXA1, FOXA3, and H3K4me3 with previously published genome-wide association studies for plasma levels of liver enzymes and metabolic traits, for example, lipid and fasting glucose levels [27
]. We searched for these reported SNPs in all our positive regions and identified those that were associated with a specific trait and that were included in our significant peaks for TF binding or H3K4me3 (Table ; Table S9 in Additional data file 1). Locating these SNPs in the regulatory regions is an important first step towards identification of functional SNPs and a possible hint on the effect of this nucleotide variation.
Comparison of ChIP-seq data with genome-wide association studies for identification of functional SNPs