|Home | About | Journals | Submit | Contact Us | Français|
CB: Salk Institute, Integrative Genomics and Bioinformatics Core, 10010 North Torrey Pines Road, La Jolla, CA 92037, USA
MUK: University of Eastern Finland, Department of Biotechnology and Molecular Medicine, P.O. Box 1627, 70211 Kuopio, Finland
LDO: University of Californa, Los Angeles, Department of Molecular Cell and Developmental Biology, 3000 Terasaki Life Sciences Building, 610 Charles Young Drive East, Los Angeles, CA 90095, USA
The mechanisms by which genetic variation affects transcription regulation and phenotypes at the nucleotide level are incompletely understood. Here, we use natural genetic variation as an in vivo mutagenesis screen to assess the genome-wide effects of sequence variation on lineage-determining and signal-specific transcription factor binding, epigenomics, and transcriptional outcomes in primary macrophages from different mouse strains. We find substantial genetic evidence supporting the concept that lineage-determining transcription factors (LDTFs) define epigenetic and transcriptomic states by selecting enhancer-like regions in the genome in a collaborative fashion and facilitating binding of signal-dependent factors. This hierarchical model of transcription factor function suggests that limited sets of genomic data for LDTFs and informative histone modifications can be used for prioritization of disease-associated regulatory variants.
Inter-individual genetic variation is a major cause of diversity in phenotypes and disease susceptibility. While sequence variants in gene promoters and protein-coding regions provide obvious prioritization of disease-causing variants, the majority (88%) of GWAS loci are in non-coding DNA, suggesting regulatory functions1. Prioritization of functional intergenic variants remains challenging due in part to an incomplete understanding of how regulation is achieved at the nucleotide level in different cell types and environmental contexts2-11. While recent studies have described important roles for lineage-determining transcription factors (LDTFs), also referred to as pioneer factors or master regulators, in selecting cell type-specific enhancers12-15, the sequence determinants that guide their binding are poorly understood. Previous findings in macrophages and B cells suggest a hierarchical model of regulatory function6, where a relatively small set of LDTFs collaboratively compete with nucleosomes to bind DNA in a cell type-specific manner (Fig 1a, i->ii). The binding of these factors is proposed to ‘prime’ DNA by initiating deposition of histone modifications that are associated with cis-active regulatory regions (Fig. 1a, ii -> iii) and enable concurrent or subsequent binding of signal-dependent transcription factors that direct regulated gene expression (Fig. 1a, iii ->iv)6,13,15,16. In principle, this model provides a straightforward framework that allows non-coding variants to be classified with respect to their ability to directly perturb LDTF binding and their potential to exert indirect effects on binding of other LDTFs and signal-dependent transcription factors. To test the validity of this model and its ability to explain effects of genetic variation on transcription factor binding and function, we exploited the naturally-occurring genetic variation between the inbred C57BL/6J and BALB/cJ mouse strains (~4 million SNPs and ~750k InDels17) as an ‘in vivo mutagenesis screen.’
First, we quantified genome-wide binding patterns of macrophage LDTFs PU.1 and C/EBPα from both mouse strains using ChIP-Seq. These experiments identified a combined 82,154 PU.1 and 54,874 C/EBPα peaks, with less than 1% of sites exhibiting highly significant strain-specific binding (PU.1, n=496; C/EBPα, n=263; 4-fold tag count ratio, FDR < 1e−14, >90% located >3 kb from gene promoters) (Fig. 1b, c, Extended Data Fig. 1a). Strain-specific binding was defined using biological ChIP-Seq replicates, which yielded <0.2% empirical false positives (Extended Data Fig. 1b-g). Differential binding of PU.1 and C/EBPα was significantly correlated with differential expression of the nearest gene as measured by RNA-Seq (Fig. 1d). There were no apparent differences in genomic context for strain-similar and strain-specific binding at inter- or intragenic sites (>3 kb to promoters) as defined by CpG content, distance from nearest gene or repetitive element, or conservation score (Extended Data Fig. 2a). Instead, strain-specific binding was highly correlated with polymorphism frequency. We observed 5-fold enrichment of polymorphisms at strain-specific versus strain-similar PU.1-bound and C/EBPα-bound regions (Fig. 1e, Extended Data Fig. 2b), with the greatest variant density at the peak centers, (Extended Data Fig. 2c,d).
To investigate direct effects of sequence variants on transcription factor binding, we identified the most enriched position weight matrices (PWM) in genomic regions marked by histone H3 lysine 4 di-methylation (H3K4me2) or bound by PU.1 or C/EBPα (Extended Data Fig. 3a, Supplementary Table 1). This analysis consistently identified consensus and degenerate motifs for the LDTFs PU.1, C/EBP and AP-1 as the most highly enriched PWMs. Notably, the frequency of mutations in these motifs increased with strain-specific binding of PU.1 and C/EBPα (Extended Data Fig. 2e,f). Excluding strain-specific loci without cis-variation (~11%), 41% of strain-specific PU.1 binding directly associated with strain-specific mutations in PU.1 motifs in the other strain. For C/EBPα, 44% of strain-specific binding associated with strain-specific C/EBPα motifs (Fig. 1f).
Although strain-specific binding of PU.1 and C/EBPα was highly linked to strain-specific motif mutations, strain-specific motif mutations were also associated with strain-similar binding (Extended Data Fig. 3c,d). This raised the question as to whether specific features of motif mutations could be used to predict strain-specific binding. Comparison of motif mutations in strain-specific to strain-similar peaks revealed three distinct attributes contributing to predictive power. First, mutated motifs within 20 bp of the experimentally defined binding centers were more highly associated with an effect on binding (PU.1, p = 1.6e-4; C/EBPα, p = 0.036; Extended Data Fig. 4a-d). Second, the presence of alternate motifs within 100 bp of the PU.1 peak centers significantly buffered the effect of strain-specific PU.1 motifs (Extended Data Fig. 4e,f). Third, after removing peaks with alternative motifs, analysis of the nucleotides mutated enabled delineation of an empirically defined functional motif that revealed a strong relationship between ‘core’ mutations and altered binding (Fig. 1g, Extended Data Fig. 4g-i, p=3.2e-8 PU.1 and p=5.1e-4 C/EBP). Taken together, core motif mutations <20 bp from the peak center that lacked alternative motifs were 3.5× and 3× more likely to occur in differential versus similar bound peaks for PU.1 and C/EBPα, respectively (Extended Data Fig. 4j,k). Notably, up to 90% of these mutations were located in differentially bound peaks (Extended Data Fig. 4l,m). To investigate the possibility that an algorithm incorporating these characteristics could be used to predict the impact of a specific motif mutation on transcription factor binding, we performed ChIP-Seq analysis for PU.1 in macrophages derived from a third inbred strain of mice, NOD/ShiLtJ (NOD). Of the ~1.4 million identifiable PU.1 motifs in the C57BL/6J reference genome, 18,322 contain SNPs that mutate the PU.1 motif in the NOD genome. 1.6% of these mutations were associated with strain specific binding (Fig. 1h). Of the 244 NOD PU.1 motif mutations located in PU.1-bound regions in C57BL/6J or BALBc, 68% were associated with strain-specific binding. When considering all three variables (motif distance, alternative motif, and motif core, Extended Data Fig. 5), 88% of the predicted functional mutations were consistent with impaired PU.1 binding in NOD (Fig. 1h).
To investigate the potential impact of mutations in LDTF recognition motifs on collaborative binding, we analyzed all strain-specific PU.1 or C/EBPα binding events in regions containing LDTF motif mutations. PU.1 motif mutations resulting in loss of PU.1 binding were frequently associated with a corresponding loss of nearby C/EBPα binding in the absence of C/EBP motif mutations (Fig. 2a, top). Conversely, C/EBP motif mutations resulting in loss of C/EBPα binding were frequently associated with a corresponding loss of nearby PU.1 binding in the absence of PU.1 motif mutations (Fig. 2a, middle). Similar results were observed at locations containing strain-specific mutations in AP-1 binding motifs, but intact PU.1 and C/EBP motifs (Fig. 2a, bottom).
We next considered the global relationships of mutations in PU.1, C/EBP, and AP-1 motifs with strain-specific binding of PU.1 and C/EBPα, taking into account both consensus and ‘weak’ motifs for PU.1 and C/EBP. NF-κB motifs were included as controls that were not expected to affect PU.1 or C/EBPα binding in unstimulated macrophages (Extended Data Fig. 3a,b, Supplementary Table 1). While mutations in PU.1 motifs had the strongest effect on strain-specific PU.1 binding, mutations exclusively in C/EBP and/or AP-1 motifs also significantly correlated with differential PU.1 binding relative to similarly bound loci (Fig. 2b). Similar relationships were observed for C/EBP (Extended Data Fig. 6a). The motif distance distributions for co-bound factors were broad (half width ~100 nt), and only a minor subset of sites exhibited defined distances expected for direct protein-protein interactions (Extended Data Fig. 6b), suggesting transcription factor-nucleosome competition as the driving force behind the collaborative binding behavior6,18. Together, strain-specific mutations in nearby C/EBP and AP-1 motifs were associated with ~15% of strain-specific PU.1 binding at sites with strain-similar PU.1 motifs. Mutations in nearby PU.1 and AP-1 motifs were associated with ~30% of strain-specific C/EBPα binding at sites with strain-similar C/EBP motifs (Fig. 1f). Overall, 48% of strain-specific PU.1 binding and 57% of C/EBPα binding was associated with at least one assignable LDTF motif mutation (Fig. 1f). To genetically test whether these correlations are consistent with a collaborative binding model, we considered all LDTF motif mutations and evaluated their effects on PU.1 binding in macrophages derived from NOD mice. For polymorphic strain-specific PU.1 loci containing strain-specific LDTF motifs (n = 220), PU.1 binding profiles matched the strain with shared alleles for 91% and 92% of cases (Fig. 3a). At 8% (n = 17) of the loci, the NOD genome broke the C57BL/6J/BALB/cJ haplotypes, and in all cases, the NOD genotype at the LDTF motif variant matched the strain with similar binding (Supplementary Table 2), indicating that these variants are likely the cause of binding differences. An example is shown in Fig. 3b, where PU.1 binds in C57BL/6J but not in BALB/cJ or NOD. Only one SNP in this region is associated with PU.1 binding exclusively in C57BL/6J; here, the T allele forms part of a neighboring AP-1 motif in C57BL/6J that is mutated by the C allele in BALB/cJ and NOD. These findings provide genetic evidence that PU.1 binding to this location is dependent on collaborative interactions with AP-1.
To confirm that the allele-specific binding also occurs in heterozygous cells, we performed ChIP-Seq for PU.1 and C/EBPα in macrophages from CB6F1/J hybrid mice, which are F1 offspring of a C57/BL/6J × BALB/cJ cross. In the great majority of cases, alleles bound specifically in a parental strain were also bound preferentially in the F1 generation. (Fig. 3c and Extended Data Fig. 6c).
Given the genetic evidence that LDTFs collaborate to bind DNA, we next tested the extent to which strain-specific LDTF binding explained promoter-distal (>3 kb) strain-specific histone modification events, such as H3K4me2 and H3K27Ac deposition, which respectively mark ‘primed’ and ‘active’ chromatin19,20 (Fig. 1a, ii->iii). Genomic regions exhibiting strain-specific binding of PU.1 and C/EBPα were associated with strain-specific H3K4me2 and H3K27ac (Fig. 2a, right columns). Strain-specific histone modifications correlated with nearby gene expression (Fig. 2c), and H3K27Ac modification tracked with the corresponding parental allele in CB6F1/J hybrid mice (Fig. 3d). Strain-specific binding of PU.1 and C/EBPα were individually correlated with H3Kme2 and H3K27Ac deposition, with the combined binding of both factors exhibiting even greater correlation than the individual factors (Extended Data Fig. 7a-f). Further, LDTF motif mutations segregated with differential LDTF binding and histone modifications (Fig. 2d, Extended Data Fig. 7g). Together, these findings support the concept that LDTFs play quantitatively important roles in establishing these histone modifications, likely through initiating transcription in a combinatorial fashion21.
Expression quantitative trait loci (eQTLs) are polymorphic loci whose alleles are associated with individual RNA expression levels across a population22. Thus, eQTLs define active gene regulatory loci and provide an alternative method for assigning regulatory function to gene expression. To interrogate the relationship between histone modification and eQTLs, we analyzed previously reported eQTL data from 85 inbred mouse strains in the Hybrid Mouse Diversity Panel (HMDP) in primary macrophages23 (see Methods). We found that eQTLs overlapped H3K4me2- or H3K27Ac-marked regions at frequencies greater than expected by chance, supporting the role of histone modifications as landmarks of regulatory activity (hypergeometric test p-values: H3K4me2 = 1e-2147, H3K27Ac = 1e-2290). Next, given the highly cell type-specific nature of gene regulation24, we hypothesized that eQTLs from different cell types would be reflected in the histone modification profiles in the same cell type. We examined liver and macrophage eQTLs for a set of ~130k SNPs from the HMDP25 for overlap with H3K27Ac loci defined in macrophages or in liver, pro-B, or mouse ES cells20. Macrophage eQTL were more significantly enriched for overlap with macrophage H3K27Ac regions than liver H3K27Ac. Similarly, liver eQTL were most significantly enriched with liver H3K27Ac relative to macrophage H3K27Ac (Fig. 2e). Clustering of H3K27Ac profiles revealed that liver and ES H3K27Ac profiles are most similar (Extended Data Fig. 7h), providing an explanation as to why liver eQTLs were highly enriched in mES H3K27Ac regions.
To evaluate the prediction that primed regulatory loci (containing H3K4me2) often require additional binding of signal-dependent TFs to achieve regulatory activity (Fig 1a, iii->v), we treated C57BL/6J and BALB/cJ macrophages with Kdo2-Lipid A (KLA), a potent and specific agonist of TLR426. KLA treatment causes NF-κB to enter the nucleus, bind DNA and regulate several hundred target genes26,27. We performed ChIP-Seq for PU.1, C/EBPα and the RelA/p65 component of NF-κB in untreated and KLA-treated macrophages and observed that 61% of sites that gained p65 were pre-bound by PU.1 and/or C/EBPα without KLA. De novo motif analysis indicated that an AP-1 motif was present in 42% of the remaining sites, suggesting that AP-1 is responsible for priming a large proportion of the p65 cistrome (Extended Data Fig. 8a), in line with previous reports16.
To further interrogate the dependence of p65 on LDTFs we focused on sites that gained p65 only in one strain (n = 932, >90% promoter-distal, Extended Data Fig. 1a, Fig. 4a, 4th column). In the vast majority of cases, PU.1 and/or C/EBPα were bound prior to KLA treatment only in the strain exhibiting p65 binding (Fig. 4a). In addition, strain-specific p65 binding primarily occurred at loci already marked by H3K4me2, and led to an increase of H3K27Ac, consistent with the proposed model. To analyze the effects of genetic variation on transcription factor motifs, we performed strain-specific LDTF and NF-κB motif finding in polymorphic strain-specific p65-bound peaks (n = 750) (Extended Data Fig. 3b). Notably, p65 binding was influenced by mutations in individual LDTF motifs to a similar extent as mutations in the NF-κB motif itself (Fig. 4b, Extended Data Fig. 8b). For strain-specific p65 binding events, 34% could be attributed to assignable mutations in PU.1, C/EBP, or AP-1 motifs, whereas 9% could be explained by mutations in the assignable NF-κB motifs themselves (Fig. 4c). RelA/p65 is known to bind to degenerate and non-canonical motifs28 that might not be captured by de novo motif analysis. To gain motif-independent insight into variant location and strain-specific TF binding, we assessed the variant frequency relative to the centers of strain-specific p65 peaks. Similar to strain-specific PU.1 and C/EBPα peaks, strain-specific p65 peaks are in regions of higher variant density than strain-similar peaks (Extended Data Fig. 8c). In contrast to LDTFs, where strain-specifically bound regions have a high variant density at their peak centers, the distribution of variants at strain-specific p65 peaks is significantly different from those of the LDTFs (Kolmogorov-Smirnov p-value < 0.013) as it contains fewer variants at the peak centers and is broader (Fig. 4d, Extended Data Fig. 8d-f). This is consistent with p65 binding being more affected by sequence variation in motifs of neighboring factors than LDTFs.
Overall, strain-specific p65-bound regulatory sites were significantly correlated with nearby genic transcription and mRNA production (Fig. 4e). We tested strain-specifically bound and epigenetically marked putative enhancer sequences with strain-specific mutations for differential enhancer function in transient and stable reporter assays (Fig. 5a, Extended Data Figs. 9a, b). We observed the predicted strain-specific enhancer activity for 18 of 20 of these genomic sequences. Conversely, enhancer elements with sequence variation in non-core nucleotides that were not predicted to alter PU.1 or C/EBP binding and that exhibited strain-similar binding patterns exhibited similar enhancer activity (Extended Data Fig. 10a).
Lastly, we tested whether the predicted motif-disrupting variants could specifically explain strain-specific enhancer activity by swapping variants at the putative causative alleles in C57BL/6J to BALB/cJ while maintaining the genetic background for the remainder of the enhancer sequences. Representative examples in which reversal of such SNPs in PU.1, C/EBP and p65 motifs reversed strain-specific enhancer activity are illustrated in Fig. 5c and Extended Data Figs. 10b,c. In contrast, reversal of nearby SNPs not predicted to alter LDTF motifs had no effect on strain-specific enhancer activity (Extended Data Fig. 10c).
In concert, we have exploited natural genetic variation to test a collaborative model for enhancer selection and function, and conversely explored the ability of this model to explain strain-specific differences in transcription factor binding and epigenetic features associated with functional enhancers in macrophages. These studies provide genetic evidence that lineage-determining transcription factors are dependent on collaborative binding to variably spaced DNA recognition motifs in order to select enhancers and enable binding of signal-dependent transcription factors. Notably, the variable motif distances observed at co-bound LDTF loci suggests that collaborative binding does not generally require direct protein-protein interactions between the involved transcription factors. The proposed hierarchical LDTF collaborative model provides a conceptual framework for prioritization of non-coding disease-associated regulatory variants. While all cells express hundreds of transcription factors, a large fraction of functional enhancers (~70% in macrophages) are characterized by collaborative interactions involving relatively small sets of lineage determining transcription factors (e.g., PU.1, AP-1 and C/EBPs). The requirement for collaborative binding interactions provides an explanation for why transcription factor binding is lost at sites where mutations do not occur in the cognate recognition motif. In the case of NF-κB, for example, mutations in the motifs for lineage determining factors were approximately three times more likely to result in decreased binding of NF-κB than mutations in the NF-κB binding site itself. An essential step in leveraging the collaborative model to pinpoint potential disease-causing variants is the definition of relevant LDTF binding sites and functionally important variants. At the current level of genome annotation, this cannot be achieved by analysis of DNA sequence alone. For example, there are ~1-2×106 identifiable PU.1 binding sites in the human29 and mouse genomes, but <10% are actually occupied by PU.1 in macrophages. By experimentally defining strain-similar and strain-specific binding patterns for PU.1, the relevant sites at which mutations can result in altered function are identified. Comparison of PU.1 motif mutations associated with strain-specific versus strain-similar binding allowed the genetic definition of a functional binding matrix and additional distinguishing features that enabled accurate prediction of functional mutations in a third strain. Thus, by collecting a relatively limited set of genomic binding data for LDTFs and informative histone modifications, this analytical approach can be exploited to explain a greater extent of variation in enhancer selection and function than previously possible7,10. To further increase the specificity and sensitivity for detecting functional variations, identification of transcription factor motifs that permit binding but diverge from the consensus PWM, i.e. “weak” motifs, needs to be improved, as such sites are more likely to be affected by mutation29,30. In addition, transcription factors less abundant than LDTFs likely play individually small but collectively significant roles. At a larger scale, non-cis-acting, long-range epigenetic mechanisms may also be important for enhancer selection. A major goal for the future will be to extend these approaches to understanding natural genetic variation associated with human disease.
Thioglycolate-elicited peritoneal macrophages were collected 4 days post-injection from male 6-8 wk C57BL/6J, BALB/cJ or CB6F1/J Hybrid mice and plated at 20 × 106 cells/15-cm petri dish in RPMI1640 + 10% FBS + 1X PenStrep. One day after plating, cells were treated with fresh media with or without 100 ng/ml Kdo2-LipidA (KLA) for 1 hour and then directly used for downstream analyses. All animal experiments were performed in compliance with the ethical standards set forth by UC San Diego's Institutional Annual Care and Use Committee (IUCAC).
Media was decanted and cells were fixed at room temperature with either 1% formaldehyde/PBS for 10 minutes (for PU.1, C/EBPa, H3K27Ac ChIPs) or 2 mM disuccinimidylglutarate (DSG, Pierce)/10% DMSO/PBS for 30 minutes followed by 1% formaldehyde/PBS for another 15 minutes (p65). After quenching the reaction by adding glycine to 0.125 M, cells were washed twice with PBS and snap-frozen in dry-ice/methanol. ChIPs for PU.1 (Santa Cruz, sc-352), C/EBPa (Santa Cruz, sc-61) were performed exactly as described previously6. The H3K27Ac (Abcam, ab4729) ChIP was performed in the presence of 1 mM butyric acid. For p65 (Santa Cruz, sc-372), IP conditions were identical to the ones described before6, except that pre-clearing was omitted, and the ChIP was performed with 5 μg antibody (Santa Cruz, sc-372) pre-bound to 50 μl Protein A Dynabeads (Invitrogen) for 30 minutes in 0.5% BSA/TE, and a final wash with TE/50 mM NaCl was performed before elution. ChIP-Seq library preps for the initial p65 ChIPs were performed as described before6, libraries for the replicates were prepared using magnetic beads similar to previously described procedures12. ChIPs for H3K4me2 were carried out on MNase-digested chromatin as described previously31. To control for open chromatin and library biases, input chromatin libraries after sonication were sequenced for each strain, crosslinking condition and ChIP lysis protocol. Sequencing libraries were prepared as previously described6 using barcoded adapters (NextFlex, Bioo Scientific), and sequenced for 50 cycles on a Hi-Seq 2000 (Illumina) using CASAVA1.7 or 1.8.
C57BL/6J and BALB/cJ demultiplexed fastq files were mapped to both the mm9 reference (C57BL/6J) genome and the BALB/cJ contigs17 using Bowtie0.12.733 with the options –m 1 --best –n 3 -e 200. Mapping parameters for C57BL/6J and BALB/cJ data allowed 3 mismatches in the 28 bp seed sequence with up to 5 high quality mismatches in the entire 50 bp read. NOD ChIP-Seq data were mapped to the mm9 genome using the above options. To identify allele-specific reads from CB6F1/J data, ChIP-seq reads were aligned to the C57BL/6J or BALB/cJ sequence using Bowtie2-2.0.0-beta734 allowing 0 mismatches in 32 bp reads with options -N 0 -L 32 --score-min L,0,0 --gbar 17. Tags mapping to both genomes were discarded. Resulting allele-specific reads were counted for regions of interest.
For C57BL/6J and BALB/cJ data, reads mapping to only one genome were discarded (<2% of total) to avoid bias caused by mappability differences, and reads mapping to both were assigned to the mm9 genomic location. Genomic binding peaks for transcription factors PU.1, C/EBPa, and p65, were identified using the findPeaks command in the HOMER (http://biowhat.ucsd.edu/homer/) software suite6 with default settings of –style factor: 200 bp peaks with 4-fold tag enrichment and 0.001 FDR significance over background (ChIP input), 4-fold enrichment over local tags, and normalization to 10 million mapped tags per experiment. H3K4me2 and H3K27Ac regions used for initial de novo motif finding (Extended Data Fig. 3a) were identified using the default parameters of findPeaks –style histone with the addition of –nfr centering for H3K4me2 MNase data. For H3K4me2 and H3K27Ac peaks identified for comparison to LDTF binding and mutation events (e.g., Fig. 2d), findPeaks –style peaks was used to define more focal, non-gene associated loci. In particular, H3K27Ac regions were identified with the findPeaks options –style factor using –size 1000bp –L2 (2-fold enrichment over local tags). H3K27Ac peaks were merged between strains using mergePeaks –size given and peaks were resized to 1 kb. Peaks within 3 kb of gene promoters were excluded from further analysis. H3K4me2 peaks were identified using findPeaks options –style factor –size 500 –L2 –C0 (which allows for unlimited tags considered per genome position as may occur with MNase data). Peaks were then centered on the best nucleosome-free region (nfr) within a 1 kb window using getPeakTags –nfr. Peak files between strains were merged with mergePeaks –size given and H3K4me2 tags were counted in 1 kb regions centered on the merged peak file definitions. Peaks were then re-centered based on the best nfr in 1 kb identified by getPeakTags –nfr according to the strain with more H3K4me2 tags. Peaks were extended to 1 kb and restricted to those >3 kb from gene promoters.
To determine strain-specific binding and epigenetic modification events we counted the number of sequencing tags (normalized to 10 million) at peaks/regions identified in the set of combined peaks/regions from both strains. We normalized the mean peak tag count to be equal in each strain and compared the tag counts in each region and required strain-specific peaks/regions to exhibit ≥4-fold difference in tag counts and an adjusted cumulative Poisson p-value corresponding to FDR<1e-14 35 These criteria were based on empirical data relating replicate ChIP-seq experiments (Extended Data Fig. 1b). Individual genome sequences for C57BL/6J and BALB/cJ were constructed in regions of interest using the reference (C57BL/6J) sequence and replacing BALB/cJ alleles at SNPs and InDels reported in the vcf files from Keane et. al.17.
De novo motif finding in ChIP-Seq-enriched regions from both mouse strains was used to define position weight matrices (PWM) for transcription factors of interest (Extended Data Fig. 3a,b, Supplementary Table 1). These PWM were used to define strain-specific motifs by using the options homer2 –find <individual genome sequence> in HOMER for each genome sequence for the regions of interest. The positions of the identified motifs were compared between strains taking into account shifts cause by InDels relative to peak start coordinates and which DNA strand matched the identified motifs. Motifs with alignments only in one genome were considered strain specific.
For each condition, RNA was isolated from 5 × 106 thioglycolate-elicited macrophages with Trizol LS, and 15 μg RNA were DNase-treated using TURBO DNase (Ambion) according to the manufacturer's instructions and ethanol-precipitated. PolyA-RNA was selected from 7 μg total RNA using the MicroPoly(A)Purist kit (Ambion), according to the manufacturer's instructions. The isolated RNA was hydrolyzed in 20 μl total volume with 2 μl RNA fragmentation buffer (Ambion) for 10 minutes at 70°C, the reaction stopped with stop buffer, and buffer was exchanged to Tris, pH 8.5 using P30 size exclusion columns (Bio-Rad). The fragmented RNA (30 ng) was 5’-decapped in 21 μl total volume containing 0.5 μl tobacco acid pyrophosphatase (TAP, Epicentre), 2 μl 10× TAP buffer, 1 μl SUPERase-IN, incubate for 2 h at 37°C. To dephosphorylate RNA 3’ ends, 0.5 μl 10× TAP buffer, 1.5 μl water, 0.5 μl 0.25 M MgCl2 (4.17 mM final –1 mM EDTA for maximum phosphatase activity), 0.5 μl 10 mM ATP (0.2 μM final to protect PNK) where added, and the reaction incubated with 1 μl PNK (Enzymatics) for 50 minutes at 37°C. RNA fragments were 5’-phosphorylated by adding 10 μl 10× T4 DNA ligase buffer, 63 μl water, 2 μl PNK, and 60 minute incubation at 37°C. RNA fragments were isolated using Trizol LS, precipitated in the presence of 300 mM NaAc and 2 μl Glycoblue (Ambion), washed twice with 80% ethanol and dissolved in 4.5 μl water.
To prepare sequencing libraries, 0.5 μl 9 μM of a 5’-adenylated sRNA3'MPX adapter /5Phos/AG ATC GGA AGA GCA CAC GTC TGA /3AmMO/ (IDT, desalted; adenylated with Mth ligase (NEB) according to the manufacturer's instructions, phenol-chloroform/chloroform-extracted, ethanol-precipitated with glycogen and dissolved in water at 9 μM) were heat-denatured together with the RNA for 2 minutes at 70°C, and ligated with 100 U truncated T4RNA ligase 2 K227Q (NEB) in 10 μl 1× T4 RNA ligase buffer without ATP, containing 10 U SUPERase-In and 15% PEG8000 for 2 hours at 16°C. To reduce adapter dimer formation, 0.5 μl 10 μM MPX_RT primer 5’-GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC T-3’ (IDT, desalted) was added and annealed to the ligation product by incubating at 75°C for 2 minutes, then 37°C for 30 minutes, then 25°C for 15 minutes. Finally, 0.5 μl 5 μM hybrid DNA/RNA sRNA5'h adapter 5’-GTT CAG AGT TCT ACA rGrUrC rCrGrA rCrGrA rUrC-3’ (IDT) were ligated to previously capped RNA 5’ ends by adding 2 μl T4 RNA ligase buffer, 6 μl 50% PEG8000 (15% final), 1 μl 10 mM ATP, 9.5 μl water and 0.5 μl (5 U) T4 RNA ligase 1 for 90 minutes at 20°C. To 15 μl ligation reaction, an additional 0.5 μl 10 μM MPX_RT primer were added, reactions were denatured at to 70°C for 1 minute, then placed on ice. RNA was reverse-transcribed by adding 3 μl 10× first strand buffer, 4.5 μl water, 1.5 μl 10 mM dNTP, 3 μl 0.1 M DTT, 1.5 μl RNaseOUT and 1 μl Superscript III reverse transcriptase (Invitrogen), and incubating for 30 minutes at at 50°C. Complementary DNA was isolated by adding 35 μl AMPure XL beads (Beckman), binding and washing according to manufacturer's instructions and dissolved in 40 μl TET (0.1 % Tween 20/TE). Libraries were PCR-amplified for 9 (polyA RNA-Seq), 11 (5’-GRO-Seq), 12 (rRNA--5'-RNA-Seq) or 13 (polyA-5’-RNA-Seq) cycles with 0.75 μM primers oNTI201 and TruSeq-compatible indexed primers (e.g. 5’-CAA GCA GAA GAC GGC ATA CGA GAT nnn nnn GTG ACT GGA GTT CAG ACG TGT GCT CTT-3’ (desalted, IDT, index in lowercase letters) using Phusion Hot Start II (Thermo Scientific) in HF buffer containing 0.5 M betaine (98°C, 30s/12x(98°C, 10s/57°C, 25s/72°C, 20s)/ 72°C, 1min/4°C, ∞), and 175-225 bp fragments were size-selected on 10% PAGE gels. Libraries were diluted 1:105 with TET and quantified relative to samples of known cluster density by SYBR green Q-PCR with primers Solexa_1G_A 5’-AAT GAT ACG GCG ACC ACC GA-3’, Solexa_1G_B 5’-CAA GCA GAA GAC GGC ATA CGA-3’ (95°C, 15 min/25x(95°C, 10s/60°C, 60 s)) and sequenced for 51 (insert) +7 (index) cycles on a HiSeq 2000 sequencer (Illumina) with small RNA sequencing primer 5’-CGA CAG GTT CAG AGT TCT ACA GTC CGA CGA TC-3’ and TruSeq Index sequencing primer (Illumina).
GRO-Seq was performed as described previously32 using 107 cells per condition. RNA at RefSeq transcripts was quantified for GRO-Seq and PolyA-RNA-Seq by counting the normalized tags (to 10 million tags/experiment) in annotated exons for each RefSeq transcript.
Odds ratios for observing C57BL/6J-specific motif mutations relative to BALB/cJ-specific motif mutations in different classes of bound/modified loci (e.g., Fig. 2b) were calculated using (p1 / (1-p1)) / (p2 / (1-p2)) where p1 is the frequency of C57BL/6J-specific motifs and p2 is the frequency of BALB/cJ-specific motifs. For Extended Data Fig. 4j,k, p1 is the frequency of indicated events occurring in differentially bound loci and p2 is the frequency in similarly bound loci. Unless otherwise indicated, t-tests were two-sided assuming unequal variance.
eQTL analysis was performed as previously described23,25. In brief, thioglycolate-elicited peritoneal macrophages were collected from 85 strains of mice. RNA was processed and hybridized to Affymetrix Genome HT_MG-430A. There were 22,416 probe sets analyzed after removing individual probes overlapping SNPs and probe sets with 8 or more probes overlapping SNPs. Expression data was RMA normalized.
3,918,755 SNPs with a minimum minor allele frequency of 10% originating from mouse Perlegen variation dataset36 were imputed across the strains37 and filtered to 3,695,041 SNPs based on proximity (<2 Mb) to transcription start sites of transcripts detectable by the microarray. Gene expression for each transcript was associated to SNPs within 2 Mb using the efficient mixed-model association (EMMA) mapping that corrects for population structure38. Association p-values less than 1 × 10−5 (<1% FDR) were deemed significant23. The 3,695,041 SNPs used for association mapping were overlapped with H3K4me2 and H3K27Ac regions. Since H3K4me2 and H3K27Ac regions ranged from 500-1500 bp whereas haplotype blocks averaged 300 kb, we considered SNPs outside H3K4me2/H3K27Ac regions that were in linkage disequilibrium (LD) with a SNP in H3K4me2/H3K27Ac regions as overlapping. Haplotype Blocks were estimated in Haploview39 using 143 strains with the following options: blockMAFThresh=0.1, blockCutLowCI=0.8, blockCutHighCI=0.98, blockRecHighCI=0.9, blockInformFrac=0.95. SNPs in LD with Enhancer SNP were considered markers of H3K4me2 regions. To test for enrichment of significant eQTL in H3K4me2 regions, we used the Hypergeometric Distribution Function as follows:
k successes = # significant eQTL in (or in LD with) H3K4me2 regions; m = # SNPs with significant eQTL, N = # total SNPs, n = # total SNPs in H3K4me2 regions.
The short read archive files were downloaded from GEO Accession GSE2416420 for ChIP-sequencing for the H3K27Ac mark in whole liver (SRX027340), pro-B cells (SRX027345), and ES cells (SRX027331, SRX027332) and input chromatin as background (liver: SRX027343, pro-B: SRX027348, ES: SRX027352). Sequencing reads were mapped to the C57BL/6J genome. H3K27Ac regions were identified where tag pileups exceeded 4X the input tags using HOMER6 and interrogated for enrichment of significant macrophage eQTLs as described for macrophage H3K4me2 and H3K27Ac regions above.
One kb enhancers were PCR-amplified from C57BL/6J and BALB/cJ genomic DNA using genomic primers not overlapping variants that introduced terminal BamHI, BglII or BclI sites on one end and SalI or XhoI sites on the other end of the PCR products, depending on the restriction site content of the enhancer. These were digested with the respective restriction enzymes and ligated into a modified, BamHI and SalI-digested pGL4.10 luciferase reporter plasmid (Invitrogen) containing a minimal HSV-TK promoter derived from pTAL-Luc (Clontech) (see Fig. 5a). Alternatively, 1 kb fragments were amplified using primers that introduced overhangs identical to the sequences flanking the BamHI/SalI tandem site in the pGL4.10 plasmid. Fragments were purified from the PCR reaction by SPRI using magnetic beads and cloned into the BamHI/SalI-cut reporter plasmid described above using Gibson Assembly master mix (NEB) according to the manufacturer's instructions. Mutations were introduced by PCR amplification with complementary primers containing the mutation to be introduced in the center, followed by DpnI digest of the template and transformation of bacteria. All constructs were confirmed by sequencing. For each reporter assay, 300 ng plasmid was transfected into RAW264.7 macrophages using SuperFect (Qiagen) together with 300 ng UB6 promoter-driven beta-galactosidase reporter for transfection normalization in 24-well plates seeded with 1 × 105 cells 24 hours prior to transfection. 24 hours post-transfection, media alone (RPMI + 10%FBS) or media also containing 100 ng/ml KLA was added for an additional 16 hours. Luciferase activity was measured 24 hr post transfection using a Veritas microplate luminometer (Turner Biosystems) and normalized to beta-galactosidase activity (Applied Biosystem) for transfection efficiency. Each experiment was performed at least three independent times, with each reaction done in triplicates. Data represented mean ± s.d., and statistical significance was determined by one-sided t-test.
Stable transfected cell lines were made by transient co-transfection of the linearized reporter plasmids together with linearized neomycin resistance-expressing pcDNA3 vector as described above, followed by incubation with 275 μg/ml Geneticin (G418 Sulfate, Invitrogen) for 2-3 weeks. Bulk cells from stably transfected colonies were tested for luciferase activity and normalized to total protein concentration (DC Protein Assay, BioRad).
We thank Aldons J. Lusis, Ph.D. (UCLA) for providing access to eQTL data (http://systems.genetics.ucla.edu/) and for productive conversations. We thank our Referees for highly constructive comments. We thank Daniel Pollard (UCSD) for discussions and suggestions, and Lynn Bautista for assistance with figure preparation. These studies were supported by NIH grants DK091183, CA17390 and DK063491 (C.K.G.). M.U.K. was supported by the Foundation Leducq Career Development award and grants from Academy of Finland, Finnish Foundation for Cardiovascular Research and Finnish Cultural Foundation, North Savo Regional fund. C.E.R. was supported by the American Heart Association Western States Affiliates (12POST11760017) and NIH (5T32DK007494).
SH, CKG and CER designed the study; SH, CER, KAA, MUK and LDO performed experiments; CER performed all genetic variation-related analysis; CB wrote custom code for HOMER2 and analyzed data; KAA and SH analyzed data; CER, SH and CKG wrote the manuscript.
The authors declare no competing financial interests. Data is available in GEO accession GSE46494.
Supplementary Tables 1-3 and Data Source Files 1 and 2 accompany this manuscript.