Purification of human CD4^{+} T-cell, chromatin preparation, H3K4me2 ChIP

1 × 10^{8} CD4^{+} T cells were isolated from human blood as previously described ^{7}. Cells were then crosslinked with 1% formaldehyde for 10 min at room temperature, followed by sonication of the cells to obtain chromatin fragments ranging in size from 150 bp to 300 bp. The ChIP assay was carried out using the antibody against H3K4me2 (Millipore, 07-030) and the immunoprecipitated chromatin was stored in TE buffer.

ChIA-PET library construction and paired-end sequencing

ChIA-PET libraries were prepared as described ^{16} and pair-end sequencing was carried out using Illumina Hi-seq sequencer.

3C-PCR assays

Crosslinked CD4

^{+} T cells and IMR90 cells were resuspended in lysis buffer (10 mM Tris-HCl, pH 8.0, 10 mM NaCl, 0.2% NP40, and proteinase inhibitors) and lysed with a Dounce homogenizer. Following

*EcoR*I digestion overnight, 3C-ligated DNA was prepared as previously described

^{37}. Regular PCR products were loaded on 2% agarose gel. For quantitative 3C analysis, real-time PCR was performed as described

^{31}. The PCR primers used for 3C-PCR analyses are listed in the

Supplementary information, Table S2.

Reporter assays

Enhancer regions were amplified by PCR and cloned into pGL3-HS vector. Jurkat cells were transfected with the reporter construct using the method as described

^{1}. The PCR primers used for reporter assays are listed in the

Supplementary information, Table S3.

Data preprocessing

We performed three sequencing experiments: the biological replicate Exp 1 that consists of two technical replicates Seq 1 and Seq 2, and the biological replicate Exp 2.

The raw sequences were trimmed and the first 36 bp were kept. We modified the linker-filtering code from ^{16} to output base quality values in addition to the sequence, and used the modified code to remove the linker sequences from the 36-bp reads.

The sequences of the two ends of PETs were separately mapped to the hg18 human genome using the bowtie algorithm ^{43} with the option “-n 0 -l 20 -m 1 — best” and taking into account the base-quality values. This bowtie option ensured that only uniquely mapped reads with zero mismatches to the reference genome were retained. The mapped locations of PETs were then represented as ordered quadruples (*chrom*_{1}, *pos*_{1}, *chrom*_{2}, and *pos*_{2}) of chromosome names and positions of the two ends of each PET, with the ordering relation: *chrom*_{1} ≤ *chrom*_{2} and *pos*_{1} ≤ *pos*_{2}. From the resulting set of PETs, we removed redundant ones. In other words, if the multiple copies of a quadruple (*chrom*_{1}, *pos*_{1}, *chrom*_{2}, and *pos*_{2}) were present, we kept only one. There are 53 million, 55 million, 73 million, 20 million, and 92 million uniquely mapped non-redundant PETs in Seq 1, Seq 2, Exp 1, Exp 2, and Exp 1 + Exp 2 libraries, respectively.

Identification of chromatin interactions

Each PET may be represented as a point in the two-dimensional space formed by the Cartesian product

*Genome* ×

*Genome*, where the

*Genome* is a concatenation of chromosomes separated by 1 Mbp spacers. The problem of identification of chromatin interactions is thus reduced to the detection of clusters of points in the two-dimensional space. In order to efficiently cluster tens of millions of data points, we implemented in C++ the density-based clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications with Noise) described in

^{44}. We set the DBSCAN parameters: the size of the

-neighborhood of a point, Eps = 3 000 bp, and the minimum number of points required to form a cluster, MinPts = 3. This choice of parameters is optimal as suggested by

Supplementary information, Figure S1A, where we plotted the number of statistically significant (FDR < 5%) intra-chromosomal interactions detected for each combination of Eps and MinPts parameters. The choice Eps = 5 Kbp, MinPts = 3 gave slightly more interactions, but the interacting regions became broader and hence less specific. We removed intra-chromosomal PETs of length < 20 kbp from the clustering analysis because such PETs may originate from self-ligation of chromatin fragments in the ChIA-PET procedure. After excluding the intra-chromosomal PETs of length < 20 kb, 68 million PETs remained in Exp 1 + Exp 2 library.

We observed that single nucleotide polymorphisms could cause misalignment of tags and lead to false-positive chromatin interactions. A sequence tag that originates from a genomic locus that has a nucleotide mismatch with the reference genome may fortuitously map to a different locus perfectly without a mismatch. Therefore, we realigned the PETs in the identified clusters to the hg18 human genome using the option “-n 1 -l 20 -m 10 -k 10 –best”, allowing up to one mismatch with the reference genome and up to ten reportable alignments. The aligned reads were then examined for the presence of known single nucleotide polymorphisms (dbSNP build 130) at mismatch locations with the reference genome. A read was discarded from further analysis if a known SNP was present in at least one of its mapped locations. The resulting SNP-filtered PETs were then clustered using DBSCAN algorithm as described above and the interaction PET clusters were identified. The statistical significance of the resulting PET clusters was assessed using the hyper-geometric distribution (as described in ^{16}). The FDR of chromatin interaction detection was estimated by “cutting” and randomly “rejoining” PETs *in silico*, simulating random ligation events on magnetic beads. We observed that the FDR for the inter-chromosomal interactions is very high. Therefore, only the intra-chromosomal chromatin interactions with FDR < 5% were used for all downstream analyses. A chromatin interaction as determined by the above method is a pair of regions (R_{1}, R_{2}) from the human genome and was displayed as a pair of rectangular boxes connected by a thin line in UCSC genome browser.

As a global test of our data, we estimated the relative (observed/expected) intra-chromosomal interaction frequency of two genomic loci as a function of the linear distance separating the loci. As expected from the flexibility of the chromatin fiber, we found that the frequency decays with increasing linear distance (

Supplementary information, Figure S1B).

Annotation of chromatin interactions

We retrieved the hg18 genomic coordinates of transcription start sites of Refseq, UCSC, and Ensembl genes from the UCSC genome table browser. If the center of a region R is less than 5 kb upstream or downstream of the nearest TSS, the region is deemed a “promoter”. Otherwise, it is regarded as an “enhancer”. The overlapping “enhancer” regions were merged to obtain a non-redundant set of enhancers. All “promoter” regions located within 5 kb of the same TSS were regarded as one single promoter. The resulting non-redundant sets of “enhancers” and “promoters” were treated as nodes in our chromatin interaction network. The Cytoscape program ^{45} was used to display the network.

Enhancer-promoter (EP) genes are more expressed than genes in the control group

EP-promoters are the promoters that interact with the putative enhancers. The H3K4me2 scores of promoters were computed as the maximums of SICER ^{46} scores of H3K4me2 islands that overlap ± 2 kb region around TSS. In order to have an unbiased comparison, we selected a control group of non-EP promoters that match the EP promoters in terms of H3K4me2 score distribution as follows: we computed the quantiles *q*_{0}, *q*_{10}, ..., *q*_{100} of the H3K4me2 score distribution of EP promoters at 0%, 10%, 20%,...,100% percentiles of the ranked score list. A total of 150 non-EP promoters were then randomly sampled from each score interval (*q*_{i}, *q*_{i}_{+ 10}). The resulting 1 500 non-EP promoters were used as the random controls. The one-sided Kolmogorov-Smirnov test was used to show that EP genes are significantly more expressed than the random control non-EP genes (*P* < 10^{−12}). The results of the above analyses are displayed in .

The global chromatin interaction heatmap for the chromosome 19

The intra-chromosomal PETs of length at least 20 kbp on the chromosome 19 were represented as points in the two-dimensional space

*Chrom19* ×

*Chrom19*, and the density of points was computed using bivariate Gaussian kernel density estimator. We, thus, have a density value ρ(

*x*,

*y*) for each point (

*x*,

*y*)

*Chrom19* ×

*Chrom19*. We then computed the average density over all PETs of length

*L* as ρ

_{ave}(

*L*) = (1/

*N*) ∑

_{abs(}_{x}_{-}_{y}_{)=}_{L} ρ(

*x*,

*y*), where

*N =*∑

_{abs(}_{x}_{-}_{y}_{)=}_{L}1. The average density ρ

_{ave}(

*L*) represents the expected frequency of interaction of two loci separated by the linear distance

*L* bp on the chromosome 19. The negative, zero, and positive values of the difference between observed and expected densities, ξ(

*x*,

*y*) = ρ(

*x*,

*y*) − ρ

_{ave}(abs(

*x*−

*y*)), are color-coded by blue, white, and red colors, respectively ().

Co-expression of the interacting genes

The interacting gene pairs are co-regulated and their expression levels may be correlated. We devised a sensitive statistical test for this purpose. Let {I_{mn}} be the connectivity matrix of the intra-chromosomal promoter-promoter interaction network and let *N*_{E} be the total number of edges in the network. It is assumed that the genomic distance between interacting promoters is >50 kbp. Thus, I_{mn} = 1 if the promoter *m* and the promoter *n* interact, and I_{mn} = 0 otherwise. Let *x*_{m} be the expression (RPKM) value of the gene whose promoter is *m*. For each RPKM value of *x*, let λ(*x*) be the fraction of interacting promoter pairs in the network {I_{mn}} such that the expression levels of both genes of the pair are at least *x*. Namely, λ(*x*) = (1/*N*_{E}) ∑_{m < n} Θ (*x*_{m} ≥ *x* AND *x*_{n} ≥ *x*) I_{mn}, where Θ is the function: Θ(true) = 1 and Θ(false) = 0. The function λ(*x*) is a pairwise gene expression correlation characteristic of the promoter-promoter interaction network. The definite integral ∫λ(*x*)d*x* in the interval (0,∞) yields the area under the curve (AUC). The higher the AUC value is, the more correlated the expression values of the gene pairs in the network are.

To show the statistical significance of pairwise correlations, we generated 1 000 random promoter-promoter interaction networks using the same set of promoters and the expression levels of the associated genes as in the network {I_{mn}}, and computed AUC for each random network. From the resulting distribution of AUC values, we estimated the *P*-value for the statistical significance of AUC of the real network. Since the promoters (nodes) of randomized networks were chosen from the real network, the levels of H3K4me2 at the promoters of real and random networks are similar, thus making the analysis unbiased.

In order to demonstrate the tissue specificity of the correlations, the above analysis was repeated in HEK293 cells using the chromatin interaction network in CD4^{+}T cells, but with the gene expression values in the HEK293 cells obtained from ^{47}.

The results of the above correlation analyses are displayed in .

Higher order correlation of genes controlled by the same enhancer

Our data show that multiple genes can interact with the same enhancer (see ). Such co-regulated genes are expected to have correlated expression levels. To test this, we devised a sensitive statistical test.

Let S be the set of genes that interact with the same enhancer. For each gene g in the set, we computed a normalized gene expression value *y*_{g} as the percentage of g's expression value compared to the maximum expression value in S. Namely, *y*_{g}= 100 (*x*_{g}/*x*_{max})%, where *x*_{g} is RPKM value of g and *x*_{max} is the maximum of RPKM values in S. For each value, 0 ≤ *y* ≤ 100, we computed the fraction Φ(*y*) of genes in the S with normalized expression value of at least *y.*

Namely, Φ(*y*) = (1/*N*_{S}) ∑_{g}_{}_{S}Θ(*y*_{g} ≥ *y*), where *N*_{S} is the number of genes in S and the function Θ was defined above. The function Φ(*y*) is a higher order correlation characteristic of the set S.

Let S_{1}, S_{2}, ..., S_{M} be the sets of genes that interact with enhancers E_{1}, E_{2}, ..., E_{M}, respectively. The number of genes N_{k} in each set S_{k} is assumed to be at least two. From the higher order correlation characteristics Φ_{1}, Φ_{2}, ..., Φ_{M} corresponding to these sets, we computed the average characteristic Φ_{ave} (red curve in ). We then computed the AUC = ∫ Φ(*y*)d*y* for each set of genes controlled by an enhancer and obtained a distribution of AUC values.

The randomized controls were obtained by randomly sampling M sets of genes of the same sizes as in the real data, i.e. of the sizes N_{1}, N_{2}, ..., N_{M}, from the network of chromatin interactions in CD4^{+}T cells. Using these gene sets, we computed the average higher order correlation characteristic for randomized data (blue dashed curve in ).

To obtain the corresponding curves and *P*-value in HEK293 cells (), we used the same chromatin interaction network from CD4^{+} T cells as above, but the gene expression values were from the HEK293 cells ^{47}.

Motifs at enhancers

We detected H3K4me2 peaks in CD4^{+} T cells using model-based analysis of ChIP-Seq algorithm ^{48}, extracted “summit” locations from the peaks, and kept the summits located within enhancers that interact with promoters according to our data. We used HOMER algorithm ^{49} to find enriched motifs of known factors at ± 300 bp regions surrounding the summits.