Control of gene transcription is believed to be important in determining organismal phenotype and fitness. Variations in genomic DNA, such as single-nucleotide polymorphisms (SNPs), insertions, or deletions (indels), may act singly or in combination to influence gene regulation (1
). These heritable variations have been thought to affect the binding of sequence-specific transcription factors or to affect the physical conformation of packaged DNA, namely chromatin. Humans typically harbor two copies (alleles) of every gene, and recent studies show that for between 10% and 22% of human genes, the two copies are regulated differently—for example, one copy may be transcribed while the other is not (3
). Such allele-specific expression can be created in part by underlying biological processes such as imprinting, but little is known about other molecular determinants of allele-specific gene regulation in humans or to what extent these events are genetically determined, given that variation in gene regulation can also be caused by nongenetic phenomena including epigenetic, environmental, or stochastic effects (4
). To aid in our understanding of the molecular basis of allele-specific gene regulation and the separate but related topic of phenotypic variation between individuals, we have cataloged allele-specific and individual-specific variation in transcription factor binding and chromatin structure.
To assay individual variation and how it relates to the allele-specific behavior of chromatin, we used deoxyribonuclease I hypersensitive (DNase I HS) site mapping, which broadly identifies regulatory DNA elements such as promoters, enhancers, silencers, and insulators (7
). We also performed chromatin immunoprecipitation (ChIP) for elements associated with the CCCTC-binding factor (CTCF), a multifunctional transcriptional and chromatin regulator (9
). The combination of these two different methods, DNase I HS mapping and CTCF ChIP, allowed us to independently validate our results. Assays were performed on cell lines from one CEU (CEPH Utah reference family; residents with ancestry from northern and western Europe) family (both parents and their daughter) and one YRI (Yoruba from Ibadan, Nigeria) family (both parents and their daughter) in the 1000 Genomes Project (13
). The study design therefore features four unrelated adults (the parents) and two children who are directly related to one pair of adults but unrelated to the other pair or each other (). This design allows us to dissect individual- and allele-specific information in the context of these families, and thereby to determine heritability and the contribution from genetic or epigenetic processes. Previous studies have identified very few individual-specific sites and have not explored their heritability (14
Fig. 1 (A) Cell lines from CEU and YRI parent-child trios. (B) Classification of DNase I HS or CTCF binding sites among individuals. Constant sites are those occurring in all four parents. CEU- and YRI-only sites occurred in both parents within only one population. (more ...)
We generated DNase-seq and CTCF ChIP-seq (deep sequencing) data from two independent cell growths for each cell line ( and fig. S1
). Sites were classified as “constant” (present in all four unrelated parents), “individual-specific” (present in at least two of the parents and absent in the other two parents), or “singletons” (present in just one individual) (, , fig. S2
, and table S1
). Global analysis of the 10,041 (DNase) or 1632 (CTCF) individual-specific sites specific to one set of parents compared to the other showed that the children’s signals at those sites were closer to their own parents than to that of the unrelated family (). Given the large number of individual sites tested, this result shows that these chromatin signals are heritable. However, this analysis alone cannot distinguish among genetic, epigenetic, or other causes for inheritance. The high degree of concordance at the 54,621 sites identified by both assays also supports the heritability of binding-level specificity (fig. S3
Fig. 2 Individual-specific chromatin transmission. (A) Example of CEU-only individual-specific DNase I HS and CTCF sites (shaded areas). (B) Example of YRI-only individual-specific sites. (C and D) Genome-wide individual-specific DNase I HS sites (C) and CTCF (more ...)
We next examined the correlation of individual variation in these chromatin sites with variation in gene expression. The presence of an individual-specific DNase I HS site near the transcription start site of a gene was positively correlated with expression of that gene in that individual, relative to genes that were farther away (fig. S4, A and C
). Individual-specific CTCF sites were associated with both activation and repression of nearby genes, suggesting a more complex relationship to gene expression (fig. S4, B and D
The use of high-throughput sequencing allowed us to assess allele-specific chromatin signals by detecting preferential recovery of sequence reads containing one allele over the other when there was an underlying heterozygous SNP in the individual. When aligning our sequences containing such a mixture of alleles at a given heterozygous SNP to the reference human genome sequence, we found a marked preference for alignment of sequence reads containing the allele that also happened to be represented in the reference sequence (fig. S5
). After correcting for this technical bias (13
), we assessed the true allele specificity of each heterozygous SNP sequenced at sufficient depth for each assay, and found that 7% of DNase I HS sites and 11% of CTCF sites have significant allele specificity after multiple testing correction ().
Although allele-specific sites occurred on all chromosomes, the X chromosome was particularly enriched for such sites. This would be expected if DNase I HS and CTCF binding on the two X chromosomes is unequal in females, provided that one of the two X chromosomes is preferentially inactivated in the cell population (fig. S6, A and B
). Indeed, we established that X inactivation patterns were nonrandom in the cell lines studied, and that the paternal X was preferentially inactivated in 90% of cells in each cell line from both daughters (fig. S7A
). Most X-chromosome allele-specific CTCF sites showed a bias toward the active maternal X (fig. S7B
), thus demonstrating that allelic imbalance in CTCF binding is generally associated with epigenetic silencing in X inactivation. We found several sites at which CTCF bound equally to the inactive and active X alleles or preferentially bound the allele on the inactive X. These could represent CTCF binding in regions escaping inactivation, or sites involved in or otherwise reflecting epigenetic changes associated with dosage compensation (9
To establish that the allele-specific CTCF binding biases were not an artifact, we tested four allele-specific and five non–allele-specific CTCF sites using matrix-assisted laser desorption/ionization–time-of-flight mass spectrometry (MALDI-TOF MS) (fig. S8A
and table S4
). Each of the four allele-specific sites showed a significantly higher proportion of the enriched allele (fig. S8B
), although the absolute levels of enrichment were lower as assayed by MALDI-TOF MS than by ChIP-seq. In contrast, none of the five non–allele-specific ChIP-seq CTCF sites showed significant bias by MALDI-TOF MS (fig. S8B
and table S5
Chromatin signals could be individual-specific or allele-specific as a result of nongenetic factors, such as environmental, epigenetic, or stochastic differences between individuals (4
). If allele-specific chromatin structure has a direct genetic basis, the relationship between a specific allele and the chromatin signal should be maintained between individuals. When we considered the 10,364 shared heterozygous sites present in two or more individuals, if two individuals showed significant allele-specific CTCF binding, it was nearly always toward the same allele (). We next examined the prevalence of an autosomal imprinting-like process for generating allele specificity. Because the male and female parental alleles are randomly distributed with respect to any genetic haplotype, one would expect that if a site were subjected to a parent-of-origin imprinting-like process, half of such sites would have reversed allele specificity in unrelated individuals with the same heterozygous sites. However, only about 2% of interindividual pairs showed significantly opposite behavior () (13
). This shows that an autosomal imprinting-like mechanism is not a major contributor to allelic bias, at least for CTCF binding.
Fig. 3 Comparison of allele-specific sites between individuals. (A) Each subpanel shows a different allele-specific site in two individuals in the indicated category. The overlapping SNP is indicated below. Adjoining pie charts show concordant allelic bias within (more ...)
Using the parent-child structure of our study, we could also examine the relationship between allele-specific information present in the children and individual-specific information in the parents. Unlike the earlier transmission test of individual-specific sites (), this comparison specifically assesses a genetic mechanism for generating allele specificity. At the 62 CTCF sites where there was a significant allele-specific signal in the child and where one parent was homozygous for one allele and the other parent homozygous for the other (), the allele bound most strongly by CTCF in the child was most often (65%) the allele carried by the parent who showed the greatest level of CTCF binding, and the extent of parental differential CTCF binding was correlated to the extent of the child’s allele specificity (P
= 6.6 × 10−5
, Spearman’s correlation) (). These results suggest a heritable genetic rather than an epigenetic basis for a large proportion of the allele-specific binding of CTCF. There was a strong tendency for the same allele to be preferred in both the CTCF and DNase I HS assays when both could be measured (fig. S9
). It is thus likely that DNase I HS sites are also correlated between individuals and are transmissible from parent to child.
SNPs underlying the allele-specific sites could directly affect transcription factor binding and chromatin. Alternatively, these SNPs could merely be markers for other cis polymorphisms such as indels that we did not incorporate into our reconstructed reference genomes. We therefore examined whether SNPs themselves disrupted the CTCF binding motif, and whether the effect of any disruption was consistent with the observed effect on CTCF binding (13
). At sites where CTCF showed allele-specific binding, the motif score tended to be higher for the favored allele, whereas at sites lacking differences in CTCF binding, motif scores were similar (fig. S10
). Moreover, strongly conserved positions in the motif were more likely to harbor allele-specific SNPs (). Thus, SNPs underlying many allele-specific binding sites are likely to directly affect the binding of CTCF, further suggesting that there is a genetic basis for allele-specific binding.
Fig. 4 Representation of allele-specific and non–allele-specific SNPs across the CTCF binding motif (17). The y axis indicates the difference between the two as a percentage of normalized total SNPs. Higher bars indicate an increased representation of (more ...)
Our results suggest a strong genetic component for allele-specific differences at the level of transcription factor binding and chromatin structure. In addition to the genetic effects, we expect that some individual-specific differences may be due to nongenetic or epigenetic differences between individuals, such as DNA methylation, which could vary without regard to the underlying genotype. Our results are not consistent with widespread random allelic inactivation in lymphoblastoid lines (16
), and they place limits on the extent of an imprinting-like process affecting transcription factor binding and chromatin structure. Chromatin structure is thought to be an important reservoir of epigenetic information as well as part of the means by which genetic and epigenetic changes affect phenotypes. Because we can now reliably measure individual differences in chromatin structure, our data may have implications for the identification and characterization of common noncoding polymorphisms associated with disease risk.