|Home | About | Journals | Submit | Contact Us | Français|
Allele-specific DNA methylation (ASM) and allele-specific gene expression (ASE) have long been studied in genomic imprinting and X chromosome inactivation. But these types of allelic asymmetries, along with allele-specific transcription factor binding (ASTF), have turned out to be far more pervasive—affecting many non-imprinted autosomal genes in normal human tissues. ASM, ASE and ASTF have now been mapped genome-wide by microarray-based methods and NextGen sequencing. Multiple studies agree that all three types of allelic asymmetries, as well as the related phenomena of expression and methylation quantitative trait loci, are mostly accounted for by cis-acting regulatory polymorphisms. The precise mechanisms by which this occurs are not yet understood, but there are some testable hypotheses and already a few direct clues. Future challenges include achieving higher resolution maps to locate the epicenters of cis-regulated ASM, using this information to test mechanistic models, and applying genome-wide maps of ASE/ASM/ASTF to pinpoint functional regulatory polymorphisms influencing disease susceptibility.
Genome sequencing, expression profiling and now genome-wide mapping of epigenetic markings have been huge advances that have brought us into the so-called post-genomic era. Equally important, we now have saturating genome-wide maps of common DNA polymorphisms in humans, and the concept of haplotypes has been fully developed and widely applied. Going forward, the fields of genetics and epigenetics are starting to capitalize on this basic groundwork to explore allele-specific phenomena at unprecedented levels of detail. Realizing that this research area will continue to expand rapidly, I take this opportunity to survey the current landscape, particularly focusing on the role of cis-acting DNA polymorphisms in setting up allele-specific DNA methylation (ASM) and allele-specific gene expression (ASE).
Genomic or parental imprinting produces strong ASM and ASE in a parent-of-origin-dependent manner. The imprint, which is an extremely potent dose-regulating mechanism, is purely epigenetic and, strikingly, is completely erased and reset each time the allele passes through the germline. While one can find more optimistic projections, the number of known imprinted genes appears to be reaching an asymptote at around 100, or <1% of the mammalian gene repertoire (1). Imprinting is a non-Mendelian phenomenon par excellence, and this relative rarity of imprinted genes is completely consistent with the overall success of Mendel's laws in human and mouse genetics. It is also consistent with classical experiments using mice carrying Robertsonian chromosomal translocations, which showed that only some, not all, whole chromosome uniparental disomies produce abnormal phenotypes. These elegant genetic studies suggested early on that some chromosomes may be devoid of imprinted genes (2). Nonetheless, imprinted genes are crucial for normal mammalian development (1,3), and mechanistic studies of imprinting have laid an important and impressive groundwork for understanding allele-specific gene regulation. Similarly, methods developed to study imprinting are now the workhorse tools for analyzing the types of non-imprinted allele-specific phenomena that are the main focus of this review. Examples include bisulfite conversion of DNA followed by PCR spanning an SNP and cloning of the products to reveal ASM, and comparing allelic representation in PCR products from cDNA versus gDNA to score ASE. Genome-wide scanning methods like methylation analysis on SNP arrays (MSNP) were initially developed with the hope of finding additional imprinted genes, but instead uncovered the novel and more widespread phenomenon of non-imprinted ASM (4). Other new approaches that have been published for analyzing imprinted domains, such as chromatin immunoprecipitation (ChIP)–chip or ChIP–Seq, to search for ‘overlapping’ activating (methylated H3K4) and repressive (methylated H3K9) histone modifications that in fact represent the two oppositely poised (active/inactive) parental alleles (5), may also prove useful for finding loci with non-imprinted allelic asymmetries. As I discuss more in a later section, prior work on the mechanisms of genomic imprinting will also likely be relevant for understanding the mechanisms that produce non-imprinted allelic asymmetries.
Allelic asymmetries are now recognized as very common at non-imprinted loci. Here I consider three related classes of asymmetry affecting non-imprinted genes: allele-specific expression of mRNAs and non-coding RNAs (ASE), ASM, allele-specific chromatin modifications and transcription factor binding (ASTF). A fourth class, random monoallelic expression and DNA methylation, is also important for gene regulation (6), but it is not covered here. ASE refers to asymmetric mRNA or non-coding RNA expression from the two alleles; alternative abbreviations in the literature include AE and AI (allelic imbalance). ASE is scored in heterozygous samples by comparing the representation of the two alleles of a given SNP in genomic DNA (by definition 50:50) to their representation in the corresponding mRNA or non-coding RNA, generally assayed as cDNA. This type of gDNA/cDNA comparison has now been carried out genome-wide by many laboratories, first using SNP arrays and more recently massively parallel NextGen sequencing. A clear conclusion from all studies is that ASE is quite frequent across the human genome, and it usually reflects the presence of a cis-acting regulatory polymorphism or regulatory haplotypes near or encompassing the gene (references in Table 1). In contrast, these studies, and expression quantitative trait locus (eQTL) screens discussed below, have shown that ASE due to trans-acting genetic or epigenetic mechanisms is relatively rare. ASE due to cis-acting regulatory polymorphisms is typically a quantitative phenomenon; it does not produce ‘all or none’ monoallelic expression but instead results in a bias in the ratio of transcripts from the two alleles. ASE due to cis-effects at a given locus is therefore a continuous variable, and thoughtful statistical approaches are essential for setting cutoffs and making meaningful statements about its frequency (7). In contrast, when ASE is produced by other mechanisms, such as genomic imprinting or, even more rarely, heterozygous germline mutations causing nonsense-mediated mRNA decay (7,8), the allelic bias can be very strong, with close to monoallelic expression.
The term eQTL is related to ASE, but not synonymous. The typical strategy for mapping eQTLs is to correlate SNP genotypes with separate data from mRNA expression profiling in large numbers of individuals. Standard microarray-based methods are easily adapted for this purpose and lend themselves to high genomic coverage and high sample throughput (Table 2). Homozygotes for the minor and major alleles at each SNP are highly informative for this type of eQTL mapping, while they are not informative for direct measurements of ASE, in which the allelic expression bias can only be examined in heterozygotes. There has been some discussion of the relative sensitivity and accuracy of these two approaches. As a tool for finding and validating regulatory SNPs measuring ASE has the major advantage of being internally controlled. It directly compares expression of the two alleles within one individual, rather than measuring associations of SNP genotypes with net expression of the gene across subjects, which can suffer from the limited precision of microarray assays and unpredictable effects of environmental influences in each individual. But both approaches are valid and assessing correlations of SNPs and haplotypes with net transcript levels gets more directly at the biologically relevant outcome—namely net gene expression. Further, for technical reasons, mapping eQTLs has allowed certain types of questions to be answered faster than mapping ASE directly. For example, Stranger and colleagues used transcriptome profiling in lymphoblastoid cell lines (LCLs) from individuals included in HapMap to sort out the relative contributions of SNPs versus DNA copy number variants (CNVs) to inter-individual differences in gene expression. They found that, while SNPs and CNVs both contributed, the majority of genotype-dependent expression variation (84%) in these cells was attributable to SNPs, which were not acting as surrogates for the CNVs (9). With available methods, the lists of cis-regulated genes obtained by ASE mapping versus eQTL analyses are significantly but not perfectly overlapping (references in Tables 1 and and2).2). One assumes that the overlap will improve as the methods are further refined, and a recent study using NextGen sequencing provided some support for this notion (10). NextGen sequencing is already increasing the information content of genome-wide studies dealing with ASE. In addition to possibly giving more linear estimates of the abundance of major transcripts, mapping ASE by RNA-Seq has revealed cis-acting SNPs in splice donor and acceptor sequences that affect exon usage in alternatively spliced transcripts (10,11), and it can facilitate the analysis of intronic SNPs in primary RNA transcripts, which substantially increases the number of informative biological samples (12).
All of the allele-specific phenomena discussed here are tissue-specific, so choosing the appropriate tissues and cell types is crucial for getting useful information. Among more than 16 recent large-scale studies of ASE, either by direct measurements or by eQTL analyses, about half utilized exclusively LCLs (Tables 1 and and2).2). This reliance on a renewable source of DNA and RNA is understandable in the methods development phase, and it allowed several groups to rapidly take advantage of available dense SNP genotyping data for these immortalized cell lines from CEPH and the HapMap (now 1000 Genomes) project. Efficient methodology developed using LCLs as the source of RNA included not just microarray-based and NextGen sequencing protocols, but also the essential statistical methods for dealing with the data, including methods for overlaying eQTL and ASE maps with GWAS data to extract functional conclusions (13,14). However, there are well-documented problems with clonal selection in established LCLs (15). Methods to monitor and correct for this problem have been developed (16), but it is still gratifying to see that all groups working in this area are now analyzing primary human cell types. As listed in Tables 1 and and2,2, some impressive studies of ASE or eQTLs have now been published using osteoblasts, non-transformed fibroblasts, keratinocytes, human ES cells or induced pluripotent stem cells (iPS), primary peripheral blood mononuclear cells, resting and PHA-stimulated T-lymphocytes, monocytes, adipose tissues and normal liver samples. Some studies are finding substantial overlap (up to 30% of eQTLs) between different types, including LCLs, suggesting the existence of a category of ‘universal eQTLs’, but more than half of all eQTLs seem to be private to specific tissues (17). Data from these pioneering studies on LCLs and primary cells and tissues are all in public repositories and the resulting maps of eQTLs and ASE will be a valuable adjunct to studies of human gene regulation and genetic variation for years to come.
The necessity of using well-chosen cells and tissues is not just academic; from the studies to date, it is already clear that the overlap of cis-regulated genes with GWAS signals will make sense only in a tissue-specific context. For example, data from analyzing LCLs and T-cells have shown overlap mainly with GWAS signals for autoimmune diseases, while data from liver samples have shown overlap with GWAS signals for lipid profiles, Type II diabetes and coronary artery heart disease (16,18–20).
So, how many genes show ASE and/or eQTLs in specific types of human cells or tissues? This number can be a moving target, as it depends on cell type and the stringency of the cutoffs for defining a significant allelic bias. For example, Chakravarti and colleagues developed an unbiased statistical approach to establish the most lenient cutoff for calling an observation ASE (7). As defined by their approach, ASE was found to be quite widespread in LCLs: 19.6% of heterozygotes at 78% of SNPs at 84% of genes demonstrated ASE in these immortalized cell lines, with a mean allelic bias of 1.6-fold. As listed in Tables 1 and and2,2, other studies of LCLs have come to similar figures, even up to 30% of genes surveyed (16). Estimates in primary tissues have tended to be somewhat lower, but still of the order of 10–20% of genes are affected by this phenomenon (Tables 1 and and2).2). In evaluating the frequency of ASE, it is crucial to carry out validations of the microarray or NextGen sequencing data using independent gene-specific molecular assays, as has been done in some, but not all, of the published studies. Without independent validations, it is impossible to know the false-positive rates of the initial screens. Lastly regarding the frequency and strength of ASE and eQTLs in the human genome, it is clear that the extent of the allelic bias can vary substantially among individuals, some of whom are expected to share the same genotype (7,21). Some of this variation may be influenced by trans-acting loci or by the environment. Most environmental effects on ASE are probably not major when considered singly, but it has been shown that single types of exposures that have very strong biological impacts, e.g. cigaret smoking, can produce quantitative effects on ASE that are detectable when the epidemiological study is sufficiently powered (21). All of these themes are also relevant to ASM, which I consider next.
In 2008, Kerkel et al. (4) used the MSNP method, pre-digestion of genomic DNA by methylation-sensitive restriction enzyme(s) followed by probe synthesis and hybridization of SNP arrays (Fig. 1), to examine ASM in several human tissues, including peripheral blood leucocytes (PBL), hematopoietic stem cells and placenta. Their study was designed to detect new examples of imprinted genes, but they found only old examples of such genes, instead identifying numerous examples of previously unsuspected ASM at loci outside of imprinted regions. Most of these examples of ASM outside of imprinted genes showed a strong correlation with local SNP genotypes, indicating cis-regulation of the phenomenon. The observed ASM, which was validated by pre-digestion PCR/RFLP assays and bisulfite sequencing, was found to be tissue-specific, and for a given positive locus, it was seen in 40–95% of heterozygotes. That paper was quickly followed by other reports (Table 3), including one by Zhang et al. (22), who used bisulfite sequencing of PBL DNA to document SNP-dependent ASM in CpG-rich sequences in or near four genes on human chromosome 21, and larger genome-wide studies by Schalkwyk et al. (23) who used MSNP on high-density Affymetrix 6.0 SNP arrays to profile ASM in blood leukocytes and buccal cells, validating their results by bisulfite conversion of genomic DNA followed by SNaPshot assays. Hellman and Chess (24), who had previously published a method similar to MSNP to study DNA methylation on human X-chromosomes, went on to use 500K SNP arrays to study autosomal loci and found that ~10% of SNP-tagged regions have genotype-dependent ASM in LCLs (25). Extending these types of observations to a well-controlled mouse model system, Schilling et al. (26) did a genome-wide analysis in macrophages from two common laboratory strains (C57BL/6 and BALB/c) and in F1 hybrid offspring. They found that ASM was frequent and widely distributed across the genome, and that the allelic asymmetry in DNA methylation was largely attributable to cis-acting polymorphisms. In another study, Lee and coworkers (27) carried out 500K MSNP on human LCLs and esophageal tissues (normal and cancerous) and observed that methylation profiles are individual-specific as well as tissue-specific, suggesting an effect of genetic background on CpG methylation at many loci. ASM per se was not scored in their study but might be extractable from their primary data, which has been deposited at NCBI. From all of these studies, we know that when ASM is found, it can vary from a highly localized asymmetry in methylation affecting only one or several CpGs to examples in which a large number of contiguous CpGs are coordinately affected. Most examples of ASM affect DNA sequences outside of CpG islands, but there are rare examples in which even large CpG-dense islands can show this phenomenon (22). The region shown in Figure 1 is a typical example of cis-regulated ASM affecting a moderate-sized cluster of CpG sites in an intergenic CG-rich region that is not long enough or CG-dense enough to qualify as an island.
A methylation QTL (mQTL) approach, analogous in design to prior studies of eQTLs, has also been quite successful, with two groups applying this strategy to human brain regions and finding strong evidence for widespread cis-regulation of DNA methylation patterns (28,29). NextGen sequencing of genomic DNA after bisulfite conversion can also be useful for analyzing ASM, as recently shown in a proof-of-principle study by Shoemaker et al. (30), and as per-sample costs go down, this approach can be expected to eventually supersede microarray-based methods.
In all of these studies in human cells and tissues, when sequence-dependent ASM was found at a given locus, its dependance on the genotype at closely adjacent SNP(s) was close to absolute. ASM is linked to ASE at some but not all loci; in the Schalkwyk et al. study, more than 150 ASM-associated SNPs, distributed across each of the human chromosomes, were found to be significantly associated with the expression of nearby genes. The frequency with which ASM is associated with ASE will likely depend on the tissue being examined and the methods and platforms used; the two mQTL studies in human brain tissues gave estimates of 5 and 13% of mQTLs associating strongly with eQTLs (28,29). Whether these estimates might be somewhat on the low side due to cell-type heterogeneity in brain tissues will no doubt be answered by future studies using purified neurons and glial cells. Along these same lines, the theme of individual specificity from studies of ASE and eQTLs also pertains, quite strikingly, to ASM and mQTLs (4,23). Is individual specificity of ASM a type of incomplete penetrance due to unpredictable trans-acting and environmental influences, or does it reflect our incomplete knowledge of the precise haplotypes surrounding each index SNP in each individual? The answers might emerge from genetic epidemiological studies including exposure information, and improvements in methods for ascertaining long-range haplotypes with direct information on phasing of SNPs and other DNA polymorphisms in each individual.
It seems a safe bet a priori that many of these genetic effects on ASE and ASM will in the end prove to be due to allele-specific affinity of DNA-binding proteins (transcription factors in the broad sense) for critical polymorphic cis-acting regulatory elements. As perhaps the most obvious example, starting with early work done by the Cedar and Turker laboratories and now nicely followed up by others, general transcription factors (GTFs) that bind to gene promoters have long been studied as candidates for playing an active role in protecting unmethylated CpG islands from a poorly understood but experimentally measureable process of ‘methylation encroachment’ from their highly methylated flanking sequences (31–35). According to the model in Figure 2A, SNPs or indels in the recognition sequences for GTFs like Sp1 could in principle lead to altered GTF binding, followed by methylation encroachment on one of the two alleles. Importantly, this model does not require that the GTF itself have methylation-dependent binding, simply that it protects the promoter from methylation encroachment. In fact, Boumber et al. (36) recently showed that a 12 bp indel polymorphism in a Sp1 GTF-binding site in the promoter of the RIL gene affects the propensity of this gene to become methylated in human leukemias. This does not necessarily imply that such a mechanism accounts for instances of ASM in normal cells, but it offers a good rationale for now performing systematic genome-wide search for ASM around polymorphic GTF-binding sites.
Essentially, all DNA methylation in humans occurs in CpG dinucleotides. By definition, ASM is measured at non-polymorphic CpG sites, but the methylation status of such sites can be influenced, at least in theory, by allele-specific differences in the density of CpGs in the surrounding local DNA sequence. Thus, another mechanistic model has SNPs that create or delete CpGs (‘CpG SNPs’) influencing the propensity of neighboring non-polymorphic CpGs to become cytosine-methylated. In regions of the DNA where overall CpG density is high and CpG SNPs are abundant, one can imagine more efficient spreading of methylation through the allele that has a higher preservation of CpGs, perhaps by more efficient cooperative binding (37) of the enzymes and cofactors in the methylation machinery (Fig. 2B). Alternatively, a subset of CpG SNPs could influence the binding of specific transcription factors, either positively or negatively, similar to the model in Figure 2A. Intriguingly, two recent genome-wide surveys of ASM have noted a small but statistically significant excess of CpG SNPs near loci with ASM (25,30).
The roles of insulator elements have been well studied at classical model loci like the globin genes, and also at several different imprinted loci where they show parent-of-origin-dependent ASM (1,38,39). There are already precedents for naturally occurring genetic lesions in such sequences within the human genome, for example, micro-deletions in the insulator located upstream of the imprinted H19 gene, which lead to over-expression of IGF2 and cause some cases of the Beckwith–Wiedemann overgrowth syndrome (40). As shown in Figure 2C, qualitative or quantitative alterations in insulator function due to SNPs or indels could lead to ASE and ASM. Lastly, given that large-scale CNVs are fairly common in human genomes, a gross alteration in chromosome structure secondary to CNVs is another a priori possibility that could account for some instances of ASE and ASM.
All of the above mechanistic models can begin to be addressed by combining ASE/ASM mapping with genome-wide mapping of ASTF. Here the abbreviation ASTF broadly includes allele-specific affinities of insulator-binding proteins and allele-specific chromatin modifications. In fact, an impressive initial group of papers have already appeared on this topic, using ChIP with downstream analysis either by probe synthesis from the IP DNA and hybridization to SNP arrays or, more recently, NextGen sequencing to query SNP representation in the IP DNA (references in Table 4). Taken together, these studies provide proof-of-principle for mapping not only allele-specific histone modifications (a method which has also been extensively used by laboratories working on imprinted chromosomal domains) but also allele-specific DNA occupancy of RNA PolII, allele-specific binding of the transcription factor NF-κB and the insulator-binding protein/transcription factor CTCF and, as a surrogate for open chromatin, mapping of allele-specific DNase I hypersensitive sites. In each of these studies, allele specificity has been found at large numbers of SNP-tagged loci. Encouragingly, in the study by McDaniell et al. (41) at least some of the allele specificity of CTCF binding could be accounted for by SNPs located within CTCF consensus-binding motifs. This observation provides some experimental support for the mechanistic model in Figure 2C.
As introduced by the above discussion, one practical use of ASE/ASM mapping is to help extract maximum information from GWAS. Here I paraphrase from my recent commentary focusing on this application (42). There is some debate now on the relative merits of the ‘common disease–common variant’ versus ‘multiple rare variant’ hypotheses for explaining complex disorders. Nonetheless, as predicted by the common variant model, GWAS have in fact identified many well replicated and biologically credible loci for disease susceptibility. In the process, these types of studies have come up against two technical roadblocks: First, most (~90%) of the supra-threshold disease association signals are at non-coding SNPs (43–45). Among these statistical signals which ones are due to bona fide functional regulatory SNPs, and how can these SNPs be identified? Second, because of multiple comparisons, the threshold for significance needs to be set high, at P < 10−7 or P < 5 × 10−8, so there are numerous sub-threshold peaks that are difficult to interpret. Are some of these signals true-positives that should not be discarded? Independent lines of evidence are needed, and a promising direct approach is to combine statistical evidence from GWAS with functional evidence for the presence of cis-acting regulatory SNPs, indels or CNVs from mapping of eQTLs, ASE and ASM (Fig. 3). Statistical methods for carrying out such overlaps have recently been published [for example (14, 21)].
Beyond providing evidence for rSNPs being near a gene of interest, mapping ASE can help to close in on the precise positions of functional SNPs. Forton et al. (46) used ASE and haplotypes analysis to map cis-regulatory elements in chromosome band 5q31, thereby pinpointing the location of cis-acting DNA sequences that regulate the IL13 gene from a distance of 250 Kb upstream. Other examples in this fast-moving area are discussed in my previous review (42) and listed here in Tables 1 and and2.2. In the future, it will be interesting to see whether the methods developed for overlapping GWAS with ASE and eQTL data in these papers can also work using ASM as the marker for nearby regulatory polymorphisms. To use this strategy, it will first be necessary to develop more efficient methods for complete ASM profiling over megabase regions of DNA to define the epicenters of the allelic asymmetry.
As this field matures over the next decade, there will likely be two lines of important descriptive work, with continuing genome-wide analyses moving forward in parallel with more focused ‘fine-mapping’ studies of genes and chromosomal regions of interest, to more tightly pin down the identities of regulatory polymorphisms and haplotypes. Doing so will be necessary both for testing mechanistic hypotheses and for completing the tasks started by GWAS—namely to fully understand the etiologies of complex genetic diseases. Even the au courant research strategy of searching for rare genetic variants to explain complex diseases could benefit from incorporating ASE, ASM and ASTF mapping. In particular, not all pathogenic variants, even if rare, will be non-synonymous coding changes, so the functional significance of rare non-coding variants will still need to be grappled with. Fine-mapping ASM to find the true epicenters of allelic asymmetry will be essential for testing mechanistic models. None of the available datasets yet provides this type of information. In the near future, microarrays with custom designs will be useful for achieving greater coverage of SNPs in CpG-rich sequences while continuing to achieve high sample throughput at reasonable costs. NextGen bisulfite sequencing with reduced genomic representation and padlock probe methods (30,47), and high-throughput bisulfite PCR on new microfluidic and microdroplet instruments (48,49), will also be essential in the near term for achieving regional genomic coverage at single base-pair resolution. Ultimately the ‘$1000 epigenome’ will become a reality. But this will not be enough—all questions in this field come back to understanding the functions of cis-acting variants in the DNA of a given chromosome homolog over both short and long distances. So it will also be important to continue using samples from multi-generation families, and to develop direct methods for establishing the phase of SNPs, indels and CNVs, in other words, their physical linkage over long stretches of the DNA (50).
This work was supported by grants R21 CA125461-02, R01 AG036040-01, and R01 AG035020-01 from the NIH and by grants from the March of Dimes and the Douglas Kroll Foundation of the Leukemia and Lymphoma Society.
I thank Barbara Stranger and Andrew Chess for their helpful comments on the manuscript.
Conflict of Interest statement. None declared.