|Home | About | Journals | Submit | Contact Us | Français|
Epigenetics may be broadly defined as the study of processes that produce a heritable phenotype that is not strictly dependent on DNA sequence. The definition has traditionally been restricted to processes that occur in the cell's nucleus, with the term “heritable” having a loose meaning that can be applied to either the entire organism or single cells. For example, a process that produces a phenotype only in a specific cell type (for instance, chromatin-mediated maintenance of a differentiated state) is usually considered epigenetic even if it is not directly inherited, but instead must be reestablished or actively maintained at each cell division. Given this definition, the field of epigenetics has long focused on proteins that affect DNA packaging, and thereby affect the utilization of the genetic information encoded in the DNA template. This focus extends to the enzymatic modification of those proteins, and to the enzymatic modification of the DNA template itself, primarily DNA methylation.
This review is written in conjunction with the international symposium on Genome-wide Epigenetics 2005, held at the University of Tokyo, Japan on November 8–10, 2005. Over the past decade, the field of epigenetics has undergone an exciting expansion in the number of researchers, techniques available and our understanding of epigenetic phenomena. The purpose of this short review is not to summarize all of these advances, but rather to guide the reader to more detailed sources of information by sketching an outline of the major thrusts in the field, emphasizing mammalian epigenetics in particular.
Epigenetic information manifests itself by regulating gene expression. The mechanistic mediators of heritable gene expression appear to be similar whether one is considering epigenetic inheritance on the organismal or cellular level. For example, a mediator of epigenetic regulation is cytosine methylation. The action of maintenance methyltransferases ensures that the methylation of pairs of cytosines on complementary DNA strands at CpG dinucleotides is propagated to both pairs of daughter DNA molecules following DNA replication (Bestor 2000). The inheritance of gene regulatory information from parental gametes defines the phenomenon of genomic imprinting, for which cytosine methylation is a central mediator (Li et al. 1993).
Cytosine methylation is the best-studied mediator of epigenetic regulation. In mammalian genomes, the vast majority of methylation of cytosines occurs when the cytosine is followed by a guanine, a so-called CpG dinucleotide (Clark et al. 1995). The methyl group protrudes into the major groove of the DNA double helix, preventing the binding of transcription factors that would otherwise bind locally in a sequence-specific manner (Bell and Felsenfeld 2000; Hark et al. 2000), and facilitating the binding of methyl-binding proteins that generally attach to DNA in a sequence non-specific manner (Jorgensen and Bird 2002). The methylation of cytosines may not be the initiating event in a change from activity to silencing of a locus, but it is certainly associated with a number of events including the alteration of covalent modifications of histone tails (El-Osta and Wolffe 2000), the recruitment of methyl-binding proteins (Jorgensen and Bird 2002) and proteins normally associated with distinctive chromatin states (such as SWI/SNF components (Harikrishnan et al. 2005)), and the compaction of the chromatin with silencing of gene expression locally (Nguyen et al. 2001).
It is possible that cytosine methylation is induced by primary events involving the covalent modification of histone tails. The histone octamers that form the nucleosome protrude N-terminal tails, where it was thought that various combinations of methylation, acetylation, phosphorylation and other post-translational modifications could be established to create what has been referred to as the histone code (Jenuwein and Allis 2001). However, this combinatorial concept is being challenged by data suggesting that the pattern of modifications may be more simple than the potential complexity, establishing what is more like a binary on or off pattern (Dion et al. 2005; Liu et al. 2005a; Schubeler et al. 2004). As described for methylcytosine, these modified histone molecules are capable of recruiting further molecules and complexes that influence chromatin structure and local gene activity. Whether histone tail modifications induce or are induced by cytosine methylation, it seems that the disruption of one component of this system can induce changes in other components also (Ben-Porath and Cedar 2001).
The recent finding in fission yeast that small RNAs are targeted to heterochromatin (Verdel et al. 2004) suggests that epigenetic gene regulation in mammalian cells may also involve the actions of such short RNAs. In addition, it has been observed that a subset of DNA methyltransferases and methyl-binding proteins have RNA-binding properties (Jeffery and Nakielny 2004), supporting the idea that RNA molecules may influence the regulation of expression of their own or other loci. However, a point worth stressing is that the major consequence of all these biochemical processes, which ultimately alter gene expression, is the regulation of transcription factor binding. Transcription factors bind in modules at specific loci to facilitate the loading of the RNA polymerase complex locally, a prerequisite for gene expression. The relationship between epigenetic regulation and transcription-factor binding tends to be overlooked in reviews of either field. However, as we emphasize below, techniques to study chromatin components and transcription factors in vivo are equally advanced, allowing integration of data and major advances in the understanding of the relationship of epigenetic alterations with transcription factor binding. An example of such a study shows a relationship between binding of the CTCF transcription factor and cytosine methylation genome-wide (Mukhopadhyay et al. 2004). The use of high-throughput techniques to study epigenetic regulatory processes in large genomic regions or genome-wide has been described as an ‘epigenomic’ approach (Fazzari and Greally 2004). Our insights into epigenetic regulatory processes have to date been founded mostly on detailed studies of individual loci; now that epigenomic approaches have been developed, we are beginning to accrue knowledge based on less detailed studies of large numbers of loci.
Although many of the details remain unclear, few people today doubt the importance of epigenetic processes (in addition to the genetic code) for biological and medical research. A full understanding of epigenetic inheritance and how epigenetic regulatory information is distributed in the genome therefore ought to be the ultimate aim of any genome-wide epigenetics or Epigenome Project. Of course, there is currently much debate on whether the time and technology is ready for such an enormous undertaking and those who remember the early discussions about the Human Genome Project will find some of the arguments reminiscent of that time. For the Genome Project, the turning point came in 1995, when the more pragmatic protagonists at the time pointed out that a (draft) genome sequence could be generated with existing technology (Marshall 1995). Anticipating, of course, that the effort would be self-catalyzing once started which, with the luxury of hindsight, turned out to be true when the Human Genome was announced finished in 2003, under budget and ahead of time. Exactly when the discussion about the Epigenome Project started is arguable, most likely it started independently in the minds of multiple scientists. In Pubmed, the first indexed mentioning of ‘Epigenome’ appeared in 1987 (Kieser 1987) and ‘Epigenome Project’ was mentioned the first time only in 2002 (Novik et al. 2002). Therefore, a 'spoof' editorial in a leading science magazine (Nature Biotechnology) was quite timely when it suggested in August 2000 that a trilogy be required to complete the Human Genome Project: “The Draft”, “The Closure” and “Epigenetics Strikes Back”. With the “Draft” and the “Closure” delivered, the time has come to explore the next frontier, the “Epigenome”.
The rationale for studying the epigenome and how this might be accomplished has been reviewed recently (Murrell et al. 2005). Basically, different cell types require different epigenomes to control – in time and space - the correct execution of their individual transcriptional programs and (reversible) epigenetic modulation has the flexibility to achieve the required fine-tuning of the underlying nuclear genome. A recent study further suggests that the transcriptional program is much larger and more complex than previously thought (Cheng et al. 2005). High-resolution transcript mapping of 10 human chromosomes revealed that the majority of (mostly non-polyadenylated) transcripts map to loci for which there is currently no annotation available. Yet, these transcripts of unknown function may have important implications for gene expression and (epi)genotype-phenotype associations. The power to detect any alterations anywhere in the epigenome will therefore be of great value to understand what goes wrong if a program is executed incorrectly, particularly in the context of disease. Obviously, our ability to analyze and interpret entire epigenomes is very much dependent upon the availability of adequate technologies which will be discussed in the next section.
Much of the recent progress in understanding epigenetic phenomena is directly attributable to technologies that allow researchers to pinpoint the genomic location of proteins that package and regulate access to the DNA. The advent of DNA microarrays and inexpensive DNA sequencing has allowed many of those technologies to be applied to the whole genome. The focus of the Tokyo conference is the study of epigenetics on a genome-wide scale, so here we will focus on technologies that enable a comprehensive investigation of the genome. We note that Bas van Steensel's group has recently published a series of very informative reviews relevant to this subject (Loden and van Steensel 2005; van Leeuwen and van Steensel 2005; van Steensel 2005).
Cytosine methylation has been the best-studied epigenetic mechanism, largely because the experimental techniques for its study have been available for some time. The mainstays of experimental approaches have been the sensitivity of certain restriction enzymes to cytosine methylation and the relative resistance of methylcytosine to bisulfite-induced deamination. Experiments based on these approaches have revealed great detail about individual loci. The expansion of these studies from single-locus to genome-wide approaches is still in development. A platform that has proven to be very robust is that using restriction landmark genome scanning (RLGS), which uses a rare-cutting methylation-sensitive restriction enzyme (such as NotI) in combination with a second enzyme to create a profile of NotI digestion products. The digested NotI overhanging ends are radiolabeled and separated by 2-dimensional gel electrophoresis. Approximately 2,000 sites can be resolved using RLGS. Using virtual image RLGS (vi-RLGS), the actual pattern of NotI fragments can be compared with the computationally predicted pattern for that genome (Matsuyama et al. 2003). This allows differences observed by gel electrophoresis to be linked to specific genomic locations for subsequent validation studies. The technique suffers from being technically challenging, difficult to validate even with vi-RLGS, and limited in terms of the number of loci that are testable (Fazzari and Greally 2004). Nonetheless, it has been the single most informative technique describing differences in cytosine methylation between tissues, stages of differentiation and in disease (Kremenskoy et al. 2003; Shiota et al. 2002; Song et al. 2005).
Other restriction enzyme-based techniques have also been developed, and high-throughput bisulfite sequencing using MALDI-TOF looking at numerous loci in parallel has been the foundation of the European Human Epigenome Project (Bradbury 2003; Rakyan et al. 2004). Most of the techniques used to date have been reviewed comprehensively (Laird 2003; Ushijima 2005), although several recent publications update this list (Ching et al. 2005; Hu et al. 2005; Weber et al. 2005). The techniques differ in terms of how they select a fraction of the genome with distinctive cytosine methylation. Most use the sensitivity of restriction enzymes to cytosine methylation as the means of discrimination, while a new technique uses the affinity of an anti-methylcytosine antibody to enrich the methylated fraction of the genome (methylated DNA immunoprecipitation, MeDIP (Weber et al. 2005)). The techniques also differ in terms of how they detect the genomic source of these distinctive fractions, with gel electrophoresis now largely superseded by microarray-based approaches or high-throughput sequencing.
At present, there is no technique that can test the methylation status of every CpG dinucleotide in the genome. Furthermore, little attention has been paid to the effects of the heterogeneity of CpG dinucleotide clustering in the genome, apart from using this property to define CpG islands (Gardiner-Garden and Frommer 1987; Takai and Jones 2002). Clustering of CpGs increases the number of informative cytosines or restriction sites in some areas compared with others, so that a comparison of cytosine methylation genome-wide will show much stronger signals of methylation or the absence of methylation at sites of clustering than sites where CpGs are relatively more sparse. Possible because of this, our attention is focused on CpG-enriched regions as particularly informative. Our appreciation of the role of less CpG-rich regions in dynamic methylation patterns remains limited as a consequence.
The “ChIP-chip” method is currently the most popular technique for high-resolution mapping of chromatin proteins and enzymes. ChIP-chip, which combines chromatin immunoprecipitation (ChIP) and microarray analysis (chip), was developed in the late 1990s (Nature Genetics editorial 1998) and first published shortly thereafter (Iyer et al. 2001; Ren et al. 2000). Generally, cells are fixed, chromatin is fragmented, and the protein of interest is purified, along with any associated DNA, using an antibody or an affinity tag. The DNA that was co-purified with the protein is then detected using a DNA microarray and mapped back to the genome, allowing the binding position of the protein to be inferred. The details of this technique have been reviewed extensively (Bernstein et al. 2004a; Buck and Lieb 2004; Farnham 2002; Hanlon and Lieb 2004; Im et al. 2004; Johnson and Bresnick 2002; Kuo and Allis 1999; Kurdistani and Grunstein 2003; Lieb 2003; Nal et al. 2001; Oberley and Farnham 2003; Oberley et al. 2004; Orlando 2000; Robyr et al. 2004; Spencer et al. 2003; Weinmann and Farnham 2002). Initial ChIP-chip studies focused on the genome-wide location of well-studied yeast transcription factors. These studies empirically verified the effectiveness of the method by identification of previously known gene targets, and discovery of new and physiologically relevant targets (Horak and Snyder 2002; Iyer et al. 2001; Lieb et al. 2001; Ren et al. 2000; Simon et al. 2001). In addition, DNA-binding motifs corresponding to the known affinities of the proteins were derived from the ChIP-chip data. In the short time since those original studies, ChIP-chip has been applied to hundreds of transcription factors in organisms ranging from bacteria to humans (van Steensel 2005).
Chromatin components and determinants of chromatin dynamics are often the mediators of epigenetic effects. The genome-wide mapping of these chromatin marks by ChIPchip has led to important insights regarding the mechanism of transcriptional and epigenetic memory, and how different chromatin states are propagated through the genome. In yeast, ChIP-chip has been used to determine the distribution of non-histone chromatin components (Glynn et al. 2004; Lengronne et al. 2004; Smith et al. 2003; Weber et al. 2004), enzymes involved in histone modification (Humphrey et al. 2004; Kurdistani et al. 2002; Ng et al. 2003), and post-translational histone modifications themselves (Bernstein et al. 2002; Kurdistani et al. 2004; Loden and van Steensel 2005; Robyr et al. 2002; van Leeuwen and van Steensel 2005).
More recently, the approach of assessing global chromatin structure has been extended to more complex genomes. In Drosophila, the global acetylation pattern of histones H3 and H4 and the global methylation pattern of histone H3 lysine 4 (H3K4) and lysine 79 (H3K79) have been monitored using ChIP-chip (Schubeler et al. 2004). H3K4 di- and trimethylation, H3K79 dimethylation and H3 and H4 acetylation were all found to be present in the same genes. These results are in agreement with the genome-wide pattern of H3K4 methylation (Bernstein et al. 2002) and Set1p binding (Ng et al. 2003) in yeast and suggest that, in Drosophila, as in yeast, a pattern of nucleosome modifications distinguishes actively transcribed genes from repressed genes throughout the genome. Detailed genome-scale studies have also been carried out in Arabidopsis (Lippman et al. 2004).
Although the antibodies used for ChIP-chip studies of chromatin modifications can often be used across species, as of the printer's deadline there no published reports of genome-wide, high-resolution chromatin modification ChIP-chips in mammalian cells. This is mainly due to difficulty in producing inexpensive and easy-to-use full-genome microarrays. The initial mammalian ChIP-chip experiments identified binding sites for the c-Myc, Max, Gata1, E2F and Rb transcription factors in cultured human cells (Horak et al. 2002; Li et al. 2003; Weinmann et al. 2002; Wells et al. 2003). By practical necessity, the DNA microarrays used in these pioneering studies represented only a fraction of the genome. For the c-Myc and Max studies, DNA microarrays were constructed with PCR products spanning the proximal promoters of 4,839 of the approximately 30,000 human genes (Li et al. 2003). Later studies that examined the genomic distribution of the mammalian Set1 homolog MLL used an expanded version of this array that included 19,000 proximal promoters (Guenther et al. 2005). To map the mammalian transcription factors E2F and Rb, DNA microarrays were created with 7776 CpG island clones (Weinmann et al. 2002; Wells et al. 2003). CpG islands are short stretches of DNA containing a high density of non-methylated CpG dinucleotides, and are associated with the promoters and the first exon of a gene (Antequera and Bird 1993). Studies of transcription factor binding across human chromosomes 21 and 22 have been carried out (Carroll et al. 2005; Cawley et al. 2004; Euskirchen et al. 2004; Martone et al. 2003). These chromosome 21 and 22 arrays were also used for the first genome-scale study of histone modifications (H3K4me2, H3K4me3, and H3K9/14Ac) in human cells (Bernstein et al. 2005). This year a group of researchers published the first truly genomic human ChIP-chip experiment, using an array of 14.5 million 50mers that covered the unique regions of the human genome at 100 bp resolution to analyze the distribution of TFIID on the human genome (Kim et al. 2005b). Microarrays similar to the one used in that study are in commercial development, and many chromatin modifications are being mapped as part of the NIH's ENCODE project (2004; Boguski 2004; van Steensel 2005).
There are some challenges inherent to any ChIP-chip experiment, and experiments that aim to determine the distribution of chromatin components and histone modifications are especially subject to these concerns. Several factors could create non-biological variation in results, including the effects of fixation, epitope accessibility, antibody specificity, microarray content, and underlying bulk nucleosome occupancy. These challenges, and suggestions for overcoming them, have been discussed at length in recent reviews (Buck and Lieb 2004; Hanlon and Lieb 2004; Loden and van Steensel 2005; van Leeuwen and van Steensel 2005).
Current alternatives to ChIP-chip include the use of DNA adenine methyltransferase-Identification (Dam-ID) (Greil et al. 2003; van Steensel et al. 2001; van Steensel and Henikoff 2003, 2000). In this method, the eukaryotic chromatin protein of interest is fused with the bacterial enzyme DNA adenine methyltransferase. Any DNA with which the protein is associated will then be methylated at adenine, a mark normally restricted to prokaryotes. The methylated DNA can then be purified with antibodies or probed with methyl-sensitive restriction enzymes to deduce the location of the protein's interaction with genomic DNA. The Dam-ID is very powerful because it does not require antibodies to be raised against each factor of interest and obviates the need for crosslinking. However, it requires expression of a recombinant protein, careful control of the level and duration of expression, and cannot be used to map post-translational histone modifications. A second technique requires that the chromatin component be tagged with the biotin ligase recognition peptide, and that this fusion protein and biotin ligase be co-expressed in the same cell type. Cellular biotin is conjugated to the chromatin protein of interest, allowing streptavidin to be used on sheared chromatin from these cells to isolate the DNA at which this chromatin component is located. This approach has been used to track histone H3.3 localization in cultured Drosophila cells (Mito et al. 2005).
Many epigenetic effects are mediated, ultimately, through their effect on transcription factors and transcriptional regulation. To understand how epigenetic processes influence transcriptional programs, accurate binding-site descriptions of hundreds of transcription factors will likely be needed. DNA-binding specificity can be determined by many well-established methods, including binding site selection (SELEX) (Oliphant et al. 1989; Tuerk and Gold 1990) and electrophoretic mobility shift assays (EMSA) (Fried and Crothers 1981). However, these methods are labor-intensive, not amenable to high-throughput analysis, and do not sample the full range of a protein’s natural in vivo DNA substrates. As described below, one can use ChIP-chip experiments to infer binding motifs from computational analysis of the ChIP-enriched sequences (Liu et al. 2002). While ChIP-chip is a powerful method, the ability to infer relevant and accurate binding sites is dependent on a number of factors, including adequate expression of the protein of interest, nuclear localization and DNA binding by that protein, and availability of a specific antibody that is capable of recognizing the protein in the context of a specific DNA-protein complex. Furthermore, the discovery of binding sites from ChIP-chip data is complicated by the effects of protein-protein interactions, and the cooperative and competitive DNA binding of other proteins in vivo. Two high-throughput in vitro methods for the determination of DNA-binding specificity have emerged recently: Protein Binding Microarrays (PBMs) (Mukherjee et al. 2004) and DIP-chip (DNA ImmunoPrecipitation followed by microarray analysis) (Liu et al. 2005b).
In the PBM approach, a purified, epitope-tagged DNA binding protein is incubated with a microarray spotted with double-stranded DNA corresponding either to short, synthetic double-stranded oligonucleotides (Bulyk et al. 2001; Mukherjee et al. 2004) or longer PCR products (Mukherjee et al. 2004). The protein-bound microarray is then washed gently and stained with a fluorophore-conjugated anti-tag antibody (for example, an anti-GST antibody would be used with GST-tagged protein); alternatively, a fluorophore-conjugated anti- transcription factor primary antibody could be used. The stained array is then washed, spun dry, and scanned with a standard microarray scanner. The PBM experimental protocol takes less than a day, and multiple PBM microarray experiments can be performed in parallel. The PBM data, normalized by DNA concentration, are analyzed with a motif finding algorithm in order to identify the protein’s DNA binding sites. Although PBM experiments can be performed using some ‘standard’ binding conditions, one could examine the effects of altered binding conditions, such as alterations in buffer composition, transcription factor concentration, or even the effects of small molecule or protein cofactors. Importantly, the PBM experiments do not require prior knowledge of the culture conditions in which the transcription factor is expressed, do not require potentially limiting tissue sources, and do not require protein-specific antibodies. Comparison of PBM and ChIP-chip data has revealed that the respective DNA binding site motifs can correspond quite well (Mukherjee et al. 2004). Moreover, comparison of the in vivo and in vitro binding data will permit an analysis of what local sequence context features may contribute to binding site usage. PBM data will be useful for generating binding site motifs for known and predicted transcription factors with poorly or uncharacterized DNA binding specificities. Such binding site data, coupled with further computational analyses, will permit the prediction of the regulatory roles of transcription factors, as well as the prediction of candidate cis regulatory modules (i.e., transcriptional enhancers) (Bulyk 2003; Philippakis et al.; Philippakis et al. 2005).
In DIP-chip, a purified DNA binding protein is incubated with purified, sheared yeast genomic DNA. Protein-DNA complexes are separated from unbound DNA using immunoprecipitation or affinity purification. Purified DNA fragments are amplified, labeled fluorescently, and identified by hybridization to a DNA microarray. Computational methods are then used to define a binding site based on enriched sequences. DIP-chip, while similar in concept to ChIP-chip, can overcome some of the limitations of ChIP-chip by inferring accurate DNA-binding specificities under well-defined and easily varied in vitro conditions. The experimental protocol for DIP-chip can also be used for a rather different purpose, which is comparing the sites of binding in vitro with the sites of binding in vivo, as defined by ChIP-chip. Comparisons of DIP-chip and ChIP-chip experiments will be useful in determining how much of the specificity of in vivo interactions depends on chromatin and other epigenetic factors, and how much is inherent to the protein and DNA itself. Both PBMs and DIP-chip are powerful adjuncts to ChIP-chip and Dam-ID experiments, and are efficient and accurate methods for determining in vitro DNA binding specificity.
The DNA purified from a ChIP experiment can be cloned and sequenced, providing an alternative to microarray-based detection (Chen and Sadowski 2005; Impey et al. 2004; Kim et al. 2005a; Ng et al. 2005; Roh et al. 2005; Roh et al. 2004). Currently, the cost of performing DNA microarray hybridization on commercial arrays that cover the whole human genome is prohibitive to all but the wealthiest labs. Even with these “whole genome” arrays, the whole genome is not represented, only the non-repetitive portions. Therefore, sequencing-based methods could prove to be very attractive, particularly for larger genomes. Rather than sequencing the entire cloned DNA fragments, most of the referenced methods are similar to SAGE, and create small (~17–21 bp) tags from each enriched fragment. The main question surrounding these methods is how many sequencing reads must be performed before an adequate sampling of all enriched sequences is achieved. Consider the case in which a 20-fold enrichment of targets is achieved by IP, and targets represent 1% of all genomic fragments. If a sequencing approach is chosen, only ~17% of all sequenced tags would be IP targets at all, and for each experiment, a very large number of clones would have to be sequenced to sample the entire IP result with sufficient coverage to confidently identify targets. This method may become feasible by devising high-throughput schemes to increase the practical enrichment and decrease background prior to sequencing. These may include prescreening clones for repetitive elements, including a subtractive hybridization step, modification of the standard ChIP experiment to include a second IP, or size selection to limit nonspecific clones and repetitive elements (Weinmann and Farnham 2002). Another challenge of sequence-based methods include assigning the sequence tag back to a unique position in the genome (Kim et al. 2005a). What may ultimately drive these alternatives to the forefront are future reductions in sequencing costs. New commercially available sequencing technologies, such as massively parallel signature sequencing (MPSS) (Brenner et al. 2000) and more recently and emulsion-based method coupled to pyrosequencing (Margulies et al. 2005) can sequence tens of thousands of DNA fragments simultaneously, and could be used to sample an entire ChIP. An advantage of a sequencing-based approach is that the results obtained should be more directly correlated with in vivo events when compared with microarray-based approaches.
New technologies and methods for understanding chromatin and epigenetics continue to emerge. Many are based on long-trusted biochemical assays coupled to new detection techniques. DNAse I hypersensitive sites have been a defining characteristic of “active” chromatin for decades, and several new methods have arisen to map hypersensitive site quantitatively and genome-wide (Dorschner et al. 2004; Sabo et al. 2004). Researchers have also directly measured nucleosome occupancy with genomic methods. A clever alternative to ChIP-chip was performed by digesting yeast chromatin with micrococcal nuclease, purifying mononucleosomes, and using high-resolution tiling arrays to look for missing linker DNA (Yuan et al. 2005). Using this method, the authors were able to obtain a very high-resolution map of nucleosome occupancy on yeast chromosome III, which was consistent with earlier, lower-resolution ChIP-chip studies (Bernstein et al. 2004b; Lee et al. 2004). Another group separated mammalian chromatin with a sucrose gradient, and using microarrays with ~1 Mb resolution showed that the “open”, chromatin is composed of underlying DNA with a high gene density (Gilbert et al. 2004). In a reciprocal approach, another group isolated chromatin enriched for Histone H1 and relatively resistant to DNAse I digestion, and found an inverse correlation between the isolated chromatin and gene expression (Weil et al. 2004).
The least-understood aspects of chromatin, long-range interactions and nuclear domain organization, remain enigmatic largely because few quantitative techniques exist for their study. Fluorescence in situ hybridization studies of epigenetically-distinctive regions have revealed differences in the resolution of sister chromatids believed to be a marker of DNA replication timing (Greally et al. 1998; Kagotani et al. 2002; Takebayashi et al. 2005). Recently, several new techniques have emerged that are providing insights into the higher-order organization of chromatin (Carter et al. 2002; Dekker et al. 2002; Lebrun et al. 2003). Casolari et al. mapped the genome-wide localization of several components of the S. cerevisiae nuclear transport machinery, and their data suggests that the most highly transcribed genes are held near nuclear pores at the periphery of the nucleus to facilitate rapid transport of mRNA to the cytoplasm (Casolari et al. 2004).
An obvious prediction at this stage of development of the field of epigenomics is that our ability to generate experimental results will shortly overwhelm our ability to manage and analyze these complex, highly-dimensional data. Some of these problems will require standardization of vocabularies and the definition of common data elements to allow the kind of syntactic and semantic interoperability that allows us to integrate diverse sources of data. However, for the purpose of this review we will focus on the more immediate need to analyze the results of epigenomic studies.
While genomic DNA is generally packaged into chromatin, around functional regulatory sequences local chromatin structure is often disrupted, making the regions accessible for transcription factors. These regions, typically ~250 bp in length, are hypersensitive (HS) to DNAseI, thus can be detected throughout the genome by comparing the DNase treated to non-treated genomic DNA using high-throughput quantitative PCR (Dorschner et al. 2004). Given a collection of experimentally detected HS sequences and non-HS control sequences, a support vector machine (SVM) can be trained to differentiate HS from non-HS sequences and predict new HS sequences in the genome with 85% accuracy. In the SVM, each sequence is embedded into a vector that contains the counts of all k-mers (k = 1 to 6) in the sequence. The SVM found the CG-dinucleotide to be the most important sequence feature in the determination of HS, and predicted 65% of HS sequences to be > 5 kb from the nearest transcription start site. The HS sequences were also found to be enriched with CTCF recognition sequences, which are signature insulator elements (Noble et al. 2005).
In another nucleosome occupancy study, differentially labeled micrococcal nuclease (MNase)-treated and non-treated yeast genomic DNA were hybridized to microarrays with overlapping 50-mer probes tiled every 20 bp across the yeast chromosome (Yuan et al. 2005) . Nucleosome-associated DNA is resistant to MNase digestion, whereas the nucleosome-free DNA regions or linkers are hypersensitive. The authors constructed a hidden Markov model to look for well-positioned nucleosomes as a run of 6–8 high-ratio probes (~146 bp) and delocalized nucleosomes as a run of >9 high-ratio probes (Yuan et al. 2005). The remaining regions with runs of low-ratio probes were considered nucleosome-depleted DNA regions. The hidden Markov model identified nucleosome-depleted regions of ~150 bp at ~200 bp upstream of many ORFs. These regions are more conserved than the background genomic sequence, include many transcription-factor binding sites, and are enriched for poly-A and poly-T.
Noise is inevitable in ChIP-chip experiments due to the random nature of chromatin shearing, nonspecific binding during the immunoprecipitation step, infidelities in DNA amplification and labeling, and the noise intrinsic to microarray synthesis and hybridization. The means of dealing with these problems to date have been based on performing independent experimental replicates, although advances in molecular techniques and microarray resources are improving our ability to define signals with confidence based on single experiments. To combine experimental replicates, an error model first proposed by Hughes et al (Hughes et al. 2000) was adapted to ChIP-chip experiments (Harbison et al. 2004; Lee et al. 2002; Ren et al. 2002; Simon et al. 2001). For ChIP-chip conducted on a transcription factor, a (ratio, p-value) pair is reported for each intergenic probe. The ratio is a weighted average of ChIP over control fold-change in the replicates, where weight is based on the fold-change and systematic error in each replicate. The product of ratio and weight sum follows a normal distribution, so its p-value can be calculated. A p-value of 0.001 has been often used as the threshold to call a binding target, at which level false positive and false negative rates are estimated as 6–10% and 24%, respectively (Lee and Young 2000). Each intergenic sequence represented on the microarray is assigned to the gene(s) downstream of it. An intergenic region is assigned to two genes if it lies in between two divergently transcribed genes (←→), and not assigned if it lies in between two convergently transcribed genes (→←).
ChIP-chip experiments identify protein-DNA binding targets at a resolution of hundreds of base pairs, but precise protein-DNA binding sites are often only 5–20 bp in length. Although many chromatin regulators do not have sequence-specific DNA-recognition themselves, they can be recruited to specific sequence locations in the genome by interacting with other proteins (Cosma et al. 1999; Hampsey and Reinberg 2003). After targets of ChIP-chip against a transcription factor or chromatin regulator are identified, biologists often proceed to use computational tools to find enriched sequence motifs from the ChIP-targets to characterize the precise protein-DNA binding sites. Earlier computational methods such as MEME (Bailey and Elkan 1994; Grundy et al. 1996), AlignACE (Kurdistani et al. 2002; Roth et al. 1998), and BioProspector (Liu et al. 2001) look for sequence motifs enriched in all the ChIP-target sequences compared to the genome background. The DNA fragments most highly enriched by ChIP often contain multiple binding sites for the same protein, so MDscan (Liu et al. 2002) was developed to begin motif searches from these sequences. The insight that sequences with more and stronger motif occurrences often have higher ChIP-enrichment values motivated REDUCE (Bussemaker et al. 2001; Wang et al. 2002) and Motif Regressor (BenYehuda et al. 2005; Conlon et al. 2003) to find motifs whose occurrence in sequences correlate with ChIP-enrichment ratio. Perhaps the most powerful tools are meta-motif finders such as TAMO (Gordon et al. 2005) that combine different motif-finding algorithms to incorporate sequence pattern enrichment, evolutionary conservation, microarray correlation, and known motif databases to derive the most likely binding motif.
While formaldehyde is used with the belief that it stabilizes protein-DNA interactions, evidence from 20 years ago indicates that it works in a ChIP experiment by trapping DNA when protein constituents of chromatin cross-link (Solomon and Varshavsky 1985). However, the effect to cause protein-protein cross-linking allows the interactions between two transcription factors or chromatin regulators to be studied. An antibody used to immunoprecipitate one of an interacting complex of proteins will pull down the DNA targets of both proteins, although the indirect target might have weaker enrichment (Iyer et al. 2001). Therefore, if the ChIP targets of two factors overlap significantly, yet the two factors are not expected to have the same DNA-recognition properties, it is reasonable to hypothesize that the two proteins may be part of the same complex. This has led to the discovery of interactions between Ifh1 and Rap1 (Wade et al. 2004), and between Sum1 and Hst1 (Robert et al. 2004). Computational motif-finding can even predict protein-interaction partners using ChIP-chip data from just a single protein. For example, the canonical binding motifs of both the estrogen receptor (ER) and a forkhead protein were discovered from ChIP targets of the ER (Carroll et al. 2005). Based on the motif discoveries and the fact that forkhead protein FoxA1 is highly co-expressed with ER, the authors predicted the interaction between ER and FoxA1, and later proved that FoxA1 is required for ER binding (Carroll et al. 2005).
As reviewed above, many microarray platforms have been used to perform ChIP-chip experiments in organisms with large genomes. For tiled arrays, two platforms are currently in use, long oligonucleotide arrays from Nimblegen (Kim et al. 2005b), and Affymetrix 25-mer oligonucleotide microarrays. Due to the significant probe variability and noise associated with the Affymetrix arrays, robust analysis algorithms are required. The first method to analyze the ChIP-chip experiments on Affymetrix tiled arrays was the non-parametric Wilcoxon rank sum test (Bernstein et al. 2005; Cawley et al. 2004).
This method looks at probes in a sliding window of 800–1000 bp, ranks all the ChIP and control probes together in each window by their perfect match-mismatch (PM-MM) values, and checks whether the sum of probe ranks in the ChIPs are significantly higher than that in the controls. To adjust for probe variability, another approach is to estimate the baseline hybridization behavior of a probe across ChIP and control experiments from many different conditions and laboratories (Carroll et al. 2005; Li et al. 2005). A hidden Markov model was then applied to find regions of probes whose (PM-MM) values are much higher than the estimated baseline behavior. Since the full-genome Affymetrix tiled arrays are still a developing platform that is not currently commercially available, we expect many new analysis algorithms developed in the near future.
Different statistical analysis programs are more appropriate for data derived from the NimbleGen oligonucleotide arrays. These arrays are composed of long oligomers (50-mers) and only perfect matches to the sequence are built on the arrays. One analysis program that has been developed to identify binding sites from ChIP-chip data obtained from NimbleGen arrays is based on percentile-derived thresholds and p-value determined peak widths which are calculated according to the Waterman extension of the Erdos-Renyi Law. Due to the inherent noise in the ChIP-chip data generated, this analysis program was used to predict peaks from three independent array experiments and then the peaks that are present in at least two of the three arrays are categorized as binding sites. Using very conservative thresholds and p-values, E2F1 binding sites in HeLa cells were identified from ChIP-chip experiments. Interestingly, between 30 and 40% of all promoters within the 30 Mb encompassing the ENCODE regions were categorized as bound by E2F1 (Bieda et al., manuscript submitted) (Bieda et al. 2005).
The ChIP-chip technique is a very young technology at present, and there are differing opinions on a number of experimental variables, including the appropriate experimental controls, the optimal microarray design and hybridization conditions, whether experimental replicates will continue to be necessary as the technology improves, and whether the same experimental controls and data analytical techniques are appropriate for transcription factor binding and histone modification studies. A lot of progress is expected in this field in the immediate future.
While the focus in this review has been on the basic scientific questions and techniques in whole-genome epigenetic studies, the goal towards which researchers in this field are aiming is the application of these techniques to the study of human disease. The two areas where most insights have been gained are in mammalian development and in cancer biology. We focus on these areas here, while recognizing that many other applications are being developed, including interesting new insights into the role of epigenetics in aging (Fraga et al. 2005) and sex-specific regulation of autosomal genes (Sarter et al. 2005).
Activation of a particular set of genes and inactivation of others underlies the differentiation of cells and mammalian development. Many genes have what are described as tissue-dependent and differentially-methylated regions, or T-DMRs (Shiota 2004). DNA methylation at T-DMRs is involved in tissue-specific and developmentally-regulated gene expression. For example, Pou5f1 (Oct4) is a member of the POU family of transcription factors, essential for normal mammalian development (Nichols et al. 1998; Niwa et al. 2000; Okamoto et al. 1990; Rosner et al. 1990; Scholer et al. 1990). DNA methylation of the Oct4 T-DMR was identified by RLGS and found to play a critical role in silencing of the locus (Hattori et al. 2004). The Oct4 gene has no CpG island at the transcription start site but has a relatively CG-rich and TATA-less promoter (Okazawa et al. 1991; Sylvester and Scholer 1994). In the placenta of DNA methyltransferase 1 (Dnmt1)-deficient mice, most CpGs were unmethylated and Oct4 was expressed ectopically, supporting a causal role for cytosine methylation in the regulation of this gene. Expression of Sry is also another example of DNA methylation of a T-DMR mediating gene silencing (Nishino et al. 2004). Sry is a master gene for testis differentiation in mammals (Koopman et al. 1991) and has a promoter region with rare CpGs. Certain prolactin/growth hormone gene family members also have promoters with CpGs but not enough to constitute a CpG island that are regulated by DNA methylation (Cho et al. 2001; Ngo et al. 1996). DNA methylation can therefore be involved in the in gene regulation even when (or perhaps especially when) CpG dinucleotide density is insufficient to encode a CpG island.
Rat sphingosine kinase 1 (Sphk1) is an example of a gene for which tissue-specific expression is regulated by DNA methylation at a T-DMR (Imamura et al. 2001). The T-DMR is located approximately 800 bp upstream of the first exon of Sphk1 in a 200 bp region at the 5′ edge of a CpG island. The T-DMR of Sphk1 is conserved in the human and mouse genomes and these T-DMRs are also targets of DNA methylation (Imamura et al. 2004). Other genes with T-DMR in CpG islands have also been described, including the Ednrp (endothelin receptor type B) (Pao et al. 2001), Pomc1 (proopiomelanocortin) (Newell-Price et al. 2001) and Serpinb5 (Maspin) (Futscher et al. 2002) genes. In addition, genome-wide screening identified CpG islands that are methylated in a manner reflecting their gametic origin, leading to their being termed germline differentially-methylated regions (gDMRs) (Strichman-Almashanu et al. 2002). These loci can be considered a further type of T-DMR occurring within CpG islands.
These loci represent the characterized subset of a much larger group of loci for which tissue-specific methylation has been identified using RLGS. A genome-wide analysis focusing on 1,500 CpG islands and CG-rich regions identified 247 T-DMRs, which were methylated or unmethylated depending cell- or tissue-type (Ohgane et al. 1998; Ohgane et al. 2005; Okazaki et al. 1995; Shiota et al. 2002). The methylation profile of one cell- or tissue-type creates a profile for a cell or tissue type that distinguishes it from other cells or tissues (Figure 1A). Considering that there are 16,100 CpG islands in the mouse haploid genome (March 2005/mm6 mouse genome assembly, UCSC Genome Browser, http://genome.ucsc.edu), of which RLGS can only sample a subset, the number of T-DMRs is likely to expand with further studies, allowing even more complex specific DNA methylation profiles. A further consequence of these studies is to replace the old idea of universal unmethylation of CpG islands with a model in which CpG methylation differences, even at CpG islands, mediates differences in cell states.
If the DNA methylation profile is a unique identification code for a cell (Figure 1) and is involved in the regulation of gene expression, a change of the DNA methylation profile will cause alteration of the properties of the cell. Cloned animals created by somatic cell nuclear transfer have been found to have aberrant DNA methylation profiles in every tested tissue (Ohgane et al. 2001). As placentomegaly (placental overgrowth) is a frequently observed phenotype of cloned animals regardless of the origin or gender of donor cells (Ogura et al. 2000; Tanaka et al. 2001; Wakayama and Yanagimachi 2001), RLGS was performed to determine whether this specific problem reflected abnormal epigenetic regulation. A specific pattern of aberrant DNA methylation at the Sall3 locus was found in the cloned mice exhibiting the placentomegaly (Ohgane et al. 2004).
With advances in genome-wide techniques to study epigenetic regulation, extending beyond cytosine methylation studies and testing more loci than allowed by RLGS, it is probable that our ability to detect variability associated with normal development, cloning, assisted reproductive technology and fetal exposures will be enhanced. A critical resource that will allow multiple investigators to compare their observations will be a common database structure allowing the DNA methylation and other epigenetic regulatory characteristics of genomic loci to be assembled.
Although epigenetic alterations are being recognized to occur in disorders of development, the major current example of a disease state involving disordered epigenetic regulation is carcinogenesis (Jones and Baylin 2002). Aberrant methylation of CpG islands in promoter regions is involved in inactivation of various tumor-suppressor genes, such as RB, p16, VHL, hMLH1, E-cadherin and BRCA1, processes that occur in a number of cancers, including colon, bladder, stomach, liver, breast, uterine, and renal malignancies. In some cancers, such as gastric cancers, tumor-suppressor genes are inactivated more frequently by their promoter methylation than by mutations (Ushijima and Sasako 2004). At the same time, global genomic hypomethylation is occurring in most cancer cells (Feinberg and Tycko 2004). Hypomethylation can lead to genomic instability and contribute to tumor formation (Gaudet et al. 2003). Hypomethylation of specific CpG islands causes aberrant expression of some cancer-testis antigen genes, such as MAGE genes (De Smet et al. 1999) and can lead to loss of imprinting (LOI) (Feinberg and Tycko 2004).
Epigenetic inactivation of MLH1 induces microsatellite instability (Kane et al. 1997), while inactivation of CHFR induces loss of cell cycle checkpoint function (Mizuno et al. 2002). On the other hand, some cancer cells harbor intrinsic abnormalities that induce increased rates of de novo methylation (Ushijima et al. 2005). The increased rates lead to what has been described as the CpG island methylator phenotype (CIMP), which was originally reported in colorectal cancers (Toyota et al. 1999). However, there still remains a controversy over the presence of CIMP (Yamashita et al. 2003), and a genome-wide analysis of well-characterized CpG islands is awaited.
Factors that induce aberrant DNA methylation include aging and chronic inflammation and possibly viral infection (Issa et al. 2001; Issa et al. 1994). Cell division is considered to be a major factor in the induction of abnormal methylation (Velicescu et al. 2002). It has been proposed that decreased gene expression and methylation of scattered CpG sites (“seeds of methylation”) are involved in the induction of dense methylation of CpG islands (De Smet et al. 2004; Song et al. 2002; Ushijima and Okochi-Takada 2005). The development of convenient assay systems is necessary to identify individual factors that induce DNA methylation, and the detailed molecular mechanisms for induction of aberrant methylation need to be understood.
The epigenetic alterations in cancer are now being used in cancer diagnosis and treatment (Miyamoto and Ushijima 2005). For diagnostic purposes, aberrant DNA methylation can be used first to detect cancer cells in biopsy or laboratory specimens and cancer-derived free DNA in serum/plasma (Belinsky 2004; Laird 2003). Aberrant DNA methylation has an advantage over mutations because aberrant methylation can be detected sensitively, while targets for diagnosis can be readily identified by a genome-wide search for differences in DNA methylation (Ushijima 2005). Secondly, aberrant DNA methylation can be used to predict a disease phenotype, such as prognosis, responses to chemotherapies, or occurrence of adverse effects. Methylation of MGMT, a gene involved in repair of O6-methylguanine, is a useful predictor of the responsiveness of tumors to alkylating agents in gliomas and diffuse large B-cell lymphomas (Esteller and Herman 2004). The presence of CIMP in neuroblastomas is a strong predictor of poor clinical outcome (Abe et al. 2005). Thirdly, aberrant methylation in non-cancerous tissue has potential as a cancer risk marker. The presence of the LOI in normal colonic mucosa and peripheral leukocytes is associated with risk of colorectal cancers (Cui et al. 2003). An advantage of working with cytosine methylation as a diagnostic marker is that DNA is a more stable nucleic acid than RNA and is consequently less sensitive to specimen handling than RNA-based assays.
The plasticity of epigenetic information offers a good target for cancer therapeutics (Egger et al. 2004). Administration of the DNA demethylating agent 5-aza-deoxycytidine (5-aza-dC) at low doses for a prolonged period turned out to be more effective than higher doses in hematological malignancies (Issa et al. 2004). To overcome the short half-life of 5-aza-dC, the drug Zebularine, which can be administered orally, has been developed (Cheng et al. 2004). Histone deacetylases (HDACs), are also well studied as targets of therapeutics (Marks et al. 2004; Yoshida et al. 2001). Various HDAC inhibitors have been developed for therapeutic purposes, and tumor cells are known to show higher sensitivity to these agents than normal cells for reasons that remain unclear. Phase I/II trials are now under way for solid tumors (Marks et al. 2004).
In this review, we have placed an emphasis on the technical advances in the analysis of the epigenetic organization of the whole genome. We also highlight the need to develop means of organization and analysis of epigenetic data. While we focused on the role of epigenetic dysregulation in developmental abnormalities and cancer, the list of disorders to which epigenetic abnormalities could be contributing is very large. If resources are to be used with maximum efficiency to test how epigenetic regulation contributes to human disease, it has become clear that investigator-initiated research will need to be complemented by the kind of common resources that were used for the human genome project. These resources first and foremost should address the difficulties of data management, integration and analysis. In addition, just as technology innovation drove the human genome project, attention needs to be paid to the development of the technologies that will allow genome-wide epigenetics studies, especially applied to the limited numbers of cells that can be isolated to a high degree of purity by techniques such as laser capture microscopy. The goal is simple – to define the role of epigenetics in human disease, allowing new insights, diagnostic tests and drug targets. Current efforts to create a coherent Human Epigenome Project include a recent workshop sponsored by the American Association for Cancer Research (Jones and Martienssen 2005). While such an undertaking is daunting in its complexity (Fig. 2), the goal is unquestionably worthwhile.
The contributions of the following speakers to the Whole-Genome Epigenetics 2005 conference are gratefully acknowledged: Drs. Kuniya Abe, Hiroyuki Aburatani, Izuho Hatada, Tim Hui-Ming Huang, Fumio Ishino, Peter A. Jones, Tetsuji Kakutani, Hiroshi Kataoka, Takeo Kubota, Hiroki Kurihara, Robert Martienssen, Mitsuyoshi Nakao, Mikio Nakazono, Masaki Okano, Mitsuo Oshimura, Hiroyuki Sasaki, Toru Shimada, Yoichi Shinkai, Masakazu Sugiyama, Kazunari Taira, Shoji Tajima, Satoshi Tanaka, Nobuhiro Tsutsumi, Shintaro Yagi and Minoru Yoshida.