Epigenomics refers to genome-scale analysis of epigenetic marks. The term “methylome,” or genome-wide state of DNAm, was first introduced by the author in 2001 [44
]. Despite the availability of an essentially complete genome sequence for several years, understanding the methylome has progressed more slowly, largely due to limitations in technology affecting sensitivity, specificity, throughput, quantitation, and cost among the previously used detection methods. All of the available methods involve trade-offs among these variables. Furthermore, all of these variables are themselves moving targets, particularly cost. The rule in genomic science generally is that increased demand substantially reduces cost because of three factors: fierce competition in the biotechnology sector; production efficiencies as methods are automated; and continued technological advances. It is also important to define clearly what is meant by genome scale. The term is commonly applied to any method not limited to specific predefined genes, but no epigenomic method in common use examines the entire epigenome. For the sake of this article, the discussion will be limited to DNAm analysis because of its particular suitability for pathological and epidemiological studies due to its stability in biobanked specimens.
The human genome contains ~3 × 109
bp of DNA, of which there are ~3 × 107
CpG dinucleotides, and half of that is nonrepetitive single- or low-copy sequence [45
]. CpG dinucleotides are the sites that can be methylated and the methylation in turn replicated faithfully during cell division by DNA methyltransferase 1. While non-CpG methylation exists, it is not currently considered epigenetic information since no mechanism is known for its propagation during DNA replication.
What are the methods in common practice for measuring genome-scale DNAm? While this review naturally is written from the perspective of our own approaches, there are several other excellent reviews of epigenomics [46
]. Most investigators are drawn to commercially available methods, particularly those that can be performed as a service, with only DNA needing to be prepared by the investigator. However, these methods are not necessarily the most comprehensive or most accurate. A method similar to array-based SNP analysis is the Illumina GoldenGate methylation assay [48
], or its more recent cousin, the Illumina Infinium methylation platform. Both methods involve bisulfite conversion of unmethylated DNA to uracil, followed by polymerase chain reaction (PCR) which propagates a thymine residue at the converted base [49
]. Methylated cytosine is unconverted and thus read as cytosine. Thus, the methylation state (C/m
C) is transformed to a pseudopolymorphism (T/C, respectively). The readout is then as for any SNP and is semiquantitative, accurate within ~17% for the GoldenGate assay [50
]. The major limitation of this approach is the relatively poor coverage of the genome by both methods, only ~1,500 CpG by GoldenGate and ~27,000 CpG by Infinium, thus representing 0.01% to 0.18% of the single-copy methylome. A second limitation is the choices involved in selecting CpG sites for analysis. The chips are designed based in part on the idea that functional CpG methylation lies within canonical gene promoters and CpG islands. CpG islands are defined algorithmically, i.e., based on a formula given above. The major rationale for this choice is literature showing hypermethylation of CpG islands in cancer. Yet, those studies are largely self-referential in design, and recent studies described below suggest that most variable DNAm occurs outside of these islands. Nevertheless, great advances have been made possible by these reagents and methods, and they show the promise of increasing efforts by many laboratories to improve the resolution of genome-scale technology.
Furthermore, there are comparatively few data supporting the choice specifically of promoters and CpG islands for studies of other diseases, or normal population variation. Indeed, a relatively small scale but very comprehensive study was performed by Stephan Beck at the Sanger Center on ~1.8 Mb of DNA including ~40,000 CpG sites across 12 tissues [10
]. The study showed that most methylation variation was not at transcriptional start site-associated CpGs or at CpG islands [10
]. One encouraging result from that study, for those who wish to use CpG chips as described above, was a high degree of correlation between CpG site methylation within a few hundred base pairs. However, the choice of one or two CpGs per candidate region seems precariously underrepresented.
A second approach in common practice is hybridization of antibody-purified methylated DNA to high-density genome arrays [51
]. For example, NimbleGen offers of methylated DNA immunoprecipitation (MeDIP) to a ~2-Mb array tiled through gene promoters and CpG islands. The coverage of this array is much greater than the SNP-based arrays described above. However, choice of selection is still a significant issue given that complete tiling of the genome would currently require ten arrays, which is cost-prohibitive for large-scale epidemiological studies. Furthermore, MeDIP shows significant limitations in discriminating methylation differences in regions of medium- to low-density CpG content [52
], and our recent study shows that that is exactly where many or most significant variation in DNAm occurs [53
]. Another method focused on CpG islands is restriction landmark genome scanning [54
]. There are emerging alternatives for methylation fractionation, including affinity purification of methylated DNA on methyl-CpG-binding protein [55
], or affinity purification of unmethylated DNA [56
Two promising methods for genome-scale analysis use methylated DNA fractionation based on restriction endonuclease digestion. One of these, developed by John Greally and colleagues at Albert Einstein College of Medicine in New York, is termed HELP for HpaII-tiny fragment Enrichment by Ligation-mediated PCR [57
]. It takes advantage of the difference in sizes of Hpa-II fragments, which are generated from unmethylated DNA, and Msp-I fragments, which recognize the same cleavage site but are methylation-independent. While initial specificity was relatively limited, recent improvements involve additional methylcytosine sensitive endonucleases and allow representation of >98% of CpG islands and >90% of refSeq promoters, and it can also be combined with next generation sequencing for readout [58
]. A second involves fractionation of the unmethylated component with McrBC, which recognizes methylated DNA if there are two methylcytosines preceded by purines and separated by ~40–100 b, an easy condition to meet for methylated DNA except at very low CpG density. This approach was first applied to specific chromosome analysis [59
] but was subsequently extended to study of human cancer [60
Rafael Irizarry and I with our colleagues developed an array-based readout method that is independent of methylation fractionation method and can be applied equally to McrBC, HELP, or antibody-based methods. This approach, termed CHARM for comprehensive high-throughput array-based relative methylation analysis, involves two essential components. First, the array is agnostic to presuppositions about the location of differential methylation and tiles through regions based only on the relative CpG content in decreasing abundance [52
]. It, therefore, includes all CpG islands, but that represents only 38% of the CpG “real estate,” or available oligonucleotide probe positions for analysis on the array. One could use additional arrays or soon to be released higher density arrays to increase coverage, which is now about one fourth of the entire nonrepetitive methylome. The second component is genome-weighted smoothing, or correction for the hybridization properties of the target (i.e., sample) genome at each location, which is in turn calculated from empirical measurements of hybridization efficiency with regard to GC content, CpG density, and length of fragments [52
]. A statistical suite of postprocessing algorithms, written in R, is termed CharmR and is continually revised. The arrays and CharmR are open access and open source (http://www.biostat.jhsph.edu/~maryee/charmR/
). Thus, while not commercially available, this technology is readily transportable to core laboratories that have statistical and programming support.
Although my colleagues and I have developed one of the current approaches to epigenomic analysis, we gladly welcome the advent of second generation sequencing technology for DNAm analysis. There are multiple competing commercial platforms for massively parallel sequencing on slides, with throughput per machine >300 Gb per run at <1% of the cost of conventional automated sequencing [61
]. A particular advantage of sequencing-based methylation analysis is the ability to ascertain allele-specific methylation by virtue of DNA polymorphisms within the same sequencing read. This is particularly true as longer reads become cost-effective.
Nevertheless, whole genome bisulfite sequencing applied to humans is not presently cost-effective for epidemiological studies. Costs are in the many tens of thousands of dollars, compared to hundreds of dollars per sample for alternative less comprehensive methods. Therefore, sequencing-based methods all involve significant trade-offs. One method involves “reduced representation,” using restriction enzymes to limit the sequenced target to regions within CpG islands [63
], which may miss significant normal variation in patient populations or across tissues [53
]. Single molecule sequencing detection, such as in development by Pacific Biosystems, or other methods not widely discussed publicly, might reduce costs to the point of making whole genome shotgun sequencing inexpensive compared to other methods for epigenomic profiling. Until that day comes, however, a great deal can be learned about the methylome of human normal and disease populations using array or chip-based approaches.