|Home | About | Journals | Submit | Contact Us | Français|
Next generation sequencing has brought epigenomic studies to the forefront of current research. The power of massively parallel sequencing coupled to innovative molecular and computational techniques has allowed researchers to profile the epigenome at resolutions that were unimaginable only a few years ago. With early proof of concept studies published, the field is now moving into the next phase where the importance of method standardization and rigorous quality control are becoming paramount. In this review we will describe methodologies that have been developed to profile the epigenome using next generation sequencing platforms. We will discuss these in terms of library preparation, sequence platforms and analysis techniques.
At its inception massively parallel sequencing was ill suited for the task of sequencing the human genome. Perhaps then it is not surprising that some of the first publications that utilized next-generation sequencing were directed at chromatin immuno-precipitation enriched fractions of the genome [1–3]. Since their introduction, short read massively parallel sequencing platforms have continued to improve at an exponential rate, generating longer sequences of better quality in ever increasing numbers . The research community has leveraged these improvements to develop a diverse collection of sequence-based methodologies to probe the functional genome [4–6]. These methodologies can be broadly divided into protocols that profile transcribed regions of the genome and those that profile the processes regulating transcription. Transcriptional regulation is maintained through complex interactions between sequence specific transcription factors which act in short time frame, generally in response to specific cellular stimuli, and those which act on longer time scales in response to more generalized environmental and developmental signals. The study of the mechanistic features that control this latter category is called epigenetics and the study of how these marks are patterned across the genome is called epigenomics.
Epigenetic processes act on DNA and histones, the building blocks of nucleosomes . In the mammalian genome DNA modification occurs exclusively on cytosine residues, at the 5′-position of the purine ring, in the form of either a methyl or hydroxymethyl group [8, 9]. Until recently, modification of mammalian genomic DNA was thought to be restricted to the context of CG dinucleotides known as ‘CpGs’. However, recent epigenomic profiles have revealed that methylation is found in alternate contexts including CHG and CHH in pluripotent cell types .
The nucleosome is the fundamental unit of chromatin and is composed of two copies of each of the four core histones (H3, H4, H2A and H2B) around which 146 bp of DNA are wrapped [11, 12]. Histones are evolutionarily conserved proteins characterized by an accessible amino-terminal tail and a histone fold domain that mediates interactions between histones to form the nucleosome scaffold . The N-termini of histone polypeptides are extensively modified by more than 60 different post-translational modifications including methylation, acetylation, phosphorylation and ubiquitination [14, 15]. Although the vast majority of these modifications remain poorly understood there has been significant progress in recent years understanding the roles that methylation and acetylation play in transcriptional regulation .
A prerequisite for understanding the role of epigenetics in development and disease is knowledgeable of the genome-wide distribution of epigenetic modifications in normal and diseased states. The availability of reference genome assemblies and massively parallel, next generation sequencing platforms has led to methods which provide high-resolution genome-wide epigenomic profiles. In this review, we will describe methodologies that have been developed to profile the epigenome using next generation sequencing platforms. We will discuss these in terms of library preparation techniques, sequence platforms and analysis. Current next generation sequencing approaches require that the collection of DNA fragments to be sequenced and flanked by standard nucleotide string to allow for clonal amplification or, in the case of the Helicos platform, direct sequencing. In this review we will refer to collections of such fragments as ‘libraries’ and the process to build such collections as library preparation.
Library preparation for next generation sequencing can be broadly divided into two distinct processes. The first involves preparation of genomic DNA (gDNA) fragments, generally in the size range of a single nucleosome, followed by preparation of the fragments for sequencing (Figure 1). Preparation of genomic DNA fragments for next generation sequencing generally involves the addition of nucleotide sequences on the ends of the fragments that will hybridize to complementary sequences present on the matrix used to generated clonal copies of the library fragment for sequencing.
The N-terminal tails of histones are extensively modified in response to developmental and environmental signals [14, 15]. The predominant method for mapping these post-translational modifications genome-wide involves a technique known as chromatin immuno-precipitation (ChIP) . In this method histones are either chemically coupled to DNA through the action of a cross-linking reagent such as formaldehyde  or released in their native form by the addition of a nuclease that, in the correct dilution, specifically digests gDNA at unprotected linker sequence . Following gDNA fragmentation the protein/DNA mixture is subjected to immuno-precipitation using antibodies raised against the post-translational modification under study. In the process of immuno-precipitation, DNA fragments that are in association with histone peptides are co-purified and, following proteolytic digestion and DNA purification, subjected to library construction and direct sequencing (ChIP-seq) [2, 3, 19, 20].
Direct sequencing of ChIP enriched fractions has distinct advantages over competing hybridization based techniques. Among these is the ability to interrogate epigenomic marks in repetitive elements, which comprise ~45% of the human genome . Early methods for ChIP followed by direct sequencing by either capillary sequencing [22, 23] or the 454 platform  included concatenation of short sequence tags derived from the immuno-precipitated fragments to effectively utilize the relatively longer reads provided by these sequencing methodologies. The development of massively parallel short read platforms such as the Genome Analyzer (Illumina Inc.) and SOLiD (Life Technologies) platforms negated the need for complicated library construction techniques and allowed for direct library construction from the immuno-precipitated products. The dominant platform utilized to date for ChIP-sequencing is the Illumina Genome Analyzer [1, 2, 3, 19, 20]. More recently the SOLiD platform [25, 26] has been applied in this area and a single reference is available outlining the application of the Heliscope Genetic Analysis platform  (Helicos BioSciences).
ChIP sequencing (ChIP-Seq) library construction for the Genome Analyzer or SOLiD next generation sequencing platforms is an implementation of standard methodologies for whole genome shotgun sequencing [28, 29]. In this method the ragged ends of the enriched fragmented DNA, typically in the low nanogram range, are repaired and platform specific adapters are ligated onto the resulting fragments on either blunt end (SOLiD) or A-tailed (Genome Analyzer) DNA fragments. Adapter ligated product is then PCR amplified using primers which hybridize to the adapter sequences and extend to include sequences which facilitate clonal amplification and sequencing. In addition to A-tailing the gDNA fragments, library preparation for the Genome Analyzer utilizes adapters that are only partially complementary introducing a ‘fork’ in the adapter that is subsequently resolved during PCR. Application of this structure allows for all adapted fragments to be PCR amplified. In contrast, the SOLiD platform involves the addition of two independent adapter sequences during ligation allowing 50% of adapted fragments to participate PCR amplification.
Recently, Goren et al.  reported the use of the Heliscope Genetic Analysis platform for a ChIP-seq study directed at limited cell populations. Library construction for the Heliscope platform differs significantly from competing next generation sequencing platforms in that PCR amplification is not required. Library construction involves a single step where immuno-precipitated gDNA fragments are A-tailed using a terminal transferase enzyme and dATP and, after capture to a proprietary substrate, directly sequenced.
In contrast to histone modification profiling, a wide variety of approaches have been developed to profile DNA methylation utilizing next generation sequencing platforms. Approaches to profile DNA methylation genome-wide can be broadly divided into those that rely on methylation dependent enzymatic restriction, methyl-DNA enrichment and direct bisulfite conversion [21, 30]. Individual methods can also be combined to increase the resolution or efficiency of a single method. For example, a combination of MeDIP-seq and MRE-seq, to profile both the methylated and unmethylated fraction of the genome .
Methylated DNA Immuno-precipitation sequencing (MeDIP-Seq) is an immuno-precipitation based technique where fragmented DNA is enriched based on its methylation content [32, 33]. Antibodies used in this technique are raised against a single stranded methyl-cytosine and thus the immuno-precipitation is performed in a denatured state . To avoid over representation of repeat content in the subsequent library through preferential annealing of highly methylated genomic repeats, library construction is performed prior to the immuno-precipitation and amplified following enrichment by PCR.
At sufficient sequencing depths, on the order of two Gigabases aligned, MeDIP-seq is capable of identifying 70–80% of the 28 million CpGs in the human haploid genome at a resolution of 100–300 bases . This is near to the expected frequency of methylation in the human genome [8, 9]. At saturating sequencing depths it may also be possible to annotate uncovered CpGs as non-methylated.
Methylated DNA Binding Domain sequencing (MBD-seq) is similar in concept to MeDIP-seq where genomic fragments are enriched based on their methylation content . In this technique bead immobilized recombinant methylated-CpG binding proteins MECP2 or MBD2 are used to enrich for methylated DNA fragments from a pool of genomic DNA fragmented by sonication to 100–300 bp in length. Following enrichment of methylated double stranded DNA fragments standard library construction techniques are utilized to generate a library representing the methylated fraction of the genome.
MBD-seq differs from MeDIP-seq in the application of multiple salt cuts during elution of the methyl-CpG containing DNA fragments bound to the immobilized methyl binding domain. In this technique, weakly methylated DNA fragments are eluted at lower salt concentrations compared with moderately or densely methylated DNA fragments (e.g. methylated CpG Islands). Thus it is possible to selectively enrich for weakly or densely methylated DNA fragments during immuno-precipitation, potentially reducing downstream sequencing costs. In the absence of selective enrichment, MBD-seq requires a similar degree of sequencing as MeDIP-seq and at this depth (2 Gigabases aligned) MBD-seq is capable of identifying 70–80% of the 28 million CpGs in the human genome at a resolution of 100–300 bases . As with MeDip-seq, at saturating sequencing depths, it may also be possible to call any uncovered CpGs as non-methylated.
The ‘gold standard’ for profiling methylated cytosine is bisulfite-mediated deamination of cytosine. This technique, discovered simultaneously by the Shapiro and Hayatsu groups in the early 1970s, relies on the selectivity of the bisulfite reaction to deaminate cytosine, but not 5-methylcytosine, to uracil which is subsequently read as thymidine during sequencing [36, 37]. Bisulfite-based methods detect hydroxylmethylation, but cannot distinguish it from methylation . In the original methodology, bisulfite treated genomic regions were amplified by site specific PCR, cloned and subjected to Sanger sequencing . Sequence reads were assessed individually and visualized as a matrix with the CpG content of each clone represented as a row. While this approach has been extremely valuable in the elucidation of the methylation status of discrete genomic regions it does not scale well and cannot be feasibly applied to whole genome studies. With the advent of next generation sequencing is it now possible to directly shotgun sequence bisulfite treated genomic DNA. In this method, library construction is performed prior to bisulfite treatment using adapters in which cytosines have been replaced by methyl-cytosines to protect them from deamination during the bisulfite treatment [39, 40]. Following the bisulfite treatment, a process which is performed under denaturing conditions, the library is PCR amplified using PCR primers which extend the adapter sequencing to allow for clonal amplification and sequencing. This technique, termed Methyl-C-seq or BS-seq, first performed genome-wide on the genome of the flowering plant Arabidopsis thaliana [39, 40], has recently been applied to the human genome . To generate sufficient read coverage for the latter study over 200 lanes of Illumina Genome Analyzer sequence data, at a list cost of over $200 000 USD in reagents, was required. However, recent advances in throughput and efficiency of next generation sequencing platforms have reduced the costs associated with such an experiment dramatically. It is expected that in the fall of 2010 the cost for such an experiment will have dropped 20-fold to the range of the $10 000 USD (reagents only).
The high cost of sequencing a sodium bisulfite converted genome has spurred the development of strategies for enriching genomic regions of interest followed by bisulfite sequencing [3, 41–44]. Two general strategies have emerged. In the first, coined reduced representation bisulfite sequencing (RRBS), the genome is digested by the methylation insensitive restriction enzyme Msp1 and size selected to generate a fragment library within the range of next generation sequencing platforms (typically in the 100–300 bp) . The size selected material becomes the input for the library construction using methylated adapters and subjected to the bisulfite conversion analogous to the procedure used for the full genome bisulfite shotgun sequencing . Due to the selective nature of the method, RRBS covers only 12% of CpGs genome-wide however these CpGs are highly enriched within CpG islands .
Alternatively, genome enrichment can be performed using molecular inversion probes and PCR following bisulphite conversion of the genome [43, 44]. Molecular inversion probes can be designed to include all possible combinations of the cytosines and uracils or to avoid CpGs to mitigate specificity loss associated with the cytosine to uracil conversion. Once amplified the targeted regions can be directly sequenced on a next generation sequencing platform following standard techniques. Publications utilizing these methods typically target in the range of 1000s of CpGs, or 0.2% of genome-wide CpGs [43, 44].
Various strategies have been developed to profile the unmethylated fraction of the genome using restriction enzymes that are sensitive the CpG methylation state. Protocols involving a single methyl-sensitive restriction enzyme (HpaII) enzymatic digestion (HpaII; HELP-seq, Methyl-seq and MSCC) as well as multiple digestions (HpaII, AciI, Hinc6I; MRE-seq) have been developed [31, 43, 45, 46]. The protocol involves the digestion of the genomic DNA by one or more methyl-sensitive enzymes followed by size selection, pooling where appropriate and library construction. Minor modifications of the standard library construction procedures are used to account for the nature of the overhangs generated by the restriction digests. The use of additional restriction enzymes during the digestions increases the diversity of fragments in the library and thus allows for an increase in the total number of CpGs that can be queried. In these methods, the methylation status of 1–2 million CpGs are assessed.
Restriction based methodologies pose unique challenges during sequencing on the Illumina Genome Analyzer and SOLiD sequencing platforms. Enzymatic restriction skews the nucleotide representation on the terminal ends of the fragments that are subjected to library construction. During the initial stages of sequencing, this nucleotide bias can lead to the generation of poor quality focal maps, a key step in massively parallel next generation sequencing. This can be avoided by including a balanced nucleotide adapter onto the fragment ends or, in the case of the Genome Analyzer, starting base calling after the restriction site.
Individual methods may also be combined to increase coverage and/or efficiency. For example, MeDIP-seq and MRE-seq may be combined to profile both the methylated and un-methylated fractions of the genome simultaneously [21, 31]. Bisulfite conversion can be combined with an enrichment strategy (for example MBD-seq or MeDIP-seq) to provide increased resolution of methyl-cytosines in the immuno-precipitated fraction.
Recent advances in sequencing technology have raised the possibility of the direct detection of DNA modifications. In the forefront of these efforts is Pacific Biosciences who have recently demonstrated an ability to directly detect DNA methylation during single-molecule, real-time (SMRT) DNA sequencing, a technique for studying nucleic acid sequence and structure [47–49]. Similarly, Oxford Nanopore Technologies has published proof of concept data for the direct detection of the 5-methylcytosine . At the appropriate scale, these techniques offer the exciting possibility of the direct, de novo detection of the DNA methlyation genome-wide.
The majority of published epigenomic studies utilizing next generation sequencing have been generated on an Illumina Genome Analyzer. This is in part due its early adoption by the field as well as the flexibility of library preparation and base space massively parallel output. While there were some early examples of epigenomic data sets generated on the comparably longer read 454 platform , these have largely been replaced by methods on the comparatively shorter read platforms. Conceptually, the SOLiD platform from Life Technologies is equally well suited to sequencing epigenomic libraries and more recently research groups have begun to publish ChIP-seq data sets using this platform [25, 26]. There is a single report of the application of the Heliscope Genetic Analysis platform to ChIP-seq studies  and proof of concept methylation data sets have been published by Pacific Biosciences and Oxford Nanopore Technologies [47, 50].
The Genome Analyzer is a synchronous sequence-by-synthesis platform that leverages reversible dye terminators . Libraries of DNA fragments are clonally amplified on the surface of a flow cell (closed microscope slide) on to which modified oligos complementary to the sequence of the PCR primers utilized in library construction have been grafted. Sequencing is performed by the stepwise application of reagents, single nucleotide incorporation, flushing of excess reagents and imaging. The images are subsequently analyzed to generate a focal map for each clonally derived cluster and then used to call bases on each cycle. A typical Illumina Genome Analyzer run can currently generate 30 million reads per lane, 210 million per flowcell at read lengths up to 100 bases. In the spring of 2010, a higher throughput version of the Illumina Genome Analyzer was released. The specifications for this instrument, called the HiSeq2000, indicate over 60 million reads per lane, 500 million per flow cell can be achieved.
To facilitate unambiguous alignment of sequence reads within genomic repeat regions paired-end sequencing can be performed. In this implementation, a second read is generated on the clonally amplified cluster using a sequencing primer that anneals to the opposing adapter. To achieve this, the clonally derived read cluster, rendered single stranded during the first round of sequencing, is regenerated by PCR on the flowcell surface. Sequencing is performed as above utilizing the focal map generated from the first read to associate the two sequence reads together. A similar strategy can also be employed to read a sequence barcode added to the adapter during library construction. This so-called “third read” enables pooling of multiple libraries in a single flowcell lane.
The SOLiD platform is a synchronous sequencer utilizing a sequence by ligation approach . In this platform, libraries of fragments are clonally amplified on the surface of a 1 micron bead on which an oligo complementary to one of the two adapters used in the library construction is covalently bound. Clonal amplification is achieved by limiting dilution of the fragment library during PCR, (emPCR) which is performed as an emulsion generated by mechanical whipping of an aqueous solution containing PCR reagents, amplification beads, the library and oil. Following emPCR ‘loaded’ beads are enriched by hybridization of the alternate adapter to complementary oligos covalently attached to a polystyrene bead. Enriched beads are subsequently attached to the surface of a glass slide and the sequencing is performed by the stepwise application of reagents, ligation of labeled probes, flushing of excess probes and imaging. The images are subsequently analyzed to generate a focal map for each bead and call the transition of bases generated during the ligations. A SOLiD4 slide can currently generate ~600 million clonal reads at reads lengths up to 50 bases. As with the Illumina Genome Analyzer, paired-end sequencing and barcoding methodologies have also been developed for the SOLiD platform.
The 454 Genome Sequencer FLX is a pyro-sequencing platform . Similar to the SOLiD platform, the 454 FLX leverages emPCR to clonally amplify library fragments onto the surface of a bead. Following enrichment sequencing is performed by depositing beads onto the surface of a micro-fabricated slide that contains 1.6 million small reaction chambers. Single beads are sequenced in each micro-chamber by the stepwise addition of the nucleotides in a fixed order followed by imaging. Nucleotide incorporation is monitored for each micro-well by a chemi-illuminescent signal generated as a by-product of nucleotide incorporation. A 454 Genome Sequencer FLX can currently generate 1 million reads at read lengths up to 400 bp. Due to the limited number of reads and high cost/read compared with other next generation sequencing platforms, the 454 FLX platform is generally not used for epigenomic studies.
Third generation sequencing platforms are distinct from their forebearers, in that they are designed to sequence DNA at the level of a single molecule. The advantages of such an approach include a much simplified library generation process, massively parallel sequencing at long read lengths and importantly the lack of the repeated PCR amplifications prior to sequencing. A testament to human ingenuity is the diverse number of such platforms under development. Examples include; Helicos, the first company to provide a single molecule sequencer using an sequencing-by-synthesis and imaging approach ; Pacific Biosciences, which sequences DNA in real time by imaging fixed DNA polymerases  and Oxford Nanopore Technology, developing a sequencing platform based on the current changes induced by nucleotides as they pass through an alpha hemolysin nanopore . Early versions of some of these platforms have already been used in proof of concept epigenomic studies [47, 50]. However, full realization of their potential is perhaps 2–5 years away from common use.
Analysis of epigenomic data sets generated on a next generation sequencing platform remains a significant challenge. This is due, in part, to the relatively short period of time for which these data sets have been available compounded by the rapid rate of change of next generation sequencing platforms.
Analysis of epigenomic data sets generated by next generation sequencing platforms can be broken into four steps, the results of which can considered analysis levels (Figure 2). Data generated from a next generation sequence platform consist of strings of bases (Illumina Genome Analyzer, 454 FLX) or color space base transitions (SOLiD) along with associated quality scores. The first step in analysis is to align this primary, level 0, data to a reference genome assembly to generate a level 1 data set consisting of the genomic coordinates of the alignments and strand on the reference genome. A number of specialized aligners have been developed to map the tens of millions reads generated in a single experiment to a mammalian sized reference genome (for review see ref. ). The majority of widely adopted aligners use a ‘seed and extend’ based algorithm where a sub-string contained within the read is rapidly aligned to either a hash table (MAQ , SOAP , SHRiMP , ZOOM  and BFAST ) or more recently a suffix array generated from Burrows–Wheeler transform of the reference genome (BOWTIE , BWA  and SOAP2 ). Once a match is found the read is ‘extended’ up to the maximum read length on the genome to attempt to uniquely place the read within the genome. Reads that cannot be placed uniquely are either randomly placed on the genome or ignored for downstream analyses. Within the last year the output of such alignments has largely been standardized on the SAM/BAM file format . Bisulfite treated DNA requires specialized alignment to account for the C to T conversion. Several short read alignment algorithms are available that can be configured for bisulfite converted DNA alignment including, BSMAP , Pash , RMAP , ZOOM  and BS Seeker . A recent comparison of these aligners concluded that, despite minor differences in speed and accuracy, aligner choice is unlikely to have a significant impact on overall analysis . Following read alignment, level 1 data may be viewed directly by converting the read alignments into read density maps and displaying the result on a genome browser or further processed through segmentation.
Segmentation methods attempt to transform raw sequence alignments into regions of signal and background (level 3, Figure 2). In general, segmentation tools attempt to model the expected behavior of the epigenomic mark (for a recent review of segmentation methods see ). For immuno-precipitation based methodologies two main strategies have emerged. The first, used primarily for epigenomic marks that tend to be punctuate in their genomic distribution such as H3K4me3 or H3K9Ac, attempts to build ‘peaks’ of enrichment by modeling individual fragments within the library. Regions of enrichment are defined by oriented read sets that are computational extended by the insert size of the fragment library. Examples of such tools are Findpeaks , ERANGE , GLITR  and PeakSeq . The second attempts to model more broadly distributed (spreading) chromatin modifications such as H3K9me3 or H3K36me3, by dividing the genome into windows of defined size and enumerating either the raw or normalized number of reads which align within the windows. Examples of binning tools are CisGenome  and ChromaSig . In addition attempts have been made to combine the attributes of a binning and peak calling methodologies into a single algorithm .
An important consideration for segmenting ChIP-seq data sets is the use of a control signal for normalization and background estimation. A control signal is typically derived from sequencing either the sheared input DNA that was used for the immuno-precipitation or a non-specific immune-precipitate (IgG). Here the idea is to control for incorrect mappings (e.g. Read Stacks) driven by genome miss-assembly and/or polymorphisms and background signal generated from the shearing process itself— open chromatin would be expected to be more readily sheared by sonication than closed chromatin, for example. One of the main differences between segmentation tools is in how this is approached, but the general idea is to subtract the signal obtained in the control from experimental track, thus normalizing the signal to the background.
The diversity of segmentation tools currently available is a natural consequence of the rapid advances being made in the field. It is outside the scope of this review to provide a detailed breakdown of the various segmentation tools available (for an excellent current review on this please see ). However, researchers undertaking epigenomic studies utilizing next generation platforms need to be cognizant of the differences to make an informed decision on which tool would be most suitable for their data set. Overtime, as was the case with microarray analysis, it is expected that standardized tools, accepted by the majority of the community, will be employed in epigenomic research.
An additional consideration for any next generation sequencing based epigenomic method is how deeply to sample each library. As the sequencing depth increases, the number of unique reads covering a particular region should approach the total possible reads present in the library for each enriched region. Such a point, referred to as ‘saturation’, occurs when further sequencing fails to discover additional regions above background. Sequencing beyond saturation improves confidence in the observations and increases the coverage of events, though at greater cost per event covered. Thus, sequencing below or up to saturation may be sufficient, for example when maximizing the number of samples analyzed, while sequencing beyond saturation increases coverage and improves confidence.
There a number of stand-alone and web-based options available for visualization of aligned or segmented epigenomic data sets (for review see ). The most mature and widely used are the Genome Browsers maintained by the University of California Santa Cruz  and Ensemble . These ‘first generation’ genome browsers enable visualization of genome-wide data sets as linear tracks provided in the context of genome annotations. While extremely powerful for manual genome ‘browsing’ and focused visualizing on a gene-by-gene basis linear browser become unwieldy when large numbers of individual tracks are visualized at once. In addition these tools do not provide a capacity for larger scale integrative analysis of epigenomic data sets. While a number of informatic platforms designed for global, genome-wide analysis are currently in development, and few early versions have been published , the majority of genome-wide analysis of next generation sequencing based epigenomic data sets require custom scripting capabilities.
Next generation sequencing has brought epigenomic studies to the forefront of current research. The past 5 years has seen dramatic increases in the stability, throughput and quality of next generation sequencing. This exponential rate of change is expected to continue as third generation sequencing platforms become available. However the underlying molecular biology supporting epigenomic experiments is likely to remain largely unchanged. Thus the effective interpretation of data sets generated from diverse laboratories using common epigenomic techniques requires the development and adoption of standards. These standards reach through from the molecular biology to sequencing, analysis and metadata included in public data submissions.
Perhaps in no other area would the epigenomic community benefit more than from the standardization of the affinity reagents used for ChIP-seq experiments, on which the bulk of current epigenomic studies rely. Currently, a diverse collection of vendors provide affinity reagents of various sensitivities and specificities. Moreover a large fraction of these resources are non-renewable polyclonals and as such can not be used as ongoing standards in the field. Large scale epigenomic projects such as the NIH Epigenomics Roadmap  and ENCODE  have recognized this limitation and have programs targeted at the generation of renewable standardized affinity reagents. However until these are fully developed and become widely available it is critical that individual researchers undertaking epigenomic studies fully characterize the affinity reagents used in their laboratories. In this regard, arrays of modified peptides representing commonly targeted histone post-translational modifications have recently become available and should be used to assess both the false positive (cross reactions) and false negative profiles (antibody recognition blocked by adjacent modification) .
Equally important is the development and standardization of computational methods to process and display epigenomic data sets. Key to this effort is the development of computational derived quality metrics, similar to base quality calls used in genomic studies, for enrichment based epigenomic profiles. Ideally such a metric would provide a researcher with an understanding of the overall level of enrichment of an experiment. If widely adopted, such a common metric would allow for the meaningful comparisons between experiments. Finally as the scale of epigenomic data sets continues to increase information associated with data submissions needs to be standardized. Information related to the antibody, including vendor and lot, as well as experimental conditions are critical to enable meta-analyses of these rich data sets in the future.
This work was supported by the US National Institutes of Health (NIH) Roadmap Epigenomics Program, NIH grant 5U01ES017154-02 (to M.H and M.A.M.) and Canadian Institutes of Health Research Grant 92093 (to M.H.).
M.A.M. is a Terry Fox Young Investigator and a Michael Smith Senior Research Scholar.
Martin Hirst is the Functional Genomics Group Leader, Genome Sciences Centre, BC Cancer Agency and Adjunct Professor in the Department of Molecular Biology and Biochemistry at Simon Fraser University. Martin Hirst's research focuses on understanding the genetic and epigenetic causes of cancer.
Marco A. Marra is the Director, Genome Sciences Centre, BC Cancer Agency and Professor, Department of Medical Genetics, University of British Columbia. His research focuses on the evaluation and implementation of novel genomics approaches to research problems fundamentally important in health and disease.