Analysis of epigenomic data sets generated on a next generation sequencing platform remains a significant challenge. This is due, in part, to the relatively short period of time for which these data sets have been available compounded by the rapid rate of change of next generation sequencing platforms.
Analysis of epigenomic data sets generated by next generation sequencing platforms can be broken into four steps, the results of which can considered analysis levels (). Data generated from a next generation sequence platform consist of strings of bases (Illumina Genome Analyzer, 454 FLX) or color space base transitions (SOLiD) along with associated quality scores. The first step in analysis is to align this primary, level 0, data to a reference genome assembly to generate a level 1 data set consisting of the genomic coordinates of the alignments and strand on the reference genome. A number of specialized aligners have been developed to map the tens of millions reads generated in a single experiment to a mammalian sized reference genome (for review see ref. [53
]). The majority of widely adopted aligners use a ‘seed and extend’ based algorithm where a sub-string contained within the read is rapidly aligned to either a hash table (MAQ [54
], SOAP [55
], SHRiMP [56
], ZOOM [57
] and BFAST [58
]) or more recently a suffix array generated from Burrows–Wheeler transform of the reference genome (BOWTIE [59
], BWA [60
] and SOAP2 [61
]). Once a match is found the read is ‘extended’ up to the maximum read length on the genome to attempt to uniquely place the read within the genome. Reads that cannot be placed uniquely are either randomly placed on the genome or ignored for downstream analyses. Within the last year the output of such alignments has largely been standardized on the SAM/BAM file format [62
]. Bisulfite treated DNA requires specialized alignment to account for the C to T conversion. Several short read alignment algorithms are available that can be configured for bisulfite converted DNA alignment including, BSMAP [63
], Pash [64
], RMAP [65
], ZOOM [57
] and BS Seeker [66
]. A recent comparison of these aligners concluded that, despite minor differences in speed and accuracy, aligner choice is unlikely to have a significant impact on overall analysis [21
]. Following read alignment, level 1 data may be viewed directly by converting the read alignments into read density maps and displaying the result on a genome browser or further processed through segmentation.
Figure 2: Analysis process flow. Images generated during the sequencing process are converted to base (Illumina Genome Analyzer) or color (SOLiD) space strings and associated qualities. This process is performed on instrument and the output (level 0) consists of (more ...)
Segmentation methods attempt to transform raw sequence alignments into regions of signal and background (level 3, ). In general, segmentation tools attempt to model the expected behavior of the epigenomic mark (for a recent review of segmentation methods see [67
]). For immuno-precipitation based methodologies two main strategies have emerged. The first, used primarily for epigenomic marks that tend to be punctuate in their genomic distribution such as H3K4me3 or H3K9Ac, attempts to build ‘peaks’ of enrichment by modeling individual fragments within the library. Regions of enrichment are defined by oriented read sets that are computational extended by the insert size of the fragment library. Examples of such tools are Findpeaks [68
], ERANGE [20
], GLITR [69
] and PeakSeq [70
]. The second attempts to model more broadly distributed (spreading) chromatin modifications such as H3K9me3 or H3K36me3, by dividing the genome into windows of defined size and enumerating either the raw or normalized number of reads which align within the windows. Examples of binning tools are CisGenome [71
] and ChromaSig [72
]. In addition attempts have been made to combine the attributes of a binning and peak calling methodologies into a single algorithm [73
An important consideration for segmenting ChIP-seq data sets is the use of a control signal for normalization and background estimation. A control signal is typically derived from sequencing either the sheared input DNA that was used for the immuno-precipitation or a non-specific immune-precipitate (IgG). Here the idea is to control for incorrect mappings (e.g. Read Stacks) driven by genome miss-assembly and/or polymorphisms and background signal generated from the shearing process itself— open chromatin would be expected to be more readily sheared by sonication than closed chromatin, for example. One of the main differences between segmentation tools is in how this is approached, but the general idea is to subtract the signal obtained in the control from experimental track, thus normalizing the signal to the background.
The diversity of segmentation tools currently available is a natural consequence of the rapid advances being made in the field. It is outside the scope of this review to provide a detailed breakdown of the various segmentation tools available (for an excellent current review on this please see [67
]). However, researchers undertaking epigenomic studies utilizing next generation platforms need to be cognizant of the differences to make an informed decision on which tool would be most suitable for their data set. Overtime, as was the case with microarray analysis, it is expected that standardized tools, accepted by the majority of the community, will be employed in epigenomic research.
An additional consideration for any next generation sequencing based epigenomic method is how deeply to sample each library. As the sequencing depth increases, the number of unique reads covering a particular region should approach the total possible reads present in the library for each enriched region. Such a point, referred to as ‘saturation’, occurs when further sequencing fails to discover additional regions above background. Sequencing beyond saturation improves confidence in the observations and increases the coverage of events, though at greater cost per event covered. Thus, sequencing below or up to saturation may be sufficient, for example when maximizing the number of samples analyzed, while sequencing beyond saturation increases coverage and improves confidence.
There a number of stand-alone and web-based options available for visualization of aligned or segmented epigenomic data sets (for review see [74
]). The most mature and widely used are the Genome Browsers maintained by the University of California Santa Cruz [75
] and Ensemble [76
]. These ‘first generation’ genome browsers enable visualization of genome-wide data sets as linear tracks provided in the context of genome annotations. While extremely powerful for manual genome ‘browsing’ and focused visualizing on a gene-by-gene basis linear browser become unwieldy when large numbers of individual tracks are visualized at once. In addition these tools do not provide a capacity for larger scale integrative analysis of epigenomic data sets. While a number of informatic platforms designed for global, genome-wide analysis are currently in development, and few early versions have been published [77
], the majority of genome-wide analysis of next generation sequencing based epigenomic data sets require custom scripting capabilities.