Proteins that interact directly with DNA can play a key role in determining the structure of chromatin. Many of these factors affect chromatin properties through the recruitment of chromatin modifying factors such as DNA methyltransferases and histone-modifying proteins, or through the steric displacement of nucleosomes. Other factors, such as polycomb and ATP-dependent chromatin remodeling complexes, can directly alter the chromatin structure when interacting with cis-elements or histone marks. The large number of possible factors and the large number of possible interacting sites emphasizes the need for more high-throughput methods of analysis.
Chromatin immunoprecipitation (ChIP) has been an important technique used to identify sites of
trans-factor interaction [
20-
22]. ChIP involves cross-linking proteins to DNA and creating a DNA library enriched for sequences bound by a particular protein of interest using a specific antibody to that protein. This technique can be combined with microarrays (ChIP-chip) or high-throughput sequencing (ChIP-seq) to identify specific protein-DNA interaction sites in a limited number of genomic regions or genome-wide at approximately 50-100-bp resolution. This limitation in the resolution of ChIP is inherent in the procedure of fragmenting DNA and enriching for fragments with the bound factor. These fragments are of varying size, typically a couple of hundred bases in length, and do not accurately represent the typical 6–20-bp interaction site of
trans-acting proteins. Nevertheless, a large number of factors have been assayed to date across many organisms, including RNA polymerase II (PolII) [
23], STAT1 [
24], CCCTC-binding factor (CTCF) [
23,
25], GABP [
26], SRF [
26], neuron-restrictive silencer factor (NRSF) [
26,
27], FoxA2 [
28] and FoxA3 [
29] in human cell lines. Kim
et al. provided comprehensive maps of CTCF binding, a factor known to act as an insulator influencing both chromatin structure and gene regulation, in human primary fibroblasts [
25]. Most sites were found in regions far from transcription start sites of genes, although their distribution correlated well with locations of genes. Preliminary analysis in 1% of the genome for multiple cell types suggested CTCF binding is not highly variable across different cells. Johnson
et al. mapped approximately 2000 binding sites of neuron-restrictive silencer factor (NSRF) genome-wide [
27]. As expected, they demonstrated that this factor regulates genes involved in neurons and their development. Surprisingly, though, they also found enrichment in genes that drive islet cell development in the pancreas.
Comprehensive identification of transcription factor binding sites has been, and still is, an active area in bioinformatics. Extending this type of research to study the interaction of transcription factor regulatory networks plays a major role in the field of systems biology. There are many tools to determine preferred binding sequences, or motifs, for a specific factor [
30-
32]. However, some proteins may not directly interact with DNA or may bind nonspecifically, nullifying the utility of motif identification. In addition, computational techniques that apply these motifs to the entire genome are plagued by numerous false-positives owing to such problems as not knowing the chromatin accessibility of specific regions, the unknown requirements of additional factors for binding, and inadequate information content in the motif. The development of tools that combine ChIP information with motifs have the potential to use strengths from both tools to allow for the more accurate prediction of exact binding sites.
Histone tail modifications & histone variants
Regulation of biological processes can also be directly affected by modifications to the core histone proteins that comprise the nucleosome. Each nuclesome consists of an octamer comprised of two each of histones H2A, H2B, H3 and H4, making histones the most abundant protein component of chromatin. Histone variants, such as H2A.Z and CENPA, can replace one of the normal core histones and are involved in key cellular processes such as transcription, repair and replication [
33]. Post-translational modifications to the histone tails have been shown to alter the structure of chromatin [
34]. Modifications include mono-, di- and tri-methylation, acetylation, ubiquitination and phosphorylation of specific amino acids in histone tails. The list of these modifications is growing, and elucidating these is a major focus of the Epigenomics Roadmap initiative. Different histone modifications have been associated with many aspects of the genome, including transcriptional silencing, transcriptional activation, active transcriptional units, enhancers, DNA repair and other genomic features. For a full review, see [
35].
Antibodies to histone variants and specific histone tail modifications have been developed, enabling the use of ChIP experiments to identify genomic locations of specific histones and histone modifications. As the histone is part of a larger nucleosome, the resolution of the positioning of the histone need not be on the single-base level. Unlike the ChIP experiments described above, fragments targeted by antibodies can be isolated by cleaving DNA in the linker regions between nucleosomes with micrococcal nuclease (MNase) or through sonication. The resulting locations of the modifications should indicate enrichment for a modification at a nucleosome-level resolution. However, these experiments cannot determine whether both or just one of a particular histone has been modified. In addition, as mentioned below, some issues limit resolution, such as imprecise nucleosome positioning and the large number of sequence reads required in organisms with larger genomes.
Recently, many groups have performed ChIP-chip [
1,
36] and ChIP-seq [
23,
37,
38] to identify locations of histone methylations, acetylations and a limited number of variants. These have provided high-resolution maps across entire genomes for many common modifications and variants, allowing for the further study of their relationship to cellular function. The sheer number of known possible modifications and variants under all conditions has limited the comprehensiveness of these studies to modifications whose functions are better understood, but the roles for even these are more complex than previously understood.
Mikkelsen
et al. mapped the locations of five modifications genome-wide in mouse pluripotent and lineage-committed cells [
39]. They found that trimethylation of lysine 4 and 27 on histone 3 (H3K4me3 and H3K27me3) could effectively distinguish being expressed genes (H3K4me3 present), stably silent genes (H3K27me3 present) and those poised for expression (both marks present). In addition, another mark, H3K36me3, appears throughout actively transcribed coding and noncoding regions, allowing accurate gene annotation in a cell-type-specific manner. They also demonstrated an additional benefit of sequencing-based experiments by using heterogeneous polymorphic sites to identify allele-specific transcription. Barski
et al. annotated locations of 20 histone lysine and arginine methylations and the histone variant H2A.Z in human T cells [
23]. They found that monomethylations of H3K27, H3K9, H4K20, H3K79 and H2BK5 are associated with actively transcribed genes, while the trimethylation of H3K27, H3K9 and H3K79 are linked to silent genes. They also performed ChIP-seq for CTCF in these cells, and showed that CTCF is found at the edges of various methylation domains.
Nucleosome positioning & open chromatin
The combination of the histones within the nucleosome octamer and the genomic DNA wrapped around it allows for the steric regulation of transcriptional activity. Each nucleosome interacts with approximately 146 bp of DNA, rendering these bases essentially inaccessible by many factors. The prevention of access for these factors can allow transcriptional modulation through both proximal regulation (nucleosomes blocking the promoter region) and distal regulation (nucleosomes blocking enhancer elements).
Precisely how nucleosomes are positioned in a particular cell type is currently unclear. In general, it is thought that nucleosomes interact with DNA as a default state, and thus displacement of the nucleosomes is required for access by other factors. The act of displacement can be through direct factor interaction with its preferred binding site [
40], mediated by an ATP-dependent complex such as switch/sucrose nonfermentable (SWI/SNF) [
41-
43], or through apparent acetylation prior to transcription [
44]. It is generally found that at the promoter region of actively transcribed genes, nucleosomes are completely removed. The same is not true in the body of these genes. It has been hypothesized the RNA polymerase complex does not completely displace nucleosomes during transcription, somehow retaining and/or reinserting them once the polymerase has passed. This is supported in a study by Dion
et al. in yeast where it was demonstrated that nucleosomes in the gene body of actively transcribed genes were not actively replaced with specially labeled nucleosomes that had been added [
45]. Nucleosomes are also either removed or displaced to allow the binding of regulatory proteins. Therefore, regulatory elements in general can be identified by mapping the locations of nucleosomes or detecting where they are absent.
Several studies have demonstrated that certain DNA sequence patterns, such as the oscillation frequencies of particular dinucleotides, influence nucleosome positioning. Computational models based on these sequence characteristics can generate predictions in yeast that are highly correlated with
in vivo nucleosome positions [
46,
47]. While these models seem to demonstrate some statistical accuracy, others postulate that these sequence patterns are primarily found when nucleosomes need to be precisely positioned and that other nucleosomes are placed through statistical packing [
48]. For a review of nucleosome positioning, see [
49]. In general, these models have not been able to provide accurate mappings genome-wide and are limited as they cannot show cell-type-specific changes in chromatin.
Positions of nucleosomes can now be identified using MNase, which has been shown to efficiently digest the linker regions between two nucleosomes. High-resolution genome-wide nucleosome maps can be generated by extracting nondigested DNA and employing tiled microarrays or sequencing. This was successfully carried out first in yeast, where nucleosome maps have now been created under various conditions and determined using both sequencing and array methods [
47,
50–
52]. Subsequently, genome-wide maps have been generated for
C. elegans [
53],
Drosophila melanogaster [
54] and humans [
55]. While the tiling arrays generally provide lower resolution annotations than sequencing, hidden Markov models have been used to generate maps with as high as 10-bp resolution [
56]. These studies have demonstrated that there are both well-positioned nucleosomes and nucleosomes whose exact positions seem to vary across the cell population. The promoter regions of genes tend to have well-positioned nucleosomes that are phased with respect to each other.
While a sequencing approach to determine nucleosome positions in species with relatively small genomes such as yeast is feasible, equivalent sequencing depth in larger mammalian genomes requires significantly more work. In humans, it has been demonstrated that as few as 100 million short reads provide an accurate map of well-positioned nucleosomes such as those found at transcription start sites [
55]. However, equivalent coverage in humans to generate nucleosome positioning maps similar to those in yeast may likely require over 10 billion sequence reads assuming similar genomic nucleosome occupancy.
In contrast to identifying the locations of nucleosomes, some researchers are interested in identifying regions that are nucleosome free, also referred to as open chromatin. DNA nuclease I (DNaseI) has been shown to preferentially digest DNA in nucleosome-depleted regions. These DNaseI hypersensitive (HS) sites have been used to annotate promoters, enhancers, silencers, insulators and locus control regions [
57]. Genome-wide assays that comprehensively identify DNaseI HS sites have recently been developed using tiling microarrays and high-throughput sequencing [
58–
64]. An additional method, formaldehyde-assisted isolation of regulatory elements (FAIRE), has also been shown to identify open chromatin regions in a completely different way. FAIRE is a rather straightforward experiment that isolates DNA not cross-linked by formaldehyde to bound proteins, primarily nucleosomes, and then determines the locations of these protein-free regions using tiled microarrays or sequencing [
65,
66]. FAIRE has been shown to be highly associated with DNaseI HS sites and other chromatin marks. In humans, approximately 2% of the genome is nucleosome-free in a given cell type; therefore, identifying nucleosome-depleted regions requires significantly less sequencing than when determining positions of all nucleosomes.
Identifying nucleosome-free regions provides clues as to the location of active regulatory elements, but this does not reveal the function of these elements, nor what factors may be bound. It has been previously shown that DNaseI experiments can identify precise locations of individual transcription factor-binding sites, referred to as DNaseI footprinting. This utilizes the fact that
trans-factors also protect the DNA from digestion similar to nucleosomes, but at a much smaller scale. Recently, it has been demonstrated in yeast that the high-throughput DNaseI sequencing protocol can perform whole-genome DNaseI footprinting with single base pair resolution [
67]. These footprints can be compared with known factor-binding motifs to predict the particular protein interacting within that segment of DNA, potentially providing an idea of the function of the putative regulatory element. As there are many factors whose binding motif is unknown or where binding is nonspecific, combining DNaseI footprints with ChIP data for specific transcription factors can provide the precise positioning of a factor's binding site, revealing more precisely the DNA binding characteristics. Alternatively, Kang
et al. have proposed a protocol in which ChIP is performed prior to footprinting [
68].
Nuclear localization of chromosomes
While the aforementioned experimental assays map chromatin along the strands of DNA (essentially 1D data), the true structure of chromatin resides in a 3D world with chromosome loops and folds. Interactions between distal regions of chromatin can explain how enhancers many kilobases away from a gene, and even on a different chromosome, can affect the expression of that gene. Within the nucleus there are compartments of regulation that are associated with expression or repression of genes. In mammals, active gene regulation tends to take place away from the nuclear envelope, while repressed regions tend to be sequestered to the nuclear envelope, although there are numerous exceptions to these tendencies. In
Saccharomyces cerevisiae, these trends seem to be the opposite [
69]. A better understanding of this higher-order structure of chromosomes will greatly enhance our understanding of data generated from experiments described above and gene regulation as a whole.
Fluorescence
in situ hybridization assays have been the standard for mapping chromosomal locations for many years. While these experiments have led to important discoveries, they are restricted in the number of regions that can be examined simultaneously and have limited resolution. More recent techniques, such as chromosome conformation capture (3C [
70]), can reveal characteristics of chromosomal positioning at high-resolution on a much larger scale. The 3C protocol first cross-links interacting segments of chromatin and then identifies the genomic locations of these interactions. This technology has rapidly progressed from initially being limited to assaying only one predetermined pair of sites (3C [
70]), to revealing all interactions for one specific site (4C [
71,
72]), and can now reveal all interactions for all sites within a specific genomic region (5C [
73]). This 5C approach relies on microarrays or high-throughput sequencing to map these interactions. The accuracy and utility of this method was demonstrated by mapping all interactions in a 400-kb region encompassing the human β-globin locus that clearly showed strong links between the locus control region (LCR) and globin genes that are separated by 10–60kb [
73] (see review at [
74] for further details).
High-resolution technologies
As shown above, microarray and sequencing technologies are actively being employed for creating high-resolution mappings of chromatin structure. Each of these technologies has benefits and concerns associated with it. The choice of the appropriate technology for a given research study depends on many factors, including number of samples, cost, amount of genome to be assayed, desired resolution and availability of informatics pipelines. Below, we briefly discuss the major strengths and weaknesses of each in their current form.
Microarrays represent a relatively mature technology that has well-developed analysis protocols. Important issues such as normalizing results with an appropriate experimental control have been extensively studied. Custom microarrays can be designed to query a subset of a genome, making them cost-effective for multi-sample analyses. For small genomes, high-density tiling arrays have been or can be designed, providing high-resolution results. However, for larger genomes this is less practical and generally results in lower resolution due to probe spacing. Probe design and cross-hybridization concerns are better understood, but these have not been completely resolved and can still result in experimental artifacts. The repetitive nature of many genomes affects what regions can be assayed and leads to uneven spacing of probes in many regions.
High-throughput sequencing of short sequence tags is a relatively new technology that theoretically allows for whole-genome experimental coverage of any organism with a reference genome sequence assembly. For many of the experiments described above, sequencing technologies can produce results with near single-base resolution. Higher genomic coverage can be achieved compared with microarrays owing to the ability to map tags to short unique regions that are largely repetitive and to regions that are polymorphic or duplicated. In addition, single allele phenomena can be described with the availability of informative heterogeneous SNPs and sufficient sequencing depth. However, as with any new technology, much more work is needed to understand how best to process these data and to identify experimental artifacts. Analyses tools are still relatively immature. While information is generated from the whole genome, short sequence tags cannot be confidently mapped in many instances, and it is not clear how to use sequences that align to multiple genomic locations. Polymorphic sites and single base sequencing errors are more problematic owing to the short length of many sequence tags. The appropriate design and use of experimental controls is not well understood.
Despite their limitations, both microarray and sequencing technologies have been demonstrated to produce very high quality and accurate data when used in many different experimental settings. For the basic identification of genomic regions of interest, one technology has not been clearly shown to be superior. The trend towards the use of sequencing technologies seems to be motivated by the ability to produce higher resolution results, and the promise of increased information from having actual sequence data.