Genome-wide association studies (GWAS) are widely used to search for genetic loci that underlie human disease. Another goal is to predict disease risk for different individuals given their genetic sequence. Such predictions could either be used as a “black box” in order to promote changes in life-style and screening for early diagnosis, or as a model that can be studied to better understand the mechanism of the disease. Current methods for risk prediction typically rank single nucleotide polymorphisms (SNPs) by the p-value of their association with the disease, and use the top-associated SNPs as input to a classification algorithm. However, the predictive power of such methods is relatively poor. To improve the predictive power, we devised BootRank, which uses bootstrapping in order to obtain a robust prioritization of SNPs for use in predictive models. We show that BootRank improves the ability to predict disease risk of unseen individuals in the Wellcome Trust Case Control Consortium (WTCCC) data and results in a more robust set of SNPs and a larger number of enriched pathways being associated with the different diseases. Finally, we show that combining BootRank with seven different classification algorithms improves performance compared to previous studies that used the WTCCC data. Notably, diseases for which BootRank results in the largest improvements were recently shown to have more heritability than previously thought, likely due to contributions from variants with low minimum allele frequency (MAF), suggesting that BootRank can be beneficial in cases where SNPs affecting the disease are poorly tagged or have low MAF. Overall, our results show that improving disease risk prediction from genotypic information may be a tangible goal, with potential implications for personalized disease screening and treatment.
Genome-wide association studies are widely used to search for genetic loci that underlie human disease. Another goal is to predict disease risk for different individuals given their genetic sequence. Such predictions could either be used as a “black box” in order to promote changes in life-style and screening for early diagnosis, or as a model that can be studied to better understand the mechanism of the disease. Current methods for risk prediction have relatively poor performance, with one possible explanation being the fact they rely on a noisy ranking of genetic variants given to them as input. To improve the predictive power, we devised BootRank, a ranking method less sensitive to noise. We show that BootRank improves the ability to predict disease risk of unseen individuals in the Wellcome Trust Case Control Consortium (WTCCC) data, and that combining BootRank with different classification algorithms improves performance compared to previous studies that used these data. Overall, our results show that improving disease risk prediction from genotypic information may be a tangible goal, with potential implications for personalized disease screening and treatment.
Nucleosome positioning is critical for gene expression and most DNA-related processes. Here, we review the dominant patterns of nucleosome positioning that have been observed, and summarize current understanding of their underlying determinants. The genome-wide pattern of nucleosome positioning is determined by the combination of DNA sequence, ATP-dependent nucleosome remodeling enzymes, and transcription factors including activators, components of the preinitiation complex, and elongating RNA polymerase II. These determinants influence each other such that the resulting nucleosome positioning patterns are likely to differ among genes and among cells within a population, with consequent effects on gene expression.
The core promoter is the region in which RNA polymerase II is recruited to the DNA and acts to initiate transcription, but the extent to which the core promoter sequence determines promoter activity levels is largely unknown. Here, we identified several base content and k-mer sequence features of the yeast core promoter sequence that are highly predictive of maximal promoter activity. These features are mainly located in the region 75 bp upstream and 50 bp downstream of the main transcription start site, and their associations hold for both constitutively active promoters and promoters that are induced or repressed in specific conditions. Our results unravel several architectural features of yeast core promoters and suggest that the yeast core promoter sequence downstream of the TATA box (or of similar sequences involved in recruitment of the pre-initiation complex) is a major determinant of maximal promoter activity. We further show that human core promoters also contain features that are indicative of maximal promoter activity; thus, our results emphasize the important role of the core promoter sequence in transcriptional regulation.
A single transcription factor can activate or repress expression by three different mechanisms: one that increases cell-to-cell variability in target gene expression (noise) and two that decrease noise.
The ability of cells to accurately control gene expression levels in response to extracellular cues is limited by the inherently stochastic nature of transcriptional regulation. A change in transcription factor (TF) activity results in changes in the expression of its targets, but the way in which cell-to-cell variability in expression (noise) changes as a function of TF activity, and whether targets of the same TF behave similarly, is not known. Here, we measure expression and noise as a function of TF activity for 16 native targets of the transcription factor Zap1 that are regulated by it through diverse mechanisms. For most activated and repressed Zap1 targets, noise decreases as expression increases. Kinetic modeling suggests that this is due to two distinct Zap1-mediated mechanisms that both change the frequency of transcriptional bursts. Notably, we found that another mechanism of repression by Zap1, which is encoded in the promoter DNA, likely decreases the size of transcriptional bursts, producing a unique transcriptional state characterized by low expression and low noise. In addition, we find that further reduction in noise is achieved when a single TF both activates and represses a single target gene. Our results suggest a global principle whereby at low TF concentrations, the dominant source of differences in expression between promoters stems from differences in burst frequency, whereas at high TF concentrations differences in burst size dominate. Taken together, we show that the precise amount by which noise changes with expression is specific to the regulatory mechanism of transcription and translation that acts at each gene.
In response to environmental changes, cells regulate the activity of transcription factors (TFs), which in turn change the expression of dozens of downstream target genes by binding to their promoters. The response of each target gene is determined by the interplay between TF concentration and the context in which TF binding sites occur in each target promoter. To examine the relationship between promoter sequence, mechanism of regulation, and response to TF activity, we measured expression of 16 target genes of a single TF in response to changes in TF concentration in single cells. We found that different native promoters that are all targets of the same TF exhibit diverse responses to changing TF levels in terms of both gene expression level and cell-to-cell variability (noise) in expression. Using computational modeling and mutations of specific promoter elements, we show that the molecular mechanisms of regulation can be inferred by measuring how noise changes with expression. These results show that a single TF can regulate transcription through multiple mechanisms, resulting in similar changes in mean expression but vastly different changes in cell-to-cell variability.
Many genetic variants that are significantly correlated to gene expression changes across human individuals have been identified, but the ability of these variants to predict expression of unseen individuals has rarely been evaluated. Here, we devise an algorithm that, given training expression and genotype data for a set of individuals, predicts the expression of genes of unseen test individuals given only their genotype in the local genomic vicinity of the predicted gene. Notably, the resulting predictions are remarkably robust in that they agree well between the training and test sets, even when the training and test sets consist of individuals from distinct populations. Thus, although the overall number of genes that can be predicted is relatively small, as expected from our choice to ignore effects such as environmental factors and trans sequence variation, the robust nature of the predictions means that the identity and quantitative degree to which genes can be predicted is known in advance. We also present an extension that incorporates heterogeneous types of genomic annotations to differentially weigh the importance of the various genetic variants, and we show that assigning higher weights to variants with particular annotations such as proximity to genes and high regional G/C content can further improve the predictions. Finally, genes that are successfully predicted have, on average, higher expression and more variability across individuals, providing insight into the characteristics of the types of genes that can be predicted from their cis genetic variation.
Variation in gene expression across different individuals has been found to play a role in susceptibility to different diseases. In addition, many genetic variants that are linked to changes in expression have been found to date. However, their joint ability to accurately predict these changes is not well understood and has rarely been evaluated. Here, we devise a method that uses multiple genetic variants to explain the variation in expression of genes across individuals. One important aspect of our method is its robustness, in that our predictions agree well between training and test sets. Thus, although the number of genes that could be explained is relatively small, the identity and quantitative degree to which genes can be predicted is known in advance. We also present an extension to our method that integrates different genomic annotations such as location of the genetic variant or its context to differentially weigh the genetic variants in our model and improve predictions. Finally, genes that are successfully predicted have, on average, higher expression and more variability across individuals, providing insight into the characteristics of the types of genes that can be predicted by our method.
A full understanding of gene regulation requires an understanding of the contributions that the various regulatory regions have on gene expression. Although it is well established that sequences downstream of the main promoter can affect expression, our understanding of the scale of this effect and how it is encoded in the DNA is limited. Here, to measure the effect of native S. cerevisiae 3′ end sequences on expression, we constructed a library of 85 fluorescent reporter strains that differ only in their 3′ end region. Notably, despite being driven by the same strong promoter, our library spans a continuous twelve-fold range of expression values. These measurements correlate with endogenous mRNA levels, suggesting that the 3′ end contributes to constitutive differences in mRNA levels. We used deep sequencing to map the 3′UTR ends of our strains and show that determination of polyadenylation sites is intrinsic to the local 3′ end sequence. Polyadenylation mapping was followed by sequence analysis, we found that increased A/T content upstream of the main polyadenylation site correlates with higher expression, both in the library and genome-wide, suggesting that native genes differ by the encoded efficiency of 3′ end processing. Finally, we use single cells fluorescence measurements, in different promoter activation levels, to show that 3′ end sequences modulate protein expression dynamics differently than promoters, by predominantly affecting the size of protein production bursts as opposed to the frequency at which these bursts occur. Altogether, our results lead to a more complete understanding of gene regulation by demonstrating that 3′ end regions have a unique and sequence dependent effect on gene expression.
A basic question in gene expression is the relative contribution of different regulatory layers and genomic regions to the differences in protein levels. In this work we concentrated on the effect of 3′ end sequences. For this, we constructed a library of yeast strains that differ only by a native 3′ end region integrated downstream to a reported gene driven by a constant inducible promoter. Thus we could attribute all differences in reporter expression between the strains to the different 3′ end sequences. Interestingly, we found that despite being driven by the same strong, inducible promoter, our library spanned a wide and continuous range of expression levels of more than twelve-fold. As these measurements represent the sole effect of the 3′ end region, we quantify the contribution of these sequences to the variance in mRNA levels by comparing our measurements to endogenous mRNA levels. We follow by sequence analysis to find a simple sequence signature that correlates with expression. In addition, single cell analysis reveals distinct noise dynamics of 3′ end mediated differences in expression compared to different levels of promoter activation leading to a more complete understanding of gene expression which also incorporates the effect of these regions.
Precise patterns of spatial and temporal gene expression are central to metazoan complexity and act as a driving force for embryonic development. While there has been substantial progress in dissecting and predicting cis-regulatory activity, our understanding of how information from multiple enhancer elements converge to regulate a gene's expression remains elusive. This is in large part due to the number of different biological processes involved in mediating regulation as well as limited availability of experimental measurements for many of them. Here, we used a Bayesian approach to model diverse experimental regulatory data, leading to accurate predictions of both spatial and temporal aspects of gene expression. We integrated whole-embryo information on transcription factor recruitment to multiple cis-regulatory modules, insulator binding and histone modification status in the vicinity of individual gene loci, at a genome-wide scale during Drosophila development. The model uses Bayesian networks to represent the relation between transcription factor occupancy and enhancer activity in specific tissues and stages. All parameters are optimized in an Expectation Maximization procedure providing a model capable of predicting tissue- and stage-specific activity of new, previously unassayed genes. Performing the optimization with subsets of input data demonstrated that neither enhancer occupancy nor chromatin state alone can explain all gene expression patterns, but taken together allow for accurate predictions of spatio-temporal activity. Model predictions were validated using the expression patterns of more than 600 genes recently made available by the BDGP consortium, demonstrating an average 15-fold enrichment of genes expressed in the predicted tissue over a naïve model. We further validated the model by experimentally testing the expression of 20 predicted target genes of unknown expression, resulting in an accuracy of 95% for temporal predictions and 50% for spatial. While this is, to our knowledge, the first genome-wide approach to predict tissue-specific gene expression in metazoan development, our results suggest that integrative models of this type will become more prevalent in the future.
Development is a complex process in which a single cell gives rise to a multi-cellular organism comprised of diverse cell types and well-organized tissues. This transformation requires tightly coordinated expression, both spatially and temporally, of hundreds to thousands of genes specific to any given tissue. To orchestrate these patterns, gene expression is regulated at multiple steps, from TF binding to cis-regulatory modules, general transcription factor and RNA polymerase II recruitment to promoters, chromatin remodeling, and three-dimensional looping interactions. Despite this level of complexity, the regulation of gene expression is typically modeled in the context of transcription factor binding and a single enhancer's activity as this is where the majority of experimental data is available. Recent advances in the measurement of chromatin modifications and insulator binding during embryogenesis provide new datasets that can be used for modeling gene expression. Here we use a Bayesian approach to integrate all three levels of information to combine the activity of multiple regulatory elements into a single model of a gene's expression, implementing an expectation maximization strategy to overcome the problem of missing data. Importantly, while the data for histone modifications and insulator binding represents merged signals from all cells in the embryo, the model can extract cell type specific and stage-specific predictions on gene expression for hundreds of genes of unknown expression.
Despite much research, our understanding of the rules by which cis-regulatory sequences are translated into expression levels is still lacking. We devised a method for obtaining parallel and highly accurate expression measurements of thousands of fully designed promoters, and applied it to measure the effect of systematic changes to location, number, orientation, affinity and organization of transcription factor (TF) binding sites and of nucleosome disfavoring sequences. Our analyses reveal a clear relationship between expression and binding site number, and TF-specific dependencies of expression on the distance between sites and gene starts including a striking ~10bp periodic relationship. We also demonstrate the utility of our approach for measuring TF sequence specificities and sensitivity of TF sites to surrounding sequence context, and for profiling the activity of most yeast transcription factors. Our method is readily applicable for studying both the cis and trans effects of genotype on transcriptional, post-transcriptional, and translational control.
Nucleosomes are important for gene regulation because their arrangement on the genome can control which proteins bind to DNA. Currently, few human nucleosomes are thought to be consistently positioned across cells; however, this has been difficult to assess due to the limited resolution of existing data. We performed paired-end sequencing of micrococcal nuclease-digested chromatin (MNase–seq) from seven lymphoblastoid cell lines and mapped over 3.6 billion MNase–seq fragments to the human genome to create the highest-resolution map of nucleosome occupancy to date in a human cell type. In contrast to previous results, we find that most nucleosomes have more consistent positioning than expected by chance and a substantial fraction (8.7%) of nucleosomes have moderate to strong positioning. In aggregate, nucleosome sequences have 10 bp periodic patterns in dinucleotide frequency and DNase I sensitivity; and, across cells, nucleosomes frequently have translational offsets that are multiples of 10 bp. We estimate that almost half of the genome contains regularly spaced arrays of nucleosomes, which are enriched in active chromatin domains. Single nucleotide polymorphisms that reduce DNase I sensitivity can disrupt the phasing of nucleosome arrays, which indicates that they often result from positioning against a barrier formed by other proteins. However, nucleosome arrays can also be created by DNA sequence alone. The most striking example is an array of over 400 nucleosomes on chromosome 12 that is created by tandem repetition of sequences with strong positioning properties. In summary, a large fraction of nucleosomes are consistently positioned—in some regions because they adopt favored sequence positions, and in other regions because they are forced into specific arrangements by chromatin remodeling or DNA binding proteins.
Within the nucleus of the cell, the genome of eukaryotic organisms is tightly packaged into chromatin. Chromatin is composed of a repeating series of bead-like nucleosomes, each of which is encircled 1.7 times by a string of DNA. The organization of nucleosomes on the genome is fundamentally important because they can prevent other proteins from accessing the DNA. Previous studies of human nucleosomes concluded that most nucleosomes have fuzzy positioning and tend to occupy different locations in different cells. This interpretation, however, may be a consequence of the low resolution of existing data. Here we revisit the question of nucleosome positioning by generating the most precise map of nucleosome positions that has ever been created for a human cell line. We find that 8.7% of nucleosomes have very consistent positioning, and most nucleosomes are more consistently positioned than expected by chance. Additionally, we estimate that almost half of the genome contains regularly spaced arrays of nucleosomes. Much of this positioning is due to the intrinsic preference of nucleosomes for some DNA sequences over others; but in some regions of the genome, the sequence preferences of nucleosomes are overridden by proteins that out-compete them for binding or displace them using energy from ATP.
Fundamental aspects of embryonic and post-natal development, including maintenance of the mammalian female germline, are largely unknown. Here we employ a retrospective, phylogenetic-based method for reconstructing cell lineage trees utilizing somatic mutations accumulated in microsatellites, to study female germline dynamics in mice. Reconstructed cell lineage trees can be used to estimate lineage relationships between different cell types, as well as cell depth (number of cell divisions since the zygote). We show that, in the reconstructed mouse cell lineage trees, oocytes form clusters that are separate from hematopoietic and mesenchymal stem cells, both in young and old mice, indicating that these populations belong to distinct lineages. Furthermore, while cumulus cells sampled from different ovarian follicles are distinctly clustered on the reconstructed trees, oocytes from the left and right ovaries are not, suggesting a mixing of their progenitor pools. We also observed an increase in oocyte depth with mouse age, which can be explained either by depth-guided selection of oocytes for ovulation or by post-natal renewal. Overall, our study sheds light on substantial novel aspects of female germline preservation and development.
Many aspects of mammalian female germline development during embryogenesis and throughout adulthood are either unknown or under debate. In this study we applied a novel method for the reconstruction of cell lineage trees utilizing microsatellite mutations, accumulated during mouse life, in oocytes and other cells, sampled from young and old mice. Analysis of the reconstructed cell lineage trees shows that oocytes are clustered separately from bone-marrow derived cells, that oocytes from different ovaries share common progenitors, and that oocyte depth (number of cell divisions since the zygote) increases significantly with mouse age.
We recently reported the identification and characterization of DNA replication origins (Oris) in metazoan cell lines. Here, we describe additional bioinformatic analyses showing that the previously identified GC-rich sequence elements form origin G-rich repeated elements (OGREs) that are present in 67% to 90% of the DNA replication origins from Drosophila to human cells, respectively. Our analyses also show that initiation of DNA synthesis takes place precisely at 160 bp (Drosophila) and 280 bp (mouse) from the OGRE. We also found that in most CpG islands, an OGRE is positioned in opposite orientation on each of the two DNA strands and detected two sites of initiation of DNA synthesis upstream or downstream of each OGRE. Conversely, Oris not associated with CpG islands have a single initiation site. OGRE density along chromosomes correlated with previously published replication timing data. Ori sequences centered on the OGRE are also predicted to have high intrinsic nucleosome occupancy. Finally, OGREs predict G-quadruplex structures at Oris that might be structural elements controlling the choice or activation of replication origins.
DNA replication origins; DNA synthesis; G-quadruplex; nucleosome; CpG islands; transcription
We propose definitions and procedures for comparing nucleosome maps and discuss current agreement and disagreement on the effect of histone sequence preferences on nucleosome organization in vivo.
Sequence changes in coding region and regulatory region of the gene itself (cis) determine most of gene expression divergence between closely related species. But gene expression divergence between yeast species is not correlated with evolution of primary nucleotide sequence. This indicates that other factors in cis direct gene expression divergence. Here, we studied the contribution of DNA three-dimensional structural evolution as cis to gene expression divergence. We found that the evolution of DNA structure in coding regions and gene expression divergence are correlated in yeast. Similar result was also observed between Drosophila species. DNA structure is associated with the binding of chromatin remodelers and histone modifiers to DNA sequences in coding regions, which influence RNA polymerase II occupancy that controls gene expression level. We also found that genes with similar DNA structures are involved in the same biological process and function. These results reveal the previously unappreciated roles of DNA structure as cis-effects in gene expression.
The unique phenotype of each organism is partly determined by gene expression. Changes in gene expression are an important source of phenotypic variation, and can be caused by changes in regulatory and coding sequences of the gene itself (cis) and changes in regulatory factors (trans). The contribution of cis regulation to gene expression divergence between closely related species is much greater than that of trans regulation. However, evolution of primary nucleotide sequences is not correlated with gene expression divergence in yeast, suggesting that other factors in cis drive gene expression divergence. Here, we found that evolution of DNA structure in coding regions is coupled to gene expression divergence in yeast. We also found that DNA structure is associated with specific gene characteristics. Genes with similar DNA structures are involved in the same biological process and function. These results demonstrate the important roles of DNA structure in directing gene expression.
Evolution maintains organismal fitness by preserving genomic information. This is widely assumed to involve conservation of specific genomic loci among species. Many genomic encodings are now recognized to integrate small contributions from multiple genomic positions into quantitative dispersed codes, but the evolutionary dynamics of such codes are still poorly understood. Here we show that in yeast, sequences that quantitatively affect nucleosome occupancy evolve under compensatory dynamics that maintain heterogeneous levels of A+T content through spatially coupled A/T-losing and A/T-gaining substitutions. Evolutionary modeling combined with data on yeast polymorphisms supports the idea that these substitution dynamics are a consequence of weak selection. This shows that compensatory evolution, so far believed to affect specific groups of epistatically linked loci like paired RNA bases, is a widespread phenomenon in the yeast genome, affecting the majority of intergenic sequences in it. The model thus derived suggests that compensation is inevitable when evolution conserves quantitative and dispersed genomic functions.
Purifying selection is a major force in conserving genomic features. It pushes deleterious mutations to extinction while conserving the specific DNA sequence. Here we show that a large proportion of the yeast genome evolves under compensatory dynamics that conserve genomic properties while modifying the genomic sequence. Such compensatory evolution conserves the local G+C content of the genome, which influences nucleosome organization. Since purifying selection is too weak to eliminate every weakly deleterious mutation in nucleosome bound or unbound sequences, the local G+C content is frequently stabilized by compensatory G+C gaining and G+C losing mutations in proximal loci. Theoretical analysis shows that compensatory evolution is inevitable when natural selection is weak and the genomic feature is distributed over many loci. These results imply that sequence conservation may not always be equated with overall selection. They demonstrate that cycles of weakly deleterious substitutions followed by positive selection for corrective mutations, which were so far studied mostly in RNA coding genes, are observed broadly and profoundly affect genome evolution.
Long intergenic noncoding RNAs (lincRNAs) regulate chromatin states and epigenetic inheritance. Here we show that the lincRNA HOTAIR serves as a scaffold for at least two distinct histone modification complexes. A 5′ domain of HOTAIR binds Polycomb Repressive Complex 2 (PRC2) while a 3′ domain of HOTAIR binds the LSD1/CoREST/REST complex. The ability to tether two distinct complexes enables RNA-mediated assembly of PRC2 and LSD1, and coordinates targeting of PRC2 and LSD1 to chromatin for coupled histone H3 lysine 27 methylation and lysine 4 demethylation. Our results suggest that lincRNAs may serve as scaffolds by providing binding surfaces to assemble select histone modification enzymes, and thereby specify the pattern of histone modifications on target genes.
The positions of nucleosomes in eukaryotic genomes determine which parts of the DNA sequence are readily accessible for regulatory proteins and which are not. Genome-wide maps of nucleosome positions have revealed a salient pattern around transcription start sites, involving a nucleosome-free region (NFR) flanked by a pronounced periodic pattern in the average nucleosome density. While the periodic pattern clearly reflects well-positioned nucleosomes, the positioning mechanism is less clear. A recent experimental study by Mavrich et al. argued that the pattern observed in Saccharomyces cerevisiae is qualitatively consistent with a “barrier nucleosome model,” in which the oscillatory pattern is created by the statistical positioning mechanism of Kornberg and Stryer. On the other hand, there is clear evidence for intrinsic sequence preferences of nucleosomes, and it is unclear to what extent these sequence preferences affect the observed pattern. To test the barrier nucleosome model, we quantitatively analyze yeast nucleosome positioning data both up- and downstream from NFRs. Our analysis is based on the Tonks model of statistical physics which quantifies the interplay between the excluded-volume interaction of nucleosomes and their positional entropy. We find that although the typical patterns on the two sides of the NFR are different, they are both quantitatively described by the same physical model with the same parameters, but different boundary conditions. The inferred boundary conditions suggest that the first nucleosome downstream from the NFR (the +1 nucleosome) is typically directly positioned while the first nucleosome upstream is statistically positioned via a nucleosome-repelling DNA region. These boundary conditions, which can be locally encoded into the genome sequence, significantly shape the statistical distribution of nucleosomes over a range of up to ∼1,000 bp to each side.
Within the last five years, knowledge about nucleosome organization on the genome has grown dramatically. To a large extent, this has been achieved by an increasing number of experimental studies determining nucleosome positions at high resolution over entire genomes. Particular attention has been paid to promoter regions, where a canonical pattern has been established: a nucleosome-free region with pronounced adjacent oscillations in the nucleosome density. Here we tested to what extent this pattern may be quantitatively described by a minimal physical model, a one-dimensional gas of impenetrable particles, commonly referred to as the “Tonks gas.” In this model, density oscillations occur close to a boundary at dense packing. Our systematic quantitative analysis reveals that, in an average over many promoters, a Tonks gas model can indeed account for the nucleosome organization to both sides of the nucleosome-free region, if one allows for different boundary conditions at the two edges. On the downstream side, a single nucleosome is typically directly positioned such that it forms an obstacle for the neighboring nucleosomes, while such a barrier nucleosome is typically missing on the upstream side.
Active eukaryotic regulatory sites are characterized by open chromatin, and yeast promoters and transcription factor binding sites (TFBSs) typically have low intrinsic nucleosome occupancy. Here, we show that in contrast to yeast, DNA at human promoters, enhancers, and TFBSs generally encodes high intrinsic nucleosome occupancy. In most cases we examined, these elements also have high experimentally measured nucleosome occupancy in vivo. These regions typically have high G+C content, which correlates positively with intrinsic nucleosome occupancy, and are depleted for nucleosome-excluding poly-A sequences. We propose that high nucleosome preference is directly encoded at regulatory sequences in the human genome to restrict access to regulatory information that will ultimately be utilized in only a subset of differentiated cells.
Homopolymeric stretches of deoxyadenosine nucleotides (A’s) on one strand of double stranded DNA, referred to as poly(dA:dT) tracts or A-tracts, are overabundant in eukaryotic genomes. They have unusual structural, dynamic, and mechanical properties, and may resist sharp bending. Such unusual material properties, together with their overabundance in eukaryotes, raised the possibility that poly(dA:dT) tracts might function in eukaryotes to influence the organization of nucleosomes at many genomic regions. Recent genome-wide studies strongly confirm these ideas and suggest that these tracts play major roles in chromatin organization and genome function. Here we review what is known about poly(dA:dT) tracts and how they work.
The DNA of eukaryotic genomes is wrapped in nucleosomes, which strongly distort and occlude the DNA from access to most DNA-binding proteins. An understanding of the mechanisms that control nucleosome positioning along the DNA is thus essential to understanding the binding and action of proteins that carry out essential genetic functions. New genome-wide data on in vivo and in vitro nucleosome positioning greatly advance our understanding of several factors that can influence nucleosome positioning, including DNA sequence preferences, DNA methylation, histone variants and post-translational modifications, higher order chromatin structure, and the actions of transcription factors, chromatin remodelers and other DNA-binding proteins. We discuss how these factors function and ways in which they might be integrated into a unified framework that accounts for both the preservation of nucleosome positioning and the dynamic nucleosome repositioning that occur across biological conditions, cell types, developmental processes and disease.
Genome wide maps of nucleosome occupancy in yeast have recently been produced through deep sequencing of nuclease-protected DNA. These maps have been obtained from both crosslinked and uncrosslinked chromatin in vivo, and from chromatin assembled from genomic DNA and nucleosomes in vitro. Here, we analyze these maps in combination with existing ChIP-chip data, and with new ChIP-qPCR experiments reported here. We show that the apparent nucleosome density in crosslinked chromatin, when compared to uncrosslinked chromatin, is preferentially increased at transcription factor (TF) binding sites, suggesting a strategy for mapping generic transcription factor binding sites that would not require immunoprecipitation of a particular factor. We also confirm previous conclusions that the intrinsic, sequence dependent binding of nucleosomes helps determine the localization of TF binding sites. However, we find that the association between low nucleosome occupancy and TF binding is typically greater if occupancy at a site is averaged over a 600bp window, rather than using the occupancy at the binding site itself. We have also incorporated intrinsic nucleosome binding occupancies as weights in a computational model for TF binding, and by this measure as well we find better prediction if the high resolution nucleosome occupancy data is averaged over 600bp. We suggest that the intrinsic DNA binding specificity of nucleosomes plays a role in TF binding site selection not so much through the specification of precise nucleosome positions that permit or occlude binding, but rather through the creation of low occupancy regions that can accommodate competition from TFs through rearrangement of nucleosomes.
Genomic DNA is largely covered by proteins that compete with one another for binding to regulatory sequences. Most of these proteins are in the form of nucleosomes. How nucleosomes come to occupy particular sites and thereby compete with sequence specific transcription factors is a central problem in developing a systems-level understanding of gene regulation. Here, we performed a series of computational analyses using high-resolution nucleosome position data that has recently become available in yeast, thanks to advances in DNA sequencing technology. Analysis of these data, combined with data on the location and occupancy of transcription factors genome-wide, shows that the precise location of nucleosomes as determined by nucleosome sequence specificity is often less important to transcription factor binding than the broader, regional occupancy of nucleosomes that is encoded in genomic DNA. This result has implications for the evolution of DNA regulatory elements.
Transcriptional regulation in human cells is a complex process involving a
multitude of regulatory elements encoded by the genome. Recent studies have
shown that distinct chromatin signatures mark a variety of functional genomic
elements and that subtle variations of these signatures mark elements with
different functions. To identify novel chromatin signatures in the human genome,
we apply a de novo pattern-finding algorithm to genome-wide
maps of histone modifications. We recover previously known chromatin signatures
associated with promoters and enhancers. We also observe several chromatin
signatures with strong enrichment of H3K36me3 marking exons. Closer examination
reveals that H3K36me3 is found on well-positioned nucleosomes at exon
5′ ends, and that this modification is a global mark of exon
expression that also correlates with alternative splicing. Additionally, we
observe strong enrichment of H2BK5me1 and H4K20me1 at highly expressed exons
near the 5′ end, in contrast to the opposite distribution of
H3K36me3-marked exons. Finally, we also recover frequently occurring chromatin
signatures displaying enrichment of repressive histone modifications. These
signatures mark distinct repeat sequences and are associated with distinct modes
of gene repression. Together, these results highlight the rich information
embedded in the human epigenome and underscore its value in studying gene
Recent studies have observed that histone tails can be modified in a variety of
ways. Analyzing a collection of 21 histone modifications, we attempted to
determine what common signatures are associated with different classes of
regulatory elements and whether they mark places of distinct function. Indeed,
at promoters, we identified a number of distinct signatures, each associated
with a different class of expressed and functional genes. We also observed
several unexpected signatures marking exons that directly correlate with the
expression of exons. Finally, we recovered many places marked by two distinct
repressive modifications, and showed that they mark distinct populations of
repetitive elements associated with distinct modes of gene repression. Together,
these results highlight the rich information embedded in the human epigenome and
underscore its value in studying gene regulation.
We have generated and made publicly available two very large networks of molecular interactions: 49,493 mouse-specific and 52,518 human-specific interactions. These networks were generated through automated analysis of 368,331 full-text research articles and 8,039,972 article abstracts from the PubMed database, using the GeneWays system. Our networks cover a wide spectrum of molecular interactions, such as bind, phosphorylate, glycosylate, and activate; 207 of these interaction types occur more than 1,000 times in our unfiltered, multi-species data set. Because mouse and human genes are linked through an orthological relationship, human and mouse networks are amenable to straightforward, joint computational analysis. Using our newly generated networks and known associations between mouse genes and cerebellar malformation phenotypes, we predicted a number of new associations between genes and five cerebellar phenotypes (small cerebellum, absent cerebellum, cerebellar degeneration, abnormal foliation, and abnormal vermis). Using a battery of statistical tests, we showed that genes that are associated with cerebellar phenotypes tend to form compact network clusters. Further, we observed that cerebellar malformation phenotypes tend to be associated with highly connected genes. This tendency was stronger for developmental phenotypes and weaker for cerebellar degeneration.
We described and made publicly available the largest existing set of text-mined statements; we also presented its application to an important biological problem. We have extracted and purified two large molecular networks, one for humans and one for mouse. We characterized the data sets, described the methods we used to generate them, and presented a novel biological application of the networks to study the etiology of five cerebellum phenotypes. We demonstrated quantitatively that the development-related malformations differ in their system-level properties from degeneration-related genes. We showed that there is a high degree of overlap among the genes implicated in the developmental malformations, that these genes have a strong tendency to be highly connected within the molecular network, and that they also tend to be clustered together, forming a compact molecular network neighborhood. In contrast, the genes involved in malformations due to degeneration do not have a high degree of connectivity, are not strongly clustered in the network, and do not overlap significantly with the development related genes. In addition, taking into account the above-mentioned system-level properties and the gene-specific network interactions, we made highly confident predictions about novel genes that are likely also involved in the etiology of the analyzed phenotypes.
Nucleosome organization is critical for gene regulation1. In living cells this organization is determined by multiple factors, including the action of chromatin remodellers2, competition with site-specific DNA-binding proteins3, and the DNA sequence preferences of the nucleosomes themselves4-8. However, it has been difficult to estimate the relative importance of each of these mechanisms in vivo7,9-11, because in vivo nucleosome maps reflect the combined action of all influencing factors. Here we determine the importance of nucleosome DNA sequence preferences experimentally by measuring the genome-wide occupancy of nucleosomes assembled on purified yeast genomic DNA. The resulting map, in which nucleosome occupancy is governed only by the intrinsic sequence preferences of nucleosomes, is similar to in vivo nucleosome maps generated in three different growth conditions. In vitro, nucleosome depletion is evident at many transcription factor binding sites and around gene start and end sites, indicating that nucleosome depletion at these sites in vivo is partly encoded in the genome. We confirm these results with a micrococcal nuclease-independent experiment that measures the relative affinity of nucleosomes for ∼40,000 double-stranded 150-base-pair oligonucleotides. Using our in vitro data, we devise a computational model of nucleosome sequence preferences that is significantly correlated with in vivo nucleosome occupancy in Caenorhabditis elegans. Our results indicate that the intrinsic DNA sequence preferences of nucleosomes have a central role in determining the organization of nucleosomes in vivo.
Eukaryotic transcription occurs within a chromatin environment, whose organization plays an important regulatory role and is partly encoded in cis by the DNA sequence itself1-6. Here, we examine whether evolutionary changes in gene expression are linked to changes in the DNA-encoded nucleosome organization of promoters. We find that in aerobic yeast species, where cellular respiration genes are active under typical growth conditions, the promoter sequences of these genes encode a relatively open (nucleosome-depleted) chromatin organization. This nucleosome-depleted organization requires only DNA sequence information, is independent of any co-factors and of transcription, and is a general property of growth-related genes. In contrast, in anaerobic yeast species, where cellular respiration genes are inactive under typical growth conditions, respiration gene promoters encode relatively closed (nucleosome-occupied) chromatin organizations. Thus, our results suggest a previously unidentified genetic mechanism underlying phenotypic diversity, consisting of DNA sequence changes that directly alter the DNA-encoded nucleosome organization of promoters.