Metagenomic sequencing increased our understanding of the role of the microbiome in health and disease, yet it only provides a snapshot of a highly dynamic ecosystem. Here, we show that the pattern of metagenomic sequencing read coverage for different microbial genomes contains a single trough and a single peak, the latter coinciding with the bacterial origin of replication. Furthermore, the ratio of sequencing coverage between the peak and trough provides a quantitative measure of a species' growth rate. We demonstrate this in vitro and in vivo, under different growth conditions, and in complex bacterial communities. For several bacterial species, peak to trough coverage ratios, but not relative abundances, correlated with the manifestation of inflammatory bowel disease and type II diabetes.
Integrative analysis of multiple data types to address complex biomedical questions requires the use of multiple software tools in concert and remains an enormous challenge for most of the biomedical research community. Here we introduce GenomeSpace (http://www.genomespace.org), a cloud-based, cooperative community resource. Seeded as a collaboration of six of the most popular genomics analysis tools, GenomeSpace now supports the streamlined interaction of 20 bioinformatics tools and data resources. To facilitate the ability of non-programming users’ to leverage GenomeSpace in integrative analysis, it offers a growing set of ‘recipes’, short workflows involving a few tools and steps to guide investigators through high utility analysis tasks.
Summary: Understanding the effect of single nucleotide polymorphisms (SNPs) on the expression level of genes is an important goal. We recently published a study in which we devised a multi-SNP predictive model for gene expression in Lymphoblastoid cell lines (LCL), and showed that it can robustly predict the expression of a small number of genes in test individuals. Here, we validate the generality of our models by predicting expression profiles for genes in LCL in an independent study, and extend the pool of predictable genes for which we are able to explain more than 25% of their expression variability to 232 genes across 14 different cell types. As the number of people who obtained their SNP profiles through companies such as 23andMe is rising rapidly, we developed GenoExp, a web-based tool in which users can upload their individual SNP data and obtain predicted expression levels for the set of predictable genes across the 14 different cell types. Our tool thus allows users with biological knowledge to study the possible effects that their set of SNPs might have on these genes and predict their cell-specific expression levels relative to the population average.
Availability and implementation:
GenoExp is freely available at http://genie.weizmann.ac.il/pubs/GenoExp/.
Supplementary data are available at Bioinformatics online.
Life on Earth is dictated by circadian changes in the environment, caused by the planet's rotation around its own axis. All forms of life have evolved clock systems to adapt their physiology to the daily variations in geophysical parameters. The intestinal microbiome serves as a signaling hub in the communication between the host and its environment. We recently discovered that the microbiota undergoes diurnal oscillations in composition and function, and that these oscillations are required for metabolic homeostasis of the host. Here, we highlight these findings from the perspectives of microbial system stability and meta-organismal metabolic health. We also discuss the contribution of nutrition and biotic interventions on diurnal processes of the microbiota and their potential involvement in diseases commonly associated with circadian disruption.
diurnal oscillations; food; metabolic disease; microbiota
Non-caloric artificial sweeteners (NAS) are common food supplements consumed by millions worldwide as means of combating weight gain and diabetes, by retaining sweet taste without increasing caloric intake. While they are considered safe, there is increasing controversy regarding their potential ability to promote metabolic derangements in some humans. We recently demonstrated that NAS consumption could induce glucose intolerance in mice and distinct human subsets, by functionally altering the gut microbiome. In this commentary, we discuss these findings in the context of previous and recent works demonstrating the effects of NAS on host health and the microbiome, and the challenges and open questions that need to be addressed in understanding the effects of NAS consumption on human health.
artificial sweeteners; diabetes; glucose intolerance; Microbiome
A report on the first EMBO conference entitled “Next Gen Immunology—From Host Genome to the Microbiome: Immunity in the Genomic Era”, held at the Weizmann Institute of Science, Israel, 14–16 February, 2016.
Eukaryotes employ combinatorial strategies to generate a variety of expression patterns from a relatively small set of regulatory DNA elements. As in any other language, deciphering the mapping between DNA and expression requires an understanding of the set of rules that govern basic principles in transcriptional regulation, the functional elements involved, and the ways in which they combine to orchestrate a transcriptional output. Here, we review current understanding of various grammatical rules, including the effect on expression of the number of transcription factor binding-sites, their location, orientation, affinity and activity; co-association with different factors; and intrinsic nucleosome organization. We review different methods that are used to study the grammar of transcription regulation, highlight gaps in current understanding, and discuss how recent technological advances may be utilized to bridge them.
Transcriptional regulation; gene expression; transcription factor; binding site; nucleosome
The 3’end genomic region encodes a wide range of regulatory process including mRNA stability, 3’ end processing and translation. Here, we systematically investigate the sequence determinants of 3’ end mediated expression control by measuring the effect of 13,000 designed 3’ end sequence variants on constitutive expression levels in yeast. By including a high resolution scanning mutagenesis of more than 200 native 3’ end sequences in this designed set, we found that most mutations had only a mild effect on expression, and that the vast majority (~90%) of strongly effecting mutations localized to a single positive TA-rich element, similar to a previously described 3’ end processing efficiency element, and resulted in up to ten-fold decrease in expression. Measurements of 3’ UTR lengths revealed that these mutations result in mRNAs with aberrantly long 3’UTRs, confirming the role for this element in 3’ end processing. Interestingly, we found that other sequence elements that were previously described in the literature to be part of the polyadenylation signal had a minor effect on expression. We further characterize the sequence specificities of the TA-rich element using additional synthetic 3’ end sequences and show that its activity is sensitive to single base pair mutations and strongly depends on the A/T content of the surrounding sequences. Finally, using a computational model, we show that the strength of this element in native 3’ end sequences can explain some of their measured expression variability (R = 0.41). Together, our results emphasize the importance of efficient 3’ end processing for endogenous protein levels and contribute to an improved understanding of the sequence elements involved in this process.
We present a large-scale experimental investigation into sequence determinants of 3’ end mediated gene expression regulation, by measuring 13,000 designed 3’ end sequences. While 3’ end sequences contribute to expression differences through a variety of mechanisms including mRNA stability and regulation of translation, we find a predominant effect of mRNA 3’ end processing efficiency. Using extensive designed mutagenesis analysis we find that out of three functional elements described in the literature as comprising the polyadenylation signal, a single element (known as the efficiency element) is responsible for most of the effect on protein expression levels. Our work highlights the importance of 3’ end processing in expression regulation and facilitates the incorporation of the effect of this region into more complete models of DNA encoded gene expression regulation.
In parallel to the genetic code for protein synthesis, a second layer of information is embedded in all RNA transcripts in the form of RNA structure. RNA structure influences practically every step in the gene expression program1. Yet the nature of most RNA structures or effects of sequence variation on structure are not known. Here we report the initial landscape and variation of RNA secondary structures (RSS) in a human family Trio, providing a comprehensive RSS map of human coding and noncoding RNAs. We identify unique RSS signatures that demarcate open reading frames, splicing junctions, and define authentic microRNA binding sites. Comparison of native deproteinized RNA isolated from cells versus refolded purified RNA suggests that the majority of the RSS information is encoded within RNA sequence. Over 1900 transcribed single nucleotide variants (~15% of all transcribed SNVs) alter local RNA structure. We discover simple sequence and spacing rules that determine the ability of point mutations to impact RSS. Selective depletion of RiboSNitches versus structurally synonymous variants at precise locations suggests selection for specific RNA shapes at thousands of sites, including 3’UTRs, binding sites of miRNAs and RNA binding proteins genome-wide. These results highlight the potentially broad contribution of RNA structure and its variation to gene regulation.
RNA structure is critical for gene regulation and function. In the past, transcriptomes have been largely parsed by primary sequences and expression levels, but it is now becoming feasible to annotate and compare transcriptomes based on RNA structure. In addition to computational prediction methods, the recent advent of experimental techniques to probe RNA structure by deep sequencing has enabled genome-wide measurements of RNA structure, and provided the first picture of the structural organization of an eukaryotic transcriptome—the “RNA structurome”. With additional advances in method refinement and interpretation, structural views of the transcriptome should help to identify and validate regulatory RNA motifs that are involved in diverse cellular processes, and thereby increase understanding of RNA function.
The structures of RNA molecules are often important for their function and regulation1-6, yet there are no experimental techniques for genome-scale measurement of RNA structure. Here, we describe a novel strategy termed Parallel Analysis of RNA Structure (PARS), which is based on deep sequencing fragments of RNAs that were treated with structure-specific enzymes, thus providing simultaneous in-vitro profiling of the secondary structure of thousands of RNA species at single nucleotide resolution. We apply PARS to profile the secondary structure of the mRNAs of the budding yeast S. cerevisiae and obtain structural profiles for over 3000 distinct transcripts. Analysis of these profiles reveals several RNA structural properties of yeast transcripts, including the existence of more secondary structure over coding regions compared to untranslated regions, a three-nucleotide periodicity of secondary structure across coding regions, and a relationship between the efficiency with which an mRNA is translated and the lack of structure over its translation start site. PARS is readily applicable to other organisms and to profiling RNA structure in diverse conditions, thus enabling studies of the dynamics of secondary structure at a genomic scale.
A new study exploits the time-dependence of formaldehyde cross-linking in the commonly used chromatin immunoprecipitation (ChIP) assay to infer the on and off rates for site-specific chromatin interactions.
Libraries of S. cerevisiae and E. coli promoter reporters measured under different conditions reveal scaling relationships between expression profiles across conditions and suggest that most changes in activity are due to global effects.
Between any two conditions, the activity of most promoters changes by a constant global scaling factor that depends only on the conditions and not on the promoter's identity.The value of the global scaling factor between any two conditions corresponds to the change in growth rate and magnitude of the condition-specific response.When specific groups of genes are activated, they also tend to change according to scaling factors, changing the degree to which the entire group is activated, while preserving the ratios between genes within the group.Altogether, a handful of scaling factors are sufficient for quantitatively describing genome-wide expression profiles across conditions.
Most genes change expression levels across conditions, but it is unclear which of these changes represents specific regulation and what determines their quantitative degree. Here, we accurately measured activities of ∼900 S. cerevisiae and ∼1800 E. coli promoters using fluorescent reporters. We show that in both organisms 60–90% of promoters change their expression between conditions by a constant global scaling factor that depends only on the conditions and not on the promoter's identity. Quantifying such global effects allows precise characterization of specific regulation—promoters deviating from the global scale line. These are organized into few functionally related groups that also adhere to scale lines and preserve their relative activities across conditions. Thus, only several scaling factors suffice to accurately describe genome-wide expression profiles across conditions. We present a parameter-free passive resource allocation model that quantitatively accounts for the global scaling factors. It suggests that many changes in expression across conditions result from global effects and not specific regulation, and provides means for quantitative interpretation of expression profiles.
gene expression; growth rate; modeling; promoter activity; transcription regulation
RNA structural transitions are important in the function and regulation of RNAs. Here, we reveal a layer of transcriptome organization in the form of RNA folding energies. By probing yeast RNA structures at different temperatures, we obtained relative melting temperatures (Tm) for RNA structures in over 4000 transcripts. Specific signatures of RNA Tm demarcated the polarity of mRNA open reading frames, and highlighted numerous candidate regulatory RNA motifs in 3′ untranslated regions. RNA Tm distinguished non-coding versus coding RNAs, identified mRNAs with distinct cellular functions. We identified thousands of putative RNA thermometers, and their presence is predictive of the pattern of RNA decay in vivo during heat shock. The exosome complex recognizes unpaired bases during heat shock to degrade these RNAs, coupling intrinsic structural stabilities to gene regulation. Thus, genome-wide structural dynamics of RNA can parse functional elements of the transcriptome and reveal diverse biological insights.
Genome-wide association studies (GWAS) are widely used to search for genetic loci that underlie human disease. Another goal is to predict disease risk for different individuals given their genetic sequence. Such predictions could either be used as a “black box” in order to promote changes in life-style and screening for early diagnosis, or as a model that can be studied to better understand the mechanism of the disease. Current methods for risk prediction typically rank single nucleotide polymorphisms (SNPs) by the p-value of their association with the disease, and use the top-associated SNPs as input to a classification algorithm. However, the predictive power of such methods is relatively poor. To improve the predictive power, we devised BootRank, which uses bootstrapping in order to obtain a robust prioritization of SNPs for use in predictive models. We show that BootRank improves the ability to predict disease risk of unseen individuals in the Wellcome Trust Case Control Consortium (WTCCC) data and results in a more robust set of SNPs and a larger number of enriched pathways being associated with the different diseases. Finally, we show that combining BootRank with seven different classification algorithms improves performance compared to previous studies that used the WTCCC data. Notably, diseases for which BootRank results in the largest improvements were recently shown to have more heritability than previously thought, likely due to contributions from variants with low minimum allele frequency (MAF), suggesting that BootRank can be beneficial in cases where SNPs affecting the disease are poorly tagged or have low MAF. Overall, our results show that improving disease risk prediction from genotypic information may be a tangible goal, with potential implications for personalized disease screening and treatment.
Genome-wide association studies are widely used to search for genetic loci that underlie human disease. Another goal is to predict disease risk for different individuals given their genetic sequence. Such predictions could either be used as a “black box” in order to promote changes in life-style and screening for early diagnosis, or as a model that can be studied to better understand the mechanism of the disease. Current methods for risk prediction have relatively poor performance, with one possible explanation being the fact they rely on a noisy ranking of genetic variants given to them as input. To improve the predictive power, we devised BootRank, a ranking method less sensitive to noise. We show that BootRank improves the ability to predict disease risk of unseen individuals in the Wellcome Trust Case Control Consortium (WTCCC) data, and that combining BootRank with different classification algorithms improves performance compared to previous studies that used these data. Overall, our results show that improving disease risk prediction from genotypic information may be a tangible goal, with potential implications for personalized disease screening and treatment.
Nucleosome positioning is critical for gene expression and most DNA-related processes. Here, we review the dominant patterns of nucleosome positioning that have been observed, and summarize current understanding of their underlying determinants. The genome-wide pattern of nucleosome positioning is determined by the combination of DNA sequence, ATP-dependent nucleosome remodeling enzymes, and transcription factors including activators, components of the preinitiation complex, and elongating RNA polymerase II. These determinants influence each other such that the resulting nucleosome positioning patterns are likely to differ among genes and among cells within a population, with consequent effects on gene expression.
The core promoter is the region in which RNA polymerase II is recruited to the DNA and acts to initiate transcription, but the extent to which the core promoter sequence determines promoter activity levels is largely unknown. Here, we identified several base content and k-mer sequence features of the yeast core promoter sequence that are highly predictive of maximal promoter activity. These features are mainly located in the region 75 bp upstream and 50 bp downstream of the main transcription start site, and their associations hold for both constitutively active promoters and promoters that are induced or repressed in specific conditions. Our results unravel several architectural features of yeast core promoters and suggest that the yeast core promoter sequence downstream of the TATA box (or of similar sequences involved in recruitment of the pre-initiation complex) is a major determinant of maximal promoter activity. We further show that human core promoters also contain features that are indicative of maximal promoter activity; thus, our results emphasize the important role of the core promoter sequence in transcriptional regulation.
A single transcription factor can activate or repress expression by three different mechanisms: one that increases cell-to-cell variability in target gene expression (noise) and two that decrease noise.
The ability of cells to accurately control gene expression levels in response to extracellular cues is limited by the inherently stochastic nature of transcriptional regulation. A change in transcription factor (TF) activity results in changes in the expression of its targets, but the way in which cell-to-cell variability in expression (noise) changes as a function of TF activity, and whether targets of the same TF behave similarly, is not known. Here, we measure expression and noise as a function of TF activity for 16 native targets of the transcription factor Zap1 that are regulated by it through diverse mechanisms. For most activated and repressed Zap1 targets, noise decreases as expression increases. Kinetic modeling suggests that this is due to two distinct Zap1-mediated mechanisms that both change the frequency of transcriptional bursts. Notably, we found that another mechanism of repression by Zap1, which is encoded in the promoter DNA, likely decreases the size of transcriptional bursts, producing a unique transcriptional state characterized by low expression and low noise. In addition, we find that further reduction in noise is achieved when a single TF both activates and represses a single target gene. Our results suggest a global principle whereby at low TF concentrations, the dominant source of differences in expression between promoters stems from differences in burst frequency, whereas at high TF concentrations differences in burst size dominate. Taken together, we show that the precise amount by which noise changes with expression is specific to the regulatory mechanism of transcription and translation that acts at each gene.
In response to environmental changes, cells regulate the activity of transcription factors (TFs), which in turn change the expression of dozens of downstream target genes by binding to their promoters. The response of each target gene is determined by the interplay between TF concentration and the context in which TF binding sites occur in each target promoter. To examine the relationship between promoter sequence, mechanism of regulation, and response to TF activity, we measured expression of 16 target genes of a single TF in response to changes in TF concentration in single cells. We found that different native promoters that are all targets of the same TF exhibit diverse responses to changing TF levels in terms of both gene expression level and cell-to-cell variability (noise) in expression. Using computational modeling and mutations of specific promoter elements, we show that the molecular mechanisms of regulation can be inferred by measuring how noise changes with expression. These results show that a single TF can regulate transcription through multiple mechanisms, resulting in similar changes in mean expression but vastly different changes in cell-to-cell variability.
Many genetic variants that are significantly correlated to gene expression changes across human individuals have been identified, but the ability of these variants to predict expression of unseen individuals has rarely been evaluated. Here, we devise an algorithm that, given training expression and genotype data for a set of individuals, predicts the expression of genes of unseen test individuals given only their genotype in the local genomic vicinity of the predicted gene. Notably, the resulting predictions are remarkably robust in that they agree well between the training and test sets, even when the training and test sets consist of individuals from distinct populations. Thus, although the overall number of genes that can be predicted is relatively small, as expected from our choice to ignore effects such as environmental factors and trans sequence variation, the robust nature of the predictions means that the identity and quantitative degree to which genes can be predicted is known in advance. We also present an extension that incorporates heterogeneous types of genomic annotations to differentially weigh the importance of the various genetic variants, and we show that assigning higher weights to variants with particular annotations such as proximity to genes and high regional G/C content can further improve the predictions. Finally, genes that are successfully predicted have, on average, higher expression and more variability across individuals, providing insight into the characteristics of the types of genes that can be predicted from their cis genetic variation.
Variation in gene expression across different individuals has been found to play a role in susceptibility to different diseases. In addition, many genetic variants that are linked to changes in expression have been found to date. However, their joint ability to accurately predict these changes is not well understood and has rarely been evaluated. Here, we devise a method that uses multiple genetic variants to explain the variation in expression of genes across individuals. One important aspect of our method is its robustness, in that our predictions agree well between training and test sets. Thus, although the number of genes that could be explained is relatively small, the identity and quantitative degree to which genes can be predicted is known in advance. We also present an extension to our method that integrates different genomic annotations such as location of the genetic variant or its context to differentially weigh the genetic variants in our model and improve predictions. Finally, genes that are successfully predicted have, on average, higher expression and more variability across individuals, providing insight into the characteristics of the types of genes that can be predicted by our method.
A full understanding of gene regulation requires an understanding of the contributions that the various regulatory regions have on gene expression. Although it is well established that sequences downstream of the main promoter can affect expression, our understanding of the scale of this effect and how it is encoded in the DNA is limited. Here, to measure the effect of native S. cerevisiae 3′ end sequences on expression, we constructed a library of 85 fluorescent reporter strains that differ only in their 3′ end region. Notably, despite being driven by the same strong promoter, our library spans a continuous twelve-fold range of expression values. These measurements correlate with endogenous mRNA levels, suggesting that the 3′ end contributes to constitutive differences in mRNA levels. We used deep sequencing to map the 3′UTR ends of our strains and show that determination of polyadenylation sites is intrinsic to the local 3′ end sequence. Polyadenylation mapping was followed by sequence analysis, we found that increased A/T content upstream of the main polyadenylation site correlates with higher expression, both in the library and genome-wide, suggesting that native genes differ by the encoded efficiency of 3′ end processing. Finally, we use single cells fluorescence measurements, in different promoter activation levels, to show that 3′ end sequences modulate protein expression dynamics differently than promoters, by predominantly affecting the size of protein production bursts as opposed to the frequency at which these bursts occur. Altogether, our results lead to a more complete understanding of gene regulation by demonstrating that 3′ end regions have a unique and sequence dependent effect on gene expression.
A basic question in gene expression is the relative contribution of different regulatory layers and genomic regions to the differences in protein levels. In this work we concentrated on the effect of 3′ end sequences. For this, we constructed a library of yeast strains that differ only by a native 3′ end region integrated downstream to a reported gene driven by a constant inducible promoter. Thus we could attribute all differences in reporter expression between the strains to the different 3′ end sequences. Interestingly, we found that despite being driven by the same strong, inducible promoter, our library spanned a wide and continuous range of expression levels of more than twelve-fold. As these measurements represent the sole effect of the 3′ end region, we quantify the contribution of these sequences to the variance in mRNA levels by comparing our measurements to endogenous mRNA levels. We follow by sequence analysis to find a simple sequence signature that correlates with expression. In addition, single cell analysis reveals distinct noise dynamics of 3′ end mediated differences in expression compared to different levels of promoter activation leading to a more complete understanding of gene expression which also incorporates the effect of these regions.
Despite much research, our understanding of the rules by which cis-regulatory sequences are translated into expression levels is still lacking. We devised a method for obtaining parallel and highly accurate expression measurements of thousands of fully designed promoters, and applied it to measure the effect of systematic changes to location, number, orientation, affinity and organization of transcription factor (TF) binding sites and of nucleosome disfavoring sequences. Our analyses reveal a clear relationship between expression and binding site number, and TF-specific dependencies of expression on the distance between sites and gene starts including a striking ~10bp periodic relationship. We also demonstrate the utility of our approach for measuring TF sequence specificities and sensitivity of TF sites to surrounding sequence context, and for profiling the activity of most yeast transcription factors. Our method is readily applicable for studying both the cis and trans effects of genotype on transcriptional, post-transcriptional, and translational control.
Fundamental aspects of embryonic and post-natal development, including maintenance of the mammalian female germline, are largely unknown. Here we employ a retrospective, phylogenetic-based method for reconstructing cell lineage trees utilizing somatic mutations accumulated in microsatellites, to study female germline dynamics in mice. Reconstructed cell lineage trees can be used to estimate lineage relationships between different cell types, as well as cell depth (number of cell divisions since the zygote). We show that, in the reconstructed mouse cell lineage trees, oocytes form clusters that are separate from hematopoietic and mesenchymal stem cells, both in young and old mice, indicating that these populations belong to distinct lineages. Furthermore, while cumulus cells sampled from different ovarian follicles are distinctly clustered on the reconstructed trees, oocytes from the left and right ovaries are not, suggesting a mixing of their progenitor pools. We also observed an increase in oocyte depth with mouse age, which can be explained either by depth-guided selection of oocytes for ovulation or by post-natal renewal. Overall, our study sheds light on substantial novel aspects of female germline preservation and development.
Many aspects of mammalian female germline development during embryogenesis and throughout adulthood are either unknown or under debate. In this study we applied a novel method for the reconstruction of cell lineage trees utilizing microsatellite mutations, accumulated during mouse life, in oocytes and other cells, sampled from young and old mice. Analysis of the reconstructed cell lineage trees shows that oocytes are clustered separately from bone-marrow derived cells, that oocytes from different ovaries share common progenitors, and that oocyte depth (number of cell divisions since the zygote) increases significantly with mouse age.
We recently reported the identification and characterization of DNA replication origins (Oris) in metazoan cell lines. Here, we describe additional bioinformatic analyses showing that the previously identified GC-rich sequence elements form origin G-rich repeated elements (OGREs) that are present in 67% to 90% of the DNA replication origins from Drosophila to human cells, respectively. Our analyses also show that initiation of DNA synthesis takes place precisely at 160 bp (Drosophila) and 280 bp (mouse) from the OGRE. We also found that in most CpG islands, an OGRE is positioned in opposite orientation on each of the two DNA strands and detected two sites of initiation of DNA synthesis upstream or downstream of each OGRE. Conversely, Oris not associated with CpG islands have a single initiation site. OGRE density along chromosomes correlated with previously published replication timing data. Ori sequences centered on the OGRE are also predicted to have high intrinsic nucleosome occupancy. Finally, OGREs predict G-quadruplex structures at Oris that might be structural elements controlling the choice or activation of replication origins.
DNA replication origins; DNA synthesis; G-quadruplex; nucleosome; CpG islands; transcription
We propose definitions and procedures for comparing nucleosome maps and discuss current agreement and disagreement on the effect of histone sequence preferences on nucleosome organization in vivo.