No particular GO group exhibited longer than expected promoters (Table ). This suggests that the GO groups found in Table to be isolated from their neighbouring genes, such as cell wall and plasma membrane proteins, do not require this distance to accommodate larger promoters where more transcription factors can bind (see below). In contrast to the lack of larger-than-average promoters, many GO groups were enriched for long 5'UTRs. These included categories related to signal transduction pathways (amino acid phosphorylation, signal transduction, small GTPase signal transduction), invasive and pseudohyphal growth, and cell wall proteins. Long 5'UTRs have been linked in the past to translation regulation: folding of the 5'UTR may help regulate the accessibility to the ribosome [8
]. Indeed, all the processes mentioned require precise levels of expression. Our results suggest that they may be regulated at the level of initiation of translation. Table also shows that genes involved in transcription regulation tend to have long 3'UTRs (probably pointing to regulation through RNA binding proteins, see below), whereas longer than usual terminators can be seen in genes involved in response to stress and amino acid transport (Table ). The length distribution of all functional categories is presented in Additional File 3
Next, we asked whether there is a correlation between the length of the different regions of each gene. Table shows that the highest correlations are seen between the size of each ORF and its 5' UTR (a correlation of 0.19), as well as between the promoter and terminator regions (0.16). These results may suggest that longer genes require longer regulatory regions. Indeed, such genes are regulated on average by more transcription factors (correlation = 0.12, p < 10-16
; see the next section) and their mRNA tend to bind more regulatory proteins (correlation = 0.16, p < 10-16
; see the next section); these features may require longer promoters and UTRs (see the next section). Interestingly, the adjacent 3'UTR and terminator regions exhibit a clear and strong negative correlation (-0.19). The opposing trends between 3'UTR and its adjacent terminator region suggest that a minimal distance must exist between ORFs to allow proper expression levels. This results in a trade-off between the 3'UTR length and that of the terminator [18
Factors related to the length of the different regions
In the next stage, we analyzed whether the different gene regions are correlated with different factors that affect gene expression. The following variables were analyzed (Table ): ) Number of transcription factors known to bind at the promoter region (N° of TFs) [27
]. 2) Number of RNA binding proteins known to bind its mRNA product (N° of RPB) [28
]. 3) mRNA levels [29
]. 4) mRNA half life [30
]. 5) 5'UTR free energy [8
]. 6) Protein abundance (PA) [31
]. 7) Protein half life [32
]. 8) Noise in protein levels [33
]. And 9) Evolutionary rate of the gene (ER) [34
]. In the case of variables with small discrete number of values (N° of TFs, N° of RBF), the correlation is reported as significant only when an empirical p-value corresponding to a permutation test was significant (see Materials and methods; the empirical p-values appear in Additional File 4
Relations between the lengths of Promoters, 5'UTRs, 3'UTRs, and various parameters.
Table shows that the length of ORFs and untranslated regions significantly correlate with many central features. For example, as expected, a positive correlation can be seen between promoter length and the number of transcription factors binding it (r = 0.29, p < 10-16). However, the fact that the number of TFs also correlates with terminator and 5'UTR lengths additionally suggests that genes with more extensive TFs regulation require longer distance from neighboring ORFs.
Genes with higher protein abundance and increased mRNA levels tend to have longer promoters, UTR3, and terminators, and tend to be short (presumably, to allow efficient translation; see for example [35
]). This result demonstrates that the untranslated regions contribute to the tighter regulation of highly expressed genes. In addition, proteins whose abundance within the cell tends to be variable or "noisy" show longer promoters. The significance of this observation remains unclear.
Interestingly, we found a significant negative correlation between promoter length and evolutionary rate of the corresponding genes. This correlation is still significant after controlling for the number of TFs or for any of the other features that appear in Table . Thus genes with longer promoters evolve at a slower rate. This seems to occur independently of the fact that they are regulated by more TFs, and tend to have higher mRNA and protein levels. The puzzling inverse correlation between promoter length and evolutionary rate suggests that regulatory mechanisms other than TFs play an important regulatory role, which cannot be easily modified during evolution. This additional regulatory mechanism(s) could be related to chromatin configuration, an aspect of nuclear architecture that has lately been the focus of much attention [36
Throughout the years various roles have been attributed to the 5' and 3' UTR regions, including mRNA stability, folding, interactions with the nuclear export, RNA processing, splicing and translational machines, as well as intracellular traffic and localization [6
]. We show that whereas the 3' UTR length exhibits a negative correlation with mRNA half life, the 5' UTR length is inversely proportional to protein half life and abundance (Table ). These results show that the main effect that these two untranslated regions have on gene expression occurs at two different levels, the 3'UTR acting mainly at the RNA stability level, and the 5'UTR enabling appropriate translation. Lately it has become apparent that RNA-binding proteins (RBPs) play an important role in regulating gene expression [28
]. RBPs recognize specific sequences at various locations along the mRNA molecule. Our results suggest that those at the 3'UTR play a major role in regulation, as the correlation of the number of RBPs is significantly positive with the length of the 3'UTRs (0.092, p = 3.6*10-11
) and significantly negative with the length of the 5'UTRs (-0.066, p = 1.3*10-5
The organization of genomes is a subject of intensive research. Not long ago, it was assumed that genes were randomly distributed in eukaryotic genomes, in contrast to prokaryotes, where the organization of genes in regulatory operons requires their physical clustering [37
]. However, work carried out in the last few years has challenged this view (reviewed in [38
]). It appears that gene distribution is far from random and many eukaryotic genomes include clusters of genes that are related in their function [39
]. A clear connection was found between co-expression and proximity, as closely-located genes tend to be co-expressed [41
], clusters of co-expressed genes in mammalian genomes are evolutionarily conserved [42
], and highly expressed genes and housekeeping genes tend to cluster [44
]. In addition, clustered genes tend to exhibit similar functionality [39
], tend to be located in domains with low recombination rates [51
], encode proteins that tend to interact physically [38
], and belong to the same metabolic pathway [54
A number of previous publications explored the genomic distribution of genes belonging to the same biological function or biochemical pathway [48
]. Recently, Tuller et al
. compared the genomes of 16 organisms and found a high level of functional organization for eukaryotes, such as Saccharomyces cerevisiae
]. They also found that the genomic distribution of cellular functions tends to be more similar in organisms that have higher evolutionary proximity. Here we analyze the distribution of genes in the genome of the yeast Saccharomyces cerevisiae
from a functional point of view. Measuring distances between genes belonging to various GO categories, we find that certain functions in yeast are encoded by genes that tend to be close to other genes (not necessarily from the same function). We see an enrichment of functions related to mRNA splicing (Table ). Such a clustering is explained by the fact that these genes tend to have short promoters (Table ). The biological significance of this finding is not completely clear. One possibility is that for unknown reasons, genes related to mRNA splicing tend to be regulated by fewer transcription factors than others, and thus require shorter promoter regions. Although these genes have a lower number of transcription factors, the difference with the rest of the genome is not statistically significant (data not shown), suggesting that additional forces may affect promoter length of these genes. Alternatively, proper regulation of this set of genes may require physical proximity between transcription initiation factors and upstream regulators such as transcription factors and chromatin remodelers. Interestingly, chromatin remodelers by themselves constitute another GO group with short promoters. Additional GO groups with short promoters include those related to genome maintenance (DNA repair, DNA damage response, etc). In contrast, GO groups involved in responses to environmental changes (signal transduction, cell wall, etc.) tend to have longer untranslated sequences.
Our results suggest that gene distribution in the genome has evolved to allow suitable regulation: highly expressed genes tend to be shorter, and have extensive promoters and terminators. The longer promoters can partially be explained by the need of tighter regulation of these genes by TFs; the longer terminator may be needed in order to reduce transcription noise from neighbor genes. In addition we have shown that 5' and 3' UTRs may provide additional layers of regulation, with 3'UTRs exerting their effect at the RNA level, and 5'UTRs affecting translation levels. Thus, genome architecture has a significant role in regulating gene expression, and in shaping the characteristics and functionality of proteins.