|Home | About | Journals | Submit | Contact Us | Français|
We have comprehensively mapped long-range associations between chromosomal regions throughout the fission yeast genome using the latest genomics approach that combines next generation sequencing and chromosome conformation capture (3C). Our relatively simple approach, referred to as enrichment of ligation products (ELP), involves digestion of the 3C sample with a 4bp cutter and self-ligation, achieving a resolution of 20kb. It recaptures previously characterized genome organizations and also identifies new and important interactions. We have modeled the 3D structure of the entire fission yeast genome and have explored the functional relationships between the global genome organization and transcriptional regulation. We find significant associations among highly transcribed genes. Moreover, we demonstrate that genes co-regulated during the cell cycle tend to associate with one another when activated. Remarkably, functionally defined genes derived from particular gene ontology groups tend to associate in a statistically significant manner. Those significantly associating genes frequently contain the same DNA motifs at their promoter regions, suggesting that potential transcription factors binding to these motifs are involved in defining the associations among those genes. Our study suggests the presence of a global genome organization in fission yeast that is functionally similar to the recently proposed mammalian transcription factory.
Eukaryotic genomes are non-randomly organized in the nucleus. It is becoming clear that intra-nuclear positions of genomic loci are influenced by various nuclear processes including transcription, replication and repair (1). It is well known that the ribosomal genes (rDNA repeats) are transcribed by RNA polymerase (Pol) I in the nucleolus. Moreover, it has been shown that Pol III genes such as tRNA genes are clustered at or near the nucleolus in yeasts, suggesting that Pol III transcription likely occurs in a subnuclear domain (2,3). It has been proposed that Pol II gene transcription involves higher-order genome organization associated with ‘transcription factories’ which accumulate Pol II transcription machinery for gene transcription (4–7). It has recently been suggested that transcription factors are involved in the association of genes with these transcription factories (8). However, how transcription factories function remains unclear, partly because they have been studied in complex mammalian cells. Studying the factories in a model organism with a much simpler genome can facilitate understanding of the role of transcription factories with regard to transcriptional regulation.
Fluorescent in situ hybridization (FISH) has been used to analyze nuclear localization of genomic loci at a global level, but a relatively new approach, chromosome conformation capture (3C), now allows us to investigate physical associations between specific genomic loci (9). The use of the 3C method has triggered development of several additional genome-wide approaches including 4C and 5C (10–12). It has recently been reported that 3C combined with next-generation DNA sequencing, referred to as Hi-C, can be used to comprehensively map genomic associations (13). Application of the Hi-C method to the human genome has identified genomic associations at a resolution of 1Mb, and has shown that the human genome is segregated into two compartments corresponding to open and closed chromatin. We hypothesized that the latest genomics approach was likely to provide much higher-resolution if applied to a model organism carrying a small genome. Indeed, the similar method applied to budding yeast significantly increased the resolution of mapped genomic associations (14,15).
The fission yeast Schizosaccharomyces pombe offers an excellent model system to investigate the organization of a functional genome. Its genome is ~14 Mb, consisting of ~5000 genes located on only three chromosomes, with an organization and composition similar to higher eukaryotes (16). For example, its genome contains large stretches of heterochromatin at centromeres and subtelomeres (17). We have previously shown that the fission yeast genome displays a specific functional architecture within the nucleus (2,18).
In this study, we utilize the latest genomic approach combining the 3C and next-generation DNA sequencing to gain insights into functional relationships between the global genome organization and transcriptional regulation in the model organism fission yeast. Our analyses have revealed significant associations between highly transcribed genes, between co-regulated genes during cell-cycle progression, and between functionally related genes derived from particular gene ontology groups. Our study identifies inter- and intra-chromosomal interactions providing further evidence for a mechanism of functional genome organization that supports gene expression in a structure similar to the transcription factory described in mammals.
3C analysis was performed as described previously (9) with modifications. Briefly, fission yeast cells (~7×108 cell) were digested by Zymolyase 100T at 30°C for 10min, and then cross-linked with 4% paraformaldehyde at 18°C for 30min. The fixed sample was treated with HindIII at 37°C for 2h and then diluted 20 times with T4 DNA ligase buffer, followed by DNA ligation at 16°C for 70min. To prepare the random ligation (RL) control sample, genomic DNA was first purified from the wild-type fission yeast strain used in 3C analysis. The genomic DNA was completely digested by HindIII at 37°C for 2h, followed by DNA ligation. The 3C and RL samples were further subjected to the following sample preparation processes for Illumina paired end sequencing.
Eight micro grams of 3C and RL samples were digested by BfuCI at 37°C for 1h. The resultant samples were diluted 1:10 with a T4 DNA ligase buffer and subjected to DNA ligation at 16°C for >8h. The DNA samples were then digested by HindIII at 37°C for 2h. The purified ELP samples were sequenced by an Illumina Genome Analyzer II. The obtained sequences have been deposited at NCBI Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra/) under the accession # SRP002804.
This section contains the following:
1. Alignment of paired reads and filtering processes. The 36bp paired reads were aligned by using Maq (http://maq.sourceforge.net/) with the setting of maximum outer distance (900bp). Reference sequence of the fission yeast genome (20090706) was obtained from the Sanger Institute. Paired sequences containing HindIII sites at both ends of DNA molecules were maintained for subsequent analyses. In order to extract the data that reflect long-range associations, paired DNA sequences aligned to two genomic regions positioned <20kb apart were discarded. To eliminate paired reads aligned to the repeat sequences, all the reads were aligned to the reference sequence using Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) with the option –m1, which allows Bowtie to discard the sequences alignable to multiple positions. The discarded sequences were used to identify the paired reads derived from repetitive sequences. The paired sequences from repeats were removed from Maq-aligned data.
2. Calculation of physical proximity values. The entire fission yeast genome was divided into 20kb sections. Paired sequences were assigned to two distant genomic sections according to positions of the reads. There are a total of 628 genomic sections. The total number of combinations between two sections was 196878. All the paired reads were mapped to the genome. Total numbers of paired reads assigned to respective combinations were counted. Total counts of paired reads from the 3C sample were compared with those from RL control. Physical proximity value I(i, j) was calculated as follows:
N3C(i, j) indicates total count of paired reads from the 3C sample assigned to the combination between genomic loci i and j. NRL(i, j) is the count from RL control. I(i, j) was discarded if NRL(i, j) was less than 4, because low values of NRL(i, j) appear to cause fluctuation in values of I(i, j), resulting in 180562 remaining combinations. The physical proximity values are accessible at the Wistar website (http://www.wistar.org/research_facilities/noma/pubdata.htm).
3. Distance normalization. Average physical proximity values between genomic loci separated by the same distances were gradually decreased along with the distances between two loci. Three curves for respective chromosomes were fitted by double-exponential curves. Physical proximity values were normalized by means of the following formula.
x represents the specific distance between two genomic loci i and j. Q(i, j) is the distance-normalized value of physical proximity value, I(i, j). F(x) is the function indicating the fitting curves for respective chromosomes.
4. Statistical analyses for detecting significant associations. Statistical analyses were performed to test the hypothesis that genes related to some biological features associate together. For instance, the significance of associations among LTRs, highly and poorly expressed genes, cell-cycle regulated genes and genes in gene ontology groups was investigated. If associations among a specified group of genes are significant, the total physical proximity value among genomic sections containing those genes should be higher than that among randomly selected sections. According to this criterion, total physical proximity values among genomic sections containing genes in the target group were compared to those among genomic sections from a null model. For each target group, we calculated the total physical proximity value among the same number of genomic sections, randomly selecting from the entire genome. A null model was built by repeating this process 1000 times (1000 permutations). Distribution of total physical proximity values corresponding to the null model was used for the calculation of P-value.
The modeling of the 3D genome structure was performed as described previously (14) with modifications. The fission yeast genome was modeled as strings of beads. Each bead displays a center of a 20kb genomic section. There are a total of 622 beads covering the entire genome.
The first step was to calculate the 3D distance from the physical proximity value. Eighteen pairs of distant genomic loci were analyzed by FISH. Due to the distributions of FISH measurements, 30% of data points were truncated from both tails in order to remove possible outliers and the remaining middle 40% of the FISH data were used for the following calculations. The relationship between physical proximity values and FISH data was fitted by a non-linear regression curve. All physical proximity values were converted to 3D distances according to the fitted equation. The top 60% of physical proximity values corresponding to 115878 combinations were used for the following modeling processes.
The next step was to calculate coordinates of all beads separated by the distances calculated above. Let pi=(xi,yi,zi) be the 3D coordinate of the i-th bead. dist(pi, pj) denotes the Euclidean distance between pi and pj. Let δi,j be 3D distance converted from physical proximity values between two genomic sections i and j. All bead coordinates were finally found by minimizing the squared sum of differences between dist(pi, pj) and δi,j as described by:
This minimization was performed under the following five constraints:
Without loss of generality, we set the origin (0,0,0) as the center of the sphere.
where pc corresponds to the position of the centromere. There are three pc representing three centromeres.
where pt corresponds to the position of the telomere. There are 6 pt representing telomeres.
No constraints were applied for inter-chromosomal associations. Applying the above five constraints, the minimization was solved by AMPL software with IPOPT solver (21). The 3D structure of the entire fission yeast genome was built by smoothly interpolating the obtained 3D coordinates of the 622 beads. The modeled structure was drawn by Pymol (22). The model structure is accessible at the Wistar website (http://www.wistar.org/research_facilities/noma/pubdata.htm).
FISH experiments were performed as described (23). To generate FISH probes, cosmid, plasmid or PCR-derived DNA fragments were labeled by incorporating Cy3-dCTP or Cy5-dCTP (GE Healthcare) using a random primer DNA labeling kit (Takara). Cosmid clones were obtained from the Sanger Institute. The cosmid cos212 and the plasmid pRS140 were used for preparing FISH probes specific to telomeres and centromeres, respectively. Stained cells were analyzed by a Zeiss Axioimager Z1 fluorescence microscope with oil immersion objective lens (Plan Apochromat, 100×, NA 1.4, Zeiss). Images were acquired at 0.2µm intervals in the z-axis and deconvolved by Axiovision 4.6.3 software (Zeiss). More than 100 cells were analyzed for each experiment.
Total RNA was extracted from cells as described previously (24). The total RNA sample (~5µg) was treated with 10 U of DNase I (Promega) at 37°C for 40min, to remove contaminating genomic DNA and then purified by phenol/chloroform extraction. The resultant RNA sample was subjected to microarray analysis. Microarray experiments were conducted as described in the Nugen ovation manual and the Affymetrix genechip expression analysis technical manual. Briefly, 100ng of total RNA was reverse transcribed by poly(T) nucleotides and cDNA was amplified by Ovation RNA amplification system v2 (Nugen Technologies). The amplified cDNA was biotinylated by Fl-ovation cDNA biotin module v2, followed by hybridization to Yeast genome 2.0 genechips (Affymetrix) at 45°C for 16h. The array was washed with low (6× SSPE) and high (100mM MES, 0.1M NaCl) stringency buffers, and stained with streptavidin-phycoerythrin. Fluorescence signal was amplified by the addition of biotinylated anti-streptavidin and an additional aliquot of streptavidin–phycoerythrin stain. A confocal scanner was used to scan microarrays at excitation 570nm. For initial data analysis, an Affymetrix command console was used to quantitate expression levels for targeted genes. Microarray data preprocessing, including normalization and background correction, was performed by the Mas5.0 software. The expression data have been deposited at NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) and are accessible through the GEO accession # GSE15108.
Based on the microarray data, expression levels were assigned to the genes. To analyze the relationship between expression levels of the genes and their associations, the entire fission yeast genome was first divided into 20kb sections. There are 628 sections derived from the fission yeast genome. Each 20kb section contains ~10 genes. Expression levels of the 10 genes within a section were compared, and the maximum expression level from a gene was assigned to the section. The genomic sections were ranked by the expression levels. The genomic sections corresponding to highly and poorly expressed genes were derived from the top 100 and bottom 100 sections, respectively.
DNA sequences corresponding to 600bp regions upstream of target genes were obtained from the Sanger Institute FTP server. Existing software such as MDSCAN, MEME, BioInspector, GLAM, Gibbs Motif Sampler, Weeder, Prority and SCOPE were first used to search for motifs with the setting allowing arbitrary length and wild cards. These methods did not produce any consistent motifs. Therefore, a new motif search method was developed. With our method, all possible combinations of 6–9nt were searched exhaustively in the real data set and two background data sets (null models). The two null models were used to evaluate the significance of specific sequences identified in the real data set. One of the null models was DNA sequence derived from intergenic regions, while another null model was created by randomly shuffling sequence in the real data set so that the order of the nucleotides changes but the letter content was identical to the real set. Appearance frequency of the specific sequence in the real set was compared to those in the two null models. Sequence shuffling and obtaining intergenic sequence were repeated 1000 times. DNA sequences were recognized as motifs if appearance frequencies of the specific sequences were >21% of the total sequences in the real set and the ratios between appearance frequencies in the real set and the two null models were both greater than two. Finally, the motif with the highest score was selected and used to scan the TRANSFAC database to determine whether it had been previously identified (25).
In order to study the fission yeast genome organization, we have applied a modified Hi-C approach to our studies (Figure 1A). We first established that our 3C approach was suitable for analyzing the fission yeast genome organization by confirming the clustering of centromeres, as previously characterized by FISH analysis (Figure 1B) (26). The 3C sample contains hybrid DNA molecules reflecting physical associations between discrete genomic loci. In carrying out these studies, we developed a method, referred to as enrichment of ligation products (ELP), to prepare the 3C sequencing samples (Figure 1A). The ELP method involves an initial digestion of the 3C sample with a restriction enzyme (BfuCI) that recognizes a specific 4bp sequence, followed by self-ligation and further treatment with another restriction enzyme used for 3C (HindIII in our experiment). As a result, the hybrid DNA fragments ligated together during the 3C experiment can be enriched for the sequencing step due to the linearity of these DNA molecules while non-linear spurious background sequences are reduced. The ELP-processed samples were then sequenced using an Illumina Genome Analyzer II with the Paired End (PE) module (Illumina, San Diego, CA, USA). The Paired End method determines the sequences present at both ends of the single DNA molecule, and can also determine whether or not they are derived from the same contiguous DNA fragment. Paired reads from >10 to 15 million DNA molecules were then mapped to genomic positions. Paired sequences found to be located <20kb apart were first filtered out. The paired reads derived from repetitive DNA sequences such as centromeric repeats were also eliminated, even if only one end of the paired reads was assigned to repeats, because these sequences can be assigned to multiple genomic positions. The remaining sequences were then examined to identify long-range genomic associations. Approximately a half million paired end reads derived from a single sequencing lane remained after the above filtering processes (Figure 1C). To obtain a sufficient number of paired reads to cover the entire fission yeast genome, the ELP-prepared sample was sequenced three times. In comparison to the simple application of the 3C sample to the massive sequencing, the ELP method resulted in an ~9-fold increase in the number of paired reads representing associations between genomic loci.
To check the reproducibility of our analysis, an independent 3C sample was processed using the ELP method and again sequenced three times. A randomly ligated (RL) control sample, which does not reflect in vivo associations between genomic loci, was also processed with the ELP method and sequenced four times. Total numbers of paired reads after the filtering processes were 1.2–1.3 million for the 3C samples and 3.8 million for RL control (Supplementary Figure S1A). We found that paired reads from the RL control were not evenly distributed throughout the genome, indicating that there are obvious sequencing biases which also appeared to affect the distribution of paired reads from the 3C samples. We thus created our specific approach to determine a physical proximity value using normalization according to the distribution of paired reads from the RL control. We calculated a physical proximity value between distant 20kb DNA fragments by comparing the total count of paired reads from the 3C sample with that from the RL control (Supplementary Figure S1B). The same calculation was carried out for every combination of 20kb DNA fragments throughout the fission yeast genome. The physical proximity values from the two independent 3C samples (3C–1 and 3C–2) showed a clear correlation at a 20kb resolution (Pearson’s r=0.744, P<2.2×10−16), indicating that our methodology generates reproducible data (Supplementary Figure S1C). Resolutions at 10 and 40kb indicated the lower and higher correlations, respectively, compared to the 20kb resolution (Supplementary Figure S1C). From here on we employ the 20kb resolution data, judging from the correlation and size of genomic sections that was suitable for the following genomics analyses.
We plotted the physical proximity values throughout the three fission yeast chromosomes (Figure 2). The comprehensive map represents the physical proximity values between 20kb DNA fragments distributed throughout the genome. We identified specific associations among centromeres and among telomeres. These genome structures are tightly linked to chromosome dynamics, and interactions were also detected by FISH analyses (Figure 3A and B) (26). In fission yeast, heterochromatin is distributed at centromeres, telomeres and a few other loci, and euchromatin is present in the remaining domains (17). It is known that RNAi machinery is involved in associations of these heterochromatic domains (27). We next tested whether other associations indicated in the map could also be detected using FISH analysis to visualize the intra-nuclear positioning of the various genomic loci. We investigated three combinations (1, 2 and 3) indicated in Figure 2, and found that the physical proximity values correlated with FISH data (Figure 3C). We performed extensive FISH analyses on a total of 18 combinations of genomic loci, and found the physical proximity values to be very strongly correlated with the FISH data (R2=0.9065; Figure 3D). These observations support our interpretation that the physical proximity values in the map reflect global genome organization in vivo.
Three-dimensional structure of the budding yeast genome has recently been modeled using the Hi-C data (14). We employed a similar approach to model the fission yeast genome structure (See ‘Materials and Methods’ section). Physical proximity values were converted into 3D distances using the conversion formula obtained by comparing physical proximity values and FISH data for 18 pairs of genomic loci (Figure 3D and Supplementary Table S1). The 3D genome structure was modeled based on the calculated distances corresponding to 115878 combinations between distant genomic loci (Figure 4A). Moreover, we validated the modeled genome structure by comparing the distances in the 3D structure to FISH data. Distances in the modeled structure and in FISH data lie near the 45° line (R2=0.8970; Figure 4B), indicating that the modeled genome structure appears to reflect the in vivo structure to some extent. However, it is important to note that the modeled structure might not perfectly match the in vivo genome structure due to technical limitations. Physical proximity values used for the modeling of the genome structure only reflect average association frequencies between genomic loci in the cell population, and do not directly represent stability of respective associations. For example, physical proximity values cannot distinguish between stable associations in a few cells and unstable associations in many cells. However, it is likely that stable associations such as telomere clustering occurs in many, if not all, cells, resulting in high scores of physical proximity values, which are major determinants for positioning of genomic loci in the modeled genome structure. This likely accounts for the modeled structure being strongly correlated with FISH data (Figure 4B).
In the modeled genome structure, we first noticed that the telomeres from chromosomes 1 and 2 were in close proximity, which was also indicated by FISH results (Figures 4A and and3B).3B). This again suggests that the modeled genome structure at least partially reflects the in vivo structure. Interestingly, we also observed that three chromosomes were segregated into respective domains with overlapping junctions. This chromosome segregation partially results from the strong local associations that are represented diagonally in the physical proximity map (Figure 2). Those local associations between genomic loci separated by <1 Mb contribute to self-assembly of the respective chromosomes. Moreover, the average physical proximity value for intrachromosomal associations between genomic loci separated by >1.0 Mb was 0.64, while the average physical proximity value for interchromosomal associations was 0.59. This difference should not be observed when chromosomes are randomly disposed in the nucleus, supporting chromosome segregation in fission yeast. This disposition of chromosomes in the nucleus is similar to chromosome territories observed in mammalian cells (4,28). Our results are also consistent with previous observations, by which FISH analyses indicated chromosome territories existing in fission yeast (29). Together, our analyses suggest that the intra-nuclear disposition of the fission yeast chromosomes might to some extent be similar to the mammalian organization.
We found strong local associations that are represented diagonally in the physical proximity map (Figure 2), most likely because those DNA fragments are relatively closely positioned in the nucleus. To examine the extent of the distance effect, we plotted the average ligation frequencies between genomic loci separated by the same distances, and found that the average ligation frequencies for the 3C sample were gradually decreased along with the distances, while the frequencies for the RL control samples were not related to the linear distances (Supplementary Figure S2). We also found that the average physical proximity values between genomic loci positioned less than ~1Mb apart were gradually decreased along with the distances between two loci (Supplementary Figure S3A). The distance curves also revealed associations between left and right telomeres within the same chromosomes. Since it was possible that some local associations embedded in the map reflected specific local interactions, we tested this possibility by using the distance curves to normalize the physical proximity values (Supplementary Figure S3B). This distance normalization eliminated a major population of local associations that likely resulted from random positioning of spatially linked genomic loci (Supplementary Figure S4A). Physical proximity values more than the average level (~1.0) imply that association frequencies between distant genomic sections are greater than the random association level. The distribution of physical proximity values indicated that 14 and 5% of the total combinations (180562) between 20kb genomic sections had values of more than 1.5 and 2.0, respectively (Supplementary Figure S4B). Associations that scored with physical proximity values >1.5–2.0 were likely to be detected by FISH in some populations of cells (Figure 3D).
We examined whether the distance-normalized map captures previously identified genome organizations. A previous study has shown that long-terminal repeat (LTR) retrotransposons cluster in the fission yeast nucleus (30). Our analysis also identified significant associations among DNA fragments containing LTRs (P=0.00529, 1000 permutations; Figure 5A). Although paired reads derived from repetitive DNA sequences were removed by the filtering process as described above, we were able to investigate the associations among DNA fragments containing LTRs, because HindIII sites are not present within LTRs. We found that associations with physical proximity values >1.5 were increased by 4.2%, when associations between genomic sections containing LTRs were compared to the random control considering entire genomic sections. In other words, there were 493 (4.2%) additional associations derived from a total of 11628 combinations between 153 LTR sections, as compared to the average association frequency between randomly picked genomic sections. This result again argues that the physical proximity map reflects the global genome organization in vivo. The physical proximity values are accessible at our website (See ‘Materials and Methods’ section) and can be used to identify novel genome organizations involving long-range associations. In the following sections, we exemplify how the physical proximity values can be used to investigate global genome organizations.
We examined whether gene arrangement influences association between genomic loci. We considered in total 36 gene arrangements involving 6 genes. Interestingly, genomic sections containing the specific gene arrangements tend to associate with one another in a statistically significant manner (Supplementary Figure S5). Genomic sections containing three consecutive convergent genes displayed the most significant association. Associations among genomic sections carrying two consecutive convergent genes were also significant. All the top 7 gene arrangements contained consecutive convergent genes, but the remaining 29 gene arrangements did not have any consecutive convergent genes. The gene arrangement without any convergent genes displayed the lowest average physical proximity value. These results suggest that consecutive arrangement of convergent genes is favored for associations between genomic regions. It has been shown that cohesin is recruited to convergent genes in fission yeast (31,32). In mammals, cohesin is implicated in association between genomic loci (33–37). It is possible that, in fission yeast, cohesin might be involved in the association between genomic regions containing consecutive convergent genes. In any case, our analyses suggest that gene arrangements contribute to global genome organization in fission yeast.
To explore the influence of transcription on global genome organization, we asked whether genomic sections containing highly expressed genes associate in the nucleus. Our analysis revealed significant associations between genomic regions containing highly expressed genes, as compared to randomly selected genes serving as a control (P=0.0252, 1000 permutations; Figure 5B). Associations that scored with physical proximity values of more than 1.5 were increased by 3.5% (172 combinations) when associations between highly expressed genes were compared to the random control. In clear contrast, associations among the poorly-expressed genes were not different from the control (P=0.418, 1000 permutations; Figure 5B), suggesting that highly transcribed genes tend to associate with one another in a statistically significant manner. It has been shown that active genes are co-localized to the shared nuclear sites referred to as transcription factories in mammalian cells, although the exact functions of transcription factories and their assembly processes are still unclear (6,7,38–40). Our results suggest that highly active genes frequently co-localize at transcription factories or functionally similar entities present in the fission yeast nucleus.
We next examined whether co-regulated genes associate in the fission yeast nucleus. It has been reported that many genes in fission yeast are periodically regulated during the cell cycle (41). Those periodically transcribed genes were previously classified into four groups representing expression peaks during M, G1, S or G2 phases. Interestingly, we found that only G2 phase genes exhibited significant associations (P=0.0285, 1000 permutations), whereas genes in the other groups did not show significant associations (Figure 5C). Association frequencies among G2 genes were similar to those among LTR retrotransposons (Figure 5A and C). Associations scored with physical proximity values of more than 1.5 were increased by 3.8% (259 combinations) when associations among G2 genes were compared to the random control. It is noteworthy that the 3C samples were prepared from asynchronous cultures, which predominantly contain G2 cells (~75%). Therefore, our data suggest that in fission yeast, G2 genes tend to associate with one another when activated. Since a majority of the cells in the culture are in G2, it is possible that other underrepresented cell-cycle-regulated genes associated with M, G1 and S phase might also associate during their respective cell-cycle stages, although this requires further experimental validation.
The regulation of periodically expressed genes involves interaction with specific transcription factors (41). In examining the upstream sequence of the G2 genes, we have identified a new sequence motif, C[T/G]CGTTA, within the 600bp region upstream of 21 G2 genes (Figure 5D). The motif was frequently positioned between the transcription start site and 200bp upstream. Remarkably, G2 genes with this motif showed significantly stronger associations compared to associations among the entire G2 gene group (P=0.0152, 1000 permutations; Figure 5E), suggesting that an unidentified DNA binding protein, likely a transcription factor recognizing this G2 gene-related motif, may facilitate these associations. Consistent with this result, we found that several G2 genes containing the motif were present in proximity in the modeled genome structure (Figure 5F). Moreover, almost all G2 genes (107/118 G2 genes) contain a degenerate motif with one mutation in the perfect motif. It is possible that the degenerate motifs in G2 genes might be less tightly bound by the potential factors, causing significantly enhanced associations among the entire G2 gene population compared to the random control (Figure 5C). It has recently been reported that in mouse, co-regulated genes preferentially cluster at transcription factories, and that this clustering is mediated by binding of the transcription factor Klf1 to the genes (8). Therefore, our data suggest that co-regulated genes in fission yeast associate with one another in a fashion functionally similar to the mammalian transcription factories.
Our analyses had suggested that co-regulated genes significantly associate with one another. We next expanded our study to the entire gene population and asked whether genes involved in other particular biological process also frequently associate. To this end, we investigated the significance of the associations among a group of annotated genes classified by gene ontology in the fission yeast genome database (42,43). We analyzed 467 gene ontology groups containing 26–121 genes in 20–100 genomic sections. This range of genomic sections was chosen to avoid a high false-negative rate. We observed that genes from 23 gene ontology groups showed the significant associations compared to the random controls (Figure 6A). We discarded 6 out of the 23 gene ontology groups, because they were obviously subgroups of other main groups, which also showed significant associations. The remaining 17 gene ontology groups included metabolic process, transmembrane transporter activity, response to stimulus, regulation of Ras-GTPase activity and cell wall biogenesis.
If genes in the respective gene ontology groups associate through binding of transcription factors, then comparative sequence analyses should find conserved DNA motifs at the promoter regions. Indeed, we found new conserved DNA motifs present at the promoter regions of the genes in the four ontology groups (Figure 6B). More importantly, those genes containing these DNA motifs showed significantly enhanced associations compared to associations among the entire gene members in the respective gene ontology groups (Figure 6B). In agreement with these results, we found that several motif-containing genes in the cellular carbohydrate catabolic process were present in proximity in the modeled genome structure (Figure 6C). We also found a similar positioning of motif-containing genes derived from the three other gene ontology groups in the modeled structure. These results suggest the importance of the DNA motifs and the potential involvement of factors binding to those motifs in facilitating associations between functionally defined genes in particular gene ontology groups.
We next investigated whether genes in many gene ontology groups might weakly associate with one another. To test this possibility, we plotted the distribution of average physical proximity values for 465 gene ontology groups and compared it to the distribution of the values for hypothetical random groups (Figure 6D). Distribution of average physical proximity values for actual gene ontology groups was significantly shifted to the right (Kolmogorov-Smirnov test P =1.96×10−54). Average physical proximity values of most of the gene ontology groups (97%) represented more than 0.9, whereas only about half (58%) of the hypothetical groups had more than 0.9, suggesting that genes in many gene ontology groups tend to weakly associate with one another. It has recently been suggested that individual genes are confined to the distinct subnuclear compartments, referred to as gene territories in budding yeast (44). It is possible that genes in many gene ontology groups might be present at shared gene territories, although future study is essential to infer any biological functions related to the weak associations observed among genes in many gene ontology groups.
We have demonstrated that highly transcribed genes, co-regulated genes, and genes from particular gene ontology groups tend to co-localize in the in vivo genome structure. The associations among highly transcribed genes are reminiscent of the transcription factories proposed to exist in mammals, although the functional role of such transcription factories remains unclear (4–7). It has been recently suggested that the transcription factor Klf1 is involved in the association of genes with transcription factories in mouse (8). Our study indicated that genes containing the same DNA motifs at promoter regions associate with one another in the significantly enhanced frequencies, suggesting that unknown factors, likely transcription factors, play a role in gene associations. The DNA motif-dependent gene associations were observed for co-regulated genes during the cell cycle as well as functionally defined genes in particular gene ontology groups. Our current hypothesis is that transcription factors binding to the motifs are involved in the functional organization of the global genome structure, which is suitable for coordinated expression of genes dispersed throughout the genome. Future studies that attempt to address the mechanism of DNA motif/transcription factor-mediated gene associations should lead to new insights into complex genome wide processes in functional genome organization coupled with transcriptional regulation.
Supplementary Data are available at NAR Online.
National Institutes of Health (CA010815); and the National Institutes of Health Director’s New Innovator Award Program (1DP2OD004348-01). Funding for open access charge: National Institutes of Health Director’s New Innovator Award Program (1DP2OD004348-01).
Conflict of interest statement. None declared.
The authors would like to thank the Sanger Institute for cosmid clones, the Penn Microarray facility for microarray experiment and the Wistar Genomics and Bioinformatics facilities for high-throughput sequencing and its analyses. The authors also thank the Wistar faculties, especially Louise Showe, for comments on the article. The authors are grateful to Andrew Kossenkov, Lisa Bain and Marion Sacks for institutional assistance.