De novo microbial genome sequencing reached a turning point with third-generation sequencing (TGS) platforms, and several microbial genomes have been improved by TGS long reads. Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and it has a function in the production of the traditional Japanese fermented food “natto.” The B. subtilis natto BEST195 genome was previously sequenced with short reads, but it included some incomplete regions. We resequenced the BEST195 genome using a PacBio RS sequencer, and we successfully obtained a complete genome sequence from one scaffold without any gaps, and we also applied Illumina MiSeq short reads to enhance quality. Compared with the previous BEST195 draft genome and Marburg 168 genome, we found that incomplete regions in the previous genome sequence were attributed to GC-bias and repetitive sequences, and we also identified some novel genes that are found only in the new genome.
Previous studies have shown that Meis1 plays an important role in blood development and vascular homeostasis, and can induce blood cancers, such as leukemia. However, its role in epithelia remains largely unknown. Here, we uncover two roles for Meis1 in the epidermis: as a critical regulator of epidermal homeostasis in normal tissues and as a proto-oncogenic factor in neoplastic tissues. In normal epidermis, we show that Meis1 is predominantly expressed in the bulge region of the hair follicles where multipotent adult stem cells reside, and that the number of these stem cells is reduced when Meis1 is deleted in the epidermal tissue of mice. Mice with epidermal deletion of Meis1 developed significantly fewer DMBA/TPA-induced benign and malignant tumors compared with wild-type mice, suggesting that Meis1 plays a role in both tumor development and malignant progression. This is consistent with the observation that Meis1 expression increases as tumors progress from benign papillomas to malignant carcinomas. Interestingly, we found that Meis1 localization was altered to neoplasia development. Instead of being localized to the stem cell region, Meis1 is localized to more differentiated cells in tumor tissues. These findings suggest that, during the transformation from normal to neoplastic tissues, a functional switch occurs in Meis1.
Metagenomes present assembly challenges, when assembling multiple genomes from mixed reads of multiple species. An assembler for single genomes can’t adapt well when applied in this case. A metagenomic assembler, Genovo, is a de novo assembler for metagenomes under a generative probabilistic model. Genovo assembles all reads without discarding any reads in a preprocessing step, and is therefore able to extract more information from metagenomic data and, in principle, generate better assembly results. Paired end sequencing is currently widely-used yet Genovo was designed for 454 single end reads. In this research, we attempted to extend Genovo by incorporating paired-end information, named Xgenovo, so that it generates higher quality assemblies with paired end reads.
First, we extended Genovo by adding a bonus parameter in the Chinese Restaurant Process used to get prior accounts for the unknown number of genomes in the sample. This bonus parameter intends for a pair of reads to be in the same contig and as an effort to solve chimera contig case. Second, we modified the sampling process of the location of a read in a contig. We used relative distance for the number of trials in the symmetric geometric distribution instead of using distance between the offset and the center of contig used in Genovo. Using this relative distance, a read sampled in the appropriate location has higher probability. Therefore a read will be mapped in the correct location.
Results of extensive experiments on simulated metagenomic datasets from simple to complex with species coverage setting following uniform and lognormal distribution showed that Xgenovo can be superior to the original Genovo and the recently proposed metagenome assembler for 454 reads, MAP. Xgenovo successfully generated longer N50 than Genovo and MAP while maintaining the assembly quality even for very complex metagenomic datasets consisting of 115 species. Xgenovo also demonstrated the potential to decrease the computational cost. This means that our strategy worked well. The software and all simulated datasets are publicly available online at http://xgenovo.dna.bio.keio.ac.jp.
Genovo; 454 paired end reads; de novo metagenomic assembler
During neocortical development, excitatory neurons are produced from apical progenitors in the ventricular zone (VZ) or from dividing cells in the subventricular zone (SVZ). We previously reported that the direct progenies of VZ cells in mice slowly exit the VZ and accumulate just above the VZ (lower SVZ) as multipolar migrating neurons, whereas subsequently dividing cells in the SVZ exit the VZ earlier than the former and become widely distributed in the SVZ. These two populations are named the slowly exiting population (SEP) and the rapidly exiting population (REP), respectively. In mice, REP cells include basal progenitors as the major population and are characterized by a long ascending process; their morphology resembles that of basal radial glial cells (bRGs), which have been observed in the inner and outer SVZ in primates. The dramatic increase in the number of bRGs in primates, especially in humans, is thought to underlie the acquisition of a huge cortex during evolution. We previously reported that the REP/SEP production rate in the lateral cortical VZ is higher than that in the dorsomedial VZ in mice. To search for molecules responsible for the higher REP production in the lateral cortical VZ, we conducted microarray analyses and identified genes that were differentially expressed between the lateral and medial VZs in mice. These genes were considered to be among the candidates responsible for the regulation of the REP/SEP production rate. To investigate the selection pressures during primate evolution on these candidate genes, we estimated the synonymous vs. non-synonymous base substitution rates. As a result, the negative selection pressures on the Megf11, Dmrt3, and Cntn3 genes were found to be significantly weaker in primates than in non-primates, while those on Jag1, Ntrk2, and Pmp22 were stronger. Candidate molecules responsible for primate cortical expansion through an increase in bRGs may be included among these genes.
neural stem cells; human evolution; notch signaling; selection bias; subventricular zone (SVZ)
Prediction of biochemical (metabolic) pathways has a wide range of applications, including the optimization of drug candidates, and the elucidation of toxicity mechanisms. Recently, several methods have been developed for pathway prediction to derive a goal compound from a start compound. However, these methods require high computational costs, and cannot perform comprehensive prediction of novel metabolic pathways. Our aim of this study is to develop a de novo prediction method for reconstructions of metabolic pathways and predictions of unknown biosynthetic pathways in the sense that it does not require any initial network such as KEGG metabolic network to be explored.
We formulated pathway prediction between a start compound and a goal compound as the shortest path search problem in terms of the number of enzyme reactions applied. We propose an efficient search method based on A* algorithm and heuristic techniques utilizing Linear Programming (LP) solution for estimation of the distance to the goal. First, a chemical compound is represented by a feature vector which counts frequencies of substructure occurrences in the structural formula. Second, an enzyme reaction is represented as an operator vector by detecting the structural changes to compounds before and after the reaction. By defining compound vectors as nodes and operator vectors as edges, prediction of the reaction pathway is reduced to the shortest path search problem in the vector space. In experiments on the DDT degradation pathway, we verify that the shortest paths predicted by our method are biologically correct pathways registered in the KEGG database. The results also demonstrate that the LP heuristics can achieve significant reduction in computation time. Furthermore, we apply our method to a secondary metabolite pathway of plant origin, and successfully find a novel biochemical pathway which cannot be predicted by the existing method. For the reconstruction of a known biochemical pathway, our method is over 40 times as fast as the existing method.
Our method enables fast and accurate de novo pathway predictions and novel pathway detection.
An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.
Identification of the target proteins of bioactive compounds is critical for elucidating the mode of action; however, target identification has been difficult in general, mostly due to the low sensitivity of detection using affinity chromatography followed by CBB staining and MS/MS analysis.
We applied our protocol of predicting target proteins combining in silico screening and experimental verification for incednine, which inhibits the anti-apoptotic function of Bcl-xL by an unknown mechanism. One hundred eighty-two target protein candidates were computationally predicted to bind to incednine by the statistical prediction method, and the predictions were verified by in vitro binding of incednine to seven proteins, whose expression can be confirmed in our cell system.
As a result, 40% accuracy of the computational predictions was achieved successfully, and we newly found 3 incednine-binding proteins.
This study revealed that our proposed protocol of predicting target protein combining in silico screening and experimental verification is useful, and provides new insight into a strategy for identifying target proteins of small molecules.
Identification of non-protein-coding RNAs (ncRNAs) in genomes is a crucial task for not only molecular cell biology but also bioinformatics. Secondary structures of ncRNAs are employed as a key feature of ncRNA analysis since biological functions of ncRNAs are deeply related to their secondary structures. Although the minimum free energy (MFE) structure of an RNA sequence is regarded as the most stable structure, MFE alone could not be an appropriate measure for identifying ncRNAs since the free energy is heavily biased by the nucleotide composition. Therefore, instead of MFE itself, several alternative measures for identifying ncRNAs have been proposed such as the structure conservation index (SCI) and the base pair distance (BPD), both of which employ MFE structures. However, these measurements are unfortunately not suitable for identifying ncRNAs in some cases including the genome-wide search and incur high false discovery rate. In this study, we propose improved measurements based on SCI and BPD, applying generalized centroid estimators to incorporate the robustness against low quality multiple alignments. Our experiments show that our proposed methods achieve higher accuracy than the original SCI and BPD for not only human-curated structural alignments but also low quality alignments produced by CLUSTAL W. Furthermore, the centroid-based SCI on CLUSTAL W alignments is more accurate than or comparable with that of the original SCI on structural alignments generated with RAF, a high quality structural aligner, for which twofold expensive computational time is required on average. We conclude that our methods are more suitable for genome-wide alignments which are of low quality from the point of view on secondary structures than the original SCI and BPD.
structure conservation index; centroid estimators; non-coding RNAs
Clustering of unannotated transcripts is an important task to identify novel families of noncoding RNAs (ncRNAs). Several hierarchical clustering methods have been developed using similarity measures based on the scores of structural alignment. However, the high computational cost of exact structural alignment requires these methods to employ approximate algorithms. Such heuristics degrade the quality of clustering results, especially when the similarity among family members is not detectable at the primary sequence level.
We describe a new similarity measure for the hierarchical clustering of ncRNAs. The idea is that the reliability of approximate algorithms can be improved by utilizing the information of suboptimal solutions in their dynamic programming frameworks. We approximate structural alignment in a more simplified manner than the existing methods. Instead, our method utilizes all possible sequence alignments and all possible secondary structures, whereas the existing methods only use one optimal sequence alignment and one optimal secondary structure. We demonstrate that this strategy can achieve the best balance between the computational cost and the quality of the clustering. In particular, our method can keep its high performance even when the sequence identity of family members is less than 60%.
Our method enables fast and accurate clustering of ncRNAs. The software is available for download at http://bpla-kernel.dna.bio.keio.ac.jp/clustering/.
Appropriate regulation of respective gene expressions is a bottleneck for the realization of artificial biological systems inside living cells. The modification of several promoter sequences is required to achieve appropriate regulation of the systems. However, a time-consuming process is required for the insertion of an operator, a binding site of a protein for gene expression, to the gene regulatory region of a plasmid. Thus, a standardized method for integrating operator sequences to the regulatory region of a plasmid is required.
We developed a standardized method for integrating operator sequences to the regulatory region of a plasmid and constructed a synthetic promoter that functions as a genetic AND gate. By standardizing the regulatory region of a plasmid and the operator parts, we established a platform for modular assembly of the operator parts. Moreover, by assembling two different operator parts on the regulatory region, we constructed a regulatory device with an AND gate function.
We implemented a new standard to assemble operator parts for construction of functional genetic logic gates. The logic gates at the molecular scale have important implications for reprogramming cellular behavior.
Computational prediction of noncoding RNAs (ncRNAs) is an important task in the post-genomic era. One common approach is to utilize the profile information contained in alignment data rather than single sequences. However, this strategy involves the possibility that the quality of input alignments can influence the performance of prediction methods. Therefore, the evaluation of the robustness against alignment errors is necessary as well as the development of accurate prediction methods.
We describe a new method, called Profile BPLA kernel, which predicts ncRNAs from alignment data in combination with support vector machines (SVMs). Profile BPLA kernel is an extension of base-pairing profile local alignment (BPLA) kernel which we previously developed for the prediction from single sequences. By utilizing the profile information of alignment data, the proposed kernel can achieve better accuracy than the original BPLA kernel. We show that Profile BPLA kernel outperforms the existing prediction methods which also utilize the profile information using the high-quality structural alignment dataset. In addition to these standard benchmark tests, we extensively evaluate the robustness of Profile BPLA kernel against errors in input alignments. We consider two different types of error: first, that all sequences in an alignment are actually ncRNAs but are aligned ignoring their secondary structures; second, that an alignment contains unrelated sequences which are not ncRNAs but still aligned. In both cases, the effects on the performance of Profile BPLA kernel are surprisingly small. Especially for the latter case, we demonstrate that Profile BPLA kernel is more robust compared to the existing prediction methods.
Profile BPLA kernel provides a promising way for identifying ncRNAs from alignment data. It is more accurate than the existing prediction methods, and can keep its performance under the practical situations in which the quality of input alignments is not necessarily high.
With the number of available genome sequences increasing rapidly, the magnitude of sequence data required for multiple-genome analyses is a challenging problem. When large-scale rearrangements break the collinearity of gene orders among genomes, genome comparison algorithms must first identify sets of short well-conserved sequences present in each genome, termed anchors. Previously, anchor identification among multiple genomes has been achieved using pairwise alignment tools like BLASTZ through progressive alignment tools like TBA, but the computational requirements for sequence comparisons of multiple genomes quickly becomes a limiting factor as the number and scale of genomes grows.
Our algorithm, named Murasaki, makes it possible to identify anchors within multiple large sequences on the scale of several hundred megabases in few minutes using a single CPU. Two advanced features of Murasaki are (1) adaptive hash function generation, which enables efficient use of arbitrary mismatch patterns (spaced seeds) and therefore the comparison of multiple mammalian genomes in a practical amount of computation time, and (2) parallelizable execution that decreases the required wall-clock and CPU times. Murasaki can perform a sensitive anchoring of eight mammalian genomes (human, chimp, rhesus, orangutan, mouse, rat, dog, and cow) in 21 hours CPU time (42 minutes wall time). This is the first single-pass in-core anchoring of multiple mammalian genomes. We evaluated Murasaki by comparing it with the genome alignment programs BLASTZ and TBA. We show that Murasaki can anchor multiple genomes in near linear time, compared to the quadratic time requirements of BLASTZ and TBA, while improving overall accuracy.
Murasaki provides an open source platform to take advantage of long patterns, cluster computing, and novel hash algorithms to produce accurate anchors across multiple genomes with computational efficiency significantly greater than existing methods. Murasaki is available under GPL at http://murasaki.sourceforge.net.
Bacillus subtilis natto is closely related to the laboratory standard strain B. subtilis Marburg 168, and functions as a starter for the production of the traditional Japanese food "natto" made from soybeans. Although re-sequencing whole genomes of several laboratory domesticated B. subtilis 168 derivatives has already been attempted using short read sequencing data, the assembly of the whole genome sequence of a closely related strain, B. subtilis natto, from very short read data is more challenging, particularly with our aim to assemble one fully connected scaffold from short reads around 35 bp in length.
We applied a comparative genome assembly method, which combines de novo assembly and reference guided assembly, to one of the B. subtilis natto strains. We successfully assembled 28 scaffolds and managed to avoid substantial fragmentation. Completion of the assembly through long PCR experiments resulted in one connected scaffold for B. subtilis natto. Based on the assembled genome sequence, our orthologous gene analysis between natto BEST195 and Marburg 168 revealed that 82.4% of 4375 predicted genes in BEST195 are one-to-one orthologous to genes in 168, with two genes in-paralog, 3.2% are deleted in 168, 14.3% are inserted in BEST195, and 5.9% of genes present in 168 are deleted in BEST195. The natto genome contains the same alleles in the promoter region of degQ and the coding region of swrAA as the wild strain, RO-FF-1.
These are specific for γ-PGA production ability, which is related to natto production. Further, the B. subtilis natto strain completely lacked a polyketide synthesis operon, disrupted the plipastatin production operon, and possesses previously unidentified transposases.
The determination of the whole genome sequence of Bacillus subtilis natto provided detailed analyses of a set of genes related to natto production, demonstrating the number and locations of insertion sequences that B. subtilis natto harbors but B. subtilis 168 lacks. Multiple genome-level comparisons among five closely related Bacillus species were also carried out. The determined genome sequence of B. subtilis natto and gene annotations are available from the Natto genome browser http://natto-genome.org/.
Whole-genome sequence analysis of Mycobacterium leprae has revealed a limited number of protein-coding genes, with half of the genome composed of pseudogenes and noncoding regions. We previously showed that some M. leprae pseudogenes are transcribed at high levels and that their expression levels change following infection. In order to clarify the RNA expression profile of the M. leprae genome, a tiling array in which overlapping 60-mer probes cover the entire 3.3-Mbp genome was designed. The array was hybridized with M. leprae RNA from the SHR/NCrj-rnu nude rat, and the results were compared to results from an open reading frame array and confirmed by reverse transcription-PCR. RNA expression was detected from genes, pseudogenes, and noncoding regions. The signal intensities obtained from noncoding regions were higher than those from pseudogenes. Expressed noncoding regions include the M. leprae unique repetitive sequence RLEP and other sequences without any homology to known functional noncoding RNAs. Although the biological functions of RNA transcribed from M. leprae pseudogenes and noncoding regions are not known, RNA expression analysis will provide insights into the bacteriological significance of the species. In addition, our study suggests that M. leprae will be a useful model organism for the study of the molecular mechanism underlying the creation of pseudogenes and the role of microRNAs derived from noncoding regions.
Predictions of interactions between target proteins and potential leads are of great benefit in the drug discovery process. We present a comprehensively applicable statistical prediction method for interactions between any proteins and chemical compounds, which requires only protein sequence data and chemical structure data and utilizes the statistical learning method of support vector machines. In order to realize reasonable comprehensive predictions which can involve many false positives, we propose two approaches for reduction of false positives: (i) efficient use of multiple statistical prediction models in the framework of two-layer SVM and (ii) reasonable design of the negative data to construct statistical prediction models. In two-layer SVM, outputs produced by the first-layer SVM models, which are constructed with different negative samples and reflect different aspects of classifications, are utilized as inputs to the second-layer SVM. In order to design negative data which produce fewer false positive predictions, we iteratively construct SVM models or classification boundaries from positive and tentative negative samples and select additional negative sample candidates according to pre-determined rules. Moreover, in order to fully utilize the advantages of statistical learning methods, we propose a strategy to effectively feedback experimental results to computational predictions with consideration of biological effects of interest. We show the usefulness of our approach in predicting potential ligands binding to human androgen receptors from more than 19 million chemical compounds and verifying these predictions by in vitro binding. Moreover, we utilize this experimental validation as feedback to enhance subsequent computational predictions, and experimentally validate these predictions again. This efficient procedure of the iteration of the in silico prediction and in vitro or in vivo experimental verifications with the sufficient feedback enabled us to identify novel ligand candidates which were distant from known ligands in the chemical space.
This work describes a statistical method that identifies chemical compounds binding to a target protein given the sequence of the target or distinguishes proteins to which a small molecule binds given the chemical structure of the molecule. As our method can be utilized for virtual screening that seeks for lead compounds in drug discovery, we showed the usefulness of our method in its application to the comprehensive prediction of ligands binding to human androgen receptors and in vitro experimental verification of its predictions. In contrast to most previous virtual screening studies which predict chemical compounds of interest mainly with 3D structure-based methods and experimentally verify them, we proposed a strategy to effectively feedback experimental results for subsequent predictions and applied the strategy to the second predictions followed by the second experimental verification. This feedback strategy makes full use of statistical learning methods and, in practical terms, gave a ligand candidate of interest that structurally differs from known drugs. We hope that this paper will encourage reevaluation of statistical learning methods in virtual screening and that the utilization of statistical methods with efficient feedback strategies will contribute to the acceleration of drug discovery.
Chaperone therapy is a newly developed molecular approach to lysosomal diseases, a group of human genetic diseases causing severe brain damage. We found two valienamine derivatives, N-octyl-4-epi-β-valienamine (NOEV) and N-octyl-β-valienamine (NOV), as promising therapeutic agents for human β-galactosidase deficiency disorders (mainly GM1-gangliosidosis) and β-glucosidase deficiency disorders (Gaucher disease), respectively. We briefly reviewed the historical background of research in carbasugar glycosidase inhibitors. Originally NOEV and NOV had been discovered as competitive inhibitors, and then their paradoxical bioactivities as chaperones were confirmed in cultured fibroblasts from patients with these disorders. Subsequently GM1-gangliosidosis model mice were developed and useful for experimental studies. Orally administered NOEV entered the brain through the blood-brain barrier, enhanced β-galactosidase activity, reduced substrate storage, and improved neurological deterioration clinically. Furthermore, we executed computational analysis for prediction of molecular interactions between β-galactosidase and NOEV. Some preliminary results of computational analysis of molecular interaction mechanism are presented in this article. NOV also showed the chaperone effect toward several β-glucosidase gene mutations in Gaucher disease. We hope chaperone therapy will become available for some patients with GM1-gangliosidosis, Gaucher disease, and potentially other lysosomal storage diseases with central nervous system involvement.
Chaperone; Valienamine; Lysosomal disease; Lysosomal enzyme; β-Galactosidase; β-Glucosidase; GM1-gangliosidosis; Gaucher disease
Despite the accumulating research on noncoding RNAs (ncRNAs), it is likely that we are seeing only the tip of the iceberg regarding our understanding of the functions and the regulatory roles served by ncRNAs in cellular metabolism, pathogenesis and host-pathogen interactions. Therefore, more powerful computational and experimental tools for analyzing ncRNAs need to be developed. To this end, we propose novel kernel functions, called base-pairing profile local alignment (BPLA) kernels, for analyzing functional ncRNA sequences using support vector machines (SVMs). We extend the local alignment kernels for amino acid sequences in order to handle RNA sequences by using STRAL's; scoring function, which takes into account sequence similarities as well as upstream and downstream base-pairing probabilities, thus enabling us to model secondary structures of RNA sequences. As a test of the performance of BPLA kernels, we applied our kernels to the problem of discriminating members of an RNA family from nonmembers using SVMs. The results indicated that the discrimination ability of our kernels is stronger than that of other existing methods. Furthermore, we demonstrated the applicability of our kernels to the problem of genome-wide search of snoRNA families in the Caenorhabditis elegans genome, and confirmed that the expression is valid in 14 out of 48 of our predicted candidates by using qRT-PCR. Finally, highly expressed six candidates were identified as the original target regions by DNA sequencing.
Recent discoveries of a large variety of important roles for non-coding RNAs (ncRNAs) have been reported by numerous researchers. In order to analyze ncRNAs by kernel methods including support vector machines, we propose stem kernels as an extension of string kernels for measuring the similarities between two RNA sequences from the viewpoint of secondary structures. However, applying stem kernels directly to large data sets of ncRNAs is impractical due to their computational complexity.
We have developed a new technique based on directed acyclic graphs (DAGs) derived from base-pairing probability matrices of RNA sequences that significantly increases the computation speed of stem kernels. Furthermore, we propose profile-profile stem kernels for multiple alignments of RNA sequences which utilize base-pairing probability matrices for multiple alignments instead of those for individual sequences. Our kernels outperformed the existing methods with respect to the detection of known ncRNAs and kernel hierarchical clustering.
Stem kernels can be utilized as a reliable similarity measure of structural RNAs, and can be used in various kernel-based applications.
We present web servers for analysis of non-coding RNA sequences on the basis of their secondary structures. Software tools for structural multiple sequence alignments, structural pairwise sequence alignments and structural motif findings are available from the integrated web server and the individual stand-alone web servers. The servers are located at http://software.ncrna.org, along with the information for the evaluation and downloading. This website is freely available to all users and there is no login requirement.
Cooperative transcriptional activations among multiple transcription factors (TFs) are important to understand the mechanisms of complex transcriptional regulations in eukaryotes. Previous studies have attempted to find cooperative TFs based on gene expression data with gene expression profiles as a measure of similarity of gene regulations. In this paper, we use protein–protein interaction data to infer synergistic binding of cooperative TFs. Our fundamental idea is based on the assumption that genes contributing to a similar biological process are regulated under the same control mechanism. First, the protein–protein interaction networks are used to calculate the similarity of biological processes among genes. Second, we integrate this similarity and the chromatin immuno-precipitation data to identify cooperative TFs. Our computational experiments in yeast show that predictions made by our method have successfully identified eight pairs of cooperative TFs that have literature evidences but could not be identified by the previous method. Further, 12 new possible pairs have been inferred and we have examined the biological relevances for them. However, since a typical problem using protein–protein interaction data is that many false-positive data are contained, we propose a method combining various biological data to increase the prediction accuracy.