Now multiple types of data are available for identifying disease genes. Those data include gene-disease associations, disease phenotype similarities, protein-protein interactions, pathways, gene expression profiles, etc.. It is believed that integrating different kinds of biological data is an effective method to identify disease genes.
In this paper, we propose a multiple data integration method based on the theory of Markov random field (MRF) and the method of Bayesian analysis for identifying human disease genes. The proposed method is not only flexible in easily incorporating different kinds of data, but also reliable in predicting candidate disease genes.
Numerical experiments are carried out by integrating known gene-disease associations, protein complexes, protein-protein interactions, pathways and gene expression profiles. Predictions are evaluated by the leave-one-out method. The proposed method achieves an AUC score of 0.743 when integrating all those biological data in our experiments.
disease gene; data integration; Markov random field; Bayesian analysis
Tissue specificity is an important aspect of many genetic diseases in the context of genetic disorders as the disorder affects only few tissues. Therefore tissue specificity is important in identifying disease-gene associations. Hence this paper seeks to discuss the impact of using tissue specificity in predicting new disease-gene associations and how to use tissue specificity along with phenotype information for a particular disease.
In order to find out the impact of using tissue specificity for predicting new disease-gene associations, this study proposes a novel method called tissue-specified genes to construct tissues-specific gene-gene networks for different tissue samples. Subsequently, these networks are used with phenotype details to predict disease genes by using Katz method. The proposed method was compared with three other tissue-specific network construction methods in order to check its effectiveness. Furthermore, to check the possibility of using tissue-specific gene-gene network instead of generic protein-protein network at all time, the results are compared with three other methods.
In terms of leave-one-out cross validation, calculation of the mean enrichment and ROC curves indicate that the proposed approach outperforms existing network construction methods. Furthermore tissues-specific gene-gene networks make a more positive impact on predicting disease-gene associations than generic protein-protein interaction networks.
In conclusion by integrating tissue-specific data it enabled prediction of known and unknown disease-gene associations for a particular disease more effectively. Hence it is better to use tissue-specific gene-gene network whenever possible. In addition the proposed method is a better way of constructing tissue-specific gene-gene networks.
As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results.
A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves.
The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.
Gene Regulatory Network; Reverse Engineering; Group LASSO; Optimization; Gene Expression Data
Predicting disease-related genes is one of the most important tasks in bioinformatics and systems biology. With the advances in high-throughput techniques, a large number of protein-protein interactions are available, which make it possible to identify disease-related genes at the network level. However, network-based identification of disease-related genes is still a challenge as the considerable false-positives are still existed in the current available protein interaction networks (PIN).
Considering the fact that the majority of genetic disorders tend to manifest only in a single or a few tissues, we constructed tissue-specific networks (TSN) by integrating PIN and tissue-specific data. We further weighed the constructed tissue-specific network (WTSN) by using DNA methylation as it plays an irreplaceable role in the development of complex diseases. A PageRank-based method was developed to identify disease-related genes from the constructed networks. To validate the effectiveness of the proposed method, we constructed PIN, weighted PIN (WPIN), TSN, WTSN for colon cancer and leukemia, respectively. The experimental results on colon cancer and leukemia show that the combination of tissue-specific data and DNA methylation can help to identify disease-related genes more accurately. Moreover, the PageRank-based method was effective to predict disease-related genes on the case studies of colon cancer and leukemia.
Tissue-specific data and DNA methylation are two important factors to the study of human diseases. The same method implemented on the WTSN can achieve better results compared to those being implemented on original PIN, WPIN, or TSN. The PageRank-based method outperforms degree centrality-based method for identifying disease-related genes from WTSN.
Many clustering algorithms are unable to solve the clustering problem of protein-protein interaction (PPI) networks effectively. A novel clustering model which combines the optimization mechanism of artificial bee colony (ABC) with the fuzzy membership matrix is proposed in this paper. The proposed ABC-IFC clustering model contains two parts: searching for the optimum cluster centers using ABC mechanism and forming clusters using intuitionistic fuzzy clustering (IFC) method. Firstly, the cluster centers are set randomly and the initial clustering results are obtained by using fuzzy membership matrix. Then the cluster centers are updated through different functions of bees in ABC algorithm; then the clustering result is obtained through IFC method based on the new optimized cluster center. To illustrate its performance, the ABC-IFC method is compared with the traditional fuzzy C-means clustering and IFC method. The experimental results on MIPS dataset show that the proposed ABC-IFC method not only gets improved in terms of several commonly used evaluation criteria such as precision, recall, and P value, but also obtains a better clustering result.
Genetic regulatory networks are dynamic systems which describe the interactions among gene products (mRNAs and proteins). The internal states of a genetic regulatory network consist of the concentrations of mRNA and proteins involved in it, which are very helpful in understanding its dynamic behaviors. However, because of some limitations such as experiment techniques, not all internal states of genetic regulatory network can be effectively measured. Therefore it becomes an important issue to estimate the unmeasured states via the available measurements. In this study, we design a state observer to estimate the states of genetic regulatory networks with time delays from available measurements. Furthermore, based on linear matrix inequality (LMI) approach, a criterion is established to guarantee that the dynamic of estimation error is globally asymptotically stable. A gene repressillatory network is employed to illustrate the effectiveness of our design approach.
Microarray technology has produced a huge body of time-course gene expression data and will continue to produce more. Such gene expression data has been proved useful in genomic disease diagnosis and drug design. The challenge is how to uncover useful information from such data by proper analysis methods such as significance analysis and clustering analysis. Many statistic-based significance analysis methods and distance/correlation-based clustering analysis methods have been applied to time-course expression data. However, these techniques are unable to account for the dynamics of such data. It is the dynamics that characterizes such data and that should be considered in analysis of such data. In this paper, we employ a nonlinear model to analyse time-course gene expression data. We firstly develop an efficient method for estimating the parameters in the nonlinear model. Then we utilize this model to perform the significance analysis of individually differentially expressed genes and clustering analysis of a set of gene expression profiles. The verification with two synthetic datasets shows that our developed significance analysis method and cluster analysis method outperform some existing methods. The application to one real-life biological dataset illustrates that the analysis results of our developed methods are in agreement with the existing results.
The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.
A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.
Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.
Peptide identification; Peptide spectrum matches (PSMs); Fuzzy support vector machine (SVM); Fuzzy silhouette
β-turns are secondary structure type that have essential role in molecular recognition, protein folding, and stability. They are found to be the most common type of non-repetitive structures since 25% of amino acids in protein structures are situated on them. Their prediction is considered to be one of the crucial problems in bioinformatics and molecular biology, which can provide valuable insights and inputs for the fold recognition and drug design.
We propose an approach that combines support vector machines (SVMs) and logistic regression (LR) in a hybrid prediction method, which we call (H-SVM-LR) to predict β-turns in proteins. Fractional polynomials are used for LR modeling. We utilize position specific scoring matrices (PSSMs) and predicted secondary structure (PSS) as features. Our simulation studies show that H-SVM-LR achieves Qtotal of 82.87%, 82.84%, and 82.32% on the BT426, BT547, and BT823 datasets respectively. These values are the highest among other β-turns prediction methods that are based on PSSMs and secondary structure information. H-SVM-LR also achieves favorable performance in predicting β-turns as measured by the Matthew's correlation coefficient (MCC) on these datasets. Furthermore, H-SVM-LR shows good performance when considering shape strings as additional features.
In this paper, we present a comprehensive approach for β-turns prediction. Experiments show that our proposed approach achieves better performance compared to other competing prediction methods.
Protein interaction networks (PINs) are known to be useful to detect protein complexes. However, most available PINs are static, which cannot reflect the dynamic changes in real networks. At present, some researchers have tried to construct dynamic networks by incorporating time-course (dynamic) gene expression data with PINs. However, the inevitable background noise exists in the gene expression array, which could degrade the quality of dynamic networkds. Therefore, it is needed to filter out contaminated gene expression data before further data integration and analysis.
Firstly, we adopt a dynamic model-based method to filter noisy data from dynamic expression profiles. Then a new method is proposed for identifying active proteins from dynamic gene expression profiles. An active protein at a time point is defined as the protein the expression level of whose corresponding gene at that time point is higher than a threshold determined by a standard variance involved threshold function. Furthermore, a noise-filtered active protein interaction network (NF-APIN) is constructed. To demonstrate the efficiency of our method, we detect protein complexes from the NF-APIN, compared with those from other dynamic PINs.
A dynamic model based method can effectively filter out noises in dynamic gene expression data. Our method to compute a threshold for determining the active time points of noise-filtered genes can make the dynamic construction more accuracy and provide a high quality framework for network analysis, such as protein complex prediction.
A metabolic system consists of a number of reactions transforming molecules of one kind into another to provide the energy that living cells need. Based on the biochemical reaction principles, dynamic metabolic systems can be modeled by a group of coupled differential equations which consists of parameters, states (concentration of molecules involved), and reaction rates. Reaction rates are typically either polynomials or rational functions in states and constant parameters. As a result, dynamic metabolic systems are a group of differential equations nonlinear and coupled in both parameters and states. Therefore, it is challenging to estimate parameters in complex dynamic metabolic systems. In this paper, we propose a method to analyze the complexity of dynamic metabolic systems for parameter estimation. As a result, the estimation of parameters in dynamic metabolic systems is reduced to the estimation of parameters in a group of decoupled rational functions plus polynomials (which we call improper rational functions) or in polynomials. Furthermore, by taking its special structure of improper rational functions, we develop an efficient algorithm to estimate parameters in improper rational functions. The proposed method is applied to the estimation of parameters in a dynamic metabolic system. The simulation results show the superior performance of the proposed method.
In a single proteomic project, tandem mass spectrometers can produce hundreds of millions of tandem mass spectra. However, majority of tandem mass spectra are of poor quality, it wastes time to search them for peptides. Therefore, the quality assessment (before database search) is very useful in the pipeline of protein identification via tandem mass spectra, especially on the reduction of searching time and the decrease of false identifications. Most existing methods for quality assessment are supervised machine learning methods based on a number of features which describe the quality of tandem mass spectra. These methods need the training datasets with knowing the quality of all spectra, which are usually unavailable for the new datasets.
This study proposes an unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset. This proposed method estimates the conditional probabilities of spectra being high quality from the quality assessments based on individual features. The probabilities are estimated through a constraint optimization problem. An efficient algorithm is developed to solve the constraint optimization problem and is proved to be convergent. Experimental results on two datasets illustrate that if we search only tandem spectra with the high quality determined by the proposed method, we can save about 56 % and 62% of database searching time while losing only a small amount of high-quality spectra.
Results indicate that the proposed method has a good performance for the quality assessment of tandem mass spectra and the way we estimate the conditional probabilities is effective.
For high-resolution tandem mass spectra, the determination of monoisotopic masses of fragment ions plays a key role in the subsequent peptide and protein identification. In this paper, we present a new algorithm for deisotoping the bottom-up spectra. Isotopic-cluster graphs are constructed to describe the relationship between all possible isotopic clusters. Based on the relationship in isotopic-cluster graphs, each possible isotopic cluster is assessed with a score function, which is built by combining nonintensity and intensity features of fragment ions. The non-intensity features are used to prevent fragment ions with low intensity from being removed. Dynamic programming is adopted to find the highest score path with the most reliable isotopic clusters. The experimental results have shown that the average Mascot scores and F-scores of identified peptides from spectra processed by our deisotoping method are greater than those by YADA and MS-Deconv software.
Clustering periodically expressed genes from their time-course expression data could help understand the molecular mechanism of those biological processes. In this paper, we propose a nonlinear model-based clustering method for periodically expressed gene profiles. As periodically expressed genes are associated with periodic biological processes, the proposed method naturally assumes that a periodically expressed gene dataset is generated by a number of periodical processes. Each periodical process is modelled by a linear combination of trigonometric sine and cosine functions in time plus a Gaussian noise term. A two stage method is proposed to estimate the model parameter, and a relocation-iteration algorithm is employed to assign each gene to an appropriate cluster. A bootstrapping method and an average adjusted Rand index (AARI) are employed to measure the quality of clustering. One synthetic dataset and two biological datasets were employed to evaluate the performance of the proposed method. The results show that our method allows the better quality clustering than other clustering methods (e.g., k-means) for periodically expressed gene data, and thus it is an effective cluster analysis method for periodically expressed gene data.
Gene expression data; nonlinear model; periodicall expressed genes; clustering; average adjusted Rand index
There is a growing interest in the identification of proteins on the proteome wide scale. Among different kinds of protein structure identification methods, graph-theoretic methods are very sharp ones. Due to their lower costs, higher effectiveness and many other advantages, they have drawn more and more researchers’ attention nowadays. Specifically, graph-theoretic methods have been widely used in homology identification, side-chain cluster identification, peptide sequencing and so on. This paper reviews several methods in solving protein structure identification problems using graph theory. We mainly introduce classical methods and mathematical models including homology modeling based on clique finding, identification of side-chain clusters in protein structures upon graph spectrum, and de novo peptide sequencing via tandem mass spectrometry using the spectrum graph model. In addition, concluding remarks and future priorities of each method are given.
Charge states of tandem mass spectra from low-resolution collision induced dissociation can not be determined by mass spectrometry. As a result, such spectra with multiple charges are usually searched multiple times by assuming each possible charge state. Not only does this strategy increase the overall database search time, but also yields more false positives. Hence, it is advantageous to determine charge states of such spectra before database search.
We propose a new approach capable of determining the charge states of low-resolution tandem mass spectra. Four novel and discriminant features are introduced to describe tandem mass spectra and used in Gaussian mixture model to distinguish doubly and triply charged peptides. By testing on three independent datasets with known validity, the results have shown that this method can assign charge states to low-resolution tandem mass spectra more accurately than existing methods.
The proposed method can be used to improve the speed and reliability of peptide identification.
Personalized drug design requires the classification of cancer patients as accurate as possible. With advances in genome sequencing and microarray technology, a large amount of gene expression data has been and will continuously be produced from various cancerous patients. Such cancer-alerted gene expression data allows us to classify tumors at the genomewide level. However, cancer-alerted gene expression datasets typically have much more number of genes (features) than that of samples (patients), which imposes a challenge for classification of tumors. In this paper, a new method is proposed for cancer diagnosis using gene expression data by casting the classification problem as finding sparse representations of test samples with respect to training samples. The sparse representation is computed by the l1-regularized least square method. To investigate its performance, the proposed method is applied to six tumor gene expression datasets and compared with various support vector machine (SVM) methods. The experimental results have shown that the performance of the proposed method is comparable with or better than those of SVMs. In addition, the proposed method is more efficient than SVMs as it has no need of model selection.
Essential proteins are indispensable for living organisms to maintain life activities and play important roles in the studies of pathology, synthetic biology, and drug design. Therefore, besides experiment methods, many computational methods are proposed to identify essential proteins. Based on the centrality-lethality rule, various centrality methods are employed to predict essential proteins in a Protein-protein Interaction Network (PIN). However, neglecting the temporal and spatial features of protein-protein interactions, the centrality scores calculated by centrality methods are not effective enough for measuring the essentiality of proteins in a PIN. Moreover, many methods, which overfit with the features of essential proteins for one species, may perform poor for other species. In this paper, we demonstrate that the centrality-lethality rule also exists in Protein Subcellular Localization Interaction Networks (PSLINs). To do this, a method based on Localization Specificity for Essential protein Detection (LSED), was proposed, which can be combined with any centrality method for calculating the improved centrality scores by taking into consideration PSLINs in which proteins play their roles. In this study, LSED was combined with eight centrality methods separately to calculate Localization-specific Centrality Scores (LCSs) for proteins based on the PSLINs of four species (Saccharomyces cerevisiae, Homo sapiens, Mus musculus and Drosophila melanogaster). Compared to the proteins with high centrality scores measured from the global PINs, more proteins with high LCSs measured from PSLINs are essential. It indicates that proteins with high LCSs measured from PSLINs are more likely to be essential and the performance of centrality methods can be improved by LSED. Furthermore, LSED provides a wide applicable prediction model to identify essential proteins for different species.
Essential proteins are vitally important for cellular survival and development, and identifying essential proteins is very meaningful research work in the post-genome era. Rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein essentiality at the network level. A series of centrality measures have been proposed to discover essential proteins based on the PPI networks. However, the PPI data obtained from large scale, high-throughput experiments generally contain false positives. It is insufficient to use original PPI data to identify essential proteins. How to improve the accuracy, has become the focus of identifying essential proteins. In this paper, we proposed a framework for identifying essential proteins from active PPI networks constructed with dynamic gene expression. Firstly, we process the dynamic gene expression profiles by using time-dependent model and time-independent model. Secondly, we construct an active PPI network based on co-expressed genes. Lastly, we apply six classical centrality measures in the active PPI network. For the purpose of comparison, other prediction methods are also performed to identify essential proteins based on the active PPI network. The experimental results on yeast network show that identifying essential proteins based on the active PPI network can improve the performance of centrality measures considerably in terms of the number of identified essential proteins and identification accuracy. At the same time, the results also indicate that most of essential proteins are active.
Essential proteins; Protein-protein interaction; Dynamic gene expression profiles; Active protein-protein interaction networks; Centrality measures
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value.
Protein-Protein Interaction (PPI) data; firefly algorithm (FA); synchronization-based hierarchical clustering (SHC); spectral clustering (SC)
Identification of protein complexes from protein-protein interaction networks has become a key problem for understanding cellular life in postgenomic era. Many computational methods have been proposed for identifying protein complexes. Up to now, the existing computational methods are mostly applied on static PPI networks. However, proteins and their interactions are dynamic in reality. Identifying dynamic protein complexes is more meaningful and challenging. In this paper, a novel algorithm, named DPC, is proposed to identify dynamic protein complexes by integrating PPI data and gene expression profiles. According to Core-Attachment assumption, these proteins which are always active in the molecular cycle are regarded as core proteins. The protein-complex cores are identified from these always active proteins by detecting dense subgraphs. Final protein complexes are extended from the protein-complex cores by adding attachments based on a topological character of “closeness” and dynamic meaning. The protein complexes produced by our algorithm DPC contain two parts: static core expressed in all the molecular cycle and dynamic attachments short-lived. The proposed algorithm DPC was applied on the data of Saccharomyces cerevisiae and the experimental results show that DPC outperforms CMC, MCL, SPICi, HC-PIN, COACH, and Core-Attachment based on the validation of matching with known complexes and hF-measures.
We have often observed unexpected state transitions of complex systems. We are thus interested in how to steer a complex system from an unexpected state to a desired state. Here we introduce the concept of transittability of complex networks, and derive a new sufficient and necessary condition for state transittability which can be efficiently verified. We define the steering kernel as a minimal set of steering nodes to which control signals must directly be applied for transition between two specific states of a network, and propose a graph-theoretic algorithm to identify the steering kernel of a network for transition between two specific states. We applied our algorithm to 27 real complex networks, finding that sizes of steering kernels required for transittability are much less than those for complete controllability. Furthermore, applications to regulatory biomolecular networks not only validated our method but also identified the steering kernel for their phenotype transitions.
A real-time peptide-spectrum matching (RT-PSM) algorithm is a database search method to interpret tandem mass spectra (MS/MS) with strict time constraints. Restricted by the hardware and architecture of individual workstation, previous RT-PSM algorithms either are not fast enough to satisfy all real-time system requirements or need to sacrifice the level of inference accuracy to provide the required processing speed.
We develop two parallelized algorithms for MS/MS data analysis: a multi-core RT-PSM (MC RT-PSM) algorithm which works on individual workstations and a distributed computing RT-PSM (DC RT-PSM) algorithm which works on a computer cluster. Two data sets are employed to evaulate the performance of our proposed algorithms. The simulation results show that our proposed algorithms can reach approximately 216.9-fold speedup on a sub-task process (similarity scoring module) and 84.78-fold speedup on the overall process compared with a single-thread process of the RT-PSM algorithm when 240 logical cores are employed.
The improved RT-PSM algorithms can achieve the processing speed requirement without sacrificing the level of inference accuracy. With some configuration adjustments, the proposed algorithm can support many peptide identification programs, such as X!Tandem, CUDA version RT-PSM, etc.
Characterization of unknown proteins through computational approaches is one of the most challenging problems in silico biology, which has attracted world-wide interests and great efforts. There have been some computational methods proposed to address this problem, which are either based on homology mapping or in the context of protein interaction networks.
In this paper, two algorithms are proposed by integrating the protein-protein interaction (PPI) network, proteins’ domain information and protein complexes. The one is domain combination similarity (DCS), which combines the domain compositions of both proteins and their neighbors. The other is domain combination similarity in context of protein complexes (DSCP), which extends the protein functional similarity definition of DCS by combining the domain compositions of both proteins and the complexes including them. The new algorithms are tested on networks of the model species of Saccharomyces cerevisiae to predict functions of unknown proteins using cross validations. Comparing with other several existing algorithms, the results have demonstrated the effectiveness of our proposed methods in protein function prediction. Furthermore, the algorithm DSCP using experimental determined complex data is robust when a large percentage of the proteins in the network is unknown, and it outperforms DCS and other several existing algorithms.
The accuracy of predicting protein function can be improved by integrating the protein-protein interaction (PPI) network, proteins’ domain information and protein complexes.
In dynamic biological processes, genes, transcription factors(TF) and microRNAs(miRNAs) play vital regulation roles. Many researchers have focused on the transcription factors or miRNAs in transcriptional or post transcriptional stage, respectively. However, the transcriptional regulation and post transcriptional regulation is not isolated in the whole dynamic biological processes, there are few reserchers who have tried to consider the network composed by genes, miRNAs and TFs in this dynamic biological processes, especially in the mouse lung development. Moreover, it is widely acknowledged that cancer is a kind of developmental disorders, and some of pathways involved in tissue development might be also implicated in causing cancer. Although it has been found that many genes differentially expressed during mouse lung development are also differentially expressed in lung cancer, very little work has been reported to elucidate the combinational regulatory programs of such kind of associations.
In order to investigate the association of transcriptional and post-transcriptional regulating activities in the mouse lung development, we define the significant triple relations among miRNAs, TFs and mRNAs as circuits. From the lung development time course data GSE21053, we mine 142610 circuit candidates including 96 TFs, 129 miRNAs and 13403 genes. After removing genes with little variation along different time points, we finally find 64760 circuit candidates, containing 8299 genes, 50 TFs, and 118 miRNAs in total. Further analysis on the circuits shows that the circuits vary in different stages of the lung development and play different roles. By investigating the circuits in the context of lung specific genes, we identify out the regulatory combinations for lung specific genes, as well as for those lung non-specific genes. Moreover, we show that the lung non-specific genes involved circuits are functionally related to the lung development. Noticing that some tissue developmental systems may be involved in tumourigenesis, we also check the cancer genes involved circuits, trying to find out their regulatory program, which would be useful for the research of lung cancer.
The relevant transcriptional or post-transcriptional factors and their roles involved in the mouse lung development are both changed greatly in different stages. By investigating the cancer genes involved circuits, we can find miRNAs/TFs playing important roles in tumour progression. Therefore, the miRNA-TF-mRNA circuits can be used in wide translational biomedicine studies, and can provide potential drug targets towards the treatment of lung cancer.