Cells live in uncertain, dynamic environments and have many mechanisms for sensing and responding to changes in their surroundings. However, sudden fluctuations in the environment can be catastrophic to a population if it relies solely on sensory responses, which have a delay associated with them. Cells can reconcile these effects by using a tunable stochastic response, where in the absence of a stressor they create phenotypic diversity within an isogenic population, but use a deterministic response when stressors are sensed. Here, we develop a stochastic model of the multiple antibiotic resistance network of Escherichia coli and show that it can produce tunable stochastic pulses in the activator MarA. In particular, we show that a combination of interlinked positive and negative feedback loops plays an important role in setting the dynamics of the stochastic pulses. Negative feedback produces a pulsatile response that is tunable, while positive feedback serves to amplify the effect. Our simulations show that the uninduced native network is in a parameter regime that is of low cost to the cell (taxing resistance mechanisms are expressed infrequently) and also elevated noise strength (phenotypic variability is high). The stochastic pulsing can be tuned by MarA induction such that variability is decreased once stresses are sensed, avoiding the detrimental effects of noise when an optimal MarA concentration is needed. We further show that variability in the expression of MarA can act as a bet hedging mechanism, allowing for survival in time-varying stress environments, however this effect is tunable to allow for a fully induced, deterministic response in the presence of a stressor.
Cells can sense their environment and respond to changes, however the sudden appearance of a stressor can be catastrophic if the time it takes to sense and initiate a response is slow relative to the action of a stressor. A possible solution is to couple a sensory response with a stochastic, random approach. In the absence of stress, a random subset of cells expresses resistance genes, ensuring that if a stressor appears there will be some cells that are able to survive and regenerate the population; once stress is sensed all cells should respond by expressing resistance genes. Such an approach is particularly advantageous when resistance mechanisms are taxing to the cell because it limits their expression when no stress is present. We studied this phenomenon computationally using a model of the multiple antibiotic resistance activator, MarA. MarA controls over 40 resistance genes and can be induced by many harmful compounds. We show that when uninduced, the gene regulatory network controlling MarA is capable of producing stochastic pulses that can serve to hedge against sudden changes in the environment with minimal cost to the population. When induced, MarA expression is elevated and has low variability to ensure a uniform response.
Vaccination is a preventive measure against influenza that does not require placing restrictions on social activities. However, since the stockpile of vaccine that can be prepared before the arrival of an emerging pandemic strain is generally quite limited, one has to select priority target groups to which the first stockpile is distributed. In this paper, we study a simulation-based priority target selection method with the goal of enhancing the collective immunity of the whole population. To model the region in which the disease spreads, we consider an urban area composed of suburbs and central areas connected by a single commuter train line. Human activity is modelled following an agent-based approach. The degree to which collective immunity is enhanced is judged by the attack rate in unvaccinated people. The simulation results show that if students and office workers are given exclusive priority in the first three months, the attack rate can be reduced from in the baseline case down to 1–2%. In contrast, random vaccination only slightly reduces the attack rate. It should be noted that giving preference to active social groups does not mean sacrificing those at high risk, which corresponds to the elderly in our simulation model. Compared with the random administration of vaccine to all social groups, this design successfully reduces the attack rate across all age groups.
Assigning a protein into one of its folds is a transitional step for discovering three dimensional protein structure, which is a challenging task in bimolecular (biological) science. The present research focuses on: 1) the development of classifiers, and 2) the development of feature extraction techniques based on syntactic and/or physicochemical properties.
Apart from the above two main categories of research, we have shown that the selection of physicochemical attributes of the amino acids is an important step in protein fold recognition and has not been explored adequately. We have presented a multi-dimensional successive feature selection (MD-SFS) approach to systematically select attributes. The proposed method is applied on protein sequence data and an improvement of around 24% in fold recognition has been noted when selecting attributes appropriately.
The MD-SFS has been applied successfully in selecting physicochemical attributes of the amino acids. The selected attributes show improved protein fold recognition performance.
Integrating genetic perturbations with gene expression data not only improves accuracy of regulatory network topology inference, but also enables learning of causal regulatory relations between genes. Although a number of methods have been developed to integrate both types of data, the desiderata of efficient and powerful algorithms still remains. In this paper, sparse structural equation models (SEMs) are employed to integrate both gene expression data and cis-expression quantitative trait loci (cis-eQTL), for modeling gene regulatory networks in accordance with biological evidence about genes regulating or being regulated by a small number of genes. A systematic inference method named sparsity-aware maximum likelihood (SML) is developed for SEM estimation. Using simulated directed acyclic or cyclic networks, the SML performance is compared with that of two state-of-the-art algorithms: the adaptive Lasso (AL) based scheme, and the QTL-directed dependency graph (QDG) method. Computer simulations demonstrate that the novel SML algorithm offers significantly better performance than the AL-based and QDG algorithms across all sample sizes from 100 to 1,000, in terms of detection power and false discovery rate, in all the cases tested that include acyclic or cyclic networks of 10, 30 and 300 genes. The SML method is further applied to infer a network of 39 human genes that are related to the immune function and are chosen to have a reliable eQTL per gene. The resulting network consists of 9 genes and 13 edges. Most of the edges represent interactions reasonably expected from experimental evidence, while the remaining may just indicate the emergence of new interactions. The sparse SEM and efficient SML algorithm provide an effective means of exploiting both gene expression and perturbation data to infer gene regulatory networks. An open-source computer program implementing the SML algorithm is freely available upon request.
Deciphering the structure of gene regulatory networks is crucial for understanding gene functions and cellular dynamics, as well as system-level modeling of individual genes and cellular functions. Computational methods exploiting gene expression and other types of data generated from high-throughput experiments provide an efficient and low-cost means of inferring gene networks. Sparse structural equation models are employed to: i) integrate both gene expression and genetic perturbation data for inference of gene networks; and, ii) develop an efficient sparsity-aware inference algorithm. Computer simulations corroborate that the novel algorithm markedly outperforms state-of-the-art alternatives. The algorithm is further applied to infer a real human gene network unveiling possible interactions between several genes. Since gene networks can be perturbed not only by genetic variations but also by other means such as gene copy number changes, gene knockdown or controlled gene over-expression, this paper's method can be applied to a number of practical scenarios.
For the vast majority of species – including many economically or ecologically important organisms, progress in biological research is hampered due to the lack of a reference genome sequence. Despite recent advances in sequencing technologies, several factors still limit the availability of such a critical resource. At the same time, many research groups and international consortia have already produced BAC libraries and physical maps and now are in a position to proceed with the development of whole-genome sequences organized around a physical map anchored to a genetic map. We propose a BAC-by-BAC sequencing protocol that combines combinatorial pooling design and second-generation sequencing technology to efficiently approach denovo selective genome sequencing. We show that combinatorial pooling is a cost-effective and practical alternative to exhaustive DNA barcoding when preparing sequencing libraries for hundreds or thousands of DNA samples, such as in this case gene-bearing minimum-tiling-path BAC clones. The novelty of the protocol hinges on the computational ability to efficiently compare hundred millions of short reads and assign them to the correct BAC clones (deconvolution) so that the assembly can be carried out clone-by-clone. Experimental results on simulated data for the rice genome show that the deconvolution is very accurate, and the resulting BAC assemblies have high quality. Results on real data for a gene-rich subset of the barley genome confirm that the deconvolution is accurate and the BAC assemblies have good quality. While our method cannot provide the level of completeness that one would achieve with a comprehensive whole-genome sequencing project, we show that it is quite successful in reconstructing the gene sequences within BACs. In the case of plants such as barley, this level of sequence knowledge is sufficient to support critical end-point objectives such as map-based cloning and marker-assisted breeding.
The problem of obtaining the full genomic sequence of an organism has been solved either via a global brute-force approach (called whole-genome shotgun) or by a divide-and-conquer strategy (called clone-by-clone). Both approaches have advantages and disadvantages in terms of cost, manual labor, and the ability to deal with sequencing errors and highly repetitive regions of the genome. With the advent of second-generation sequencing instruments, the whole-genome shotgun approach has been the preferred choice. The clone-by-clone strategy is, however, still very relevant for large complex genomes. In fact, several research groups and international consortia have produced clone libraries and physical maps for many economically or ecologically important organisms and now are in a position to proceed with sequencing. In this manuscript, we demonstrate the feasibility of this approach on the gene-space of a large, very repetitive plant genome. The novelty of our approach is that, in order to take advantage of the throughput of the current generation of sequencing instruments, we pool hundreds of clones using a special type of “smart” pooling design that allows one to establish with high accuracy the source clone from the sequenced reads in a pool. Extensive simulations and experimental results support our claims.
MicroRNAs (miRNAs) are key post-transcriptional regulators of gene expression and commonly deregulated in carcinogenesis. To explore functionally crucial tumor-suppressive (TS)-miRNAs in hepatocellular carcinoma (HCC), we performed integrative function- and expression-based screenings of TS-miRNAs in six HCC cell lines. The screenings identified seven miRNAs, which showed growth-suppressive activities through the overexpression of each miRNA and were endogenously downregulated in HCC cell lines. Further expression analyses using a large panel of HCC cell lines and primary tumors demonstrated four miRNAs, miR-101, -195, -378 and -497, as candidate TS-miRNAs frequently silenced in HCCs. Among them, two clustered miRNAs miR-195 and miR-497 showed significant growth-suppressive activity with induction of G1 arrest. Comprehensive exploration of their targets using Argonute2-immunoprecipitation-deep-sequencing (Ago2-IP-seq) and genome-wide expression profiling after their overexpression followed by pathway analysis, revealed a significant enrichment of cell cycle regulators. Among the candidates, we successfully identified CCNE1, CDC25A, CCND3, CDK4, and BTRC as direct targets for miR-497 and miR-195. Moreover, target genes frequently upregulated in HCC in a tumor-specific manner, such as CDK6, CCNE1, CDC25A and CDK4, showed an inverse correlation in the expression of miR-195 and miR-497, and their targets. These results suggest the molecular pathway regulating cell cycle progression to be integrally altered by downregulation of miR-195 and miR-497 expression, leading to the aberrant cell proliferation in hepatocarcinogenesis.
Cox regression is commonly used to predict the outcome by the time to an event of interest and in addition, identify relevant features for survival analysis in cancer genomics. Due to the high-dimensionality of high-throughput genomic data, existing Cox models trained on any particular dataset usually generalize poorly to other independent datasets. In this paper, we propose a network-based Cox regression model called Net-Cox and applied Net-Cox for a large-scale survival analysis across multiple ovarian cancer datasets. Net-Cox integrates gene network information into the Cox's proportional hazard model to explore the co-expression or functional relation among high-dimensional gene expression features in the gene network. Net-Cox was applied to analyze three independent gene expression datasets including the TCGA ovarian cancer dataset and two other public ovarian cancer datasets. Net-Cox with the network information from gene co-expression or functional relations identified highly consistent signature genes across the three datasets, and because of the better generalization across the datasets, Net-Cox also consistently improved the accuracy of survival prediction over the Cox models regularized by or . This study focused on analyzing the death and recurrence outcomes in the treatment of ovarian carcinoma to identify signature genes that can more reliably predict the events. The signature genes comprise dense protein-protein interaction subnetworks, enriched by extracellular matrix receptors and modulators or by nuclear signaling components downstream of extracellular signal-regulated kinases. In the laboratory validation of the signature genes, a tumor array experiment by protein staining on an independent patient cohort from Mayo Clinic showed that the protein expression of the signature gene FBN1 is a biomarker significantly associated with the early recurrence after 12 months of the treatment in the ovarian cancer patients who are initially sensitive to chemotherapy. Net-Cox toolbox is available at http://compbio.cs.umn.edu/Net-Cox/.
Network-based computational models are attracting increasing attention in studying cancer genomics because molecular networks provide valuable information on the functional organizations of molecules in cells. Survival analysis mostly with the Cox proportional hazard model is widely used to predict or correlate gene expressions with time to an event of interest (outcome) in cancer genomics. Surprisingly, network-based survival analysis has not received enough attention. In this paper, we studied resistance to chemotherapy in ovarian cancer with a network-based Cox model, called Net-Cox. The experiments confirm that networks representing gene co-expression or functional relations can be used to improve the accuracy and the robustness of survival prediction of outcome in ovarian cancer treatment. The study also revealed subnetwork signatures that are enriched by extracellular matrix receptors and modulators and the downstream nuclear signaling components of extracellular signal-regulators, respectively. In particular, FBN1, which was detected as a signature gene of high confidence by Net-Cox with network information, was validated as a biomarker for predicting early recurrence in platinum-sensitive ovarian cancer patients in laboratory.
The quantification of social media impacts on societal and political events is a difficult undertaking. The Japanese Society of Oriental Medicine started a signature-collecting campaign to oppose a medical policy of the Government Revitalization Unit to exclude a traditional Japanese medicine, “Kampo,” from the public insurance system. The signature count showed a series of aberrant bursts from November 26 to 29, 2009. In the same interval, the number of messages on Twitter including the keywords “Signature” and “Kampo,” increased abruptly. Moreover, the number of messages on an Internet forum that discussed the policy and called for signatures showed a train of spikes.
Methods and Findings
In order to estimate the contributions of social media, we developed a statistical model with state-space modeling framework that distinguishes the contributions of multiple social media in time-series of collected public opinions. We applied the model to the time-series of signature counts of the campaign and quantified contributions of two social media, i.e., Twitter and an Internet forum, by the estimation. We found that a considerable portion (78%) of the signatures was affected from either of the social media throughout the campaign and the Twitter effect (26%) was smaller than the Forum effect (52%) in total, although Twitter probably triggered the initial two bursts of signatures. Comparisons of the estimated profiles of the both effects suggested distinctions between the social media in terms of sustainable impact of messages or tweets. Twitter shows messages on various topics on a time-line; newer messages push out older ones. Twitter may diminish the impact of messages that are tweeted intermittently.
The quantification of social media impacts is beneficial to better understand people’s tendency and may promote developing strategies to engage public opinions effectively. Our proposed method is a promising tool to explore information hidden in social phenomena.
Recent advances in high-throughput sequencing technologies have enabled a comprehensive dissection of the cancer genome clarifying a large number of somatic mutations in a wide variety of cancer types. A number of methods have been proposed for mutation calling based on a large amount of sequencing data, which is accomplished in most cases by statistically evaluating the difference in the observed allele frequencies of possible single nucleotide variants between tumours and paired normal samples. However, an accurate detection of mutations remains a challenge under low sequencing depths or tumour contents. To overcome this problem, we propose a novel method, Empirical Bayesian mutation Calling (https://github.com/friend1ws/EBCall), for detecting somatic mutations. Unlike previous methods, the proposed method discriminates somatic mutations from sequencing errors based on an empirical Bayesian framework, where the model parameters are estimated using sequencing data from multiple non-paired normal samples. Using 13 whole-exome sequencing data with 87.5–206.3 mean sequencing depths, we demonstrate that our method not only outperforms several existing methods in the calling of mutations with moderate allele frequencies but also enables accurate calling of mutations with low allele frequencies (≤10%) harboured within a minor tumour subpopulation, thus allowing for the deciphering of fine substructures within a tumour specimen.
Apoptosis is a critical process in endothelial cell (EC) biology and pathology, which has been extensively studied at protein level. Numerous gene expression studies of EC apoptosis have also been performed, however few attempts have been made to use gene expression data to identify the molecular relationships and master regulators that underlie EC apoptosis. Therefore, we sought to understand these relationships by generating a Bayesian gene regulatory network (GRN) model.
ECs were induced to undergo apoptosis using serum withdrawal and followed over a time course in triplicate, using microarrays. When generating the GRN, this EC time course data was supplemented by a library of microarray data from EC treated with siRNAs targeting over 350 signalling molecules.
The GRN model proposed Vasohibin-1 (VASH1) as one of the candidate master-regulators of EC apoptosis with numerous downstream mRNAs. To evaluate the role played by VASH1 in EC, we used siRNA to reduce the expression of VASH1. Of 10 mRNAs downstream of VASH1 in the GRN that were examined, 7 were significantly up- or down-regulated in the direction predicted by the GRN.Further supporting an important biological role of VASH1 in EC, targeted reduction of VASH1 mRNA abundance conferred resistance to serum withdrawal-induced EC death.
We have utilised Bayesian GRN modelling to identify a novel candidate master regulator of EC apoptosis. This study demonstrates how GRN technology can complement traditional methods to hypothesise the regulatory relationships that underlie important biological processes.
Vasohibin; HUVEC; Bayesian; Gene regulatory network
TNF (Tumor Necrosis Factor-α) induces HUVEC (Human Umbilical Vein Endothelial Cells) to proliferate and form new blood vessels. This TNF-induced angiogenesis plays a key role in cancer and rheumatic disease. However, the molecular system that underlies TNF-induced angiogenesis is largely unknown.
We analyzed the gene expression changes stimulated by TNF in HUVEC over a time course using microarrays to reveal the molecular system underlying TNF-induced angiogenesis. Traditional k-means clustering analysis was performed to identify informative temporal gene expression patterns buried in the time course data. Functional enrichment analysis using DAVID was then performed for each cluster. The genes that belonged to informative clusters were then used as the input for gene network analysis using a Bayesian network and nonparametric regression method. Based on this TNF-induced gene network, we searched for sub-networks related to angiogenesis by integrating existing biological knowledge.
k-means clustering of the TNF stimulated time course microarray gene expression data, followed by functional enrichment analysis identified three biologically informative clusters related to apoptosis, cellular proliferation and angiogenesis. These three clusters included 648 genes in total, which were used to estimate dynamic Bayesian networks. Based on the estimated TNF-induced gene networks, we hypothesized that a sub-network including IL6 and IL8 inhibits apoptosis and promotes TNF-induced angiogenesis. More particularly, IL6 promotes TNF-induced angiogenesis by inducing NF-κB and IL8, which are strong cell growth factors.
Computational gene network analysis revealed a novel molecular system that may play an important role in the TNF-induced angiogenesis seen in cancer and rheumatic disease. This analysis suggests that Bayesian network analysis linked to functional annotation may be a powerful tool to provide insight into disease.
A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes.
In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence.
This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.
Structural variations (SVs) in genomes are commonly observed even in healthy individuals and play key roles in biological functions. To understand their functional impact or to infer molecular mechanisms of SVs, they have to be characterized with the maximum resolution. However, high-resolution analysis is a difficult task because it requires investigation of the complex structures involved in an enormous number of alignments of next-generation sequencing (NGS) reads and genome sequences that contain errors.
We propose a new method called ChopSticks that improves the resolution of SV detection for homozygous deletions even when the depth of coverage is low. Conventional methods based on read pairs use only discordant pairs to localize the positions of deletions, where a discordant pair is a read pair whose alignment has an aberrant strand or distance. In contrast, our method exploits concordant reads as well. We theoretically proved that when the depth of coverage approaches zero or infinity, the expected resolution of our method is asymptotically equal to that of methods based only on discordant pairs under double coverage. To confirm the effectiveness of ChopSticks, we conducted computational experiments against both simulated NGS reads and real NGS sequences. The resolution of deletion calls by other methods was significantly improved, thus demonstrating the usefulness of ChopSticks.
ChopSticks can generate high-resolution deletion calls of homozygous deletions using information independent of other methods, and it is therefore useful to examine the functional impact of SVs or to infer SV generation mechanisms.
XiP (eXtensible integrative Pipeline) is a flexible, editable and modular environment
with a user-friendly interface that does not require previous advanced programming skills
to run, construct and edit workflows. XiP allows the construction of workflows by linking
components written in both R and Java, the analysis of high-throughput data in grid engine
systems and also the development of customized pipelines that can be encapsulated in a
package and distributed. XiP already comes with several ready-to-use pipeline flows for
the most common genomic and transcriptomic analysis and ∼300 computational
Availability: XiP is open source, freely available under the Lesser General
Public License (LGPL) and can be downloaded from http://xip.hgc.jp.
To identify stage I lung adenocarcinoma patients with a poor prognosis who will benefit from adjuvant therapy.
Patients and Methods
Whole gene expression profiles were obtained at 19 time points over a 48-hour time course from human primary lung epithelial cells that were stimulated with epidermal growth factor (EGF) in the presence or absence of a clinically used EGF receptor tyrosine kinase (RTK)-specific inhibitor, gefitinib. The data were subjected to a mathematical simulation using the State Space Model (SSM). “Gefitinib-sensitive” genes, the expressional dynamics of which were altered by addition of gefitinib, were identified. A risk scoring model was constructed to classify high- or low-risk patients based on expression signatures of 139 gefitinib-sensitive genes in lung cancer using a training data set of 253 lung adenocarcinomas of North American cohort. The predictive ability of the risk scoring model was examined in independent cohorts of surgical specimens of lung cancer.
The risk scoring model enabled the identification of high-risk stage IA and IB cases in another North American cohort for overall survival (OS) with a hazard ratio (HR) of 7.16 (P = 0.029) and 3.26 (P = 0.0072), respectively. It also enabled the identification of high-risk stage I cases without bronchioalveolar carcinoma (BAC) histology in a Japanese cohort for OS and recurrence-free survival (RFS) with HRs of 8.79 (P = 0.001) and 3.72 (P = 0.0049), respectively.
The set of 139 gefitinib-sensitive genes includes many genes known to be involved in biological aspects of cancer phenotypes, but not known to be involved in EGF signaling. The present result strongly re-emphasizes that EGF signaling status in cancer cells underlies an aggressive phenotype of cancer cells, which is useful for the selection of early-stage lung adenocarcinoma patients with a poor prognosis.
The Gene Expression Omnibus (GEO) GSE31210
Epidemiological studies have suggested that the encounter with commensal microorganisms during the neonatal period is essential for normal development of the host immune system. Basic research involving gnotobiotic mice has demonstrated that colonization at the age of 5 weeks is too late to reconstitute normal immune function. In this study, we examined the transcriptome profiles of the large intestine (LI), small intestine (SI), liver (LIV), and spleen (SPL) of 3 bacterial colonization models—specific pathogen-free mice (SPF), ex-germ-free mice with bacterial reconstitution at the time of delivery (0WexGF), and ex-germ-free mice with bacterial reconstitution at 5 weeks of age (5WexGF)—and compared them with those of germ-free (GF) mice.
Hundreds of genes were affected in all tissues in each of the colonized models; however, a gene set enrichment analysis method, MetaGene Profiler (MGP), demonstrated that the specific changes of Gene Ontology (GO) categories occurred predominantly in 0WexGF LI, SPF SI, and 5WexGF SPL, respectively. MGP analysis on signal pathways revealed prominent changes in toll-like receptor (TLR)- and type 1 interferon (IFN)-signaling in LI of 0WexGF and SPF mice, but not 5WexGF mice, while 5WexGF mice showed specific changes in chemokine signaling. RT-PCR analysis of TLR-related genes showed that the expression of interferon regulatory factor 3 (Irf3), a crucial rate-limiting transcription factor in the induction of type 1 IFN, prominently decreased in 0WexGF and SPF mice but not in 5WexGF and GF mice.
The present study provides important new information regarding the molecular mechanisms of the so-called "hygiene hypothesis".
Hygiene hypothesis; Germ-free; Toll-like receptor; Type 1 interferon; MetaGene Profiler
Understanding the complex regulatory networks underlying development and evolution of multi-cellular organisms is a major problem in biology. Computational models can be used as tools to extract the regulatory structure and dynamics of such networks from gene expression data. This approach is called reverse engineering. It has been successfully applied to many gene networks in various biological systems. However, to reconstitute the structure and non-linear dynamics of a developmental gene network in its spatial context remains a considerable challenge. Here, we address this challenge using a case study: the gap gene network involved in segment determination during early development of Drosophila melanogaster. A major problem for reverse-engineering pattern-forming networks is the significant amount of time and effort required to acquire and quantify spatial gene expression data. We have developed a simplified data processing pipeline that considerably increases the throughput of the method, but results in data of reduced accuracy compared to those previously used for gap gene network inference. We demonstrate that we can infer the correct network structure using our reduced data set, and investigate minimal data requirements for successful reverse engineering. Our results show that timing and position of expression domain boundaries are the crucial features for determining regulatory network structure from data, while it is less important to precisely measure expression levels. Based on this, we define minimal data requirements for gap gene network inference. Our results demonstrate the feasibility of reverse-engineering with much reduced experimental effort. This enables more widespread use of the method in different developmental contexts and organisms. Such systematic application of data-driven models to real-world networks has enormous potential. Only the quantitative investigation of a large number of developmental gene regulatory networks will allow us to discover whether there are rules or regularities governing development and evolution of complex multi-cellular organisms.
To better understand multi-cellular organisms we need a better and more systematic understanding of the complex regulatory networks that govern their development and evolution. However, this problem is far from trivial. Regulatory networks involve many factors interacting in a non-linear manner, which makes it difficult to study them without the help of computers. Here, we investigate a computational method, reverse engineering, which allows us to reconstitute real-world regulatory networks in silico. As a case study, we investigate the gap gene network involved in determining the position of body segments during early development of Drosophila. We visualise spatial gap gene expression patterns using in situ hybridisation and microscopy. The resulting embryo images are quantified to measure the position of expression domain boundaries. We then use computational models as tools to extract regulatory information from the data. We investigate what kind, and how much data are required for successful network inference. Our results reveal that much less effort is required for reverse-engineering networks than previously thought. This opens the possibility of investigating a large number of developmental networks using this approach, which in turn will lead to a more general understanding of the rules and principles underlying development in animals and plants.
Motivation: In cancer genomes, chromosomal regions harboring cancer genes are often subjected to genomic aberrations like copy number alteration and loss of heterozygosity. Given this, finding recurrent genomic aberrations is considered an apt approach for screening cancer genes. Although several permutation-based tests have been proposed for this purpose, none of them are designed to find recurrent aberrations from the genomic dataset without paired normal sample controls. Their application to unpaired genomic data may lead to false discoveries, because they retrieve pseudo-aberrations that exist in normal genomes as polymorphisms.
Results: We develop a new parametric method named parametric aberration recurrence test (PART) to test for the recurrence of genomic aberrations. The introduction of Poisson-binomial statistics allow us to compute small P-values more efficiently and precisely than the previously proposed permutation-based approach. Moreover, we extended PART to cover unpaired data (PART-up) so that there is a statistical basis for analyzing unpaired genomic data. PART-up uses information from unpaired normal sample controls to remove pseudo-aberrations in unpaired genomic data. Using PART-up, we successfully predict recurrent genomic aberrations in cancer cell line samples whose paired normal sample controls are unavailable. This article thus proposes a powerful statistical framework for the identification of driver aberrations, which would be applicable to ever-increasing amounts of cancer genomic data seen in the era of next generation sequencing.
Availability: Our implementations of PART and PART-up are available from http://www.hgc.jp/~niiyan/PART/manual.html.
Supplementary data are available at Bioinformatics online.
Summary: Protein–protein interactions (PPIs) are mediated through specific regions on proteins. Some proteins have two or more protein interacting regions (IRs) and some IRs are competitively used for interactions with different proteins. IRView currently contains data for 3417 IRs in human and mouse proteins. The data were obtained from different sources and combined with annotated region data from InterPro. Information on non-synonymous single nucleotide polymorphism sites and variable regions owing to alternative mRNA splicing is also included. The IRView web interface displays all IR data, including user-uploaded data, on reference sequences so that the positional relationship between IRs can be easily understood. IRView should be useful for analyzing underlying relationships between the proteins behind the PPI networks.
Availability: IRView is publicly available on the web at http://ir.hgc.jp/.
Our understanding of the molecular pathways that underlie melanoma remains incomplete. Although several published microarray studies of clinical melanomas have provided valuable information, we found only limited concordance between these studies. Therefore, we took an in vitro functional genomics approach to understand melanoma molecular pathways.
Affymetrix microarray data were generated from A375 melanoma cells treated in vitro with siRNAs against 45 transcription factors and signaling molecules. Analysis of this data using unsupervised hierarchical clustering and Bayesian gene networks identified proliferation-association RNA clusters, which were co-ordinately expressed across the A375 cells and also across melanomas from patients. The abundance in metastatic melanomas of these cellular proliferation clusters and their putative upstream regulators was significantly associated with patient prognosis. An 8-gene classifier derived from gene network hub genes correctly classified the prognosis of 23/26 metastatic melanoma patients in a cross-validation study. Unlike the RNA clusters associated with cellular proliferation described above, co-ordinately expressed RNA clusters associated with immune response were clearly identified across melanoma tumours from patients but not across the siRNA-treated A375 cells, in which immune responses are not active. Three uncharacterised genes, which the gene networks predicted to be upstream of apoptosis- or cellular proliferation-associated RNAs, were found to significantly alter apoptosis and cell number when over-expressed in vitro.
This analysis identified co-expression of RNAs that encode functionally-related proteins, in particular, proliferation-associated RNA clusters that are linked to melanoma patient prognosis. Our analysis suggests that A375 cells in vitro may be valid models in which to study the gene expression modules that underlie some melanoma biological processes (e.g., proliferation) but not others (e.g., immune response). The gene expression modules identified here, and the RNAs predicted by Bayesian network inference to be upstream of these modules, are potential prognostic biomarkers and drug targets.
Determining the functional structure of biological networks is a central goal of systems biology. One approach is to analyze gene expression data to infer a network of gene interactions on the basis of their correlated responses to environmental and genetic perturbations. The inferred network can then be analyzed to identify functional communities. However, commonly used algorithms can yield unreliable results due to experimental noise, algorithmic stochasticity, and the influence of arbitrarily chosen parameter values. Furthermore, the results obtained typically provide only a simplistic view of the network partitioned into disjoint communities and provide no information of the relationship between communities. Here, we present methods to robustly detect co-regulated and functionally enriched gene communities and demonstrate their application and validity for Escherichia coli gene expression data. Applying a recently developed community detection algorithm to the network of interactions identified with the context likelihood of relatedness (CLR) method, we show that a hierarchy of network communities can be identified. These communities significantly enrich for gene ontology (GO) terms, consistent with them representing biologically meaningful groups. Further, analysis of the most significantly enriched communities identified several candidate new regulatory interactions. The robustness of our methods is demonstrated by showing that a core set of functional communities is reliably found when artificial noise, modeling experimental noise, is added to the data. We find that noise mainly acts conservatively, increasing the relatedness required for a network link to be reliably assigned and decreasing the size of the core communities, rather than causing association of genes into new communities.
One of the fundamental themes in biology is the hierarchical organization of its constituents. At higher levels of a hierarchy new properties emerge due to the complex interaction of constituents at lower levels. This same organization is expected to be found in genetic regulatory networks. If so, determining this hierarchal structure would aid in understanding the properties and functional processes of the networks. With the increasing availability of genetic expression data, developing methods to infer the underlying genetic regulatory network and detect functional communities within the network is an important goal of systems biology. Unfortunately, noise in expression data creates variability in the inferred network and the stochastic nature of community detection creates variability in the functional communities detected with existing methods. Here, we present methods for exploring the hierarchical organization of genetic regulatory networks that robustly detect core functional communities. We test the methods and demonstrate their validity, by applying them to Escherichia coli genetic expression data, finding a hierarchy of functionally relevant communities and then comparing those communities to the known E. coli functional groups. We then give examples of how our methods can be used to infer regulatory interactions between genes.
In the analysis of effects by cell treatment such as drug dosing, identifying changes on gene network structures between normal and treated cells is a key task. A possible way for identifying the changes is to compare structures of networks estimated from data on normal and treated cells separately. However, this approach usually fails to estimate accurate gene networks due to the limited length of time series data and measurement noise. Thus, approaches that identify changes on regulations by using time series data on both conditions in an efficient manner are demanded.
We propose a new statistical approach that is based on the state space representation of the vector autoregressive model and estimates gene networks on two different conditions in order to identify changes on regulations between the conditions. In the mathematical model of our approach, hidden binary variables are newly introduced to indicate the presence of regulations on each condition. The use of the hidden binary variables enables an efficient data usage; data on both conditions are used for commonly existing regulations, while for condition specific regulations corresponding data are only applied. Also, the similarity of networks on two conditions is automatically considered from the design of the potential function for the hidden binary variables. For the estimation of the hidden binary variables, we derive a new variational annealing method that searches the configuration of the binary variables maximizing the marginal likelihood.
For the performance evaluation, we use time series data from two topologically similar synthetic networks, and confirm that our proposed approach estimates commonly existing regulations as well as changes on regulations with higher coverage and precision than other existing approaches in almost all the experimental settings. For a real data application, our proposed approach is applied to time series data from normal Human lung cells and Human lung cells treated by stimulating EGF-receptors and dosing an anticancer drug termed Gefitinib. In the treated lung cells, a cancer cell condition is simulated by the stimulation of EGF-receptors, but the effect would be counteracted due to the selective inhibition of EGF-receptors by Gefitinib. However, gene expression profiles are actually different between the conditions, and the genes related to the identified changes are considered as possible off-targets of Gefitinib.
From the synthetically generated time series data, our proposed approach can identify changes on regulations more accurately than existing methods. By applying the proposed approach to the time series data on normal and treated Human lung cells, candidates of off-target genes of Gefitinib are found. According to the published clinical information, one of the genes can be related to a factor of interstitial pneumonia, which is known as a side effect of Gefitinib.
Although protein–RNA interactions (PRIs) are involved in various important cellular processes, compiled data on PRIs are still
limited. This contrasts with protein–protein interactions, which have been intensively recorded in public databases and subjected
to network level analysis. Here, we introduce PRD, an online database of PRIs, dispersed across several sources, including scientific
literature. Currently, over 10,000 interactions have been stored in PRD using PSI-MI 2.5, which is a standard model for describing
detailed molecular interactions, with an emphasis on gene level data. Users can browse all recorded interactions and execute
flexible keyword searches against the database via a web interface. Our database is not only a reference of PRIs, but will also be a
valuable resource for studying characteristics of PRI networks.
PRD can be freely accessed at http://pri.hgc.jp/
protein-RNA interaction; Biomolecular interaction; RNA binding protein; Database
Structural variations (SVs) change the structure of the genome and are therefore the causes of various diseases. Next-generation sequencing allows us to obtain a multitude of sequence data, some of which can be used to infer the position of SVs.
We developed a new method and implementation named ClipCrop for detecting SVs with single-base resolution using soft-clipping information. A soft-clipped sequence is an unmatched fragment in a partially mapped read. To assess the performance of ClipCrop with other SV-detecting tools, we generated various patterns of simulation data – SV lengths, read lengths, and the depth of coverage of short reads – with insertions, deletions, tandem duplications, inversions and single nucleotide alterations in a human chromosome. For comparison, we selected BreakDancer, CNVnator and Pindel, each of which adopts a different approach to detect SVs, e.g. discordant pair approach, depth of coverage approach and split read approach, respectively.
Our method outperformed BreakDancer and CNVnator in both discovering rate and call accuracy in any type of SV. Pindel offered a similar performance as our method, but our method crucially outperformed for detecting small duplications. From our experiments, ClipCrop infer reliable SVs for the data set with more than 50 bases read lengths and 20x depth of coverage, both of which are reasonable values in current NGS data set.
ClipCrop can detect SVs with higher discovering rate and call accuracy than any other tool in our simulation data set.
Gene regulatory networks inferred from RNA abundance data have generated significant interest, but despite this, gene network approaches are used infrequently and often require input from bioinformaticians. We have assembled a suite of tools for analysing regulatory networks, and we illustrate their use with microarray datasets generated in human endothelial cells. We infer a range of regulatory networks, and based on this analysis discuss the strengths and limitations of network inference from RNA abundance data. We welcome contact from researchers interested in using our inference and visualization tools to answer biological questions.