PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (1442766)

Clipboard (0)
None

Related Articles

1.  Predicting genome-wide redundancy using machine learning 
Background
Gene duplication can lead to genetic redundancy, which masks the function of mutated genes in genetic analyses. Methods to increase sensitivity in identifying genetic redundancy can improve the efficiency of reverse genetics and lend insights into the evolutionary outcomes of gene duplication. Machine learning techniques are well suited to classifying gene family members into redundant and non-redundant gene pairs in model species where sufficient genetic and genomic data is available, such as Arabidopsis thaliana, the test case used here.
Results
Machine learning techniques that combine multiple attributes led to a dramatic improvement in predicting genetic redundancy over single trait classifiers alone, such as BLAST E-values or expression correlation. In withholding analysis, one of the methods used here, Support Vector Machines, was two-fold more precise than single attribute classifiers, reaching a level where the majority of redundant calls were correctly labeled. Using this higher confidence in identifying redundancy, machine learning predicts that about half of all genes in Arabidopsis showed the signature of predicted redundancy with at least one but typically less than three other family members. Interestingly, a large proportion of predicted redundant gene pairs were relatively old duplications (e.g., Ks > 1), suggesting that redundancy is stable over long evolutionary periods.
Conclusions
Machine learning predicts that most genes will have a functionally redundant paralog but will exhibit redundancy with relatively few genes within a family. The predictions and gene pair attributes for Arabidopsis provide a new resource for research in genetics and genome evolution. These techniques can now be applied to other organisms.
doi:10.1186/1471-2148-10-357
PMCID: PMC2998534  PMID: 21087504
2.  Quantitative analysis of regulatory flexibility under changing environmental conditions 
Day length changes with the seasons in temperate latitudes, affecting the many biological rhythms that entrain to the day/night cycle: we measure these effects on the expression of Arabidopsis clock genes, using RNA and reporter gene readouts, with a new method of phase analysis.Dusk sensitivity is proposed as a simple, natural and general mathematical measure to analyse and manipulate the changing phase of a clock output relative to the change in the day/night cycle.Dusk sensitivity shows how increasing the numbers of feedback loops in the Arabidopsis clock models allows more flexible regulation, consistent with a previously-proposed, general operating principle of biological networks.The Arabidopsis clock genes show flexibility of regulation that is characteristic of a three-loop clock model, validating aspects of the model and the operating principle, but some clock output genes show greater flexibility arising from direct light regulation.
The analysis of dynamic, non-linear regulation with the aid of mechanistic models is central to Systems Biology. This study compares the predictions of mechanistic, mathematical models of the circadian clock with molecular time-series data on rhythmic gene expression in the higher plant Arabidopsis thaliana. Analysis of the models helps us to understand (explain and predict) how the clock gene circuit balances regulation by external and endogenous factors to achieve particular behaviours. Such multi-factorial regulation is ubiquitous in, and characteristic of, living systems.
The Earth's rotation causes predictable changes in the environment, notably in the availability of sunlight for photosynthesis. Many biological processes are driven by the environmental input via sensory pathways, for example, from photoreceptors. Circadian clocks provide an alternative strategy. These endogenous, 24-h rhythms can drive biological processes that anticipate the regular environmental changes, rather than merely responding. Many rhythmic processes have both light and clock control. Indeed, the clock components themselves must balance internal timing with external inputs, because circadian clocks are reset daily through light regulation of one or more clock components. This process of entrainment is complicated by the change in day length. When the times of dawn and dusk move apart in summer, and closer together in winter, does the clock track dawn, track dusk or interpolate between them?
In plants, the clock controls leaf and petal movements, the opening and closing of stomatal pores, the discharge of floral fragrances, and many metabolic activities, especially those associated with photosynthesis. Centuries of physiological studies have shown that these rhythms can behave differently. Flowering in Ipomoea nil (Pharbitis nil, Japanese morning glory) is controlled by a rhythm that tracks the time of dusk, to give a classic example. We showed that two other rhythms associated with vegetative growth track dawn in this species (Figure 5A), so the clock system allows flexible regulation.
The relatively small number of components involved in the circadian clockwork makes it an ideal candidate for mathematical modelling. Molecular genetic studies in a variety of model eukaryotes have shown that the circadian rhythm is generated by a network of 6–20 genes. These genes form feedback loops generating a rhythm in mRNA production. A single negative feedback loop in which a gene encodes a protein that, after several hours, turns off transcription is capable of generating a circadian rhythm, in principle. A single light input can entrain the clock to ‘local time', synchronised with a light–dark cycle. However, real circadian clocks have proven to be more complicated than this, with multiple light inputs and interlocked feedback loops.
We have previously argued from mathematical analysis that multi-loop networks increase the flexibility of regulation (Rand et al, 2004) and have shown that appropriately deployed flexibility can confer functional robustness (Akman et al, 2010). Here we test whether that flexibility can be demonstrated in vivo, in the model plant, A. thaliana. The Arabidopsis clock mechanism comprises a feedback loop in which two partially redundant, myb transcription factors, LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), repress the expression of their activator, TIMING OF CAB EXPRESSION 1 (TOC1). We previously modelled this single-loop circuit and showed that it was not capable of recreating important data (Locke et al, 2005a). An extended, two-loop model was developed to match observed behaviours, incorporating a hypothetical gene Y, for which the best identified candidate was the GIGANTEA gene (GI) (Locke et al, 2005b). Two further models incorporated the TOC1 homologues PSEUDO-RESPONSE REGULATOR (PRR) 9 and PRR7 (Locke et al, 2006; Zeilinger et al, 2006). In these circuits, a morning oscillator (LHY/CCA1–PRR9/7) is coupled to an evening oscillator (Y/GI–TOC1) via the original LHY/CCA1–TOC1 loop.
These clock models, like those for all other organisms, were developed using data from simple conditions of constant light, darkness or 12-h light–12-h dark cycles. We therefore tested how the clock genes in Arabidopsis responded to light–dark cycles with different photoperiods, from 3 h light to 18 h light per 24-h cycle (Edinburgh, 56° North latitude, has 17.5 h light in midsummer). The time-series assays of mRNA and in vivo reporter gene images showed a range of peak times for different genes, depending on the photoperiod (Figure 5C). A new data analysis method, mFourfit, was introduced to measure the peak times, in the Biological Rhythms Analysis Software Suite (BRASS v3.0). None of the genes showed the dusk-tracking behaviour characteristic of the Ipomoea flowering rhythm. The one-, two- and three-loop models were analysed to understand the observed patterns. A new mathematical measure, dusk sensitivity, was introduced to measure the change in timing of a model component versus a change in the time of dusk. The one- and two-loop models tracked dawn and dusk, respectively, under all conditions. Only the three-loop model (Figure 5B) had the flexibility required to match the photoperiod-dependent changes that we found in vivo, and in particular the unexpected, V-shaped pattern in the peak time of TOC1 expression. This pattern of regulation depends on the structure and light inputs to the model's evening oscillator, so the in vivo data supported this aspect of the model. LHY and CCA1 gene expression under short photoperiods showed greater dusk sensitivity, in the interval 2–6 h before dawn, than the three-loop model predicted, so these data will help to constrain future models.
The approach described here could act as a template for experimental biologists seeking to understand biological regulation using dynamic, experimental perturbations and time-series data. Simulation of mathematical models (despite known imperfections) can provide contrasting hypotheses that guide understanding. The system's detailed behaviour is complex, so a natural and general measure such as dusk sensitivity is helpful to focus on one property of the system. We used the measure to compare models, and to predict how this property could be manipulated. To enable additional analysis of this system, we provide the time-series data and experimental metadata online.
The circadian clock controls 24-h rhythms in many biological processes, allowing appropriate timing of biological rhythms relative to dawn and dusk. Known clock circuits include multiple, interlocked feedback loops. Theory suggested that multiple loops contribute the flexibility for molecular rhythms to track multiple phases of the external cycle. Clear dawn- and dusk-tracking rhythms illustrate the flexibility of timing in Ipomoea nil. Molecular clock components in Arabidopsis thaliana showed complex, photoperiod-dependent regulation, which was analysed by comparison with three contrasting models. A simple, quantitative measure, Dusk Sensitivity, was introduced to compare the behaviour of clock models with varying loop complexity. Evening-expressed clock genes showed photoperiod-dependent dusk sensitivity, as predicted by the three-loop model, whereas the one- and two-loop models tracked dawn and dusk, respectively. Output genes for starch degradation achieved dusk-tracking expression through light regulation, rather than a dusk-tracking rhythm. Model analysis predicted which biochemical processes could be manipulated to extend dusk tracking. Our results reveal how an operating principle of biological regulators applies specifically to the plant circadian clock.
doi:10.1038/msb.2010.81
PMCID: PMC3010117  PMID: 21045818
Arabidopsis thaliana; biological clocks; dynamical systems; gene regulatory networks; mathematical models; photoperiodism
3.  Systematic identification of functional modules and cis-regulatory elements in Arabidopsis thaliana 
BMC Bioinformatics  2011;12(Suppl 12):S2.
Background
Several large-scale gene co-expression networks have been constructed successfully for predicting gene functional modules and cis-regulatory elements in Arabidopsis (Arabidopsis thaliana). However, these networks are usually constructed and analyzed in an ad hoc manner. In this study, we propose a completely parameter-free and systematic method for constructing gene co-expression networks and predicting functional modules as well as cis-regulatory elements.
Results
Our novel method consists of an automated network construction algorithm, a parameter-free procedure to predict functional modules, and a strategy for finding known cis-regulatory elements that is suitable for consensus scanning without prior knowledge of the allowed extent of degeneracy of the motif. We apply the method to study a large collection of gene expression microarray data in Arabidopsis. We estimate that our co-expression network has ~94% of accuracy, and has topological properties similar to other biological networks, such as being scale-free and having a high clustering coefficient. Remarkably, among the ~300 predicted modules whose sizes are at least 20, 88% have at least one significantly enriched functions, including a few extremely significant ones (ribosome, p < 1E-300, photosynthetic membrane, p < 1.3E-137, proteasome complex, p < 5.9E-126). In addition, we are able to predict cis-regulatory elements for 66.7% of the modules, and the association between the enriched cis-regulatory elements and the enriched functional terms can often be confirmed by the literature. Overall, our results are much more significant than those reported by several previous studies on similar data sets. Finally, we utilize the co-expression network to dissect the promoters of 19 Arabidopsis genes involved in the metabolism and signaling of the important plant hormone gibberellin, and achieved promising results that reveal interesting insight into the biosynthesis and signaling of gibberellin.
Conclusions
The results show that our method is highly effective in finding functional modules from real microarray data. Our application on Arabidopsis leads to the discovery of the largest number of annotated Arabidopsis functional modules in the literature. Given the high statistical significance of functional enrichment and the agreement between cis-regulatory and functional annotations, we believe our Arabidopsis gene modules can be used to predict the functions of unknown genes in Arabidopsis, and to understand the regulatory mechanisms of many genes.
doi:10.1186/1471-2105-12-S12-S2
PMCID: PMC3247083  PMID: 22168340
4.  Computational reconstruction of tissue-specific metabolic models: application to human liver metabolism 
The first computational approach for the rapid generation of genome-scale tissue-specific models from a generic species model.A genome scale model of human liver metabolism, which is comprehensively tested and validated using cross-validation and the ability to carry out complex hepatic metabolic functions.The model's flux predictions are shown to correlate with flux measurements across a variety of hormonal and dietary conditions, and are successfully used to predict biomarker changes in genetic metabolic disorders, both with higher accuracy than the generic human model.
The study of normal human metabolism and its alterations is central to the understanding and treatment of a variety of human diseases, including diabetes, metabolic syndrome, neurodegenerative disorders, and cancer. A promising systems biology approach for studying human metabolism is through the development and analysis of large-scale stoichiometric network models of human metabolism. The reconstruction of these network models has followed two main paths: the former being the reconstruction of generic (non-tissue specific) models, characterizing the complete metabolic potential of human cells, based mostly on genomic data to trace enzyme-coding genes (Duarte et al, 2007; Ma et al, 2007), and the latter is the reconstruction of cell type- and tissue-specific models (Wiback and Palsson, 2002; Chatziioannou et al, 2003; Vo et al, 2004), based on a similar methodology to that described above, with the extra complexity of manual curation of literature evidence for the cell/system specificity of metabolic enzymes and pathways.
On this background, we present in this study, to the best of our knowledge, the first computational approach for a rapid generation of genome-scale tissue-specific models. The method relies on integrating the previously reconstructed generic human models with a variety of high-throughput molecular ‘omics' data, including transcriptomic, proteomic, metabolomic, and phenotypic data, as well as literature-based knowledge, characterizing the tissue in hand (Figure 1). Hence, it can be readily used to quite rapidly build and use a large array of human tissue-specific models. The resulting model satisfies stoichiometric, mass-balance, and thermodynamic constraints. It serves as a functional metabolic network that can then be used to explore the metabolic state of a tissue under various genetic and physiological conditions, simulating enzymatic inhibition or drug applications through standard constraint-based modeling methods, without requiring additional context-specific molecular data.
We applied this approach to build a genome scale model of liver metabolism, which is then comprehensively tested and validated. The model is shown to be able to simulate complex hepatic metabolic functions, as well as depicting the pathological alterations caused by urea cycle deficiencies. The liver model was applied to predict measured intra-cellular metabolic fluxes given measured metabolite uptake and secretion rates at different hepatic metabolic conditions. The predictions were tested using a comprehensive set of flux measurements performed by (Chan et al, 2003), showing that the liver model obtained more accurate predictions compared to those obtained by the original, generic human model (an overall prediction accuracy of 0.67 versus 0.46). Furthermore, it was applied to identify metabolic biomarkers for liver in-born errors of metabolism—once again, displaying superiority vs. the predictions generated by the generic human model (accuracy of 0.67 versus 0.59).
From a biotechnological standpoint, the liver model generated here can serve as a basis for future studies aiming to optimize the functioning of bio artificial liver devices. The application of the method to rapidly construct metabolic models of other human tissues can obviously lead to many other important clinical insights, e.g., concerning means for metabolic salvage of ischemic heart and brain tissues. Last but not least, the application of the new method is not limited to the realm of human modeling; it can be used to generate tissue models for any multi-tissue organism for which a generic model exists, such as the Mus musculus (Quek and Nielsen, 2008; Sheikh et al, 2005) and the model plant Arabidopsis thaliana (Poolman et al, 2009).
The computational study of human metabolism has been advanced with the advent of the first generic (non-tissue specific) stoichiometric model of human metabolism. In this study, we present a new algorithm for rapid reconstruction of tissue-specific genome-scale models of human metabolism. The algorithm generates a tissue-specific model from the generic human model by integrating a variety of tissue-specific molecular data sources, including literature-based knowledge, transcriptomic, proteomic, metabolomic and phenotypic data. Applying the algorithm, we constructed the first genome-scale stoichiometric model of hepatic metabolism. The model is verified using standard cross-validation procedures, and through its ability to carry out hepatic metabolic functions. The model's flux predictions correlate with flux measurements across a variety of hormonal and dietary conditions, and improve upon the predictive performance obtained using the original, generic human model (prediction accuracy of 0.67 versus 0.46). Finally, the model better predicts biomarker changes in genetic metabolic disorders than the generic human model (accuracy of 0.67 versus 0.59). The approach presented can be used to construct other human tissue-specific models, and be applied to other organisms.
doi:10.1038/msb.2010.56
PMCID: PMC2964116  PMID: 20823844
constraint based; hepatic; liver; metabolism
5.  Prediction of plant pre-microRNAs and their microRNAs in genome-scale sequences using structure-sequence features and support vector machine 
BMC Bioinformatics  2014;15(1):423.
Background
MicroRNAs (miRNAs) are a family of non-coding RNAs approximately 21 nucleotides in length that play pivotal roles at the post-transcriptional level in animals, plants and viruses. These molecules silence their target genes by degrading transcription or suppressing translation. Studies have shown that miRNAs are involved in biological responses to a variety of biotic and abiotic stresses. Identification of these molecules and their targets can aid the understanding of regulatory processes. Recently, prediction methods based on machine learning have been widely used for miRNA prediction. However, most of these methods were designed for mammalian miRNA prediction, and few are available for predicting miRNAs in the pre-miRNAs of specific plant species. Although the complete Solanum lycopersicum genome has been published, only 77 Solanum lycopersicum miRNAs have been identified, far less than the estimated number. Therefore, it is essential to develop a prediction method based on machine learning to identify new plant miRNAs.
Results
A novel classification model based on a support vector machine (SVM) was trained to identify real and pseudo plant pre-miRNAs together with their miRNAs. An initial set of 152 novel features related to sequential structures was used to train the model. By applying feature selection, we obtained the best subset of 47 features for use with the Back Support Vector Machine-Recursive Feature Elimination (B-SVM-RFE) method for the classification of plant pre-miRNAs. Using this method, 63 features were obtained for plant miRNA classification. We then developed an integrated classification model, miPlantPreMat, which comprises MiPlantPre and MiPlantMat, to identify plant pre-miRNAs and their miRNAs. This model achieved approximately 90% accuracy using plant datasets from nine plant species, including Arabidopsis thaliana, Glycine max, Oryza sativa, Physcomitrella patens, Medicago truncatula, Sorghum bicolor, Arabidopsis lyrata, Zea mays and Solanum lycopersicum. Using miPlantPreMat, 522 Solanum lycopersicum miRNAs were identified in the Solanum lycopersicum genome sequence.
Conclusions
We developed an integrated classification model, miPlantPreMat, based on structure-sequence features and SVM. MiPlantPreMat was used to identify both plant pre-miRNAs and the corresponding mature miRNAs. An improved feature selection method was proposed, resulting in high classification accuracy, sensitivity and specificity.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0423-x) contains supplementary material, which is available to authorized users.
doi:10.1186/s12859-014-0423-x
PMCID: PMC4310204  PMID: 25547126
MiRNA; Pre-miRNA; Prediction; SVM; Feature selection
6.  Genome-Wide Patterns of Arabidopsis Gene Expression in Nature 
PLoS Genetics  2012;8(4):e1002662.
Organisms in the wild are subject to multiple, fluctuating environmental factors, and it is in complex natural environments that genetic regulatory networks actually function and evolve. We assessed genome-wide gene expression patterns in the wild in two natural accessions of the model plant Arabidopsis thaliana and examined the nature of transcriptional variation throughout its life cycle and gene expression correlations with natural environmental fluctuations. We grew plants in a natural field environment and measured genome-wide time-series gene expression from the plant shoot every three days, spanning the seedling to reproductive stages. We find that 15,352 genes were expressed in the A. thaliana shoot in the field, and accession and flowering status (vegetative versus flowering) were strong components of transcriptional variation in this plant. We identified between ∼110 and 190 time-varying gene expression clusters in the field, many of which were significantly overrepresented by genes regulated by abiotic and biotic environmental stresses. The two main principal components of vegetative shoot gene expression (PCveg) correlate to temperature and precipitation occurrence in the field. The largest PCveg axes included thermoregulatory genes while the second major PCveg was associated with precipitation and contained drought-responsive genes. By exposing A. thaliana to natural environments in an open field, we provide a framework for further understanding the genetic networks that are deployed in natural environments, and we connect plant molecular genetics in the laboratory to plant organismal ecology in the wild.
Author Summary
Plants in the real world are continuously exposed to multiple environmental signals and must respond appropriately to the dynamic conditions found in nature. Environmental signals can fluctuate during an individual's life cycle with varying degrees of predictability, and complex natural environments are where gene activity evolves. We grew two natural accessions of the model plant Arabidopsis thaliana in an open field in New York in the spring and examined genome-wide gene expression patterns in the wild. We find nearly 200 gene expression clusters in these field-grown plants, and many of these clusters were enriched in genes that had previously been shown to be associated with expression under various abiotic or biotic environmental stress conditions. Two major principal components of gene expression were associated with environmental fluctuations in temperature and rainfall, and we identified several genes (such as the thermoregulatory nucleosome occupancy gene ARP6 and the drought-sensitive hormone biosynthetic gene AAO3) that could be found in these principal components. By exploring genome-wide gene expression in plants in the wild, we were able to connect mechanistic aspects of plant molecular biology with ecological responses in nature and to begin to understand how organisms behave and adapt in their natural environments.
doi:10.1371/journal.pgen.1002662
PMCID: PMC3330097  PMID: 22532807
7.  Identification of novel motif patterns to decipher the promoter architecture of co-expressed genes in Arabidopsis thaliana 
BMC Systems Biology  2013;7(Suppl 3):S10.
Background
The understanding of the mechanisms of transcriptional regulation remains a challenge for molecular biologists in the post-genome era. It is hypothesized that the regulatory regions of genes expressed in the same tissue or cell type share a similar structure. Though several studies have analyzed the promoters of genes expressed in specific metazoan tissues or cells, little research has been done in plants. Hence finding specific patterns of motifs to explain the promoter architecture of co-expressed genes in plants could shed light on their transcription mechanism.
Results
We identified novel patterns of sets of motifs in promoters of genes co-expressed in four different plant structures (PSs) and in the entire plant in Arabidopsis thaliana. Sets of genes expressed in four PSs (flower, seed, root, shoot) and housekeeping genes expressed in the entire plant were taken from a database of co-expressed genes in A. thaliana. PS-specific motifs were predicted using three motif-discovery algorithms, 8 of which are novel, to the best of our knowledge. A support vector machine was trained using the average upstream distance of the identified motifs from the translation start site on both strands of binding sites. The correctly classified promoters per PS were used to construct specific patterns of sets of motifs to describe the promoter architecture of those co-expressed genes. The discovered PS-specific patterns were tested in the entire A. thaliana genome, correctly identifying 77.8%, 81.2%, 70.8% and 53.7% genes expressed in petal differentiation, synergid cells, root hair and trichome, as well as 88.4% housekeeping genes.
Conclusions
We present five patterns of sets of motifs which describe the promoter architecture of co-expressed genes in five PSs with the ability to predict them from the entire A. thaliana genome. Based on these findings, we conclude that the positioning and orientation of transcription factor binding sites at specific distances from the translation start site is a reliable measure to differentiate promoters of genes expressed in different A. thaliana structures from background genomic promoters. Our method can be used to predict novel motifs and decipher a similar promoter architecture for genes co-expressed in A. thaliana under different conditions.
doi:10.1186/1752-0509-7-S3-S10
PMCID: PMC3852273  PMID: 24555803
8.  Functional Knowledge Transfer for High-accuracy Prediction of Under-studied Biological Processes 
PLoS Computational Biology  2013;9(3):e1002957.
A key challenge in genetics is identifying the functional roles of genes in pathways. Numerous functional genomics techniques (e.g. machine learning) that predict protein function have been developed to address this question. These methods generally build from existing annotations of genes to pathways and thus are often unable to identify additional genes participating in processes that are not already well studied. Many of these processes are well studied in some organism, but not necessarily in an investigator's organism of interest. Sequence-based search methods (e.g. BLAST) have been used to transfer such annotation information between organisms. We demonstrate that functional genomics can complement traditional sequence similarity to improve the transfer of gene annotations between organisms. Our method transfers annotations only when functionally appropriate as determined by genomic data and can be used with any prediction algorithm to combine transferred gene function knowledge with organism-specific high-throughput data to enable accurate function prediction.
We show that diverse state-of-art machine learning algorithms leveraging functional knowledge transfer (FKT) dramatically improve their accuracy in predicting gene-pathway membership, particularly for processes with little experimental knowledge in an organism. We also show that our method compares favorably to annotation transfer by sequence similarity. Next, we deploy FKT with state-of-the-art SVM classifier to predict novel genes to 11,000 biological processes across six diverse organisms and expand the coverage of accurate function predictions to processes that are often ignored because of a dearth of annotated genes in an organism. Finally, we perform in vivo experimental investigation in Danio rerio and confirm the regulatory role of our top predicted novel gene, wnt5b, in leftward cell migration during heart development. FKT is immediately applicable to many bioinformatics techniques and will help biologists systematically integrate prior knowledge from diverse systems to direct targeted experiments in their organism of study.
Author Summary
Due to technical and ethical challenges many human diseases or biological processes are studied in model organisms. Discoveries in these organisms are then transferred back to human or other model organisms. Traditional methods for transferring novel gene function annotations have relied on finding genes with high sequence similarity believed to share evolutionary ancestry. However, sequence similarity does not guarantee a shared functional role in molecular pathways. In this study, we show that functional genomics can complement traditional sequence similarity measures to improve the transfer of gene annotations between organisms. We coupled our knowledge transfer method with current state-of-the-art machine learning algorithms and predicted gene function for 11,000 biological processes across six organisms. We experimentally validated our prediction of wnt5b's involvement in the determination of left-right heart asymmetry in zebrafish. Our results show that functional knowledge transfer can improve the coverage and accuracy of machine learning methods used for gene function prediction in a diverse set of organisms. Such an approach can be applied to additional organisms, and will be especially beneficial in organisms that have high-throughput genomic data with sparse annotations.
doi:10.1371/journal.pcbi.1002957
PMCID: PMC3597527  PMID: 23516347
9.  Targeted interactomics reveals a complex core cell cycle machinery in Arabidopsis thaliana 
A protein interactome focused towards cell proliferation was mapped comprising 857 interactions among 393 proteins, leading to many new insights in plant cell cycle regulation.A comprehensive view on heterodimeric cyclin-dependent kinase (CDK)/cyclin complexes in plants is obtained, in relation with their regulators.Over 100 new candidate cell cycle proteins were predicted.
The basic underlying mechanisms that govern the cell cycle are conserved among all eukaryotes. Peculiar for plants, however, is that their genome contains a collection of cell cycle regulatory genes that is intriguingly large (Vandepoele et al, 2002; Menges et al, 2005) compared to other eukaryotes. Arabidopsis thaliana (Arabidopsis) encodes 71 genes in five regulatory classes versus only 15 in yeast and 23 in human.
Despite the discovery of numerous cell cycle genes, little is known about the protein complex machinery that steers plant cell division. Therefore, we applied tandem affinity purification (TAP) approach coupled with mass spectrometry (MS) on Arabidopsis cell suspension cultures to isolate and analyze protein complexes involved in the cell cycle. This approach allowed us to successfully map a first draft of the basic cell cycle complex machinery of Arabidopsis, providing many new insights into plant cell division.
To map the interactome, we relied on a streamlined platform comprising generic Gateway-based vectors with high cloning flexibility, the fast generation of transgenic suspension cultures, TAP adapted for plant cells, and matrix-assisted laser desorption ionization (MALDI) tandem-MS for the identification of purified proteins (Van Leene et al, 2007, 2008Van Leene et al, 2007, 2008). Complexes for 102 cell cycle proteins were analyzed using this approach, leading to a non-redundant data set of 857 interactions among 393 proteins (Figure 1A). Two subspaces were identified in this data set, domain I1, containing interactions confirmed in at least two independent experimental repeats or in the reciprocal purification experiment, and domain I2 consisting of uniquely observed interactions.
Several observations underlined the quality of both domains. All tested reverse purifications found the original interaction, and 150 known or predicted interactions were confirmed, meaning that also a huge stack of new interactions was revealed. An in-depth computational analysis revealed enrichment for many cell cycle-related features among the proteins of the network (Figure 1B), and many protein pairs were coregulated at the transcriptional level (Figure 1C). Through integration of known cell cycle-related features, more than 100 new candidate cell cycle proteins were predicted (Figure 1D). Besides common qualities of both interactome domains, their real significance appeared through mutual differences exposing two subspaces in the cell cycle interactome: a central regulatory network of stable complexes that are repeatedly isolated and represent core regulatory units, and a peripheral network comprising transient interactions identified less frequently, which are involved in other aspects of the process, such as crosstalk between core complexes or connections with other pathways. To evaluate the biological relevance of the cell cycle interactome in plants, we validated interactions from both domains by a transient split-luciferase assay in Arabidopsis plants (Marion et al, 2008), further sustaining the hypothesis-generating power of the data set to understand plant growth.
With respect to insights into the cell cycle physiology, the interactome was subdivided according to the functional classes of the baits and core protein complexes were extracted, covering cyclin-dependent kinase (CDK)/cyclin core complexes together with their positive and negative regulation networks, DNA replication complexes, the anaphase-promoting complex, and spindle checkpoint complexes. The data imply that mitotic A- and B-type cyclins exclusively form heterodimeric complexes with the plant-specific B-type CDKs and not with CDKA;1, whereas D-type cyclins seem to associate with CDKA;1. Besides the extraction of complexes previously shown in other organisms, our data also suggested many new functional links; for example, the link coupling cell division with the regulation of transcript splicing. The association of negative regulators of CDK/cyclin complexes with transcription factors suggests that their role in reallocation is not solely targeted to CDK/cyclin complexes. New members of the Siamese-related inhibitory proteins were identified, and for the first time potential inhibitors of plant-specific mitotic B-type CDKs have been found in plants. New evidence that the E2F–DP–RBR network is not only active at G1-to-S, but also at the G2-to-M transition is provided and many complexes involved in DNA replication or repair were isolated. For the first time, a plant APC has been isolated biochemically, identifying three potential new plant-specific APC interactors, and finally, complexes involved in the spindle checkpoint were isolated mapping many new but specific interactions.
Finally, to get a general view on the complex machinery, modules of interacting cyclins and core cell cycle regulators were ranked along the cell cycle phases according to the transcript expression peak of the cyclins, showing an assorted set of CDK–cyclin complexes with high regulatory differentiation (Figure 4). Even within the same subfamily (e.g. cyclin A3, B1, B2, D3, and D4), cyclins differ not only in their functional time frame but also in the type and number of CDKs, inhibitors, and scaffolding proteins they bind, further indicating their functional diversification. According to our interaction data, at least 92 different variants of CDK–cyclin complexes are found in Arabidopsis.
In conclusion, these results reflect how several rounds of gene duplication (Sterck et al, 2007) led to the evolution of a large set of cyclin paralogs and a myriad of regulators, resulting in a significant jump in the complexity of the cell cycle machinery that could accommodate unique plant-specific features such as an indeterminate mode of postembryonic development. Through their extensive regulation and connection with a myriad of up- and downstream pathways, the core cell cycle complexes might offer the plant a flexible toolkit to fine-tune cell proliferation in response to an ever-changing environment.
Cell proliferation is the main driving force for plant growth. Although genome sequence analysis revealed a high number of cell cycle genes in plants, little is known about the molecular complexes steering cell division. In a targeted proteomics approach, we mapped the core complex machinery at the heart of the Arabidopsis thaliana cell cycle control. Besides a central regulatory network of core complexes, we distinguished a peripheral network that links the core machinery to up- and downstream pathways. Over 100 new candidate cell cycle proteins were predicted and an in-depth biological interpretation demonstrated the hypothesis-generating power of the interaction data. The data set provided a comprehensive view on heterodimeric cyclin-dependent kinase (CDK)–cyclin complexes in plants. For the first time, inhibitory proteins of plant-specific B-type CDKs were discovered and the anaphase-promoting complex was characterized and extended. Important conclusions were that mitotic A- and B-type cyclins form complexes with the plant-specific B-type CDKs and not with CDKA;1, and that D-type cyclins and S-phase-specific A-type cyclins seem to be associated exclusively with CDKA;1. Furthermore, we could show that plants have evolved a combinatorial toolkit consisting of at least 92 different CDK–cyclin complex variants, which strongly underscores the functional diversification among the large family of cyclins and reflects the pivotal role of cell cycle regulation in the developmental plasticity of plants.
doi:10.1038/msb.2010.53
PMCID: PMC2950081  PMID: 20706207
Arabidopsis thaliana; cell cycle; interactome; protein complex; protein interactions
10.  Enriching for correct prediction of biological processes using a combination of diverse classifiers 
BMC Bioinformatics  2011;12:189.
Background
Machine learning models (classifiers) for classifying genes to biological processes each have their own unique characteristics in what genes can be classified and to what biological processes. No single learning model is qualitatively superior to any other model and overall precision for each model tends to be low. The classification results for each classifier can be complementary and synergistic suggesting the benefit of a combination of algorithms, but often the prediction probability outputs of various learning models are neither comparable nor compatible for combining. A means to compare outputs regardless of the model and data used and combine the results into an improved comprehensive model is needed.
Results
Gene expression patterns from NCI's panel of 60 cell lines were used to train a Random Forest, a Support Vector Machine and a Neural Network model, plus two over-sampled models for classifying genes to biological processes. Each model produced unique characteristics in the classification results. We introduce the Precision Index measure (PIN) from the maximum posterior probability that allows assessing, comparing and combining multiple classifiers. The class specific precision measure (PIC) is introduced and used to select a subset of predictions across all classes and all classifiers with high precision. We developed a single classifier that combines the PINs from these five models in prediction and found that the PIN Combined Classifier (PINCom) significantly increased the number of correctly predicted genes over any single classifier. The PINCom applied to test genes that were not used in training also showed substantial improvement over any single model.
Conclusions
This paper introduces novel and effective ways of assessing predictions by their precision and recall plus a method that combines several machine learning models and capitalizes on synergy and complementation in class selection, resulting in higher precision and recall. Different machine learning models yielded incongruent results each of which were successfully combined into one superior model using the PIN measure we developed. Validation of the boosted predictions for gene functions showed the genes to be accurately predicted.
doi:10.1186/1471-2105-12-189
PMCID: PMC3121646  PMID: 21605426
11.  Boolean modeling of transcriptome data reveals novel modes of heterotrimeric G-protein action 
Classical mechanisms of heterotrimeric G-protein signaling are observed to function in regulation of the transcriptome. Conversely, many theoretical regulatory modes of the G-protein are not manifested in the transcriptomes we investigate.A new mechanism of G-protein signaling is revealed, in which the β subunit regulates gene expression identically in the presence or absence of the α subunit.We find evidence of cross-talk between G-protein-mediated and hormone-mediated transcriptional regulation.We find evidence of system specificity in G-protein signaling.
Heterotrimeric G-proteins, composed of α, β, and γ subunits, participate in a wide range of signaling pathways in eukaryotes (Morris and Malbon, 1999). According to the typical, mammalian paradigm, in its inactive state, the G-protein exists as an associated heterotrimer. G-protein signaling begins with ligand binding that results in a conformational change in a G-protein-coupled receptor (GPCR). Once activated by the GPCR, the Gα separates from the associated Gβγ dimer and the freed Gα and Gβγ proteins can then interact with downstream effector molecules, alone or in combination, to transduce the signal. Subsequent to signal propagation, Gα re-associates with the Gβγ dimer to reform the G-protein complex.
There are several classical routes for signal propagation through heterotrimeric G-proteins that have been categorized in mammalian systems (Marrari et al, 2007; Dupre et al, 2009). One route, which we designate classical I, requires the presence of both subunits, and can invoke one of two distinct mechanisms. In one mechanism, on GPCR activation, freed Gα and Gβγ each interact with downstream effectors to elicit the downstream response. In a related mechanism, Gα but not Gβγ interacts with downstream effectors, but the Gβγ dimer is nevertheless required to facilitate coupling of Gα with the relevant GPCR (Marrari et al, 2007). In a second route, which we designate classical II, it is solely the Gβγ dimer that interacts with downstream effectors; in this case, sequestration of Gβγ within the heterotrimer prevents signal propagation. In addition, a few non-classical G-protein regulatory modes have also been implicated in some systems, for example signaling by the intact heterotrimer in yeast (Klein et al, 2000; Frank et al, 2005). Observations such as these lead to a fundamental question, namely, which of all the theoretical regulatory modes of G-protein signaling are realized biologically. Our study answers this question in the context of the model plant Arabidopsis thaliana, and in addition analyzes the manner in which G-protein signaling couples with signaling by the plant hormone abscisic acid. The Arabidopsis genome encodes only one canonical Gα subunit, GPA1, and one canonical Gβ subunit, AGB1, and knockout mutants are available for each of these, allowing clear dissection of Gα- and Gβ-related phenotypes.
Abscisic acid (ABA) is a major plant hormone, which inhibits growth and promotes tolerance of abiotic stresses such as drought, salinity, and cold. ABA signaling is known to interact with heterotrimeric G-protein signaling in both developmental and stress responses in a complex manner, causing, for example, ABA hyposensitivity of guard cell stomatal opening in gpa1 and agb1 single mutants as well as agb1 gpa1 double mutants (Fan et al, 2008), but ABA hypersensitivity of the inhibition of seed germination and post-germination seedling development in the same mutants (Pandey et al, 2006). These experimental observations implicate G-proteins as one of the components of ABA signaling, but to date no systematic study has been conducted in either plant or metazoan systems to define the co-regulatory modes of a G-protein and a hormone.
In this study, we conduct genome-wide gene expression profiling in G-protein subunit mutants of A. thaliana guard cells and leaves, with or without treatment with ABA. By introducing one or more mediators acting downstream of the G-protein and ABA to control transcript levels, we propose nine G-protein/ABA signaling pathways including ABA-independent G-protein signaling pathways, G-protein-independent ABA signaling pathways, and seven distinct ABA–G-protein-coupled signaling pathways (Figure 1). We develop a Boolean modeling framework to systematically enumerate 14 possible theoretical regulatory modes of the G-protein and 142 co-regulatory modes of the G-protein and ABA, and then use a pattern matching approach to associate target genes with theoretical regulatory modes.
Our analysis shows that the G-protein regulatory mode that requires the presence of both Gα and Gβγ subunits (consistent with classical I mechanisms), is well represented in both guard cells and leaves. The G-protein regulatory mode that requires a freed Gβγ subunit (classical II G-protein regulatory mechanism) is well supported in guard cells and somewhat less so in leaves. In addition, a G-protein regulatory mode representing a non-classical regulatory mechanism is prevalent in guard cells but less so in leaves (Figure 5). In this regulatory mode, signaling by Gβ(γ) occurs, and this signaling is not regulated in any way by Gα.
By relating the target genes with the nine proposed G-protein/ABA signaling pathways, we are able to gauge the plausibility of regulatory modes of the G-protein and ABA at the pathway level. We find that G-protein-independent ABA signaling pathways are prevalent in both guard cells and leaves. The existence of an ABA-independent regulatory activity of the G-protein is well supported in guard cells, but not supported in leaves. Additive regulation by G-protein signaling plus G-protein-independent ABA signaling is rare in both guard cells and leaves. In addition, combinatorial cross-talk between G-protein signaling and ABA signaling and additive cross-talk between ABA–G-protein signaling and G-protein-independent ABA signaling are observed in both guard cells and leaves. Our transcriptome analysis indicates that in some cases, ABA definitely does not influence G-protein signaling, though it may do so in some other cases.
To investigate whether previously observed hypersensitivity or hyposensitivity of developmental and dynamic transient responses to ABA in G-protein mutants is recapitulated at the level of transcriptional regulation, we compare gene regulation by ABA in guard cells and leaves of the G-protein mutants versus wild type. We find that in guard cells, equal ABA hyposensitivity of all mutants combined is significant, although hyposensitivity in individual mutants is not. There is also a separate group of genes in guard cells that show ABA hypersensitivity in the gpa1 mutant, suggesting complex interactions between ABA and G-protein signaling in gene regulation in this cell type. In leaves, ABA hyposensitivity of gene expression in the three individual mutants and equal hyposensitivity in all mutants are strongly supported. In addition, several of the functional categories identified by our analysis of G-protein regulatory modes have been implicated in previous physiological analyses of G-protein mutants, providing validation to the biological interpretation of our results.
In summary, by conducting a genome-wide gene expression profiling study in G-protein subunit mutants of A. thaliana guard cells and leaves and developing a Boolean modeling framework, we systematically evaluate the biological utilization of mechanisms of G-protein regulatory action and reveal novel regulatory modes of the G-protein. The results generate empirical evidence and insights regarding molecular events of G-protein signaling and response at the physiological level in both plants and mammals.
Heterotrimeric G-proteins mediate crucial and diverse signaling pathways in eukaryotes. Here, we generate and analyze microarray data from guard cells and leaves of G-protein subunit mutants of the model plant Arabidopsis thaliana, with or without treatment with the stress hormone, abscisic acid. Although G-protein control of the transcriptome has received little attention to date in any system, transcriptome analysis allows us to search for potentially uncommon yet significant signaling mechanisms. We describe the theoretical Boolean mechanisms of G-protein × hormone regulation, and then apply a pattern matching approach to associate gene expression profiles with Boolean models. We find that (1) classical mechanisms of G-protein signaling are well represented. Conversely, some theoretical regulatory modes of the G-protein are not supported; (2) a new mechanism of G-protein signaling is revealed, in which Gβ regulates gene expression identically in the presence or absence of Gα; (3) guard cells and leaves favor different G-protein modes in transcriptome regulation, supporting system specificity of G-protein signaling. Our method holds significant promise for analyzing analogous ‘switch-like' signal transduction events in any organism.
doi:10.1038/msb.2010.28
PMCID: PMC2913393  PMID: 20531402
abscisic acid; Arabidopsis thaliana; Boolean modeling; heterotrimeric G-protein; transcriptome
12.  Discriminative local subspaces in gene expression data for effective gene function prediction 
Bioinformatics  2012;28(17):2256-2264.
Motivation: Massive amounts of genome-wide gene expression data have become available, motivating the development of computational approaches that leverage this information to predict gene function. Among successful approaches, supervised machine learning methods, such as Support Vector Machines (SVMs), have shown superior prediction accuracy. However, these methods lack the simple biological intuition provided by co-expression networks (CNs), limiting their practical usefulness.
Results: In this work, we present Discriminative Local Subspaces (DLS), a novel method that combines supervised machine learning and co-expression techniques with the goal of systematically predict genes involved in specific biological processes of interest. Unlike traditional CNs, DLS uses the knowledge available in Gene Ontology (GO) to generate informative training sets that guide the discovery of expression signatures: expression patterns that are discriminative for genes involved in the biological process of interest. By linking genes co-expressed with these signatures, DLS is able to construct a discriminative CN that links both, known and previously uncharacterized genes, for the selected biological process. This article focuses on the algorithm behind DLS and shows its predictive power using an Arabidopsis thaliana dataset and a representative set of 101 GO terms from the Biological Process Ontology. Our results show that DLS has a superior average accuracy than both SVMs and CNs. Thus, DLS is able to provide the prediction accuracy of supervised learning methods while maintaining the intuitive understanding of CNs.
Availability: A MATLAB® implementation of DLS is available at http://virtualplant.bio.puc.cl/cgi-bin/Lab/tools.cgi
Contact: tfpuelma@uc.cl
Supplementary Information: Supplementary data are available at http://bioinformatics.mpimp-golm.mpg.de/.
doi:10.1093/bioinformatics/bts455
PMCID: PMC3426849  PMID: 22820203
13.  Trichoderma-Plant Root Colonization: Escaping Early Plant Defense Responses and Activation of the Antioxidant Machinery for Saline Stress Tolerance 
PLoS Pathogens  2013;9(3):e1003221.
Trichoderma spp. are versatile opportunistic plant symbionts which can colonize the apoplast of plant roots. Microarrays analysis of Arabidopsis thaliana roots inoculated with Trichoderma asperelloides T203, coupled with qPCR analysis of 137 stress responsive genes and transcription factors, revealed wide gene transcript reprogramming, proceeded by a transient repression of the plant immune responses supposedly to allow root colonization. Enhancement in the expression of WRKY18 and WRKY40, which stimulate JA-signaling via suppression of JAZ repressors and negatively regulate the expression of the defense genes FMO1, PAD3 and CYP71A13, was detected in Arabidopsis roots upon Trichoderma colonization. Reduced root colonization was observed in the wrky18/wrky40 double mutant line, while partial phenotypic complementation was achieved by over-expressing WRKY40 in the wrky18 wrky40 background. On the other hand increased colonization rate was found in roots of the FMO1 knockout mutant. Trichoderma spp. stimulate plant growth and resistance to a wide range of adverse environmental conditions. Arabidopsis and cucumber (Cucumis sativus L.) plants treated with Trichoderma prior to salt stress imposition show significantly improved seed germination. In addition, Trichoderma treatment affects the expression of several genes related to osmo-protection and general oxidative stress in roots of both plants. The MDAR gene coding for monodehydroascorbate reductase is significantly up-regulated and, accordingly, the pool of reduced ascorbic acid was found to be increased in Trichoderma treated plants. 1-Aminocyclopropane-1-carboxylate (ACC)-deaminase silenced Trichoderma mutants were less effective in providing tolerance to salt stress, suggesting that Trichoderma, similarly to ACC deaminase producing bacteria, can ameliorate plant growth under conditions of abiotic stress, by lowering ameliorating increases in ethylene levels as well as promoting an elevated antioxidative capacity.
Author Summary
Trichoderma fungi have been developed as biocontrol agents and are applied to protect and improve crop yields. Colonization of plant roots by Trichoderma can protect plants against diseases and environmental stresses such as salinity and drought, and an improve plant growth and development. To better understand the mechanism underlining the plant-Trichoderma interaction we followed changes in global gene expression in colonized Arabidopsis roots. We associate the known gene biological function to the processes of root colonization and abiotic stress tolerance mediated by Trichoderma. Using Arabidopsis mutant lines we show the function of a subset of those genes in root colonization. We show that wrky18 and wrky40 transcription factors activate and suppress the expression of different genes in order to allow successful root colonization. We also combine the gene expression data together with the measurement of ascorbic acid level to demonstrate that salt stress tolerance offered by Trichoderma is dependent on activation of the plant antioxidant defense machinery. Using Trichoderma lines mutated in the ACC deaminase gene, we demonstrate that reduction of ethylene levels is also essential in achieving salt tolerance. This study represents an important step forward in understanding the nature of the non-pathogenic plant Trichoderma interaction, and may contribute to the efforts to improve Trichoderma biocontrol abilities.
doi:10.1371/journal.ppat.1003221
PMCID: PMC3597500  PMID: 23516362
14.  Integration of Arabidopsis thaliana stress-related transcript profiles, promoter structures, and cell-specific expression 
Genome Biology  2007;8(4):R49.
The integration of stress-dependent, tissue- and cell-specific expression profiles and 5'-regulatory sequence motif analysis defines a common stress transcriptome, identifies major motifs for stress response, and places stress response in the context of tissue and cell lineages in the Arabidopsis root.
Background
Arabidopsis thaliana transcript profiles indicate effects of abiotic and biotic stresses and tissue-specific and cell-specific gene expression. Organizing these datasets could reveal the structure and mechanisms of responses and crosstalk between pathways, and in which cells the plants perceive, signal, respond to, and integrate environmental inputs.
Results
We clustered Arabidopsis transcript profiles for various treatments, including abiotic, biotic, and chemical stresses. Ubiquitous stress responses in Arabidopsis, similar to those of fungi and animals, employ genes in pathways related to mitogen-activated protein kinases, Snf1-related kinases, vesicle transport, mitochondrial functions, and the transcription machinery. Induced responses to stresses are attributed to genes whose promoters are characterized by a small number of regulatory motifs, although secondary motifs were also apparent. Most genes that are downregulated by stresses exhibited distinct tissue-specific expression patterns and appear to be under developmental regulation. The abscisic acid-dependent transcriptome is delineated in the cluster structure, whereas functions that are dependent on reactive oxygen species are widely distributed, indicating that evolutionary pressures confer distinct responses to different stresses in time and space. Cell lineages in roots express stress-responsive genes at different levels. Intersections of stress-responsive and cell-specific profiles identified cell lineages affected by abiotic stress.
Conclusion
By analyzing the stress-dependent expression profile, we define a common stress transcriptome that apparently represents universal cell-level stress responses. Combining stress-dependent and tissue-specific and cell-specific expression profiles, and Arabidopsis 5'-regulatory DNA sequences, we confirm known stress-related 5' cis-elements on a genome-wide scale, identify secondary motifs, and place the stress response within the context of tissues and cell lineages in the Arabidopsis root.
doi:10.1186/gb-2007-8-4-r49
PMCID: PMC1896000  PMID: 17408486
15.  Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data 
PLoS ONE  2009;4(12):e8250.
Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:
Support Vector Machine Recursive Feature Elimination (SVMRFE)Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)Gradient based Leave-one-out Gene Selection (GLGS)
To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.
doi:10.1371/journal.pone.0008250
PMCID: PMC2789385  PMID: 20011240
16.  Chromosomal periodicity and positional networks of genes in Escherichia coli 
Escherichia coli periodic gene distribution is identified for a periodic interval of 33 kb.Two positional networks of genes are discovered by studying gene periodic distribution: one is driven by metabolic genes and the other by genes involved in cellular processing and signaling.A functional core of Escherichia coli genes drives gene periodic distribution.A few chromosomal regions that preserve gene transcription profiles across environmental changes are identified.This single genome analysis approach can be taken as a footprint for a large-scale bacterial and archaeal periodic distribution analysis.
The structure of dynamic folds in microbial chromosomes is largely unknown. On the other hand, genes characterizing a functional core in Escherichia coli K12 show to be periodically distributed along the arcs, suggesting an encoded three-dimensional genomic organization helping functional activities among which are translation and, possibly, transcription. Core genes are expected to be either highly expressed or rapidly expressed when needed. Because of E. coli K12 life mode, they are especially encoded at the genomic level, with a very biased codon composition, and as a consequence, they can, at some extent, be predicted in silico. On the basis of a computational method allowing the definition of a class of genes that are organism specific, we identify a pool of core genes, some of which are conserved across many species, some depend on the environmental living conditions of the organism, some are involved in the stress response, and others have no yet identified function. This set of predicted core genes covers roughly 10% of all genes in E. coli K12 and approximates well the class of experimentally known essential genes. An important property of core genes is that they cover all the spectrum of microbial functions. This means that for any functional class of genes, some representative of the class belongs to the functional core. Consequently, we reasoned, the three-dimensional chromosomal arrangement of these genes may be important to fulfill basic functional responses.
A strong periodic signal of 33 kb is detected, and the approach shows also that a periodic arrangement affects not only core genes, but in fact, all genes along the E. coli K12 chromosome, even if the signal is weaker. An analysis of functional classes of genes shows that they systematically organize into two independent positional gene networks, one driven by metabolic genes and the other by genes involved in cellular processing and signaling (Figure 5A). We conclude that functional reasons justify periodic gene organization.
To explore the functional basis of the distribution, we examined the relationships between the codon bias of E. coli K12 genes and transcriptomic data for a number of different growth conditions. We could identify in a very precise manner a few chromosomal regions that preserve gene transcription profiles across environmental changes. These regions present a profile of the expression levels for their genes, which is periodic by a period of 33 kb.
These finding generate new questions on evolutionary pressures imposed on the chromosome and suggest a number of insights on chromosomal superhelicity that can lead to a precise conception of experiments and to hypothesis to be tested. The theoretical analysis of functional classes of genes involved in the periodic distribution, for instance, makes clear that metabolic genes and genes involved in translation are expected to be the most affected by a disruption of the periodic chromosomal arrangement. The methodological approach is based on single genome analysis. Given either core genes or genes organized in functional classes, we analyze the detailed distribution of distances between pairs of genes through a parameterized model based on signal processing and find that these groups of genes tend to be separated by a regular integral distance characterized by a periodic interval of 33 kb. The methodology can be applied to any set of genes and can be taken as a footprint for large-scale bacterial and archaeal analysis.
The structure of dynamic folds in microbial chromosomes is largely unknown. Here, we find that genes with a highly biased codon composition and characterizing a functional core in Escherichia coli K12 show to be periodically distributed along the arcs, suggesting an encoded three-dimensional genomic organization helping functional activities among which are translation and, possibly, transcription. This extends to functional classes of genes that are shown to systematically organize into two independent positional gene networks, one driven by metabolic genes and the other by genes involved in cellular processing and signaling. We conclude that functional reasons justify periodic gene organization. This finding generates new questions on evolutionary pressures imposed on the chromosome. Our methodological approach is based on single genome analysis. Given either core genes or genes organized in functional classes, we analyze the detailed distribution of distances between pairs of genes through a parameterized model based on signal processing and find that these groups of genes tend to be separated by a regular integral distance. The methodology can be applied to any set of genes and can be taken as a footprint for large-scale bacterial and archaeal analysis.
doi:10.1038/msb.2010.21
PMCID: PMC2890325  PMID: 20461073
chromosome structure; COGs classes; core genes; Escherichia coli K12; essential genes
17.  A Global Survey of Gene Regulation during Cold Acclimation in Arabidopsis thaliana 
PLoS Genetics  2005;1(2):e26.
Many temperate plant species such as Arabidopsis thaliana are able to increase their freezing tolerance when exposed to low, nonfreezing temperatures in a process called cold acclimation. This process is accompanied by complex changes in gene expression. Previous studies have investigated these changes but have mainly focused on individual or small groups of genes. We present a comprehensive statistical analysis of the genome-wide changes of gene expression in response to 14 d of cold acclimation in Arabidopsis, and provide a large-scale validation of these data by comparing datasets obtained for the Affymetrix ATH1 Genechip and MWG 50-mer oligonucleotide whole-genome microarrays. We combine these datasets with existing published and publicly available data investigating Arabidopsis gene expression in response to low temperature. All data are integrated into a database detailing the cold responsiveness of 22,043 genes as a function of time of exposure at low temperature. We concentrate our functional analysis on global changes marking relevant pathways or functional groups of genes. These analyses provide a statistical basis for many previously reported changes, identify so far unreported changes, and show which processes predominate during different times of cold acclimation. This approach offers the fullest characterization of global changes in gene expression in response to low temperature available to date.
Synopsis
Freezing tolerance is an important determinant of geographical distribution of plant species, and freezing damage in crop plants leads to severe losses in agriculture. Many temperate plants increase their freezing tolerance during exposure to low, but nonfreezing temperatures, a process known as cold acclimation. Freezing tolerance and cold acclimation are complex, quantitative genetic traits. The number and functional roles of the responsible genes are not known for any plant species. Using the model plant Arabidopsis thaliana, which is moderately freezing tolerant and able to cold acclimate, the global regulation of gene expression during exposure to 4 °C for 14 d was analyzed by microarray hybridization. For validation of gene expression data, triplicate biological samples were hybridized to two different oligonucleotide arrays. Results from the two platforms showed good agreement, indicating the reliability of the measurements. The authors combined their data with all publicly available data on cold-regulated gene expression in A. thaliana to compile a database detailing the cold responsiveness of 22,043 genes as a function of exposure time. In addition, thorough statistical analysis was used to identify metabolic pathways and physiological processes that are predominantly involved in the plant cold-acclimation process.
doi:10.1371/journal.pgen.0010026
PMCID: PMC1189076  PMID: 16121258
18.  Transfer RNA modifications and genes for modifying enzymes in Arabidopsis thaliana 
BMC Plant Biology  2010;10:201.
Background
In all domains of life, transfer RNA (tRNA) molecules contain modified nucleosides. Modifications to tRNAs affect their coding capacity and influence codon-anticodon interactions. Nucleoside modification deficiencies have a diverse range of effects, from decreased virulence in bacteria, neural system disease in human, and gene expression and stress response changes in plants. The purpose of this study was to identify genes involved in tRNA modification in the model plant Arabidopsis thaliana, to understand the function of nucleoside modifications in plant growth and development.
Results
In this study, we established a method for analyzing modified nucleosides in tRNAs from the model plant species, Arabidopsis thaliana and hybrid aspen (Populus tremula × tremuloides). 21 modified nucleosides in tRNAs were identified in both species. To identify the genes responsible for the plant tRNA modifications, we performed global analysis of the Arabidopsis genome for candidate genes. Based on the conserved domains of homologs in Sacccharomyces cerevisiae and Escherichia coli, more than 90 genes were predicted to encode tRNA modifying enzymes in the Arabidopsis genome. Transcript accumulation patterns for the genes in Arabidopsis and the phylogenetic distribution of the genes among different plant species were investigated. Transcripts for the majority of the Arabidopsis candidate genes were found to be most abundant in rosette leaves and shoot apices. Whereas most of the tRNA modifying gene families identified in the Arabidopsis genome was found to be present in other plant species, there was a big variation in the number of genes present for each family.
Through a loss of function mutagenesis study, we identified five tRNA modification genes (AtTRM10, AtTRM11, AtTRM82, AtKTI12 and AtELP1) responsible for four specific modified nucleosides (m1G, m2G, m7G and ncm5U), respectively (two genes: AtKTI12 and AtELP1 identified for ncm5U modification). The AtTRM11 mutant exhibited an early-flowering phenotype, and the AtELP1 mutant had narrow leaves, reduced root growth, an aberrant silique shape and defects in the generation of secondary shoots.
Conclusions
Using a reverse genetics approach, we successfully isolated and identified five tRNA modification genes in Arabidopsis thaliana. We conclude that the method established in this study will facilitate the identification of tRNA modification genes in a wide variety of plant species.
doi:10.1186/1471-2229-10-201
PMCID: PMC2956550  PMID: 20836892
19.  A novel computational model of the circadian clock in Arabidopsis that incorporates PRR7 and PRR9 
We developed a mathematical model of the Arabidopsis circadian clock, including PRR7 and PRR9, which is able to predict several single, double and triple mutant phenotypes.Sensitivity Analysis was used to identify the properties and time sensing mechanisms of model structures.PRR7 and CCA1/LHY were identified as weak points of the mathematical model indicating where more experimental data is needed for further model development.Detailed dynamical studies showed that the timing of an evening light sensing element is essential for day length responsiveness
In recent years, molecular genetic techniques have revealed a complex network of components in the Arabidopsis circadian clock. Mathematical models allow for a detailed study of the dynamics and architecture of such complex gene networks leading to a better understanding of the genetic interactions. It is important to maintain a constant iteration with experimentation, to include novel components as they are discovered and use the updated model to design new experiments. This study develops a framework to introduce new components into the mathematical model of the Arabidopsis circadian clock accelerating the iterative model development process and gaining insight into the system's properties.
We used the interlocked feedback loop model published in Locke et al (2005) as the base model. In Arabidopsis, the first suggested regulatory loop involves the morning expressed transcription factors CIRCADIAN CLOCK-ASSOCIATED 1 (CCA1) and LATE ELONGATED HYPOCOTYL (LHY), and the evening expressed pseudo-response regulator TIMING OF CAB EXPRESSION (TOC1). The hypothetical component X had been introduced to realize a longer delay between gene expression of CCA1/LHY and TOC1. The introduction of Y was motivated by the need for a mechanism to reproduce the dampening short period rhythms of the cca1/lhy double mutant and to include an additional light input at the end of the day.
In this study, the new components pseudo-response regulators PRR7 and PRR9 were added in negative feedback loops based on the biological hypothesis that they are activated by LHY and in turn repress LHY transcription (Farré et al, 2005; Figure 1). We present three iterations steps of model development (Figure 1A–C).
A wide range of tools was used to establish and analyze new model structures. One of the challenges facing mathematical modeling of biological processes is parameter identification; they are notoriously difficult to determine experimentally. We established an optimization procedure based on an evolutionary strategy with a cost function mainly derived from wild-type characteristics. This ensured that the model was not restricted by a specific set of parameters and enabled us to use a large set of biological mutant information to assess the predictive capability of the model structure. Models were evaluated by means of an extended phenotype catalogue, allowing for an easy and fair comparison of the structures. We also carried out detailed simulation analysis of component interactions to identify weak points in the structure and suggest further modifications. Finally, we applied sensitivity analysis in a novel manner, using it to direct the model development. Sensitivity analysis provides quantitative measures of robustness; the two measures in this study were the traces of component concentrations over time (classical state sensitivities) and phase behavior (measured by the phase response curve). Three major results emerged from the model development process.
First, the iteration process helped us to learn about general characteristics of the system. We observed that the timing of Y expression is critical for evening light entrainment, which enables the system to respond to changes in day length. This is important for our understanding of the mechanism of light input to the clock and will add in the identification of biological candidates for this function. In addition, our results suggest that a detailed description of the mechanisms of genetic interactions is important for the systems behavior. We observed that the introduction of an experimentally based precise light regulation mechanism on PRR9 expression had a significant effect on the systems behavior.
Second, the final model structure (Figure 1C) was capable of predicting a wide range of mutant phenotypes, such as a reduction of TOC1 expression by RNAi (toc1RNAi), mutations in PRR7 and PRR9 and the novel mutant combinations prr9toc1RNAi and prr7prr9toc1RNAi. However, it was unable to predict the mutations in CCA1 and LHY.
Finally, sensitivity analysis identified the weak points of the system. The developed model structure was heavily based on the TOC1/Y feedback loop. This could explain the model's failure to predict the cca1lhy double mutant phenotype. More detailed information on the regulation of CCA1 and LHY expression will be important to achieve the right balance between the different regulatory loops in the mathematical model. This is in accordance with genetic studies that have identified several genes involved in the regulation of LHY and CCA1 expression. The identification of their mechanism of action will be necessary for the next model development.
In plants, as in animals, the core mechanism to retain rhythmic gene expression relies on the interaction of multiple feedback loops. In recent years, molecular genetic techniques have revealed a complex network of clock components in Arabidopsis. To gain insight into the dynamics of these interactions, new components need to be integrated into the mathematical model of the plant clock. Our approach accelerates the iterative process of model identification, to incorporate new components, and to systematically test different proposed structural hypotheses. Recent studies indicate that the pseudo-response regulators PRR7 and PRR9 play a key role in the core clock of Arabidopsis. We incorporate PRR7 and PRR9 into an existing model involving the transcription factors TIMING OF CAB (TOC1), LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED (CCA1). We propose candidate models based on experimental hypotheses and identify the computational models with the application of an optimization routine. Validation is accomplished through systematic analysis of various mutant phenotypes. We introduce and apply sensitivity analysis as a novel tool for analyzing and distinguishing the characteristics of proposed architectures, which also allows for further validation of the hypothesized structures.
doi:10.1038/msb4100101
PMCID: PMC1682023  PMID: 17102803
Arabidopsis; circadian rhythms; mathematical modeling; parameter optimization; sensitivity analysis
20.  In silico selection of Arabidopsis thaliana ecotypes with enhanced stress tolerance 
Plant Signaling & Behavior  2013;8(11):e26364.
Climate models predict increased occurrences of combined abiotic and biotic stress. Unfortunately, most studies on plant stress responses include single or double stress scenarios only. Recently, we established a multi-factorial system in Arabidopsis thaliana (Arabidopsis) to study the influence of simultaneously applied heat, drought, and virus. Our transcriptome analysis revealed that gene expression under multi-factorial stress is not predictable from single stress treatments. Combined heat and drought stress reduced expression of defense genes and genes involved in R-mediated disease responses, which correlated with increased susceptibility of Arabidopsis to virus infection. Eleven genes were found to be differentially regulated under all stress conditions. Assuming that regulated expression of these genes is important for plant fitness, Arabidopsis ecotypes were clustered according to their expression. Interestingly, ecotypes showing a close correlation to stressed Col-0 prior stress treatment showed improved growth under stress conditions. This result suggests a functional relevance of these genes in stress tolerance.
doi:10.4161/psb.26364
PMCID: PMC4091480  PMID: 24022272
Arabidopsis thaliana; ecotypes; stress-responsive genes; in silico analysis; abiotic stress
21.  Evolution and expression analysis of the grape (Vitis vinifera L.) WRKY gene family 
Journal of Experimental Botany  2014;65(6):1513-1528.
Summary
Fifty-nine VvWRKY genes were identified. Phylogenetic tree and synteny analysis revealed the specific evolutionary relationship of these genes. Meanwhile, differential expression patterns indicated their possible roles in specific tissues and under different stresses.
WRKY proteins comprise a large family of transcription factors that play important roles in plant defence regulatory networks, including responses to various biotic and abiotic stresses. To date, no large-scale study of WRKY genes has been undertaken in grape (Vitis vinifera L.). In this study, a total of 59 putative grape WRKY genes (VvWRKY) were identified and renamed on the basis of their respective chromosome distribution. A multiple sequence alignment analysis using all predicted grape WRKY genes coding sequences, together with those from Arabidopsis thaliana and tomato (Solanum lycopersicum), indicated that the 59 VvWRKY genes can be classified into three main groups (I–III). An evaluation of the duplication events suggested that several WRKY genes arose before the divergence of the grape and Arabidopsis lineages. Moreover, expression profiles derived from semiquantitative PCR and real-time quantitative PCR analyses showed distinct expression patterns in various tissues and in response to different treatments. Four VvWRKY genes showed a significantly higher expression in roots or leaves, 55 responded to varying degrees to at least one abiotic stress treatment, and the expression of 38 were altered following powdery mildew (Erysiphe necator) infection. Most VvWRKY genes were downregulated in response to abscisic acid or salicylic acid treatments, while the expression of a subset was upregulated by methyl jasmonate or ethylene treatments.
doi:10.1093/jxb/eru007
PMCID: PMC3967086  PMID: 24510937
Evolution; expression profile analysis; grape (Vitis vinifera L.); phylogenetic analysis; synteny analysis; WRKY genes.
22.  Mechanical Stress Induces Biotic and Abiotic Stress Responses via a Novel cis-Element 
PLoS Genetics  2007;3(10):e172.
Plants are continuously exposed to a myriad of abiotic and biotic stresses. However, the molecular mechanisms by which these stress signals are perceived and transduced are poorly understood. To begin to identify primary stress signal transduction components, we have focused on genes that respond rapidly (within 5 min) to stress signals. Because it has been hypothesized that detection of physical stress is a mechanism common to mounting a response against a broad range of environmental stresses, we have utilized mechanical wounding as the stress stimulus and performed whole genome microarray analysis of Arabidopsis thaliana leaf tissue. This led to the identification of a number of rapid wound responsive (RWR) genes. Comparison of RWR genes with published abiotic and biotic stress microarray datasets demonstrates a large overlap across a wide range of environmental stresses. Interestingly, RWR genes also exhibit a striking level and pattern of circadian regulation, with induced and repressed genes displaying antiphasic rhythms. Using bioinformatic analysis, we identified a novel motif overrepresented in the promoters of RWR genes, herein designated as the Rapid Stress Response Element (RSRE). We demonstrate in transgenic plants that multimerized RSREs are sufficient to confer a rapid response to both biotic and abiotic stresses in vivo, thereby establishing the functional involvement of this motif in primary transcriptional stress responses. Collectively, our data provide evidence for a novel cis-element that is distributed across the promoters of an array of diverse stress-responsive genes, poised to respond immediately and coordinately to stress signals. This structure suggests that plants may have a transcriptional network resembling the general stress signaling pathway in yeast and that the RSRE element may provide the key to this coordinate regulation.
Author Summary
Plants are sessile organisms constantly challenged by a wide spectrum of biotic and abiotic stresses. These stresses cause considerable losses in crop yields worldwide, while the demand for food and energy is on the rise. Understanding the molecular mechanisms driving stress responses is crucial to devising targeted strategies to engineer stress-tolerant plants. To identify primary stress-responsive genes we examined the transcriptional profile of plants after mechanical wounding, which was used as a brief, inductive stimulus. Comparison of the ensemble of rapid wound response transcripts with published transcript profiles revealed a notable overlap with biotic and abiotic stress-responsive genes. Additional quantitative analyses of selected genes over a wounding time-course enabled classification into two groups: transient and stably expressed. Bioinformatic analysis of rapid wound response gene promoter sequences enabled us to identify a novel DNA motif, designated the Rapid Stress Response Element. This motif is sufficient to confer a rapid response to both biotic and abiotic stresses in vivo, thereby confirming the functional involvement of this motif in the primary transcriptional stress response. The genes we identified may represent initial components of the general stress-response network and may be useful in engineering multi-stress tolerant plants.
doi:10.1371/journal.pgen.0030172
PMCID: PMC2039767  PMID: 17953483
23.  Non-canonical peroxisome targeting signals: identification of novel PTS1 tripeptides and characterization of enhancer elements by computational permutation analysis 
BMC Plant Biology  2012;12:142.
Background
High-accuracy prediction tools are essential in the post-genomic era to define organellar proteomes in their full complexity. We recently applied a discriminative machine learning approach to predict plant proteins carrying peroxisome targeting signals (PTS) type 1 from genome sequences. For Arabidopsis thaliana 392 gene models were predicted to be peroxisome-targeted. The predictions were extensively tested in vivo, resulting in a high experimental verification rate of Arabidopsis proteins previously not known to be peroxisomal.
Results
In this study, we experimentally validated the predictions in greater depth by focusing on the most challenging Arabidopsis proteins with unknown non-canonical PTS1 tripeptides and prediction scores close to the threshold. By in vivo subcellular targeting analysis, three novel PTS1 tripeptides (QRL>, SQM>, and SDL>) and two novel tripeptide residues (Q at position −3 and D at pos. -2) were identified. To understand why, among many Arabidopsis proteins carrying the same C-terminal tripeptides, these proteins were specifically predicted as peroxisomal, the residues upstream of the PTS1 tripeptide were computationally permuted and the changes in prediction scores were analyzed. The newly identified Arabidopsis proteins were found to contain four to five amino acid residues of high predicted targeting enhancing properties at position −4 to −12 in front of the non-canonical PTS1 tripeptide. The identity of the predicted targeting enhancing residues was unexpectedly diverse, comprising besides basic residues also proline, hydroxylated (Ser, Thr), hydrophobic (Ala, Val), and even acidic residues.
Conclusions
Our computational and experimental analyses demonstrate that the plant PTS1 tripeptide motif is more diverse than previously thought, including an increasing number of non-canonical sequences and allowed residues. Specific targeting enhancing elements can be predicted for particular sequences of interest and are far more diverse in amino acid composition and positioning than previously assumed. Machine learning methods become indispensable to predict which specific proteins, among numerous candidate proteins carrying the same non-canonical PTS1 tripeptide, contain sufficient enhancer elements in terms of number, positioning and total strength to cause peroxisome targeting.
doi:10.1186/1471-2229-12-142
PMCID: PMC3487989  PMID: 22882975
24.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference 
PLoS Computational Biology  2011;7(2):e1001070.
Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet. Here, we present a de-novo motif discovery tool called Dispom for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Evaluating Dispom, we find that its prediction performance is superior to existing tools for de-novo motif discovery for 18 benchmark data sets with planted binding sites, and for a metazoan compendium based on experimental data from micro-array, ChIP-chip, ChIP-DSL, and DamID as well as Gene Ontology data. Finally, we apply Dispom to find binding sites differentially abundant in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as a refined auxin responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find in genome-wide predictions that the refined motif is more specific for auxin-responsive genes than the canonical auxin-responsive element. In general, Dispom can be used to find differentially abundant motifs in sequences of any origin. However, the positional distribution learned by Dispom is especially beneficial if all sequences are aligned to some anchor point like the transcription start site in case of promoter sequences. We demonstrate that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application at http://www.jstacs.de/index.php/Dispom.
Author Summary
Binding of transcription factors to promoters of genes, and subsequent enhancement or repression of transcription, is one of the main steps of transcriptional gene regulation. Direct or indirect wet-lab experiments allow the identification of approximate regions potentially bound or regulated by a transcription factor. Subsequently, de-novo motif discovery tools can be used for detecting the precise positions of binding sites. Many traditional tools focus on motifs over-represented in the target regions, which often turn out to be similarly over-represented in the entire genome. In contrast, several recent tools focus on differentially abundant motifs in target regions compared to a control set. As binding sites are often located at some preferred distance to the transcription start site, it is favorable to include this information into de-novo motif discovery. Here, we present Dispom a novel approach for learning differentially abundant motifs and their positional preferences simultaneously, which predicts binding sites with increased accuracy compared to many popular de-novo motif discovery tools. When applying Dispom to promoters of auxin-responsive genes of Arabidopsis thaliana, we find a binding motif slightly different from the canonical auxin-response element, which exhibits a strong positional preference and which is considerably more specific to auxin-responsive genes.
doi:10.1371/journal.pcbi.1001070
PMCID: PMC3037384  PMID: 21347314
25.  Automated identification of pathways from quantitative genetic interaction data 
We present a novel Bayesian learning method that reconstructs large detailed gene networks from quantitative genetic interaction (GI) data.The method uses global reasoning to handle missing and ambiguous measurements, and provide confidence estimates for each prediction.Applied to a recent data set over genes relevant to protein folding, the learned networks reflect known biological pathways, including details such as pathway ordering and directionality of relationships.The reconstructed networks also suggest novel relationships, including the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated.
Recent developments have enabled large-scale quantitative measurement of genetic interactions (GIs) that report on the extent to which the activity of one gene is dependent on a second. It has long been recognized (Avery and Wasserman, 1992; Hartman et al, 2001; Segre et al, 2004; Tong et al, 2004; Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Costanzo et al, 2010) that functional dependencies revealed by GI data can provide rich information regarding underlying biological pathways. Further, the precise phenotypic measurements provided by quantitative GI data can provide evidence for even more detailed aspects of pathway structure, such as differentiating between full and partial dependence between two genes (Drees et al, 2005; Schuldiner et al, 2005; St Onge et al, 2007; Jonikas et al, 2009) (Figure 1A). As GI data sets become available for a range of quantitative phenotypes and organisms, such patterns will allow researchers to elucidate pathways important to a diverse set of biological processes.
We present a new method that exploits the high-quality, quantitative nature of recent GI assays to automatically reconstruct detailed multi-gene pathway structures, including the organization of a large set of genes into coherent pathways, the connectivity and ordering within each pathway, and the directionality of each relationship. We introduce activity pathway networks (APNs), which represent functional dependencies among a set of genes in the form of a network. We present an automatic method to efficiently reconstruct APNs over large sets of genes based on quantitative GI measurements. This method handles uncertainty in the data arising from noise, missing measurements, and data points with ambiguous interpretations, by performing global reasoning that combines evidence from multiple data points. In addition, because some structure choices remain uncertain even when jointly considering all measurements, our method maintains multiple likely networks, and allows computation of confidence estimates over each structure choice.
We applied our APN reconstruction method to the recent high-quality GI data set of Jonikas et al (2009), which examined the functional interaction between genes that contribute to protein folding in the ER. Specifically, Jonikas et al used the cell's endogenous sensor (the unfolded protein response), to first identify several hundred yeast genes with functions in endoplasmic reticulum folding and then systematically characterized their functional interdependencies by measuring unfolded protein response levels in double mutants. Our analysis produced an ensemble of 500 likelihood-weighted APNs over 178 genes (Figure 2).
We performed an aggregate evaluation of our results by comparing to known biological relationships between gene pairs, including participation in pathways according to the Kyoto Encyclopedia of Genes and Genomes (KEGG), correlation of chemical genomic profiles in a recent high-throughput assay (Hillenmeyer et al, 2008) and similarity of Gene Ontology (GO) annotations. In each evaluation performed, our reconstructed APNs were significantly more consistent with the known relationships than either the raw GI values or the Pearson correlation between profiles of GI values.
Importantly, our approach provides not only an improved means for defining pairs or groups of related genes, but also enables the identification of detailed multi-gene network structures. In many cases, our method successfully reconstructed known cellular pathways, including the ER-associated degradation (ERAD) pathway, and the biosynthesis of N-linked glycans, ranking them among the highest confidence structures. In-depth examination of the learned network structures indicates agreement with many known details of these pathways. In addition, quantitative analysis indicates that our learned APNs are indicative of ordering within KEGG-annotated biological pathways.
Our results also suggest several novel relationships, including placement of uncharacterized genes into pathways, and novel relationships between characterized genes. These include the dependence of the J domain chaperone JEM1 on the PDI homolog MPD1, dependence of the Ubiquitin-recycling enzyme DOA4 on N-linked glycosylation, and the dependence of the E3 Ubiquitin ligase DOA10 on the signal peptidase complex subunit SPC2. Our APNs also place the poorly characterized TPR-containing protein SGT2 upstream of the tail-anchored protein biogenesis machinery components GET3, GET4, and MDY2 (also known as GET5), suggesting that SGT2 has a function in the insertion of tail-anchored proteins into membranes. Consistent with this prediction, our experimental analysis shows that sgt2Δ cells show a defect in localization of the tail-anchored protein GFP-Sed5 from punctuate Golgi structures to a more diffuse pattern, as seen in other genes involved in this pathway.
Our results show that multi-gene, detailed pathway networks can be reconstructed from quantitative GI data, providing a concrete computational manifestation to intuitions that have traditionally accompanied the manual interpretation of such data. Ongoing technological developments in both genetics and imaging are enabling the measurement of GI data at a genome-wide scale, using high-accuracy quantitative phenotypes that relate to a range of particular biological functions. Methods based on RNAi will soon allow collection of similar data for human cell lines and other mammalian systems (Moffat et al, 2006). Thus, computational methods for analyzing GI data could have an important function in mapping pathways involved in complex biological systems including human cells.
High-throughput quantitative genetic interaction (GI) measurements provide detailed information regarding the structure of the underlying biological pathways by reporting on functional dependencies between genes. However, the analytical tools for fully exploiting such information lag behind the ability to collect these data. We present a novel Bayesian learning method that uses quantitative phenotypes of double knockout organisms to automatically reconstruct detailed pathway structures. We applied our method to a recent data set that measures GIs for endoplasmic reticulum (ER) genes, using the unfolded protein response as a quantitative phenotype. The results provided reconstructions of known functional pathways including N-linked glycosylation and ER-associated protein degradation. It also contained novel relationships, such as the placement of SGT2 in the tail-anchored biogenesis pathway, a finding that we experimentally validated. Our approach should be readily applicable to the next generation of quantitative GI data sets, as assays become available for additional phenotypes and eventually higher-level organisms.
doi:10.1038/msb.2010.27
PMCID: PMC2913392  PMID: 20531408
computational biology; genetic interaction; pathway reconstruction; probabilistic methods

Results 1-25 (1442766)