Search tips
Search criteria

Results 1-20 (20)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  Finding gene regulatory network candidates using the gene expression knowledge base 
BMC Bioinformatics  2014;15(1):386.
Network-based approaches for the analysis of large-scale genomics data have become well established. Biological networks provide a knowledge scaffold against which the patterns and dynamics of ‘omics’ data can be interpreted. The background information required for the construction of such networks is often dispersed across a multitude of knowledge bases in a variety of formats. The seamless integration of this information is one of the main challenges in bioinformatics. The Semantic Web offers powerful technologies for the assembly of integrated knowledge bases that are computationally comprehensible, thereby providing a potentially powerful resource for constructing biological networks and network-based analysis.
We have developed the Gene eXpression Knowledge Base (GeXKB), a semantic web technology based resource that contains integrated knowledge about gene expression regulation. To affirm the utility of GeXKB we demonstrate how this resource can be exploited for the identification of candidate regulatory network proteins. We present four use cases that were designed from a biological perspective in order to find candidate members relevant for the gastrin hormone signaling network model. We show how a combination of specific query definitions and additional selection criteria derived from gene expression data and prior knowledge concerning candidate proteins can be used to retrieve a set of proteins that constitute valid candidates for regulatory network extensions.
Semantic web technologies provide the means for processing and integrating various heterogeneous information sources. The GeXKB offers biologists such an integrated knowledge resource, allowing them to address complex biological questions pertaining to gene expression. This work illustrates how GeXKB can be used in combination with gene expression results and literature information to identify new potential candidates that may be considered for extending a gene regulatory network.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0386-y) contains supplementary material, which is available to authorized users.
PMCID: PMC4279962  PMID: 25490885
Knowledge management; Knowledge representation; Semantic Systems Biology; Semantic Web; RDF; SPARQL; Network extension; Gene expression; Transcription regulation; Protein-protein interaction; Transcription factor; Target gene interaction; Hypothesis assessment; Gastrin biology
2.  Gene Ontology annotation of sequence-specific DNA binding transcription factors: setting the stage for a large-scale curation effort 
Transcription factors control which information in a genome becomes transcribed to produce RNAs that function in the biological systems of cells and organisms. Reliable and comprehensive information about transcription factors is invaluable for large-scale network-based studies. However, existing transcription factor knowledge bases are still lacking in well-documented functional information.
Here, we provide guidelines for a curation strategy, which constitutes a robust framework for using the controlled vocabularies defined by the Gene Ontology Consortium to annotate specific DNA binding transcription factors (DbTFs) based on experimental evidence reported in literature. Our standardized protocol and workflow for annotating specific DNA binding RNA polymerase II transcription factors is designed to document high-quality and decisive evidence from valid experimental methods. Within a collaborative biocuration effort involving the user community, we are now in the process of exhaustively annotating the full repertoire of human, mouse and rat proteins that qualify as DbTFs in as much as they are experimentally documented in the biomedical literature today. The completion of this task will significantly enrich Gene Ontology-based information resources for the research community.
Database URL:
PMCID: PMC3753819  PMID: 23981286
3.  Integration of biological networks and gene expression data using Cytoscape 
Nature protocols  2007;2(10):2366-2382.
Cytoscape is a free software package for visualizing, modeling and analyzing molecular and genetic interaction networks. This protocol explains how to use Cytoscape to analyze the results of mRNA expression profiling, and other functional genomics and proteomics experiments, in the context of an interaction network obtained for genes of interest. Five major steps are described: (i) obtaining a gene or protein network, (ii) displaying the network using layout algorithms, (iii) integrating with gene expression and other functional attributes, (iv) identifying putative complexes and functional modules and (v) identifying enriched Gene Ontology annotations in the network. These steps provide a broad sample of the types of analyses performed by Cytoscape.
PMCID: PMC3685583  PMID: 17947979
4.  Jointly creating digital abstracts: dealing with synonymy and polysemy 
BMC Research Notes  2012;5:601.
Ideally each Life Science article should get a ‘structured digital abstract’. This is a structured summary of the paper’s findings that is both human-verified and machine-readable. But articles can contain a large variety of information types and contextual details that all need to be reconciled with appropriate names, terms and identifiers, which poses a challenge to any curator. Current approaches mostly use tagging or limited entry-forms for semantic encoding.
We implemented a ‘controlled language’ as a more expressive representation method. We studied how usable this format was for wet-lab-biologists that volunteered as curators. We assessed some issues that arise with the usability of ontologies and other controlled vocabularies, for the encoding of structured information by ‘untrained’ curators. We take a user-oriented viewpoint, and make recommendations that may prove useful for creating a better curation environment: one that can engage a large community of volunteer curators.
Entering information in a biocuration environment could improve in expressiveness and user-friendliness, if curators would be enabled to use synonymous and polysemous terms literally, whereby each term stays linked to an identifier.
PMCID: PMC3532140  PMID: 23110757
Structured digital abstract; Biocuration; Community curation; Ontology; Controlled vocabulary
5.  OLSVis: an animated, interactive visual browser for bio-ontologies 
BMC Bioinformatics  2012;13:116.
More than one million terms from biomedical ontologies and controlled vocabularies are available through the Ontology Lookup Service (OLS). Although OLS provides ample possibility for querying and browsing terms, the visualization of parts of the ontology graphs is rather limited and inflexible.
We created the OLSVis web application, a visualiser for browsing all ontologies available in the OLS database. OLSVis shows customisable subgraphs of the OLS ontologies. Subgraphs are animated via a real-time force-based layout algorithm which is fully interactive: each time the user makes a change, e.g. browsing to a new term, hiding, adding, or dragging terms, the algorithm performs smooth and only essential reorganisations of the graph. This assures an optimal viewing experience, because subsequent screen layouts are not grossly altered, and users can easily navigate through the graph. URL:
The OLSVis web application provides a user-friendly tool to visualise ontologies from the OLS repository. It broadens the possibilities to investigate and select ontology subgraphs through a smooth visualisation method.
PMCID: PMC3394205  PMID: 22646023
Bio-ontologies; Visualisation; Browsing; Web application
6.  Gauging triple stores with actual biological data 
BMC Bioinformatics  2012;13(Suppl 1):S3.
Semantic Web technologies have been developed to overcome the limitations of the current Web and conventional data integration solutions. The Semantic Web is expected to link all the data present on the Internet instead of linking just documents. One of the foundations of the Semantic Web technologies is the knowledge representation language Resource Description Framework (RDF). Knowledge expressed in RDF is typically stored in so-called triple stores (also known as RDF stores), from which it can be retrieved with SPARQL, a language designed for querying RDF-based models. The Semantic Web technologies should allow federated queries over multiple triple stores. In this paper we compare the efficiency of a set of biologically relevant queries as applied to a number of different triple store implementations.
Previously we developed a library of queries to guide the use of our knowledge base Cell Cycle Ontology implemented as a triple store. We have now compared the performance of these queries on five non-commercial triple stores: OpenLink Virtuoso (Open-Source Edition), Jena SDB, Jena TDB, SwiftOWLIM and 4Store. We examined three performance aspects: the data uploading time, the query execution time and the scalability. The queries we had chosen addressed diverse ontological or biological questions, and we found that individual store performance was quite query-specific. We identified three groups of queries displaying similar behaviour across the different stores: 1) relatively short response time queries, 2) moderate response time queries and 3) relatively long response time queries. SwiftOWLIM proved to be a winner in the first group, 4Store in the second one and Virtuoso in the third one.
Our analysis showed that some queries behaved idiosyncratically, in a triple store specific manner, mainly with SwiftOWLIM and 4Store. Virtuoso, as expected, displayed a very balanced performance - its load time and its response time for all the tested queries were better than average among the selected stores; it showed a very good scalability and a reasonable run-to-run reproducibility. Jena SDB and Jena TDB were consistently slower than the other three implementations. Our analysis demonstrated that most queries developed for Virtuoso could be successfully used for other implementations.
PMCID: PMC3471352  PMID: 22373359
7.  Specific Impact of Tobamovirus Infection on the Arabidopsis Small RNA Profile 
PLoS ONE  2011;6(5):e19549.
Tobamoviruses encode a silencing suppressor that binds small RNA (sRNA) duplexes in vitro and supposedly in vivo to counteract antiviral silencing. Here, we used sRNA deep-sequencing combined with transcriptome profiling to determine the global impact of tobamovirus infection on Arabidopsis sRNAs and their mRNA targets. We found that infection of Arabidopsis plants with Oilseed rape mosaic tobamovirus causes a global size-specific enrichment of miRNAs, ta-siRNAs, and other phased siRNAs. The observed patterns of sRNA enrichment suggest that in addition to a role of the viral silencing suppressor, the stabilization of sRNAs might also occur through association with unknown host effector complexes induced upon infection. Indeed, sRNA enrichment concerns primarily 21-nucleotide RNAs with a 5′-terminal guanine. Interestingly, ORMV infection also leads to accumulation of novel miRNA-like sRNAs from miRNA precursors. Thus, in addition to canonical miRNAs and miRNA*s, miRNA precursors can encode additional sRNAs that may be functional under specific conditions like pathogen infection. Virus-induced sRNA enrichment does not correlate with defects in miRNA-dependent ta-siRNA biogenesis nor with global changes in the levels of mRNA and ta-siRNA targets suggesting that the enriched sRNAs may not be able to significantly contribute to the normal activity of pre-loaded RISC complexes. We conclude that tobamovirus infection induces the stabilization of a specific sRNA pool by yet unknown effector complexes. These complexes may sequester viral and host sRNAs to engage them in yet unknown mechanisms involved in plant:virus interactions.
PMCID: PMC3091872  PMID: 21572953
8.  ONTO-ToolKit: enabling bio-ontology engineering via Galaxy 
BMC Bioinformatics  2010;11(Suppl 12):S8.
The biosciences increasingly face the challenge of integrating a wide variety of available data, information and knowledge in order to gain an understanding of biological systems. Data integration is supported by a diverse series of tools, but the lack of a consistent terminology to label these data still presents significant hurdles. As a consequence, much of the available biological data remains disconnected or worse: becomes misconnected. The need to address this terminology problem has spawned the building of a large number of bio-ontologies. OBOF, RDF and OWL are among the most used ontology formats to capture terms and relationships in the Life Sciences, opening the potential to use the Semantic Web to support data integration and further exploitation of integrated resources via automated retrieval and reasoning procedures.
We extended the Perl suite ONTO-PERL and functionally integrated it into the Galaxy platform. The resulting ONTO-ToolKit supports the analysis and handling of OBO-formatted ontologies via the Galaxy interface, and we demonstrated its functionality in different use cases that illustrate the flexibility to obtain sets of ontology terms that match specific search criteria.
ONTO-ToolKit is available as a tool suite for Galaxy. Galaxy not only provides a user friendly interface allowing the interested biologist to manipulate OBO ontologies, it also opens up the possibility to perform further biological (and ontological) analyses by using other tools available within the Galaxy environment. Moreover, it provides tools to translate OBO-formatted ontologies into Semantic Web formats such as RDF and OWL.
ONTO-ToolKit reaches out to researchers in the biosciences, by providing a user-friendly way to analyse and manipulate ontologies. This type of functionality will become increasingly important given the wealth of information that is becoming available based on ontologies.
PMCID: PMC3040534  PMID: 21210987
9.  Targeted interactomics reveals a complex core cell cycle machinery in Arabidopsis thaliana 
A protein interactome focused towards cell proliferation was mapped comprising 857 interactions among 393 proteins, leading to many new insights in plant cell cycle regulation.A comprehensive view on heterodimeric cyclin-dependent kinase (CDK)/cyclin complexes in plants is obtained, in relation with their regulators.Over 100 new candidate cell cycle proteins were predicted.
The basic underlying mechanisms that govern the cell cycle are conserved among all eukaryotes. Peculiar for plants, however, is that their genome contains a collection of cell cycle regulatory genes that is intriguingly large (Vandepoele et al, 2002; Menges et al, 2005) compared to other eukaryotes. Arabidopsis thaliana (Arabidopsis) encodes 71 genes in five regulatory classes versus only 15 in yeast and 23 in human.
Despite the discovery of numerous cell cycle genes, little is known about the protein complex machinery that steers plant cell division. Therefore, we applied tandem affinity purification (TAP) approach coupled with mass spectrometry (MS) on Arabidopsis cell suspension cultures to isolate and analyze protein complexes involved in the cell cycle. This approach allowed us to successfully map a first draft of the basic cell cycle complex machinery of Arabidopsis, providing many new insights into plant cell division.
To map the interactome, we relied on a streamlined platform comprising generic Gateway-based vectors with high cloning flexibility, the fast generation of transgenic suspension cultures, TAP adapted for plant cells, and matrix-assisted laser desorption ionization (MALDI) tandem-MS for the identification of purified proteins (Van Leene et al, 2007, 2008Van Leene et al, 2007, 2008). Complexes for 102 cell cycle proteins were analyzed using this approach, leading to a non-redundant data set of 857 interactions among 393 proteins (Figure 1A). Two subspaces were identified in this data set, domain I1, containing interactions confirmed in at least two independent experimental repeats or in the reciprocal purification experiment, and domain I2 consisting of uniquely observed interactions.
Several observations underlined the quality of both domains. All tested reverse purifications found the original interaction, and 150 known or predicted interactions were confirmed, meaning that also a huge stack of new interactions was revealed. An in-depth computational analysis revealed enrichment for many cell cycle-related features among the proteins of the network (Figure 1B), and many protein pairs were coregulated at the transcriptional level (Figure 1C). Through integration of known cell cycle-related features, more than 100 new candidate cell cycle proteins were predicted (Figure 1D). Besides common qualities of both interactome domains, their real significance appeared through mutual differences exposing two subspaces in the cell cycle interactome: a central regulatory network of stable complexes that are repeatedly isolated and represent core regulatory units, and a peripheral network comprising transient interactions identified less frequently, which are involved in other aspects of the process, such as crosstalk between core complexes or connections with other pathways. To evaluate the biological relevance of the cell cycle interactome in plants, we validated interactions from both domains by a transient split-luciferase assay in Arabidopsis plants (Marion et al, 2008), further sustaining the hypothesis-generating power of the data set to understand plant growth.
With respect to insights into the cell cycle physiology, the interactome was subdivided according to the functional classes of the baits and core protein complexes were extracted, covering cyclin-dependent kinase (CDK)/cyclin core complexes together with their positive and negative regulation networks, DNA replication complexes, the anaphase-promoting complex, and spindle checkpoint complexes. The data imply that mitotic A- and B-type cyclins exclusively form heterodimeric complexes with the plant-specific B-type CDKs and not with CDKA;1, whereas D-type cyclins seem to associate with CDKA;1. Besides the extraction of complexes previously shown in other organisms, our data also suggested many new functional links; for example, the link coupling cell division with the regulation of transcript splicing. The association of negative regulators of CDK/cyclin complexes with transcription factors suggests that their role in reallocation is not solely targeted to CDK/cyclin complexes. New members of the Siamese-related inhibitory proteins were identified, and for the first time potential inhibitors of plant-specific mitotic B-type CDKs have been found in plants. New evidence that the E2F–DP–RBR network is not only active at G1-to-S, but also at the G2-to-M transition is provided and many complexes involved in DNA replication or repair were isolated. For the first time, a plant APC has been isolated biochemically, identifying three potential new plant-specific APC interactors, and finally, complexes involved in the spindle checkpoint were isolated mapping many new but specific interactions.
Finally, to get a general view on the complex machinery, modules of interacting cyclins and core cell cycle regulators were ranked along the cell cycle phases according to the transcript expression peak of the cyclins, showing an assorted set of CDK–cyclin complexes with high regulatory differentiation (Figure 4). Even within the same subfamily (e.g. cyclin A3, B1, B2, D3, and D4), cyclins differ not only in their functional time frame but also in the type and number of CDKs, inhibitors, and scaffolding proteins they bind, further indicating their functional diversification. According to our interaction data, at least 92 different variants of CDK–cyclin complexes are found in Arabidopsis.
In conclusion, these results reflect how several rounds of gene duplication (Sterck et al, 2007) led to the evolution of a large set of cyclin paralogs and a myriad of regulators, resulting in a significant jump in the complexity of the cell cycle machinery that could accommodate unique plant-specific features such as an indeterminate mode of postembryonic development. Through their extensive regulation and connection with a myriad of up- and downstream pathways, the core cell cycle complexes might offer the plant a flexible toolkit to fine-tune cell proliferation in response to an ever-changing environment.
Cell proliferation is the main driving force for plant growth. Although genome sequence analysis revealed a high number of cell cycle genes in plants, little is known about the molecular complexes steering cell division. In a targeted proteomics approach, we mapped the core complex machinery at the heart of the Arabidopsis thaliana cell cycle control. Besides a central regulatory network of core complexes, we distinguished a peripheral network that links the core machinery to up- and downstream pathways. Over 100 new candidate cell cycle proteins were predicted and an in-depth biological interpretation demonstrated the hypothesis-generating power of the interaction data. The data set provided a comprehensive view on heterodimeric cyclin-dependent kinase (CDK)–cyclin complexes in plants. For the first time, inhibitory proteins of plant-specific B-type CDKs were discovered and the anaphase-promoting complex was characterized and extended. Important conclusions were that mitotic A- and B-type cyclins form complexes with the plant-specific B-type CDKs and not with CDKA;1, and that D-type cyclins and S-phase-specific A-type cyclins seem to be associated exclusively with CDKA;1. Furthermore, we could show that plants have evolved a combinatorial toolkit consisting of at least 92 different CDK–cyclin complex variants, which strongly underscores the functional diversification among the large family of cyclins and reflects the pivotal role of cell cycle regulation in the developmental plasticity of plants.
PMCID: PMC2950081  PMID: 20706207
Arabidopsis thaliana; cell cycle; interactome; protein complex; protein interactions
10.  Flexible network reconstruction from relational databases with Cytoscape and CytoSQL 
BMC Bioinformatics  2010;11:360.
Molecular interaction networks can be efficiently studied using network visualization software such as Cytoscape. The relevant nodes, edges and their attributes can be imported in Cytoscape in various file formats, or directly from external databases through specialized third party plugins. However, molecular data are often stored in relational databases with their own specific structure, for which dedicated plugins do not exist. Therefore, a more generic solution is presented.
A new Cytoscape plugin 'CytoSQL' is developed to connect Cytoscape to any relational database. It allows to launch SQL ('Structured Query Language') queries from within Cytoscape, with the option to inject node or edge features of an existing network as SQL arguments, and to convert the retrieved data to Cytoscape network components. Supported by a set of case studies we demonstrate the flexibility and the power of the CytoSQL plugin in converting specific data subsets into meaningful network representations.
CytoSQL offers a unified approach to let Cytoscape interact with relational databases. Thanks to the power of the SQL syntax, this tool can rapidly generate and enrich networks according to very complex criteria. The plugin is available at
PMCID: PMC2910028  PMID: 20594316
11.  DASS-GUI: a user interface for identification and analysis of significant patterns in non-sequential data 
Bioinformatics  2010;26(7):987-989.
Summary: Many large ‘omics’ datasets have been published and many more are expected in the near future. New analysis methods are needed for best exploitation. We have developed a graphical user interface (GUI) for easy data analysis. Our discovery of all significant substructures (DASS) approach elucidates the underlying modularity, a typical feature of complex biological data. It is related to biclustering and other data mining approaches. Importantly, DASS-GUI also allows handling of multi-sets and calculation of statistical significances. DASS-GUI contains tools for further analysis of the identified patterns: analysis of the pattern hierarchy, enrichment analysis, module validation, analysis of additional numerical data, easy handling of synonymous names, clustering, filtering and merging. Different export options allow easy usage of additional tools such as Cytoscape.
Availability: Source code, pre-compiled binaries for different systems, a comprehensive tutorial, case studies and many additional datasets are freely available at DASS-GUI is implemented in Qt.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2844999  PMID: 20172945
12.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project 
Nature biotechnology  2008;26(8):889-896.
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
PMCID: PMC2771753  PMID: 18688244
13.  BioGateway: a semantic systems biology tool for the life sciences 
BMC Bioinformatics  2009;10(Suppl 10):S11.
Life scientists need help in coping with the plethora of fast growing and scattered knowledge resources. Ideally, this knowledge should be integrated in a form that allows them to pose complex questions that address the properties of biological systems, independently from the origin of the knowledge. Semantic Web technologies prove to be well suited for knowledge integration, knowledge production (hypothesis formulation), knowledge querying and knowledge maintenance.
We implemented a semantically integrated resource named BioGateway, comprising the entire set of the OBO foundry candidate ontologies, the GO annotation files, the SWISS-PROT protein set, the NCBI taxonomy and several in-house ontologies. BioGateway provides a single entry point to query these resources through SPARQL. It constitutes a key component for a Semantic Systems Biology approach to generate new hypotheses concerning systems properties. In the course of developing BioGateway, we faced challenges that are common to other projects that involve large datasets in diverse representations. We present a detailed analysis of the obstacles that had to be overcome in creating BioGateway. We demonstrate the potential of a comprehensive application of Semantic Web technologies to global biomedical data.
The time is ripe for launching a community effort aimed at a wider acceptance and application of Semantic Web technologies in the life sciences. We call for the creation of a forum that strives to implement a truly semantic life science foundation for Semantic Systems Biology.
Access to the system and supplementary information (such as a listing of the data sources in RDF, and sample queries) can be found at .
PMCID: PMC2755819  PMID: 19796395
14.  The Cell Cycle Ontology: an application ontology for the representation and integrated analysis of the cell cycle process 
Genome Biology  2009;10(5):R58.
A software resource for the analysis of cell cycle related molecular networks.
The Cell Cycle Ontology ( is an application ontology that automatically captures and integrates detailed knowledge on the cell cycle process. Cell Cycle Ontology is enabled by semantic web technologies, and is accessible via the web for browsing, visualizing, advanced querying, and computational reasoning. Cell Cycle Ontology facilitates a detailed analysis of cell cycle-related molecular network components. Through querying and automated reasoning, it may provide new hypotheses to help steer a systems biology approach to biological network building.
PMCID: PMC2718524  PMID: 19480664
15.  Ontology Design Patterns for bio-ontologies: a case study on the Cell Cycle Ontology 
BMC Bioinformatics  2008;9(Suppl 5):S1.
Bio-ontologies are key elements of knowledge management in bioinformatics. Rich and rigorous bio-ontologies should represent biological knowledge with high fidelity and robustness. The richness in bio-ontologies is a prior condition for diverse and efficient reasoning, and hence querying and hypothesis validation. Rigour allows a more consistent maintenance. Modelling such bio-ontologies is, however, a difficult task for bio-ontologists, because the necessary richness and rigour is difficult to achieve without extensive training.
Analogous to design patterns in software engineering, Ontology Design Patterns are solutions to typical modelling problems that bio-ontologists can use when building bio-ontologies. They offer a means of creating rich and rigorous bio-ontologies with reduced effort. The concept of Ontology Design Patterns is described and documentation and application methodologies for Ontology Design Patterns are presented. Some real-world use cases of Ontology Design Patterns are provided and tested in the Cell Cycle Ontology. Ontology Design Patterns, including those tested in the Cell Cycle Ontology, can be explored in the Ontology Design Patterns public catalogue that has been created based on the documentation system presented ().
Ontology Design Patterns provide a method for rich and rigorous modelling in bio-ontologies. They also offer advantages at different development levels (such as design, implementation and communication) enabling, if used, a more modular, well-founded and richer representation of the biological knowledge. This representation will produce a more efficient knowledge management in the long term.
PMCID: PMC2367624  PMID: 18460183
16.  Extracting expression modules from perturbational gene expression compendia 
BMC Systems Biology  2008;2:33.
Compendia of gene expression profiles under chemical and genetic perturbations constitute an invaluable resource from a systems biology perspective. However, the perturbational nature of such data imposes specific challenges on the computational methods used to analyze them. In particular, traditional clustering algorithms have difficulties in handling one of the prominent features of perturbational compendia, namely partial coexpression relationships between genes. Biclustering methods on the other hand are specifically designed to capture such partial coexpression patterns, but they show a variety of other drawbacks. For instance, some biclustering methods are less suited to identify overlapping biclusters, while others generate highly redundant biclusters. Also, none of the existing biclustering tools takes advantage of the staple of perturbational expression data analysis: the identification of differentially expressed genes.
We introduce a novel method, called ENIGMA, that addresses some of these issues. ENIGMA leverages differential expression analysis results to extract expression modules from perturbational gene expression data. The core parameters of the ENIGMA clustering procedure are automatically optimized to reduce the redundancy between modules. In contrast to the biclusters produced by most other methods, ENIGMA modules may show internal substructure, i.e. subsets of genes with distinct but significantly related expression patterns. The grouping of these (often functionally) related patterns in one module greatly aids in the biological interpretation of the data. We show that ENIGMA outperforms other methods on artificial datasets, using a quality criterion that, unlike other criteria, can be used for algorithms that generate overlapping clusters and that can be modified to take redundancy between clusters into account. Finally, we apply ENIGMA to the Rosetta compendium of expression profiles for Saccharomyces cerevisiae and we analyze one pheromone response-related module in more detail, demonstrating the potential of ENIGMA to generate detailed predictions.
It is increasingly recognized that perturbational expression compendia are essential to identify the gene networks underlying cellular function, and efforts to build these for different organisms are currently underway. We show that ENIGMA constitutes a valuable addition to the repertoire of methods to analyze such data.
PMCID: PMC2386865  PMID: 18402676
17.  CATMA, a comprehensive genome-scale resource for silencing and transcript profiling of Arabidopsis genes 
BMC Bioinformatics  2007;8:400.
The Complete Arabidopsis Transcript MicroArray (CATMA) initiative combines the efforts of laboratories in eight European countries [1] to deliver gene-specific sequence tags (GSTs) for the Arabidopsis research community. The CATMA initiative offers the power and flexibility to regularly update the GST collection according to evolving knowledge about the gene repertoire. These GST amplicons can easily be reamplified and shared, subsets can be picked at will to print dedicated arrays, and the GSTs can be cloned and used for other functional studies. This ongoing initiative has already produced approximately 24,000 GSTs that have been made publicly available for spotted microarray printing and RNA interference.
GSTs from the CATMA version 2 repertoire (CATMAv2, created in 2002) were mapped onto the gene models from two independent Arabidopsis nuclear genome annotation efforts, TIGR5 and PSB-EuGène, to consolidate a list of genes that were targeted by previously designed CATMA tags. A total of 9,027 gene models were not tagged by any amplified CATMAv2 GST, and 2,533 amplified GSTs were no longer predicted to tag an updated gene model. To validate the efficacy of GST mapping criteria and design rules, the predicted and experimentally observed hybridization characteristics associated to GST features were correlated in transcript profiling datasets obtained with the CATMAv2 microarray, confirming the reliability of this platform. To complete the CATMA repertoire, all 9,027 gene models for which no GST had yet been designed were processed with an adjusted version of the Specific Primer and Amplicon Design Software (SPADS). A total of 5,756 novel GSTs were designed and amplified by PCR from genomic DNA. Together with the pre-existing GST collection, this new addition constitutes the CATMAv3 repertoire. It comprises 30,343 unique amplified sequences that tag 24,202 and 23,009 protein-encoding nuclear gene models in the TAIR6 and EuGène genome annotations, respectively. To cover the remaining untagged genes, we identified 543 additional GSTs using less stringent design criteria and designed 990 sequence tags matching multiple members of gene families (Gene Family Tags or GFTs) to cover any remaining untagged genes. These latter 1,533 features constitute the CATMAv4 addition.
To update the CATMA GST repertoire, we designed 7,289 additional sequence tags, bringing the total number of tagged TAIR6-annotated Arabidopsis nuclear protein-coding genes to 26,173. This resource is used both for the production of spotted microarrays and the large-scale cloning of hairpin RNA silencing vectors. All information about the resulting updated CATMA repertoire is available through the CATMA database
PMCID: PMC2147040  PMID: 17945016
18.  Validating module network learning algorithms using simulated data 
BMC Bioinformatics  2007;8(Suppl 2):S5.
In recent years, several authors have used probabilistic graphical models to learn expression modules and their regulatory programs from gene expression data. Despite the demonstrated success of such algorithms in uncovering biologically relevant regulatory relations, further developments in the area are hampered by a lack of tools to compare the performance of alternative module network learning strategies. Here, we demonstrate the use of the synthetic data generator SynTReN for the purpose of testing and comparing module network learning algorithms. We introduce a software package for learning module networks, called LeMoNe, which incorporates a novel strategy for learning regulatory programs. Novelties include the use of a bottom-up Bayesian hierarchical clustering to construct the regulatory programs, and the use of a conditional entropy measure to assign regulators to the regulation program nodes. Using SynTReN data, we test the performance of LeMoNe in a completely controlled situation and assess the effect of the methodological changes we made with respect to an existing software package, namely Genomica. Additionally, we assess the effect of various parameters, such as the size of the data set and the amount of noise, on the inference performance.
Overall, application of Genomica and LeMoNe to simulated data sets gave comparable results. However, LeMoNe offers some advantages, one of them being that the learning process is considerably faster for larger data sets. Additionally, we show that the location of the regulators in the LeMoNe regulation programs and their conditional entropy may be used to prioritize regulators for functional validation, and that the combination of the bottom-up clustering strategy with the conditional entropy-based assignment of regulators improves the handling of missing or hidden regulators.
We show that data simulators such as SynTReN are very well suited for the purpose of developing, testing and improving module network algorithms. We used SynTReN data to develop and test an alternative module network learning strategy, which is incorporated in the software package LeMoNe, and we provide evidence that this alternative strategy has several advantages with respect to existing methods.
PMCID: PMC1892074  PMID: 17493254
20.  Genome-wide screening for cis-regulatory variation using a classical diallel crossing scheme 
Nucleic Acids Research  2006;34(13):3677-3686.
Large-scale screening studies carried out to date for genetic variants that affect gene regulation are generally limited to descriptions of differences in allele-specific expression (ASE) detected in vivo. Allele-specific differences in gene expression provide evidence for a model whereby cis-acting genetic variation results in differential expression between alleles. Such gene surveys for regulatory variation are a first step in identifying the specific nucleotide changes that govern gene expression differences, but they leave the underlying mechanisms unexplored. Here, we propose a quantitative genetics approach to perform a genome-wide analysis of ASE differences (GASED). The GASED approach is based on a diallel design that is often used in plant breeding programs to estimate general combining abilities (GCA) of specific inbred lines and to identify high-yielding hybrid combinations of parents based on their specific combining abilities (SCAs). In a context of gene expression, the values of GCA and SCA parameters allow cis- and trans-regulatory changes to be distinguished and imbalances in gene expression to be ascribed to cis-regulatory variation. With this approach, a total of 715 genes could be identified that are likely to carry allelic polymorphisms responsible for at least a 1.5-fold allelic expression difference in a total of 10 diploid Arabidopsis thaliana hybrids. The major strength of the GASED approach, compared to other ASE detection methods, is that it is not restricted to genes with allelic transcript variants. Although a false-positive rate of 9/41 was observed, the GASED approach is a valuable pre-screening method that can accelerate systematic surveys of naturally occurring cis-regulatory variation among inbred lines for laboratory species, such as Arabidopsis, mouse, rat and fruitfly, and economically important crop species, such as corn.
PMCID: PMC1540733  PMID: 16885241

Results 1-20 (20)