Barnyardgrass (Echinochloa crus-galli) is an important weed that is a menace to rice cultivation and production. Rapid evolution of herbicide resistance in this weed makes it one of the most difficult to manage using herbicides. Since genome-wide sequence data for barnyardgrass is limited, we sequenced the transcriptomes of susceptible and resistant barnyardgrass biotypes using the 454 GS-FLX platform.
454 pyrosequencing generated 371,281 raw reads with an average length of 341.8 bp, which made a total length of 126.89 Mb (SRX160526). De novo assembly produced 10,142 contigs (∼5.92 Mb) with an average length of 583 bp and 68,940 singletons (∼22.13 Mb) with an average length of 321 bp. About 244,653 GO term assignments to the biological process, cellular component and molecular function categories were obtained. A total of 6,092 contigs and singletons with 2,515 enzyme commission numbers were assigned to 151 predicted KEGG metabolic pathways. Digital abundance analysis using Illumina sequencing identified 78,124 transcripts among susceptible, resistant, herbicide-treated susceptible and herbicide-treated resistant barnyardgrass biotypes. From these analyses, eight herbicide target-site gene groups and four non-target-site gene groups were identified in the resistant biotype. These could be potential candidate genes involved in the herbicide resistance of barnyardgrass and could be used for further functional genomics research. C4 photosynthesis genes including RbcS, RbcL, NADP-me and MDH with complete CDS were identified using PCR and RACE technology.
This is the first large-scale transcriptome sequencing of E. crus-galli performed using the 454 GS-FLX platform. Potential candidate genes involved in the evolution of herbicide resistance were identified from the assembled sequences. This transcriptome data may serve as a reference for further gene expression and functional genomics studies, and will facilitate the study of herbicide resistance at the molecular level in this species as well as other weeds.
A new catalog of microRNA (miRNA) species called mirtrons has been discovered in animals recently, which originate from spliced introns of the gene transcripts. However, only one putative mirtron, osa-MIR1429, has been identified in rice (Oryza sativa). We employed a high-throughput sequencing (HTS) data- and structure-based approach to do a genome-wide search for the mirtron candidate in both Arabidopsis (Arabidopsis thaliana) and rice. Five and eighteen candidates were discovered in the two plants respectively. To investigate their biological roles, the targets of these mirtrons were predicted and validated based on degradome sequencing data. The result indicates that the mirtrons could guide target cleavages to exert their regulatory roles post-transcriptionally, which needs further experimental validation.
Rice (Oryza sativa) is an excellent model monocot with a known genome sequence for studying embryogenesis. Here we report the transcriptome profiling analysis of rice developing embryos using RNA-Seq as an attempt to gain insight into the molecular and cellular events associated with rice embryogenesis. RNA-Seq analysis generated 17,755,890 sequence reads aligned with 27,190 genes, which provided abundant data for the analysis of rice embryogenesis. A total of 23,971, 23,732, and 23,592 genes were identified from embryos at three developmental stages (3–5, 7, and 14 DAP), while an analysis between stages allowed the identification of a subset of stage-specific genes. The number of genes expressed stage-specifically was 1,131, 1,443, and 1,223, respectively. In addition, we investigated transcriptomic changes during rice embryogenesis based on our RNA-Seq data. A total of 1,011 differentially expressed genes (DEGs) (log2Ratio ≥1, FDR ≤0.001) were identified; thus, the transcriptome of the developing rice embryos changed considerably. A total of 672 genes with significant changes in expression were detected between 3–5 and 7 DAP; 504 DEGs were identified between 7 and 14 DAP. A large number of genes related to metabolism, transcriptional regulation, nucleic acid replication/processing, and signal transduction were expressed predominantly in the early and middle stages of embryogenesis. Protein biosynthesis-related genes accumulated predominantly in embryos at the middle stage. Genes for starch/sucrose metabolism and protein modification were highly expressed in the middle and late stages of embryogenesis. In addition, we found that many transcription factor families may play important roles at different developmental stages, not only in embryo initiation but also in other developmental processes. These results will expand our understanding of the complex molecular and cellular events in rice embryogenesis and provide a foundation for future studies on embryo development in rice and other cereal crops.
The Distributed Annotation System (DAS) offers a standard protocol for sharing and integrating annotations on biological sequences. There are more than 1000 DAS sources available and the number is steadily increasing. Clients are an essential part of the DAS system and integrate data from several independent sources in order to create a useful representation to the user. While web-based DAS clients exist, most of them do not have direct interaction capabilities such as dragging and zooming with the mouse.
Here we present GenExp, a web based and fully interactive visual DAS client. GenExp is a genome oriented DAS client capable of creating informative representations of genomic data zooming out from base level to complete chromosomes. It proposes a novel approach to genomic data rendering and uses the latest HTML5 web technologies to create the data representation inside the client browser. Thanks to client-side rendering most position changes do not need a network request to the server and so responses to zooming and panning are almost immediate. In GenExp it is possible to explore the genome intuitively moving it with the mouse just like geographical map applications. Additionally, in GenExp it is possible to have more than one data viewer at the same time and to save the current state of the application to revisit it later on.
GenExp is a new interactive web-based client for DAS and addresses some of the short-comings of the existing clients. It uses client-side data rendering techniques resulting in easier genome browsing and exploration. GenExp is open source under the GPL license and it is freely available at http://gralggen.lsi.upc.edu/recerca/genexp.
Salt stress is one of the major abiotic stresses in agriculture worldwide. Analysis of natural genetic variation in Arabidopsis is an effective approach to characterize candidate salt responsive genes. Differences in salt tolerance of three Arabidopsis ecotypes were compared in this study based on their responses to salt treatments at two developmental stages: seed germination and later growth. The Sha ecotype had higher germination rates, longer roots and less accumulation of superoxide radical and hydrogen peroxide than the Ler and Col ecotypes after short term salt treatment. With long term salt treatment, Sha exhibited higher survival rates and lower electrolyte leakage. Transcriptome analysis revealed that many genes involved in cell wall, photosynthesis, and redox were mainly down-regulated by salinity effects, while transposable element genes, microRNA and biotic stress related genes were significantly changed in comparisons of Sha vs. Ler and Sha vs. Col. Several pathways involved in tricarboxylic acid cycle, hormone metabolism and development, and the Gene Ontology terms involved in response to stress and defense response were enriched after salt treatment, and between Sha and other two ecotypes. Collectively, these results suggest that the Sha ecotype is preconditioned to withstand abiotic stress. Further studies about detailed gene function are needed. These comparative transcriptomic and analytical results also provide insight into the complexity of salt stress tolerance mechanisms.
We present GobyWeb, a web-based system that facilitates the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. We conducted performance evaluations of the software and found it to either outperform or have similar performance to analysis programs developed for specialized analyses of HTS data. We found that most biologists who took a one-hour GobyWeb training session were readily able to analyze RNA-Seq data with state of the art analysis tools. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use. GobyWeb plugins are distributed in source code and licensed under the open source LGPL3 license to facilitate code inspection, reuse and independent extensions http://github.com/CampagneLaboratory/gobyweb2-plugins.
Protein-protein interaction is one of the crucial ways to decipher the functions of proteins and to understand their role in complex pathways at cellular level. Such a protein-protein interaction network in many crop plants remains poorly defined owing largely to the involvement of high costs, requirement for state of the art laboratory, time and labour intensive techniques. Here, we employed computational docking using ZDOCK and RDOCK programmes to identify interaction network between members of Oryza sativa mitogen activated protein kinase kinase (MAPKK) and mitogen activated protein kinase (MAPK). The 3-dimentional (3-D) structures of five MAPKKs and eleven MAPKs were determined by homology modelling and were further used as input for docking studies. With the help of the results obtained from ZDOCK and RDOCK programmes, top six possible interacting MAPK proteins were predicted for each MAPKK. In order to assess the reliability of the computational prediction, yeast two-hybrid (Y2H) analyses were performed using rice MAPKKs and MAPKs. A direct comparison of Y2H assay and computational prediction of protein interaction was made. With the exception of one, all the other MAPKK-MAPK pairs identified by Y2H screens were among the top predictions by computational dockings. Although, not all the predicted interacting partners could show interaction in Y2H, yet, the harmony between the two approaches suggests that the computational predictions in the present work are reliable. Moreover, the present Y2H analyses per se provide interaction network among MAPKKs and MAPKs which would shed more light on MAPK signalling network in rice.
Intrinsically Disordered Proteins/Regions (IDPs/IDRs) are currently recognized as a widespread phenomenon having key cellular functions. Still, many aspects of the function of these proteins need to be unveiled. IDPs conformational flexibility allows them to recognize and interact with multiple partners, and confers them larger interaction surfaces that may increase interaction speed. For this reason, molecular interactions mediated by IDPs/IDRs are particularly abundant in certain types of protein interactions, such as those of signaling and cell cycle control. We present the first large-scale study of IDPs in Arabidopsis thaliana, the most widely used model organism in plant biology, in order to get insight into the biological roles of these proteins in plants. The work includes a comparative analysis with the human proteome to highlight the differential use of disorder in both species. Results show that while human proteins are in general more disordered, certain functional classes, mainly related to environmental response, are significantly more enriched in disorder in Arabidopsis. We propose that because plants cannot escape from environmental conditions as animals do, they use disorder as a simple and fast mechanism, independent of transcriptional control, for introducing versatility in the interaction networks underlying these biological processes so that they can quickly adapt and respond to challenging environmental conditions.
To know the map between transcription factors (TFs) and their binding sites is essential to reverse engineer the regulation process. Only about 10%–20% of the transcription factor binding motifs (TFBMs) have been reported. This lack of data hinders understanding gene regulation. To address this drawback, we propose a computational method that exploits never used TF properties to discover the missing TFBMs and their sites in all human gene promoters. The method starts by predicting a dictionary of regulatory “DNA words.” From this dictionary, it distills 4098 novel predictions. To disclose the crosstalk between motifs, an additional algorithm extracts TF combinatorial binding patterns creating a collection of TF regulatory syntactic rules. Using these rules, we narrowed down a list of 504 novel motifs that appear frequently in syntax patterns. We tested the predictions against 509 known motifs confirming that our system can reliably predict ab initio motifs with an accuracy of 81%—far higher than previous approaches. We found that on average, 90% of the discovered combinatorial binding patterns target at least 10 genes, suggesting that to control in an independent manner smaller gene sets, supplementary regulatory mechanisms are required. Additionally, we discovered that the new TFBMs and their combinatorial patterns convey biological meaning, targeting TFs and genes related to developmental functions. Thus, among all the possible available targets in the genome, the TFs tend to regulate other TFs and genes involved in developmental functions. We provide a comprehensive resource for regulation analysis that includes a dictionary of “DNA words,” newly predicted motifs and their corresponding combinatorial patterns. Combinatorial patterns are a useful filter to discover TFBMs that play a major role in orchestrating other factors and thus, are likely to lock/unlock cellular functional clusters.
Olive (Olea europaea L.) cultivation is rapidly expanding and low quality saline water is often used for irrigation. The molecular basis of salt tolerance in olive, though, has not yet been investigated at a system level. In this study a comparative transcriptomics approach was used as a tool to unravel gene regulatory networks underlying salinity response in olive trees by simulating as much as possible olive growing conditions in the field. Specifically, we investigated the genotype-dependent differences in the transcriptome response of two olive cultivars, a salt-tolerant and a salt-sensitive one.
A 135-day long salinity experiment was conducted using one-year old trees exposed to NaCl stress for 90 days followed by 45 days of post-stress period during the summer. A cDNA library made of olive seedling mRNAs was sequenced and an olive microarray was constructed. Total RNA was extracted from root samples after 15, 45 and 90 days of NaCl-treatment as well as after 15 and 45 days of post-treatment period and used for microarray hybridizations. SAM analysis between the NaCl-stress and the post-stress time course resulted in the identification of 209 and 36 differentially expressed transcripts in the salt–tolerant and salt–sensitive cultivar, respectively. Hierarchical clustering revealed two major, distinct clusters for each cultivar. Despite the limited number of probe sets, transcriptional regulatory networks were constructed for both cultivars while several hierarchically-clustered interacting transcription factor regulators such as JERF and bZIP homologues were identified.
A systems biology approach was used and differentially expressed transcripts as well as regulatory interactions were identified. The comparison of the interactions among transcription factors in olive with those reported for Arabidopsis might indicate similarities in the response of a tree species with Arabidopsis at the transcriptional level under salinity stress.
High-throughput analysis of genome-wide random transposon mutant libraries is a powerful tool for (conditional) essential gene discovery. Recently, several next-generation sequencing approaches, e.g. Tn-seq/INseq, HITS and TraDIS, have been developed that accurately map the site of transposon insertions by mutant-specific amplification and sequence readout of DNA flanking the transposon insertions site, assigning a measure of essentiality based on the number of reads per insertion site flanking sequence or per gene. However, analysis of these large and complex datasets is hampered by the lack of an easy to use and automated tool for transposon insertion sequencing data. To fill this gap, we developed ESSENTIALS, an open source, web-based software tool for researchers in the genomics field utilizing transposon insertion sequencing analysis. It accurately predicts (conditionally) essential genes and offers the flexibility of using different sample normalization methods, genomic location bias correction, data preprocessing steps, appropriate statistical tests and various visualizations to examine the results, while requiring only a minimum of input and hands-on work from the researcher. We successfully applied ESSENTIALS to in-house and published Tn-seq, TraDIS and HITS datasets and we show that the various pre- and post-processing steps on the sequence reads and count data with ESSENTIALS considerably improve the sensitivity and specificity of predicted gene essentiality.
The Guilt-by-Association (GBA) principle, according to which genes with similar expression profiles are functionally associated, is widely applied for functional analyses using large heterogeneous collections of transcriptomics data. However, the use of such large collections could hamper GBA functional analysis for genes whose expression is condition specific. In these cases a smaller set of condition related experiments should instead be used, but identifying such functionally relevant experiments from large collections based on literature knowledge alone is an impractical task. We begin this paper by analyzing, both from a mathematical and a biological point of view, why only condition specific experiments should be used in GBA functional analysis. We are able to show that this phenomenon is independent of the functional categorization scheme and of the organisms being analyzed. We then present a semi-supervised algorithm that can select functionally relevant experiments from large collections of transcriptomics experiments. Our algorithm is able to select experiments relevant to a given GO term, MIPS FunCat term or even KEGG pathways. We extensively test our algorithm on large dataset collections for yeast and Arabidopsis. We demonstrate that: using the selected experiments there is a statistically significant improvement in correlation between genes in the functional category of interest; the selected experiments improve GBA-based gene function prediction; the effectiveness of the selected experiments increases with annotation specificity; our algorithm can be successfully applied to GBA-based pathway reconstruction. Importantly, the set of experiments selected by the algorithm reflects the existing literature knowledge about the experiments. [A MATLAB implementation of the algorithm and all the data used in this paper can be downloaded from the paper website: http://www.paccanarolab.org/papers/CorrGene/].
Systems Biology is a field in biological science that focuses on the combination of several or all “omics”-approaches in order to find out how genes, transcripts, proteins and metabolites act together in the network of life. Metabolomics as analog to genomics, transcriptomics and proteomics is more and more integrated into biological studies and often transcriptomic and metabolomic experiments are combined in one setup. At a first glance both data types seem to be completely different, but both produce information on biological entities, either transcripts or metabolites. Both types can be overlaid on metabolic pathways to obtain biological information on the studied system. For the joint analysis of both data types the MassTRIX webserver was updated. MassTRIX is freely available at www.masstrix.org.
Switchgrass (Panicum virgatum L.) is a C4 perennial grass and widely popular as an important bioenergy crop. To accelerate the pace of developing high yielding switchgrass cultivars adapted to diverse environmental niches, the generation of genomic resources for this plant is necessary. The large genome size and polyploid nature of switchgrass makes whole genome sequencing a daunting task even with current technologies. Exploring the transcriptional landscape using next generation sequencing technologies provides a viable alternative to whole genome sequencing in switchgrass.
Switchgrass cDNA libraries from germinating seedlings, emerging tillers, flowers, and dormant seeds were sequenced using Roche 454 GS-FLX Titanium technology, generating 980,000 reads with an average read length of 367 bp. De novo assembly generated 243,600 contigs with an average length of 535 bp. Using the foxtail millet genome as a reference greatly improved the assembly and annotation of switchgrass ESTs. Comparative analysis of the 454-derived switchgrass EST reads with other sequenced monocots including Brachypodium, sorghum, rice and maize indicated a 70–80% overlap. RPKM analysis demonstrated unique transcriptional signatures of the four tissues analyzed in this study. More than 24,000 ESTs were identified in the dormant seed library. In silico analysis indicated that there are more than 2000 EST-SSRs in this collection. Expression of several orphan ESTs was confirmed by RT-PCR.
We estimate that about 90% of the switchgrass gene space has been covered in this analysis. This study nearly doubles the amount of EST information for switchgrass currently in the public domain. The celerity and economical nature of second-generation sequencing technologies provide an in-depth view of the gene space of complex genomes like switchgrass. Sequence analysis of closely related members of the NAD+-malic enzyme type C4 grasses such as the model system Setaria viridis can serve as a viable proxy for the switchgrass genome.
Increasing use of high throughput genomic scale assays requires effective visualization and analysis techniques to facilitate data interpretation. Moreover, existing tools often require programming skills, which discourages bench scientists from examining their own data. We have created iCanPlot, a compelling platform for visual data exploration based on the latest technologies. Using the recently adopted HTML5 Canvas element, we have developed a highly interactive tool to visualize tabular data and identify interesting patterns in an intuitive fashion without the need of any specialized computing skills. A module for geneset overlap analysis has been implemented on the Google App Engine platform: when the user selects a region of interest in the plot, the genes in the region are analyzed on the fly. The visualization and analysis are amalgamated for a seamless experience. Further, users can easily upload their data for analysis—which also makes it simple to share the analysis with collaborators. We illustrate the power of iCanPlot by showing an example of how it can be used to interpret histone modifications in the context of gene expression.
Homeodomain-leucine zipper (HD-ZIP) proteins are plant-specific transcriptional factors known to play crucial roles in plant development. Although sequence phylogeny analysis of Populus HD-ZIPs was carried out in a previous study, no systematic analysis incorporating genome organization, gene structure, and expression compendium has been conducted in model tree species Populus thus far.
In this study, a comprehensive analysis of Populus HD-ZIP gene family was performed. Sixty-three full-length HD-ZIP genes were found in Populus genome. These Populus HD-ZIP genes were phylogenetically clustered into four distinct subfamilies (HD-ZIP I–IV) and predominately distributed across 17 linkage groups (LG). Fifty genes from 25 Populus paralogous pairs were located in the duplicated blocks of Populus genome and then preferentially retained during the sequential evolutionary courses. Genomic organization analyses indicated that purifying selection has played a pivotal role in the retention and maintenance of Populus HD-ZIP gene family. Microarray analysis has shown that 21 Populus paralogous pairs have been differentially expressed across different tissues and under various stresses, with five paralogous pairs showing nearly identical expression patterns, 13 paralogous pairs being partially redundant and three paralogous pairs diversifying significantly. Quantitative real-time RT-PCR (qRT-PCR) analysis performed on 16 selected Populus HD-ZIP genes in different tissues and under both drought and salinity stresses confirms their tissue-specific and stress-inducible expression patterns.
Genomic organizations indicated that segmental duplications contributed significantly to the expansion of Populus HD-ZIP gene family. Exon/intron organization and conserved motif composition of Populus HD-ZIPs are highly conservative in the same subfamily, suggesting the members in the same subfamilies may also have conservative functionalities. Microarray and qRT-PCR analyses showed that 89% (56 out of 63) of Populus HD-ZIPs were duplicate genes that might have been retained by substantial subfunctionalization. Taken together, these observations may lay the foundation for future functional analysis of Populus HD-ZIP genes to unravel their biological roles.
The assessment of genetic diversity and population structure of a core collection would benefit to make use of these germplasm as well as applying them in association mapping. The objective of this study were to (1) examine the population structure of a rice core collection; (2) investigate the genetic diversity within and among subgroups of the rice core collection; (3) identify the extent of linkage disequilibrium (LD) of the rice core collection. A rice core collection consisting of 150 varieties which was established from 2260 varieties of Ting's collection of rice germplasm were genotyped with 274 SSR markers and used in this study. Two distinct subgroups (i.e. SG 1 and SG 2) were detected within the entire population by different statistical methods, which is in accordance with the differentiation of indica and japonica rice. MCLUST analysis might be an alternative method to STRUCTURE for population structure analysis. A percentage of 26% of the total markers could detect the population structure as the whole SSR marker set did with similar precision. Gene diversity and MRD between the two subspecies varied considerably across the genome, which might be used to identify candidate genes for the traits under domestication and artificial selection of indica and japonica rice. The percentage of SSR loci pairs in significant (P<0.05) LD is 46.8% in the entire population and the ratio of linked to unlinked loci pairs in LD is 1.06. Across the entire population as well as the subgroups and sub-subgroups, LD decays with genetic distance, indicating that linkage is one main cause of LD. The results of this study would provide valuable information for association mapping using the rice core collection in future.
Combinations of ‘omics’ investigations (i.e, transcriptomic, proteomic, metabolomic and/or fluxomic) are increasingly applied to get comprehensive understanding of biological systems. Because the latter are organized as complex networks of molecular and functional interactions, the intuitive interpretation of multi-omics datasets is difficult. Here we describe a simple strategy to visualize and analyze multi-omics data. Graphical representations of complex biological networks can be generated using Cytoscape where all molecular and functional components could be explicitly represented using a set of dedicated symbols. This representation can be used i) to compile all biologically-relevant information regarding the network through web link association, and ii) to map the network components with multi-omics data. A Cytoscape plugin was developed to increase the possibilities of both multi-omic data representation and interpretation. This plugin allowed different adjustable colour scales to be applied to the various omics data and performed the automatic extraction and visualization of the most significant changes in the datasets. For illustration purpose, the approach was applied to the central carbon metabolism of Escherichia coli. The obtained network contained 774 components and 1232 interactions, highlighting the complexity of bacterial multi-level regulations. The structured representation of this network represents a valuable resource for systemic studies of E. coli, as illustrated from the application to multi-omics data. Some current issues in network representation are discussed on the basis of this work.
The pooid subfamily of grasses includes some of the most important crop, forage and turf species, such as wheat, barley and Lolium. Developing genomic resources, such as whole-genome physical maps, for analysing the large and complex genomes of these crops and for facilitating biological research in grasses is an important goal in plant biology. We describe a bacterial artificial chromosome (BAC)-based physical map of the wild pooid grass Brachypodium distachyon and integrate this with whole genome shotgun sequence (WGS) assemblies using BAC end sequences (BES). The resulting physical map contains 26 contigs spanning the 272 Mb genome. BES from the physical map were also used to integrate a genetic map. This provides an independent vaildation and confirmation of the published WGS assembly. Mapped BACs were used in Fluorescence In Situ Hybridisation (FISH) experiments to align the integrated physical map and sequence assemblies to chromosomes with high resolution. The physical, genetic and cytogenetic maps, integrated with whole genome shotgun sequence assemblies, enhance the accuracy and durability of this important genome sequence and will directly facilitate gene isolation.
Predicting gene functions by integrating large-scale biological data remains a challenge for systems biology. Here we present a resource for Drosophila melanogaster gene function predictions. We trained function-specific classifiers to optimize the influence of different biological datasets for each functional category. Our model predicted GO terms and KEGG pathway memberships for Drosophila melanogaster genes with high accuracy, as affirmed by cross-validation, supporting literature evidence, and large-scale RNAi screens. The resulting resource of prioritized associations between Drosophila genes and their potential functions offers a guide for experimental investigations.
Genomic data release for the grapevine has increased exponentially in the last five years. The Vitis vinifera genome has been sequenced and Vitis EST, transcriptomic, proteomic, and metabolomic tools and data sets continue to be developed. The next critical challenge is to provide biological meaning to this tremendous amount of data by annotating genes and integrating them within their biological context. We have developed and validated a system of Grapevine Molecular Networks (VitisNet).
The sequences from the Vitis vinifera (cv. Pinot Noir PN40024) genome sequencing project and ESTs from the Vitis genus have been paired and the 39,424 resulting unique sequences have been manually annotated. Among these, 13,145 genes have been assigned to 219 networks. The pathway sets include 88 “Metabolic”, 15 “Genetic Information Processing”, 12 “Environmental Information Processing”, 3 “Cellular Processes”, 21 “Transport”, and 80 “Transcription Factors”. The quantitative data is loaded onto molecular networks, allowing the simultaneous visualization of changes in the transcriptome, proteome, and metabolome for a given experiment.
VitisNet uses manually annotated networks in SBML or XML format, enabling the integration of large datasets, streamlining biological functional processing, and improving the understanding of dynamic processes in systems biology experiments. VitisNet is grounded in the Vitis vinifera genome (currently at 8x coverage) and can be readily updated with subsequent updates of the genome or biochemical discoveries. The molecular network files can be dynamically searched by pathway name or individual genes, proteins, or metabolites through the MetNet Pathway database and web-portal at http://metnet3.vrac.iastate.edu/. All VitisNet files including the manual annotation of the grape genome encompassing pathway names, individual genes, their genome identifier, and chromosome location can be accessed and downloaded from the VitisNet tab at http://vitis-dormancy.sdstate.org.
Recognizing regulatory sequences in genomes is a continuing challenge, despite a wealth of available genomic data and a growing number of experimentally validated examples.
We discuss here a simple approach to search for regulatory sequences based on the compositional similarity of genomic regions and known cis-regulatory sequences. This method, which is not limited to searching for predefined motifs, recovers sequences known to be under similar regulatory control. The words shared by the recovered sequences often correspond to known binding sites. Furthermore, we show that although local word profile clustering is predictive for the regulatory sequences involved in blastoderm segmentation, local dissimilarity is a more universal feature of known regulatory sequences in Drosophila.
Our method leverages sequence motifs within a known regulatory sequence to identify co-regulated sequences without explicitly defining binding sites. We also show that regulatory sequences can be distinguished from surrounding sequences by local sequence dissimilarity, a novel feature in identifying regulatory sequences across a genome. Source code for WPH-finder is available for download at http://rana.lbl.gov/downloads/wph.tar.gz.
Genetic analyses of plant symbiotic mutants has led to the identification of key genes involved in Rhizobium-legume communication as well as in development and function of nitrogen fixing root nodules. However, the impact of these genes in coordinating the transcriptional programs of nodule development has only been studied in limited and isolated studies. Here, we present an integrated genome-wide analysis of transcriptome landscapes in Lotus japonicus wild-type and symbiotic mutant plants. Encompassing five different organs, five stages of the sequentially developed determinate Lotus root nodules, and eight mutants impaired at different stages of the symbiotic interaction, our data set integrates an unprecedented combination of organ- or tissue-specific profiles with mutant transcript profiles. In total, 38 different conditions sampled under the same well-defined growth regimes were included. This comprehensive analysis unravelled new and unexpected patterns of transcriptional regulation during symbiosis and organ development. Contrary to expectations, none of the previously characterized nodulins were among the 37 genes specifically expressed in nodules. Another surprise was the extensive transcriptional response in whole root compared to the susceptible root zone where the cellular response is most pronounced. A large number of transcripts predicted to encode transcriptional regulators, receptors and proteins involved in signal transduction, as well as many genes with unknown function, were found to be regulated during nodule organogenesis and rhizobial infection. Combining wild type and mutant profiles of these transcripts demonstrates the activation of a complex genetic program that delineates symbiotic nitrogen fixation. The complete data set was organized into an indexed expression directory that is accessible from a resource database, and here we present selected examples of biological questions that can be addressed with this comprehensive and powerful gene expression data set.
Analysis of gene expression data using genome-wide microarrays is a technique often used in genomic studies to find coexpression patterns and locate groups of co-transcribed genes. However, most studies done at global “omic” scale are not focused on human samples and when they correspond to human very often include heterogeneous datasets, mixing normal with disease-altered samples. Moreover, the technical noise present in genome-wide expression microarrays is another well reported problem that many times is not addressed with robust statistical methods, and the estimation of errors in the data is not provided.
Human genome-wide expression data from a controlled set of normal-healthy tissues is used to build a confident human gene coexpression network avoiding both pathological and technical noise. To achieve this we describe a new method that combines several statistical and computational strategies: robust normalization and expression signal calculation; correlation coefficients obtained by parametric and non-parametric methods; random cross-validations; and estimation of the statistical accuracy and coverage of the data. All these methods provide a series of coexpression datasets where the level of error is measured and can be tuned. To define the errors, the rates of true positives are calculated by assignment to biological pathways. The results provide a confident human gene coexpression network that includes 3327 gene-nodes and 15841 coexpression-links and a comparative analysis shows good improvement over previously published datasets. Further functional analysis of a subset core network, validated by two independent methods, shows coherent biological modules that share common transcription factors. The network reveals a map of coexpression clusters organized in well defined functional constellations. Two major regions in this network correspond to genes involved in nuclear and mitochondrial metabolism and investigations on their functional assignment indicate that more than 60% are house-keeping and essential genes. The network displays new non-described gene associations and it allows the placement in a functional context of some unknown non-assigned genes based on their interactions with known gene families.
The identification of stable and reliable human gene to gene coexpression networks is essential to unravel the interactions and functional correlations between human genes at an omic scale. This work contributes to this aim, and we are making available for the scientific community the validated human gene coexpression networks obtained, to allow further analyses on the network or on some specific gene associations.
The data are available free online at http://bioinfow.dep.usal.es/coexpression/.
A model-driven discovery process, Computing Life, is used to identify an ensemble of genetic networks that describe the biological clock. A clock mechanism involving the genes white-collar-1 and white-collar-2 (wc-1 and wc-2) that encode a transcriptional activator (as well as a blue-light receptor) and an oscillator frequency (frq) that encodes a cyclin that deactivates the activator is used to guide this discovery process through three cycles of microarray experiments. Central to this discovery process is a new methodology for the rational design of a Maximally Informative Next Experiment (MINE), based on the genetic network ensemble. In each experimentation cycle, the MINE approach is used to select the most informative new experiment in order to mine for clock-controlled genes, the outputs of the clock. As much as 25% of the N. crassa transcriptome appears to be under clock-control. Clock outputs include genes with products in DNA metabolism, ribosome biogenesis in RNA metabolism, cell cycle, protein metabolism, transport, carbon metabolism, isoprenoid (including carotenoid) biosynthesis, development, and varied signaling processes. Genes under the transcription factor complex WCC ( = WC-1/WC-2) control were resolved into four classes, circadian only (612 genes), light-responsive only (396), both circadian and light-responsive (328), and neither circadian nor light-responsive (987). In each of three cycles of microarray experiments data support that wc-1 and wc-2 are auto-regulated by WCC. Among 11,000 N. crassa genes a total of 295 genes, including a large fraction of phosphatases/kinases, appear to be under the immediate control of the FRQ oscillator as validated by 4 independent microarray experiments. Ribosomal RNA processing and assembly rather than its transcription appears to be under clock control, suggesting a new mechanism for the post-transcriptional control of clock-controlled genes.