Sorghum (Sorghum bicolor) is one of the most important cereal crops globally and a potential energy plant for biofuel production. In order to explore genetic gain for a range of important quantitative traits, such as drought and heat tolerance, grain yield, stem sugar accumulation, and biomass production, via the use of molecular breeding and genomic selection strategies, knowledge of the available genetic variation and the underlying sequence polymorphisms, is required.
Based on the assembled and annotated genome sequences of Sorghum bicolor (v2.1) and the recently published sorghum re-sequencing data, ~62.9 M SNPs were identified among 48 sorghum accessions and included in a newly developed sorghum genome SNP database SorGSD (http://sorgsd.big.ac.cn). The diverse panel of 48 sorghum lines can be classified into four groups, improved varieties, landraces, wild and weedy sorghums, and a wild relative Sorghum propinquum. SorGSD has a web-based query interface to search or browse SNPs from individual accessions, or to compare SNPs among several lines. The query results can be visualized as text format in tables, or rendered as graphics in a genome browser. Users may find useful annotation from query results including type of SNPs such as synonymous or non-synonymous SNPs, start, stop of splice variants, chromosome locations, and links to the annotation on Phytozome (www.phytozome.net) sorghum genome database. In addition, general information related to sorghum research such as online sorghum resources and literature references can also be found on the website. All the SNP data and annotations can be freely download from the website.
SorGSD is a comprehensive web-portal providing a database of large-scale genome variation across all racial types of cultivated sorghum and wild relatives. It can serve as a bioinformatics platform for a range of genomics and molecular breeding activities for sorghum and for other C4 grasses.
Sorghum; Bio-energy plant; Genome variation; SNPs; Database curation
Long-established protein-coding genes may lose their coding potential during evolution (“unitary gene loss”). Members of the Poaceae family are a major food source and represent an ideal model clade for plant evolution research. However, the global pattern of unitary gene loss in Poaceae genomes as well as the evolutionary fate of lost genes are still less-investigated and remain largely elusive.
Using a locally developed pipeline, we identified 129 unitary gene loss events for long-established protein-coding genes from four representative species of Poaceae, i.e. brachypodium, rice, sorghum and maize. Functional annotation suggested that the lost genes in all or most of Poaceae species are enriched for genes involved in development and response to endogenous stimulus. We also found that 44 mutated genomic loci of lost genes, which we referred as relics, were still actively transcribed, and of which 84% (37 of 44) showed significantly differential expression across different tissues. More interestingly, we found that there were totally five expressed relics may function as competitive endogenous RNA in brachypodium, rice and sorghum genome.
Based on comparative genomics and transcriptome data, we firstly compiled a comprehensive catalogue of unitary gene loss events in Poaceae species and characterized a statistically significant functional preference for these lost genes as well showed the potential of relics functioning as competitive endogenous RNAs in Poaceae genomes.
Electronic supplementary material
The online version of this article (doi:10.1186/s12862-015-0345-x) contains supplementary material, which is available to authorized users.
Unitary gene loss; Poaceae; Competitive endogenous RNA
Transcription factors (TFs) play key roles in both development and stress responses. By integrating into and rewiring original systems, novel TFs contribute significantly to the evolution of transcriptional regulatory networks. Here, we report a high-confidence transcriptional regulatory map covering 388 TFs from 47 families in Arabidopsis. Systematic analysis of this map revealed the architectural heterogeneity of developmental and stress response subnetworks and identified three types of novel network motifs that are absent from unicellular organisms and essential for multicellular development. Moreover, TFs of novel families that emerged during plant landing present higher binding specificities and are preferentially wired into developmental processes and these novel network motifs. Further unveiled connection between the binding specificity and wiring preference of TFs explains the wiring preferences of novel-family TFs. These results reveal distinct functional and evolutionary features of novel TFs, suggesting a plausible mechanism for their contribution to the evolution of multicellular organisms.
transcription factor; transcriptional regulation; network structure; novel family; wiring preference
Summary: Visualizing genes’ structure and annotated features helps biologists to investigate their function and evolution intuitively. The Gene Structure Display Server (GSDS) has been widely used by more than 60 000 users since its first publication in 2007. Here, we reported the upgraded GSDS 2.0 with a newly designed interface, supports for more types of annotation features and formats, as well as an integrated visual editor for editing the generated figure. Moreover, a user-specified phylogenetic tree can be added to facilitate further evolutionary analysis. The full source code is also available for downloading.
Availability and implementation: Web server and source code are freely available at http://gsds.cbi.pku.edu.cn.
Contact: email@example.com or firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Biocuration involves adding value to biomedical data by the processes of standardization, quality control and information transferring (also known as data annotation). It enhances data interoperability and consistency, and is critical in translating biomedical data into scientific discovery. Although China is becoming a leading scientific data producer, biocuration is still very new to the Chinese biomedical data community. In fact, there currently lacks an equivalent acknowledged word in Chinese for the word “curation”. Here we propose its Chinese translation as “审编” (Pinyin: shěn biān), based on its implied meanings taken by biomedical data community. The 8th International Biocuration Conference to be held in China (http://biocuration2015.tilsi.org) next year bears the potential to raise the general awareness in China of the significant role of biocuration in scientific discovery. However, challenges are ahead in its implementation.
Biocuration; Integration; Big data; Database
This manuscript describes an update of the leaf senescence database (LSD) previously featured in the 2011 NAR Database Issue. LSD provides comprehensive information concerning senescence-associated genes (SAGs) and their corresponding mutants. We have made extensive annotations for these SAGs through both manual and computational approaches. Recently, we updated LSD to a new version LSD 2.0 (http://www.eplantsenescence.org/), which contains 5356 genes and 322 mutants from 44 species, an extension from the previous version containing 1145 genes and 154 mutants from 21 species. In the current version, we also included several new features: (i) Primer sequences retrieved based on experimental evidence or designed for high-throughput analysis were added; (ii) More than 100 images of Arabidopsis SAG mutants were added; (iii) Arabidopsis seed information obtained from The Arabidopsis Information Resource (TAIR) was integrated; (iv) Subcellular localization information of SAGs in Arabidopsis mined from literature or generated from the SUBA3 program was presented; (v) Quantitative Trait Loci information was added with links to the original database and (vi) New options such as primer and miRNA search for database query were implemented. The updated database will be a valuable and informative resource for basic research of leaf senescence and for the manipulation of traits of agronomically important plants.
With the aim to provide a resource for functional and evolutionary study of plant transcription factors (TFs), we updated the plant TF database PlantTFDB to version 3.0 (http://planttfdb.cbi.pku.edu.cn). After refining the TF classification pipeline, we systematically identified 129 288 TFs from 83 species, of which 67 species have genome sequences, covering main lineages of green plants. Besides the abundant annotation provided in the previous version, we generated more annotations for identified TFs, including expression, regulation, interaction, conserved elements, phenotype information, expert-curated descriptions derived from UniProt, TAIR and NCBI GeneRIF, as well as references to provide clues for functional studies of TFs. To help identify evolutionary relationship among identified TFs, we assigned 69 450 TFs into 3924 orthologous groups, and constructed 9217 phylogenetic trees for TFs within the same families or same orthologous groups, respectively. In addition, we set up a TF prediction server in this version for users to identify TFs from their own sequences.
With the development of the Internet and the growth of online resources, bioinformatics training for wet-lab biologists became necessary as a part of their education. This article describes a one-semester course ‘Applied Bioinformatics Course’ (ABC, http://abc.cbi.pku.edu.cn/) that the author has been teaching to biological graduate students at the Peking University and the Chinese Academy of Agricultural Sciences for the past 13 years. ABC is a hands-on practical course to teach students to use online bioinformatics resources to solve biological problems related to their ongoing research projects in molecular biology. With a brief introduction to the background of the course, detailed information about the teaching strategies of the course are outlined in the ‘How to teach’ section. The contents of the course are briefly described in the ‘What to teach’ section with some real examples. The author wishes to share his teaching experiences and the online teaching materials with colleagues working in bioinformatics education both in local and international universities.
bioinformatics education; introductory course; hands-on course; project-based learning; on-site teaching
With the rapid growth of genome sequencing projects, genome browser is becoming indispensable, not only as a visualization system but also as an interactive platform to support open data access and collaborative work. Thus a customizable genome browser framework with rich functions and flexible configuration is needed to facilitate various genome research projects.
Based on next-generation web technologies, we have developed a general-purpose genome browser framework ABrowse which provides interactive browsing experience, open data access and collaborative work support. By supporting Google-map-like smooth navigation, ABrowse offers end users highly interactive browsing experience. To facilitate further data analysis, multiple data access approaches are supported for external platforms to retrieve data from ABrowse. To promote collaborative work, an online user-space is provided for end users to create, store and share comments, annotations and landmarks. For data providers, ABrowse is highly customizable and configurable. The framework provides a set of utilities to import annotation data conveniently. To build ABrowse on existing annotation databases, data providers could specify SQL statements according to database schema. And customized pages for detailed information display of annotation entries could be easily plugged in. For developers, new drawing strategies could be integrated into ABrowse for new types of annotation data. In addition, standard web service is provided for data retrieval remotely, providing underlying machine-oriented programming interface for open data access.
ABrowse framework is valuable for end users, data providers and developers by providing rich user functions and flexible customization approaches. The source code is published under GNU Lesser General Public License v3.0 and is accessible at http://www.abrowse.org/. To demonstrate all the features of ABrowse, a live demo for Arabidopsis thaliana genome has been built at http://arabidopsis.cbi.edu.cn/.
Inverted duplicates (IDs) are pervasive in genomes and have been reported to play functional roles in various biological processes. However, the general underlying evolutionary forces that maintain IDs in genomes remain largely elusive. Through a systematic screening of the Drosophila melanogaster genome, 20,223 IDs were detected in nonrepetitive intergenic regions, far more than expectation under the neutrality model. 3,846 of these IDs were identified to have stable hairpin structure (i.e., the structural IDs). Based on whole-genome transcriptome profiling data, we found 628 unannotated expressed structural IDs, which had significantly different genomic distributions and structural properties from the unexpressed IDs. Among the expressed structural IDs, 130 exhibited higher expression in males than in females (i.e., male-biased expression). Compared with sex-unbiased ones, these male-biased IDs were significantly underrepresented on the X chromosome, similar to previously reported pattern of male-biased protein-coding genes. These analyses suggest that a selection-driven process, rather than a purely neutral mutation-driven mechanism, contributes to the maintenance of IDs in the Drosophila genome.
inverted duplicates; noncoding RNA; sex evolution; MSCI; meiotic drive; Drosophila melanogaster
The concurrent release of rice genome sequences for two subspecies (Oryza sativa L. ssp. japonica and Oryza sativa L. ssp. indica) facilitates rice studies at the whole genome level. Since the advent of high-throughput analysis, huge amounts of functional genomics data have been delivered rapidly, making an integrated online genome browser indispensable for scientists to visualize and analyze these data. Based on next-generation web technologies and high-throughput experimental data, we have developed Rice-Map, a novel genome browser for researchers to navigate, analyze and annotate rice genome interactively.
More than one hundred annotation tracks (81 for japonica and 82 for indica) have been compiled and loaded into Rice-Map. These pre-computed annotations cover gene models, transcript evidences, expression profiling, epigenetic modifications, inter-species and intra-species homologies, genetic markers and other genomic features. In addition to these pre-computed tracks, registered users can interactively add comments and research notes to Rice-Map as User-Defined Annotation entries. By smoothly scrolling, dragging and zooming, users can browse various genomic features simultaneously at multiple scales. On-the-fly analysis for selected entries could be performed through dedicated bioinformatic analysis platforms such as WebLab and Galaxy. Furthermore, a BioMart-powered data warehouse "Rice Mart" is offered for advanced users to fetch bulk datasets based on complex criteria.
Rice-Map delivers abundant up-to-date japonica and indica annotations, providing a valuable resource for both computational and bench biologists. Rice-Map is publicly accessible at http://www.ricemap.org/, with all data available for free downloading.
We updated the plant transcription factor (TF) database to version 2.0 (PlantTFDB 2.0, http://planttfdb.cbi.pku.edu.cn) which contains 53 319 putative TFs predicted from 49 species. We made detailed annotation including general information, domain feature, gene ontology, expression pattern and ortholog groups, as well as cross references to various databases and literature citations for these TFs classified into 58 newly defined families with computational approach and manual inspection. Multiple sequence alignments and phylogenetic trees for each family can be shown as Weblogo pictures or downloaded as text files. We have redesigned the user interface in the new version. Users can search TFs with much more flexibility through the improved advanced search page, and the search results can be exported into various formats for further analysis. In addition, we now provide web service for advanced users to access PlantTFDB 2.0 more efficiently.
By broad literature survey, we have developed a leaf senescence database (LSD, http://www.eplantsenescence.org/) that contains a total of 1145 senescence associated genes (SAGs) from 21 species. These SAGs were retrieved based on genetic, genomic, proteomic, physiological or other experimental evidence, and were classified into different categories according to their functions in leaf senescence or morphological phenotypes when mutated. We made extensive annotations for these SAGs by both manual and computational approaches, and users can either browse or search the database to obtain information including literatures, mutants, phenotypes, expression profiles, miRNA interactions, orthologs in other plants and cross links to other databases. We have also integrated a bioinformatics analysis platform WebLab into LSD, which allows users to perform extensive sequence analysis of their interested SAGs. The SAG sequences in LSD can also be downloaded readily for bulk analysis. We believe that the LSD contains the largest number of SAGs to date and represents the most comprehensive and informative plant senescence-related database, which would facilitate the systems biology research and comparative studies on plant aging.
Phytohormone studies enlightened our knowledge of plant responses to various changes. To provide a systematic and comprehensive view of genes participating in plant hormonal regulation, an online accessible database Arabidopsis Hormone Database (AHD) has been developed, which is a collection of hormone related genes of the model organism Arabidopsis thaliana (AHRGs). Recently we updated our database from AHD to a new version AHD2.0 by adding several pronounced features: (i) updating our collection of AHRGs based on most recent publications as well as constructing elaborate schematic diagrams of each hormone biosynthesis and signaling pathways; (ii) adding orthologs of sequenced plants listed in OrthoMCL-DB to each AHRG in the updated database; (iii) providing predicted miRNA splicing site(s) for each AHRG; (iv) integrating genes that genetically interact with each AHRG according to literatures mining; (v) providing links to a powerful online analysis platform WebLab for the convenience of in-time bioinformatics analysis and (vi) providing links to widely used protein databases and integrating more expression profiling information that would facilitate users for a more systematic and integrative analysis related to phytohormone research.
Genome-wide duplication is ubiquitous during diversification of the angiosperms, and gene duplication is one of the most important mechanisms for evolutionary novelties. As an indicator of functional evolution, the divergence of expression patterns following duplication events has drawn great attention in recent years. Using large-scale whole-genome microarray data, we systematically analyzed expression divergence patterns of rice genes from block, tandem and dispersed duplications.
We found a significant difference in expression divergence patterns for the three types of duplicated gene pairs. Expression correlation is significantly higher for gene pairs from block and tandem duplications than those from dispersed duplications. Furthermore, a significant correlation was observed between the expression divergence and the synonymous substitution rate which is an approximate proxy of divergence time. Thus, both duplication types and divergence time influence the difference in expression divergence. Using a linear model, we investigated the influence of these two variables and found that the difference in expression divergence between block and dispersed duplicates is attributed largely to their different divergence time. In addition, the difference in expression divergence between tandem and the other two types of duplicates is attributed to both divergence time and duplication type.
Consistent with previous studies on Arabidopsis, our results revealed a significant difference in expression divergence between the types of duplicated genes and a significant correlation between expression divergence and synonymous substitution rate. We found that the attribution of duplication mode to the expression divergence implies a different evolutionary course of duplicated genes.
Summary:NTAP is designed to analyze ChIP-chip data generated by the NimbleGen tiling array platform and to accomplish various pattern recognition tasks that are useful especially for epigenetic studies. The modular design of NTAP makes the data processing highly customizable. Users can either use NTAP to perform the full process of NimbleGen tiling array data analysis, or choose post-processing modules in NTAP to analyze pre-processed epigenetic data generated by other platforms. The output of NTAP can be saved in standard GFF format files and visualized in GBrowse.
Availability and Implementation:The source code of NTAP is freely available at http://ntap.cbi.pku.edu.cn/. It is implemented in Perl and R and can be used on Linux, Mac and Windows platforms.
Contact: email@example.com; firstname.lastname@example.org; email@example.com
With the rapid progress of biological research, great demands are proposed for integrative knowledge-sharing systems to efficiently support collaboration of biological researchers from various fields. To fulfill such requirements, we have developed a data-centric knowledge-sharing platform WebLab for biologists to fetch, analyze, manipulate and share data under an intuitive web interface. Dedicated space is provided for users to store their input data and analysis results. Users can upload local data or fetch public data from remote databases, and then perform analysis using more than 260 integrated bioinformatic tools. These tools can be further organized as customized analysis workflows to accomplish complex tasks automatically. In addition to conventional biological data, WebLab also provides rich supports for scientific literatures, such as searching against full text of uploaded literatures and exporting citations into various well-known citation managers such as EndNote and BibTex. To facilitate team work among colleagues, WebLab provides a powerful and flexible sharing mechanism, which allows users to share input data, analysis results, scientific literatures and customized workflows to specified users or groups with sophisticated privilege settings. WebLab is publicly available at http://weblab.cbi.pku.edu.cn, with all source code released as Free Software.
Transcription factors (TFs) play key roles in controlling gene expression. Systematic identification and annotation of TFs, followed by construction of TF databases may serve as useful resources for studying the function and evolution of transcription factors. We developed a comprehensive plant transcription factor database PlantTFDB (http://planttfdb.cbi.pku.edu.cn), which contains 26 402 TFs predicted from 22 species, including five model organisms with available whole genome sequence and 17 plants with available EST sequences. To provide comprehensive information for those putative TFs, we made extensive annotation at both family and gene levels. A brief introduction and key references were presented for each family. Functional domain information and cross-references to various well-known public databases were available for each identified TF. In addition, we predicted putative orthologs of those TFs among the 22 species. PlantTFDB has a simple interface to allow users to search the database by IDs or free texts, to make sequence similarity search against TFs of all or individual species, and to download TF sequences for local analysis.
Developmental-stage-related patterns of gene expression correlate with codon usage and genomic GC content in stem cell hierarchies.
The usage of synonymous codons shows considerable variation among mammalian genes. How and why this usage is non-random are fundamental biological questions and remain controversial. It is also important to explore whether mammalian genes that are selectively expressed at different developmental stages bear different molecular features.
In two models of mouse stem cell differentiation, we established correlations between codon usage and the patterns of gene expression. We found that the optimal codons exhibited variation (AT- or GC-ending codons) in different cell types within the developmental hierarchy. We also found that genes that were enriched (developmental-pivotal genes) or specifically expressed (developmental-specific genes) at different developmental stages had different patterns of codon usage and local genomic GC (GCg) content. Moreover, at the same developmental stage, developmental-specific genes generally used more GC-ending codons and had higher GCg content compared with developmental-pivotal genes. Further analyses suggest that the model of translational selection might be consistent with the developmental stage-related patterns of codon usage, especially for the AT-ending optimal codons. In addition, our data show that after human-mouse divergence, the influence of selective constraints is still detectable.
Our findings suggest that developmental stage-related patterns of gene expression are correlated with codon usage (GC3) and GCg content in stem cell hierarchies. Moreover, this paper provides evidence for the influence of natural selection at synonymous sites in the mouse genome and novel clues for linking the molecular features of genes to their patterns of expression during mammalian ontogenesis.
The identification of chromosomal homology will shed light on such mysteries of genome evolution as DNA duplication, rearrangement and loss. Several approaches have been developed to detect chromosomal homology based on gene synteny or colinearity. However, the previously reported implementations lack statistical inferences which are essential to reveal actual homologies.
In this study, we present a statistical approach to detect homologous chromosomal segments based on gene colinearity. We implement this approach in a software package ColinearScan to detect putative colinear regions using a dynamic programming algorithm. Statistical models are proposed to estimate proper parameter values and evaluate the significance of putative homologous regions. Statistical inference, high computational efficiency and flexibility of input data type are three key features of our approach.
We apply ColinearScan to the Arabidopsis and rice genomes to detect duplicated regions within each species and homologous fragments between these two species. We find many more homologous chromosomal segments in the rice genome than previously reported. We also find many small colinear segments between rice and Arabidopsis genomes.
There is an increasing need to automatically annotate a set of genes or proteins (from genome sequencing, DNA microarray analysis or protein 2D gel experiments) using controlled vocabularies and identify the pathways involved, especially the statistically enriched pathways. We have previously demonstrated the KEGG Orthology (KO) as an effective alternative controlled vocabulary and developed a standalone KO-Based Annotation System (KOBAS). Here we report a KOBAS server with a friendly web-based user interface and enhanced functionalities. The server can support input by nucleotide or amino acid sequences or by sequence identifiers in popular databases and can annotate the input with KO terms and KEGG pathways by BLAST sequence similarity or directly ID mapping to genes with known annotations. The server can then identify both frequent and statistically enriched pathways, offering the choices of four statistical tests and the option of multiple testing correction. The server also has a ‘User Space’ in which frequent users may store and manage their data and results online. We demonstrate the usability of the server by finding statistically enriched pathways in a set of upregulated genes in Alzheimer's Disease (AD) hippocampal cornu ammonis 1 (CA1). KOBAS server can be accessed at .
Expressed Sequence Tag-based gene expression profiling can be used to discover functionally associated genes on a large scale. Currently available web servers and tools focus on finding differentially expressed genes in different samples or tissues rather than finding co-expressed genes. To fill this gap, we have developed a web server that implements the GBA (Guilt-by-Association) co-expression algorithm, which has been successfully used in finding disease-related genes. We have also annotated UniGene clusters with links to several important databases such as GO, KEGG, OMIM, Gene, IPI and HomoloGene. The GBA server can be accessed and downloaded at .
With the improved secreted protein prediction approach and comprehensive data sources, including Swiss-Prot, TrEMBL, RefSeq, Ensembl and CBI-Gene, we have constructed secretomes of human, mouse and rat, with a total of 18 152 secreted proteins. All the entries are ranked according to the prediction confidence. They were further annotated via a proteome annotation pipeline that we developed. We also set up a secreted protein classification pipeline and classified our predicted secreted proteins into different functional categories. To make the dataset more convincing and comprehensive, nine reference datasets are also integrated, such as the secreted proteins from the Gene Ontology Annotation (GOA) system at the European Bioinformatics Institute, and the vertebrate secreted proteins from Swiss-Prot. All these entries were grouped via a TribeMCL based clustering pipeline. We have constructed a web-based secreted protein database, which has been publicly available at http://spd.cbi.pku.edu.cn. Users can browse the database via a GO assignment or chromosomal-location-based interface. Moreover, text query and sequence similarity search are also provided, and the sequence and annotation data can be downloaded freely from the SPD website.
Prediction of RNA secondary structure is important in the functional analysis of RNA molecules. The RDfolder web server described in this paper provides two methods for prediction of RNA secondary structure: random stacking of helical regions and helical regions distribution. The random stacking method predicts secondary structure by Monte Carlo simulations. The method of helical regions distribution predicts secondary structure based on the helices that appear most frequently in the set of structures, which are generated by the random stacking method. The RDfolder web server can be accessed at http://rna.cbi.pku.edu.cn.