GBrowse is a mature web-based genome browser that is suitable for deployment on both public and private web sites. It supports most of genome browser features, including qualitative and quantitative (wiggle) tracks, track uploading, track sharing, interactive track configuration, semantic zooming and limited smooth track panning. As of version 2.0, GBrowse supports next-generation sequencing (NGS) data by providing for the direct display of SAM and BAM sequence alignment files. SAM/BAM tracks provide semantic zooming and support both local and remote data sources. This article provides step-by-step instructions for configuring GBrowse to display NGS data.
bioinformatics; genomics; DNA sequencing; genome browser; data visualization; data sharing
Most eukaryotic genes are interrupted by spliceosomal introns. The evolution of exon-intron structure remains mysterious despite rapid advance in genome sequencing technique. In this work, a novel approach is taken based on the assumptions that the evolution of exon-intron structure is a stochastic process, and that the characteristics of this process can be understood by examining its historical outcome, the present-day size distribution of internal translated exons (exon). Through the combination of simulation and modeling the size distribution of exons in different species, we propose a general random fragmentation process (GRFP) to characterize the evolution dynamics of exon-intron structure. This model accurately predicts the probability that an exon will be split by a new intron and the distribution of novel insertions along the length of the exon.
As the first observation from this model, we show that the chance for an exon to obtain an intron is proportional to its size to the 3rd power. We also show that such size dependence is nearly constant across gene, with the exception of the exons adjacent to the 5′ UTR. As the second conclusion from the model, we show that intron insertion loci follow a normal distribution with a mean of 0.5 (center of the exon) and a standard deviation of 0.11. Finally, we show that intron insertions within a gene are independent of each other for vertebrates, but are more negatively correlated for non-vertebrate. We use simulation to demonstrate that the negative correlation might result from significant intron loss during evolution, which could be explained by selection against multi-intron genes in these organisms.
The GRFP model suggests that intron gain is dynamic with a higher chance for longer exons; introns are inserted into exons randomly with the highest probability at the center of the exon. GRFP estimates that there are 78 introns in every 10 kb coding sequences for vertebrate genomes, agreeing with empirical observations. GRFP also estimates that there are significant intron losses in the evolution of non-vertebrate genomes, with extreme cases of around 57% intron loss in Drosophila melanogaster, 28% in Caenorhabditis elegans, and 24% in Oryza sativa.
Evolution of exon-intron structure; General random fragmentation process; Simulation
In an effort to comprehensively characterize the functional elements within the genomes of the important model organisms Drosophila melanogaster and Caenorhabditis elegans, the NHGRI model organism Encyclopaedia of DNA Elements (modENCODE) consortium has generated an enormous library of genomic data along with detailed, structured information on all aspects of the experiments. The modMine database (http://intermine.modencode.org) described here has been built by the modENCODE Data Coordination Center to allow the broader research community to (i) search for and download data sets of interest among the thousands generated by modENCODE; (ii) access the data in an integrated form together with non-modENCODE data sets; and (iii) facilitate fine-grained analysis of the above data. The sophisticated search features are possible because of the collection of extensive experimental metadata by the consortium. Interfaces are provided to allow both biologists and bioinformaticians to exploit these rich modENCODE data sets now available via modMine.
Reactome is an open source, expert-authored, manually curated and peer-reviewed database of reactions, pathways and biological processes. We provide an intuitive web-based user interface to pathway knowledge and a suite of data analysis tools. The Reactome BioMart provides biologists and bioinformaticians with a single web interface for performing simple or elaborate queries of the Reactome database, aggregating data from different sources and providing an opportunity to integrate experimental and computational results with information relating to biological pathways.
Database URL: http://www.reactome.org
The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.
Database URL: http://www.modencode.org.
With DNA sequencing now getting cheaper more quickly than data storage, the time may have come to use cloud computing for genome informatics.
With DNA sequencing now getting cheaper more quickly than data storage or computation, the time may have come for genome informatics to migrate to the cloud.
The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.
Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.
MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
Bioinformatics is alive and well in 2008 concludes Lincoln Stein, despite his earlier prediction of its imminent demise.
Bioinformatics has become too central to biology to be left to specialist bioinformaticians. Biologists are all bioinformaticians now.
While the C. elegans genome is extensively annotated, relatively little information is available for other Caenorhabditis species. The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C. elegans, and to apply this knowledge to the annotation of the genomes of four additional Caenorhabditis species and other nematodes. Seventeen groups worldwide participated in nGASP, and submitted 47 prediction sets across 10 Mb of the C. elegans genome. Predictions were compared to reference gene sets consisting of confirmed or manually curated gene models from WormBase.
The most accurate gene-finders were 'combiner' algorithms, which made use of transcript- and protein-alignments and multi-genome alignments, as well as gene predictions from other gene-finders. Gene-finders that used alignments of ESTs, mRNAs and proteins came in second. There was a tie for third place between gene-finders that used multi-genome alignments and ab initio gene-finders. The median gene level sensitivity of combiners was 78% and their specificity was 42%, which is nearly the same accuracy reported for combiners in the human genome. C. elegans genes with exons of unusual hexamer content, as well as those with unusually many exons, short exons, long introns, a weak translation start signal, weak splice sites, or poorly conserved orthologs posed the greatest difficulty for gene-finders.
This experiment establishes a baseline of gene prediction accuracy in Caenorhabditis genomes, and has guided the choice of gene-finders for the annotation of newly sequenced genomes of Caenorhabditis and other nematode species. We have created new gene sets for C. briggsae, C. remanei, C. brenneri, C. japonica, and Brugia malayi using some of the best-performing gene-finders.
GMODWeb is a software framework designed to speed the development of websites for model organism databases.
The Generic Model Organism Database (GMOD) initiative provides species-agnostic data models and software tools for representing curated model organism data. Here we describe GMODWeb, a GMOD project designed to speed the development of model organism database (MOD) websites. Sites created with GMODWeb provide integration with other GMOD tools and allow users to browse and search through a variety of data types. GMODWeb was built using the open source Turnkey web framework and is available from .
Gallus GBrowse (http://birdbase.net/cgi-bin/gbrowse/gallus/) provides online access to genomic and other information about the chicken, Gallus gallus. The information provided by this resource includes predicted genes and Gene Ontology (GO) terms, links to Gallus In Situ Hybridization Analysis (GEISHA), Unigene and Reactome, the genomic positions of chicken genetic markers, SNPs and microarray probes, and mappings from turkey, condor and zebra finch DNA and EST sequences to the chicken genome. We also provide a BLAT server (http://birdbase.net/cgi-bin/webBlat) for matching user-provided sequences to the chicken genome. These tools make the Gallus GBrowse server a valuable resource for researchers seeking genomic information regarding the chicken and other avian species.
WormBook () is an open-access, online collection of original, peer-reviewed chapters on the biology of Caenorhabditis elegans and related nematodes. Since WormBook was launched in June 2005 with 12 chapters, it has grown to over 100 chapters, covering nearly every aspect of C.elegans research, from Cell Biology and Neurobiology to Evolution and Ecology. WormBook also serves as the text companion to WormBase, the C.elegans model organism database. Objects such as genes, proteins and cells are linked to the relevant pages in WormBase, providing easily accessible background information. Additionally, WormBook chapters contain links to other relevant topics in WormBook, and the in-text citations are linked to their abstracts in PubMed and full-text references, if available. Since WormBook is online, its chapters are able to contain movies and complex images that would not be possible in a print version. WormBook is designed to keep up with the rapid pace of discovery in the field of C.elegans research and continues to grow. WormBook represents a generic publishing infrastructure that is easily adaptable to other research communities to facilitate the dissemination of knowledge in the field.
The SNP Consortium website (http://snp.cshl.org) has undergone many changes since its initial conception three years ago. The database back end has been changed from the venerable ACeDB to the more scalable MySQL engine. Users can access the data via gene or single nucleotide polymorphism (SNP) keyword searches and browse or dump SNP data to textfiles. A graphical genome browsing interface shows SNPs mapped onto the genome assembly in the context of externally available gene predictions and other features. SNP allele frequency and genotype data are available via FTP-download and on individual SNP report web pages. SNP linkage maps are available for download and for browsing in a comparative map viewer. All software components of the data coordinating center (DCC) website (http://snp.cshl.org) are open source.
WormBase (http://www.wormbase.org/) is a web-accessible central data repository for information about Caenorhabditis elegans and related nematodes. The past two years have seen a significant expansion in the biological scope of WormBase, including the integration of large-scale, genome-wide data sets, the inclusion of genome sequence and gene predictions from related species and active literature curation. This expansion of data has also driven the development and refinement of user interfaces and operability, including a new Genome Browser, new searches and facilities for data access and the inclusion of extensive documentation. These advances have expanded WormBase beyond the obvious target audience of C. elegans researchers, to include researchers wishing to explore problems in functional and comparative genomics within the context of a powerful genetic system.
The soil nematodes Caenorhabditis briggsae and Caenorhabditis elegans diverged from a common ancestor roughly 100 million years ago and yet are almost indistinguishable by eye. They have the same chromosome number and genome sizes, and they occupy the same ecological niche. To explore the basis for this striking conservation of structure and function, we have sequenced the C. briggsae genome to a high-quality draft stage and compared it to the finished C. elegans sequence. We predict approximately 19,500 protein-coding genes in the C. briggsae genome, roughly the same as in C. elegans. Of these, 12,200 have clear C. elegans orthologs, a further 6,500 have one or more clearly detectable C. elegans homologs, and approximately 800 C. briggsae genes have no detectable matches in C. elegans. Almost all of the noncoding RNAs (ncRNAs) known are shared between the two species. The two genomes exhibit extensive colinearity, and the rate of divergence appears to be higher in the chromosomal arms than in the centers. Operons, a distinctive feature of C. elegans, are highly conserved in C. briggsae, with the arrangement of genes being preserved in 96% of cases. The difference in size between the C. briggsae (estimated at approximately 104 Mbp) and C. elegans (100.3 Mbp) genomes is almost entirely due to repetitive sequence, which accounts for 22.4% of the C. briggsae genome in contrast to 16.5% of the C. elegans genome. Few, if any, repeat families are shared, suggesting that most were acquired after the two species diverged or are undergoing rapid evolution. Coclustering the C. elegans and C. briggsae proteins reveals 2,169 protein families of two or more members. Most of these are shared between the two species, but some appear to be expanding or contracting, and there seem to be as many as several hundred novel C. briggsae gene families. The C. briggsae draft sequence will greatly improve the annotation of the C. elegans genome. Based on similarity to C. briggsae, we found strong evidence for 1,300 new C. elegans genes. In addition, comparisons of the two genomes will help to understand the evolutionary forces that mold nematode genomes.
With the Caenorhabditis briggsae genome now in hand, C. elegans biologists have a powerful new research tool to refine their knowledge of gene function in C. elegans and to study the path of genome evolution
Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
WormBase (http://www.wormbase.org) is a central data repository for nematode biology. Initially created as a service to the Caenorhabditis elegans research field, WormBase has evolved into a powerful research tool in its own right. In the past 2 years, we expanded WormBase to include the complete genomic sequence, gene predictions and orthology assignments from a range of related nematodes. This comparative data enrich the C. elegans data with improved gene predictions and a better understanding of gene function. In turn, they bring the wealth of experimental knowledge of C. elegans to other systems of medical and agricultural importance. Here, we describe new species and data types now available at WormBase. In addition, we detail enhancements to our curatorial pipeline and website infrastructure to accommodate new genomes and an extensive user base.
WormBase (www.wormbase.org) is the major publicly available database of information about Caenorhabditis elegans, an important system for basic biological and biomedical research. Derived from the initial ACeDB database of C. elegans genetic and sequence information, WormBase now includes the genomic, anatomical and functional information about C. elegans, other Caenorhabditis species and other nematodes. As such, it is a crucial resource not only for C. elegans biologists but the larger biomedical and bioinformatics communities. Coverage of core areas of C. elegans biology will allow the biomedical community to make full use of the results of intensive molecular genetic analysis and functional genomic studies of this organism. Improved search and display tools, wider cross-species comparisons and extended ontologies are some of the features that will help scientists extend their research and take advantage of other nematode species genome sequences.
Comparative genomic analysis of three nematode species identifies 93 genes that encode putative components of the ciliated neurons in C. elegans and are subject to the same regulatory control.
The recent availability of genome sequences of multiple related Caenorhabditis species has made it possible to identify, using comparative genomics, similarly transcribed genes in Caenorhabditis elegans and its sister species. Taking this approach, we have identified numerous novel ciliary genes in C. elegans, some of which may be orthologs of unidentified human ciliopathy genes.
By screening for genes possessing canonical X-box sequences in promoters of three Caenorhabditis species, namely C. elegans, C. briggsae and C. remanei, we identified 93 genes (including known X-box regulated genes) that encode putative components of ciliated neurons in C. elegans and are subject to the same regulatory control. For many of these genes, restricted anatomical expression in ciliated cells was confirmed, and control of transcription by the ciliogenic DAF-19 RFX transcription factor was demonstrated by comparative transcriptional profiling of different tissue types and of daf-19(+) and daf-19(-) animals. Finally, we demonstrate that the dye-filling defect of dyf-5(mn400) animals, which is indicative of compromised exposure of cilia to the environment, is caused by a nonsense mutation in the serine/threonine protein kinase gene M04C9.5.
Our comparative genomics-based predictions may be useful for identifying genes involved in human ciliopathies, including Bardet-Biedl Syndrome (BBS), since the C. elegans orthologs of known human BBS genes contain X-box motifs and are required for normal dye filling in C. elegans ciliated neurons.
WormBase (), a model organism database for Caenorhabditis elegans and other related nematodes, continues to evolve and expand. Over the past year WormBase has added new data on C.elegans, including data on classical genetics, cell biology and functional genomics; expanded the annotation of closely related nematodes with a new genome browser for Caenorhabditis remanei; and deployed new hardware for stronger performance. Several existing datasets including phenotype descriptions and RNAi experiments have seen a large increase in new content. New datasets such as the C.remanei draft assembly and annotations, the Vancouver Fosmid library and TEC-RED 5′ end sites are now available as well. Access to and searching WormBase has become more dependable and flexible via multiple mirror sites and indexing through Google.
WormBase (), the public database for genomics and biology of Caenorhabditis elegans, has been restructured for stronger performance and expanded for richer biological content. Performance was improved by accelerating the loading of central data pages such as the omnibus Gene page, by rationalizing internal data structures and software for greater portability, and by making the Genome Browser highly customizable in how it views and exports genomic subsequences. Arbitrarily complex, user-specified queries are now possible through Textpresso (for all available literature) and through WormMart (for most genomic data). Biological content was enriched by reconciling all available cDNA and expressed sequence tag data with gene predictions, clarifying single nucleotide polymorphism and RNAi sites, and summarizing known functions for most genes studied in this organism.
WormBase (http://www.wormbase.org/) is the central data repository for information about Caenorhabditis elegans and related nematodes. As a model organism database, WormBase extends beyond the genomic sequence, integrating experimental results with extensively annotated views of the genome. The WormBase Consortium continues to expand the biological scope and utility of WormBase with the inclusion of large-scale genomic analyses, through active data and literature curation, through new analysis and visualization tools, and through refinement of the user interface. Over the past year, the nearly complete genomic sequence and comparative analyses of the closely related species Caenorhabditis briggsae have been integrated into WormBase, including gene predictions, ortholog assignments and a new synteny viewer to display the relationships between the two species. Extensive site-wide refinement of the user interface now provides quick access to the most frequently accessed resources and a consistent browsing experience across the site. Unified single-page views now provide complete summaries of commonly accessed entries like genes. These advances continue to increase the utility of WormBase for C.elegans researchers, as well as for those researchers exploring problems in functional and comparative genomics in the context of a powerful genetic system.