The use of Drosophila melanogaster as a model for studying human disease is well established, reflected by the steady increase in both the number and proportion of fly papers describing human disease models in recent years. In this article, we highlight recent efforts to improve the availability and accessibility of the disease model information in FlyBase (http://flybase.org), the model organism database for Drosophila. FlyBase has recently introduced Human Disease Model Reports, each of which presents background information on a specific disease, a tabulation of related disease subtypes, and summaries of experimental data and results using fruit flies. Integrated presentations of relevant data and reagents described in other sections of FlyBase are incorporated into these reports, which are specifically designed to be accessible to non-fly researchers in order to promote collaboration across model organism communities working in translational science. Another key component of disease model information in FlyBase is that data are collected in a consistent format – using the evolving Disease Ontology (an open-source standardized ontology for human-disease-associated biomedical data) – to allow robust and intuitive searches. To facilitate this, FlyBase has developed a dedicated tool for querying and navigating relevant data, which include mutations that model a disease and any associated interacting modifiers. In this article, we describe how data related to fly models of human disease are presented in individual Gene Reports and in the Human Disease Model Reports. Finally, we discuss search strategies and new query tools that are available to access the disease model data in FlyBase.
Drosophila melanogaster is well established as a model for studying human disease. Here, we highlight recent efforts to enhance the availability and accessibility of disease model data in FlyBase, the model organism database for Drosophila.
Drosophila; Disease model; Online resource; FlyBase
FlyBase (http://flybase.org) is the leading database and web portal for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. Whether you use the fruit fly as an experimental system or want to apply Drosophila biological knowledge to another field of study, FlyBase can help you successfully navigate the wealth of available Drosophila data. Here, we review the FlyBase web site with novice and less-experienced users of FlyBase in mind and point out recent developments stemming from the availability of genome-wide data from the modENCODE project. The first section of this paper explains the organization of the web site and describes the report pages available on FlyBase, focusing on the most popular, the Gene Report. The next section introduces some of the search tools available on FlyBase, in particular, our heavily used and recently redesigned search tool QuickSearch, found on the FlyBase homepage. The final section concerns genomic data, including recent modENCODE (http://www.modencode.org) data, available through our Genome Browser, GBrowse.
FlyBase (http://flybase.org) is the leading website and database of Drosophila genes and genomes. Whether you are using the fruit fly Drosophila melanogaster as an experimental system or wish to understand Drosophila biological knowledge in relation to human disease or to other model systems, FlyBase can help you successfully find the information you are looking for. Here, we demonstrate some of our more advanced searching systems and highlight some of our new tools for searching the wealth of data on FlyBase. The first section explores gene function in FlyBase, using our TermLink tool to search with Controlled Vocabulary terms and our new RNA-Seq Search tool to search gene expression. The second section of this article describes a few ways to search genomic data in FlyBase, using our BLAST server and the new implementation of GBrowse 2, as well as our new FeatureMapper tool. Finally, we move on to discuss our most powerful search tool, QueryBuilder, before describing pre-computed cuts of the data and how to query the database programmatically.
We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3′ UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts.
transcriptome; alternative splice; lncRNA; transcription start site; exon junction
FlyBase is the model organism database for Drosophila genetic and genomic information. Over the last 20 years, FlyBase has had to adapt and change to keep abreast of advances in biology and database design. We are continually looking for ways to improve curation efficiency and efficacy. Genetic literature curation focuses on the extraction of genetic entities (e.g. genes, mutant alleles, transgenic constructs) and their associated phenotypes and Gene Ontology terms from the published literature. Over 2000 Drosophila research articles are now published every year. These articles are becoming ever more data-rich and there is a growing need for text mining to shoulder some of the burden of paper triage and data extraction. In this article, we describe our curation workflow, along with some of the problems and bottlenecks therein, and highlight the opportunities for text mining. We do so in the hope of encouraging the BioCreative community to help us to develop effective methods to mine this torrent of information.
Release 6, the latest reference genome assembly of the fruit fly Drosophila melanogaster, was released by the Berkeley Drosophila Genome Project in 2014; it replaces their previous Release 5 genome assembly, which had been the reference genome assembly for over 7 years. With the enormous amount of information now attached to the D. melanogaster genome in public repositories and individual laboratories, the replacement of the previous assembly by the new one is a major event requiring careful migration of annotations and genome-anchored data to the new, improved assembly. In this report, we describe the attributes of the new Release 6 reference genome assembly, the migration of FlyBase genome annotations to this new assembly, how genome features on this new assembly can be viewed in FlyBase (http://flybase.org) and how users can convert coordinates for their own data to the corresponding Release 6 coordinates.
For nearly 25 years, FlyBase (flybase.org) has provided a freely available online database of biological information about Drosophila species, focusing on the model organism D. melanogaster. The need for a centralized, integrated view of Drosophila research has never been greater as advances in genomic, proteomic and high-throughput technologies add to the quantity and diversity of available data and resources.
FlyBase has taken several approaches to respond to these changes in the research landscape. Novel report pages have been generated for new reagent types and physical interaction data; Drosophila models of human disease are now represented and showcased in dedicated Human Disease Model Reports; other integrated reports have been established that bring together related genes, datasets or reagents; Gene Reports have been revised to improve access to new data types and to highlight functional data; links to external sites have been organized and expanded; and new tools have been developed to display and interrogate all these data, including improved batch processing and bulk file availability. In addition, several new community initiatives have served to enhance interactions between researchers and FlyBase, resulting in direct user contributions and improved feedback.
This chapter provides an overview of the data content, organization and available tools within FlyBase, focusing on recent improvements. We hope it serves as a guide for our diverse user base, enabling efficient and effective exploration of the database and thereby accelerating research discoveries.
FlyBase; Drosophila; database; genetics; genomics; translational research
An accurate, comprehensive, non-redundant and up-to-date bibliography is a crucial component of any Model Organism Database (MOD). Principally, the bibliography provides a set of references that are specific to the field served by the MOD. Moreover, it serves as a backbone to which all curated biological data can be attributed. Here, we describe the organization and main features of the bibliography in FlyBase (flybase.org), the MOD for Drosophila melanogaster. We present an overview of the current content of the bibliography, the pipeline for identifying and adding new references, the presentation of data within Reference Reports and effective methods for searching and retrieving bibliographic data. We highlight recent improvements in these areas and describe the advantages of using the FlyBase bibliography over alternative literature resources. Although this article is focused on bibliographic data, many of the features and tools described are applicable to browsing and querying other datasets in FlyBase.
FlyTF (http://www.flytf.org) is a database of computationally predicted and/or experimentally verified site-specific transcription factors (TFs) in the fruit fly Drosophila melanogaster. The manual classification of TFs in the initial version of FlyTF that concentrated primarily on the DNA-binding characteristics of the proteins has now been extended to a more fine-grained annotation of both DNA binding and regulatory properties in the new release. Furthermore, experimental evidence from the literature was classified into a defined vocabulary, and in collaboration with FlyBase, translated into Gene Ontology (GO) annotation. While our GO annotations will also be available through FlyBase as they will be incorporated into the genes’ official GO annotation in the future, the entire evidence used for classification including computational predictions and quotes from the literature can be accessed through FlyTF. The FlyTF website now builds upon the InterMine framework, which provides experimental and computational biologists with powerful search and filter functionality, list management tools and access to genomic information associated with the TFs.
The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.
FlyBase (http://flybase.bio.indiana.edu/) provides an integrated view of the fundamental genomic and genetic data on the major genetic model Drosophila melanogaster and related species. FlyBase has primary responsibility for the continual reannotation of the D. melanogaster genome. The ultimate goal of the reannotation effort is to decorate the euchromatic sequence of the genome with as much biological information as is available from the community and from the major genome project centers. A complete revision of the annotations of the now-finished euchromatic genomic sequence has been completed. There are many points of entry to the genome within FlyBase, most notably through maps, gene products and ontologies, structured phenotypic and gene expression data, and anatomy.
Many publications describe sets of genes or gene products that share a common biology. For example, genome-wide studies and phylogenetic analyses identify genes related in sequence; high-throughput genetic and molecular screens reveal functionally related gene products; and advanced proteomic methods can determine the subunit composition of multi-protein complexes. It is useful for such gene collections to be presented as discrete lists within the appropriate Model Organism Database (MOD) so that researchers can readily access these data alongside other relevant information. To this end, FlyBase (flybase.org), the MOD for Drosophila melanogaster, has established a ‘Gene Group’ resource: high-quality sets of genes derived from the published literature and organized into individual report pages. To facilitate further analyses, Gene Group Reports also include convenient download and analysis options, together with links to equivalent gene groups at other databases. This new resource will enable researchers with diverse backgrounds and interests to easily view and analyse acknowledged D. melanogaster gene sets and compare them with those of other species.
FlyBase (http://flybase.bio.indiana.edu/) provides an integrated view of the fundamental genomic and genetic data on the major genetic model Drosophila melanogaster and related species. Following on the success of the Drosophila genome project, FlyBase has primary responsibility for the continual reannotation of the D.melanogaster genome. The ultimate goal of the reannotation effort is to decorate the euchromatic sequence of the genome with as much biological information as is available from the community and from the major genome project centers. The current cycle of reannotation focuses on establishing a comprehensive data set of gene models (i.e. transcription units and CDSs). There are many points of entry to the genome within FlyBase, most notably through maps, gene ontologies, structured phenotypic and gene expression data, and anatomy.
FlyBase is a database of genetic and molecular data concerning Drosophila. FlyBase is maintained as a relational database (in Sybase) and is made available as html documents and flat files. The scope of FlyBase includes: genes, alleles (and phenotypes), aberrations, transposons, pointers to sequence data, clones, stock lists, Drosophila workers and bibliographic references. The Encyclopedia of Drosophila is a joint effort between FlyBase and the Berkeley Drosophila Genome Project which integrates FlyBase data with those from the BDGP.
Since 1992, FlyBase (flybase.org) has been an essential online resource for the Drosophila research community. Concentrating on the most extensively studied species, Drosophila melanogaster, FlyBase includes information on genes (molecular and genetic), transgenic constructs, phenotypes, genetic and physical interactions, and reagents such as stocks and cDNAs. Access to data is provided through a number of tools, reports, and bulk-data downloads. Looking to the future, FlyBase is expanding its focus to serve a broader scientific community. In this update, we describe new features, datasets, reagent collections, and data presentations that address this goal, including enhanced orthology data, Human Disease Model Reports, protein domain search and visualization, concise gene summaries, a portal for external resources, video tutorials and the FlyBase Community Advisory Group.
FlyBase (http://flybase.bio.indiana.edu/) is a comprehensive database of genetic and molecular data concerning Drosophila . FlyBase is maintained as a relational database (in Sybase) and is made available as html documents and flat files. The scope of FlyBase includes: genes, alleles (with phenotypes), aberrations, transposons, pointers to sequence data, gene products, maps, clones, stock lists, Drosophila workers and bibliographic references.
Much of the data within Model Organism Databases (MODs) comes from manual curation of the primary research literature. Given limited funding and an increasing density of published material, a significant challenge facing all MODs is how to efficiently and effectively prioritize the most relevant research papers for detailed curation. Here, we report recent improvements to the triaging process used by FlyBase. We describe an automated method to directly e-mail corresponding authors of new papers, requesting that they list the genes studied and indicate (‘flag’) the types of data described in the paper using an online tool. Based on the author-assigned flags, papers are then prioritized for detailed curation and channelled to appropriate curator teams for full data extraction. The overall response rate has been 44% and the flagging of data types by authors is sufficiently accurate for effective prioritization of papers. In summary, we have established a sustainable community curation program, with the result that FlyBase curators now spend less time triaging and can devote more effort to the specialized task of detailed data extraction.
FlyBase (http://flybase.org) is the primary resource for molecular and genetic information on the Drosophilidae. The database serves researchers of diverse backgrounds and interests, and offers several different query tools to provide efficient access to the data available and facilitate the discovery of significant relationships within the database. Recently, FlyBase has developed Interactions Browser and enhanced GBrowse, which are graphical query tools, and made improvements to the search tools QuickSearch and QueryBuilder. Furthermore, these search tools have been integrated with Batch Download and new analysis tools through a more flexible search results list, providing powerful ways of exploring the data in FlyBase.
The recent completion of the Drosophila melanogaster genomic sequence to high quality, and the availability of a greatly expanded set of Drosophila cDNA sequences, afforded FlyBase the opportunity to significantly improve genomic annotations.
The recent completion of the Drosophila melanogaster genomic sequence to high quality and the availability of a greatly expanded set of Drosophila cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation rules, requiring traceable evidence for each gene model, and comparing each predicted peptide to SWISS-PROT and TrEMBL sequences.
Although the number of predicted protein-coding genes in Drosophila remains essentially unchanged, the revised annotation significantly improves gene models, resulting in structural changes to 85% of the transcripts and 45% of the predicted proteins. We annotated transposable elements and non-protein-coding RNAs as new features, and extended the annotation of untranslated (UTR) sequences and alternative transcripts to include more than 70% and 20% of genes, respectively. Finally, cDNA sequence provided evidence for dicistronic transcripts, neighboring genes with overlapping UTRs on the same DNA sequence strand, alternatively spliced genes that encode distinct, non-overlapping peptides, and numerous nested genes.
Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores the complex challenges of eukaryotic gene prediction. At present, experimental data and human curation remain essential to generate high-quality genome annotations.
FlyBase (http://flybase.org) is the primary repository of genetic and molecular data of the insect family Drosophilidae. For the most extensively studied species, Drosophila melanogaster, a wide range of data are presented in integrated formats. Data types include mutant phenotypes, molecular characterization of mutant alleles and aberrations, cytological maps, wild-type expression patterns, anatomical images, transgenic constructs and insertions, sequence-level gene models and molecular classification of gene product functions. There is a growing body of data for other Drosophila species; this is expected to increase dramatically over the next year, with the completion of draft-quality genomic sequences of an additional 11 Drosphila species.
The FlyBase Drosophila genetics database and the public interfaces of the Berkeley Drosophila Genome Project (BDGP) and European Drosophila Genome Project (EDGP) are in the process of integrating. At present, the data of these projects are available from independent, but hyperlinked, WWW sites (FlyBase URL, http://flybase. bio.indiana.edu/; BDGP URL, http://fruitfly.berkeley.edu/; EDGP URL, http://edgp.ebi.ac.uk/ ). Because of the considerable overlap of data classes between the contributions of the Drosophila genome projects and the Drosophila community, the new and enlarged FlyBase consortium views the implementation of a single integrated Drosophila genomics/genetics server as essential to the scientific community. This integration will occur in a stepwise fashion over the next 1-2 years. In this report, the salient features of the current databases and how to interrogate and navigate the extensive data sets are discussed.
Flytrap is a web-enabled relational database of transposable element insertions in Drosophila melanogaster. A green fluorescent protein (GFP) artificial exon carried by a transposable P-element is mobilized and inserted into a host gene intron creating a GFP fusion protein. The sequence of the tagged gene is determined by sequencing inverse-PCR products derived from genomic DNA. Flytrap contains two principle data types: micrographs of protein localization and a cellular component ontology, based on rules derived from the Gene Ontology consortium (http://www.geneontology.org), describing protein localization. Flytrap also has links to gene information contained in Flybase (http://flybase.bio.indiana.edu). The system is designed to accept submissions of micrographs and descriptions from any type of tissue (e.g. wing imaginal disk, ovary) and at any stage of development. Insertion lines can be searched using a number of queries, including Berkeley Drosophila Genome Project (BDGP) numbers and protein localization. In addition, Flytrap provides online order forms linked to each insertion line so that users may request any line generated from this project. Flytrap may be accessed from the homepage at http://flytrap.med.yale.edu.
FlyNets (http://gifts.univ-mrs.fr/FlyNets/FlyNets_home_page.++ +html) is a WWW database describing molecular interactions (protein-DNA, protein-RNA and protein-protein) in the fly Drosophila melanogaster. It is composed of two parts, as follows. (i) FlyNets-base is a specialized database which focuses on molecular interactions involved in Drosophila development. The information content of FlyNets-base is distributed among several specific lines arranged according to a GenBank-like format and grouped into five thematic zones to improve human readability. The FlyNets database achieves a high level of integration with other databases such as FlyBase, EMBL, GenBank and SWISS-PROT through numerous hyperlinks. (ii) FlyNets-list is a very simple and more general databank, the long-term goal of which is to report on any published molecular interaction occuring in the fly, giving direct web access to corresponding s in Medline and in FlyBase. In the context of genome projects, databases describing molecular interactions and genetic networks will provide a link at the functional level between the genome, the proteome and the transcriptome worlds of different organisms. Interaction databases therefore aim at describing the contents, structure, function and behaviour of what we herein define as the interactome world.
The availability of 12 fully sequenced Drosophila species genomes provides an excellent opportunity to explore the evolutionary mechanism, structure and function of gene families in Drosophila. Currently, several important resources, such as FlyBase, FlyMine and DroSpeGe, have been devoted to integrating genetic, genomic, and functional data of Drosophila into a well-organized form. However, all of these resources are gene-centric and lack the information of the gene families in Drosophila.
FlyPhy is a comprehensive phylogenomic analysis platform devoted to analyzing the genes and gene families in Drosophila. Genes were classified into families using a graph-based Markov Clustering algorithm and extensively annotated by a number of bioinformatic tools, such as basic sequence features, functional category, gene ontology terms, domain organization and sequence homolog to other databases. FlyPhy provides a simple and user-friendly web interface to allow users to browse and retrieve the information at multiple levels. An outstanding feature of the FlyPhy is that all the retrieved results can be added to a workset for further data manipulation. For the data stored in the workset, multiple sequence alignment, phylogenetic tree construction and visualization can be easily performed to investigate the sequence variation of each given family and to explore its evolutionary mechanism.
With the above functionalities, FlyPhy will be a useful resource and convenient platform for the Drosophila research community. The FlyPhy is available at .
Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms.
The task of deciphering the complex transcriptional regulatory networks controlling development is one of the major current challenges for molecular biology. The problem is difficult, if not impossible, to solve without a detailed knowledge of the spatiotemporal dynamics of gene expression. Thus, to understand development, we need to identify and functionally characterize all players in regulatory networks. Data on gene expression dynamics obtained from whole transcriptome microarray experiments, combined with in situ hybridization mRNA localisation patterns for a subset of genes, may provide a route for predicting the localisation of gene expression for those genes for which in situ data has not been generated, as well as suggesting functional information for uncharacterised genes. Here, we report the development of one of the first methods for predicting the localisation of gene expression during Drosophila embryogenesis from microarray data. Pooling the subset of genes in the fly genome with in situ data to form functional units, localised in space and time for relevant developmental processes, facilitates the statement of a classification problem, which we address with machine-learning methods. Our approach promotes a richer annotation of biological function for genes in the absence of costly and time-consuming experimental analysis.