Search tips
Search criteria

Results 1-10 (10)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles 
The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.
Database URL:,
PMCID: PMC3978375  PMID: 24715220
2.  The Drosophila phenotype ontology 
Phenotype ontologies are queryable classifications of phenotypes. They provide a widely-used means for annotating phenotypes in a form that is human-readable, programatically accessible and that can be used to group annotations in biologically meaningful ways. Accurate manual annotation requires clear textual definitions for terms. Accurate grouping and fruitful programatic usage require high-quality formal definitions that can be used to automate classification. The Drosophila phenotype ontology (DPO) has been used to annotate over 159,000 phenotypes in FlyBase to date, but until recently lacked textual or formal definitions.
We have composed textual definitions for all DPO terms and formal definitions for 77% of them. Formal definitions reference terms from a range of widely-used ontologies including the Phenotype and Trait Ontology (PATO), the Gene Ontology (GO) and the Cell Ontology (CL). We also describe a generally applicable system, devised for the DPO, for recording and reasoning about the timing of death in populations. As a result of the new formalisations, 85% of classifications in the DPO are now inferred rather than asserted, with much of this classification leveraging the structure of the GO. This work has significantly improved the accuracy and completeness of classification and made further development of the DPO more sustainable.
The DPO provides a set of well-defined terms for annotating Drosophila phenotypes and for grouping and querying the resulting annotation sets in biologically meaningful ways. Such queries have already resulted in successful function predictions from phenotype annotation. Moreover, such formalisations make extended queries possible, including cross-species queries via the external ontologies used in formal definitions. The DPO is openly available under an open source license in both OBO and OWL formats. There is good potential for it to be used more broadly by the Drosophila community, which may ultimately result in its extension to cover a broader range of phenotypes.
PMCID: PMC3816596  PMID: 24138933
Drosophila; Phenotype; Ontology; OWL; OBO; Gene ontology; FlyBase
3.  FlyBase: improvements to the bibliography 
Nucleic Acids Research  2012;41(Database issue):D751-D757.
An accurate, comprehensive, non-redundant and up-to-date bibliography is a crucial component of any Model Organism Database (MOD). Principally, the bibliography provides a set of references that are specific to the field served by the MOD. Moreover, it serves as a backbone to which all curated biological data can be attributed. Here, we describe the organization and main features of the bibliography in FlyBase (, the MOD for Drosophila melanogaster. We present an overview of the current content of the bibliography, the pipeline for identifying and adding new references, the presentation of data within Reference Reports and effective methods for searching and retrieving bibliographic data. We highlight recent improvements in these areas and describe the advantages of using the FlyBase bibliography over alternative literature resources. Although this article is focused on bibliographic data, many of the features and tools described are applicable to browsing and querying other datasets in FlyBase.
PMCID: PMC3531214  PMID: 23125371
4.  Directly e-mailing authors of newly published papers encourages community curation 
Much of the data within Model Organism Databases (MODs) comes from manual curation of the primary research literature. Given limited funding and an increasing density of published material, a significant challenge facing all MODs is how to efficiently and effectively prioritize the most relevant research papers for detailed curation. Here, we report recent improvements to the triaging process used by FlyBase. We describe an automated method to directly e-mail corresponding authors of new papers, requesting that they list the genes studied and indicate (‘flag’) the types of data described in the paper using an online tool. Based on the author-assigned flags, papers are then prioritized for detailed curation and channelled to appropriate curator teams for full data extraction. The overall response rate has been 44% and the flagging of data types by authors is sufficiently accurate for effective prioritization of papers. In summary, we have established a sustainable community curation program, with the result that FlyBase curators now spend less time triaging and can devote more effort to the specialized task of detailed data extraction.
Database URL:
PMCID: PMC3342516  PMID: 22554788
5.  Automatic categorization of diverse experimental information in the bioscience literature 
BMC Bioinformatics  2012;13:16.
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
PMCID: PMC3305665  PMID: 22280404
6.  Drosophila Ribosomal Protein Mutants Control Tissue Growth Non-Autonomously via Effects on the Prothoracic Gland and Ecdysone 
PLoS Genetics  2011;7(12):e1002408.
The ribosome is critical for all aspects of cell growth due to its essential role in protein synthesis. Paradoxically, many Ribosomal proteins (Rps) act as tumour suppressors in Drosophila and vertebrates. To examine how reductions in Rps could lead to tissue overgrowth, we took advantage of the observation that an RpS6 mutant dominantly suppresses the small rough eye phenotype in a cyclin E hypomorphic mutant (cycEJP). We demonstrated that the suppression of cycEJP by the RpS6 mutant is not a consequence of restoring CycE protein levels or activity in the eye imaginal tissue. Rather, the use of UAS-RpS6 RNAi transgenics revealed that the suppression of cycEJP is exerted via a mechanism extrinsic to the eye, whereby reduced Rp levels in the prothoracic gland decreases the activity of ecdysone, the steroid hormone, delaying developmental timing and hence allowing time for tissue and organ overgrowth. These data provide for the first time a rationale to explain the counter-intuitive organ overgrowth phenotypes observed for certain members of the Minute class of Drosophila Rp mutants. They also demonstrate how Rp mutants can affect growth and development cell non-autonomously.
Author Summary
Ribosomes are required for protein synthesis, which is essential for cell growth and division, thus mutations that reduce Rp expression would be expected to limit cell growth. Paradoxically, heterozygous deletion or mutation of certain Rps can actually promote growth and proliferation and in some cases bestow predisposition to cancer. The underlying mechanism(s) behind these unexpected overgrowth phenotypes despite impairment of ribosome biogenesis has remained obscure. We have addressed this question using the power of Drosophila genetics, taking advantage of our observation that four different Rp mutants, or Minutes, are able to suppress a small rough eye phenotype associated with a mutation of the essential controller of cell proliferation cyclin E (cycEJP). Our findings demonstrate that suppression of cycEJP by the RpS6 mutant is exerted via a tissue non-autonomous mechanism whereby reduced Rp in the prothoracic gland decreases activity of the steroid hormone ecdysone, delaying development and hence allowing time for compensatory growth. These data provide for the first time a rationale to explain the counter-intuitive organ overgrowth phenotypes observed for certain Drosophila Minutes. Our findings also have implications for the effect of Rp mutants on endocrine related control of tissue growth in higher organisms.
PMCID: PMC3240600  PMID: 22194697
7.  Genetic characterization of ebi reveals its critical role in Drosophila wing growth 
Fly  2011;5(4):291-303.
The ebi gene of Drosophila melanogaster has been implicated in diverse signaling pathways, cellular functions and developmental processes. However, a thorough genetic analysis of this gene has been lacking and the true extent of its biological roles is unclear. Here, we characterize eleven ebi mutations and find that ebi has a novel role in promoting growth of the wing imaginal disc: viable combinations of mutant alleles give rise to adults with small wings. Wing discs with reduced EBI levels are correspondingly small and exhibit downregulation of Notch target genes. Furthermore, we show that EBI colocalizes on polytene chromosomes with Smrter (SMR), a transcriptional corepressor, and Suppressor of Hairless (SU(H)), the primary transcription factor involved in Notch signaling. Interestingly, the mammalian orthologs of ebi, transducin β-like 1 (TBL1) and TBL-related 1 (TBLR1), function as corepressor/coactivator exchange factors and are required for transcriptional activation of Notch target genes. We hypothesize that EBI acts to activate (de-repress) transcription of Notch target genes important for Drosophila wing growth by functioning as a corepressor/coactivator exchange factor for SU(H).
PMCID: PMC3266070  PMID: 22041576
ebi; wing; growth regulation; Notch pathway; TBL1; TBLR1; corepressor/coactivator exchange factor
8.  Toward an interactive article: integrating journals and biological databases 
BMC Bioinformatics  2011;12:175.
Journal articles and databases are two major modes of communication in the biological sciences, and thus integrating these critical resources is of urgent importance to increase the pace of discovery. Projects focused on bridging the gap between journals and databases have been on the rise over the last five years and have resulted in the development of automated tools that can recognize entities within a document and link those entities to a relevant database. Unfortunately, automated tools cannot resolve ambiguities that arise from one term being used to signify entities that are quite distinct from one another. Instead, resolving these ambiguities requires some manual oversight. Finding the right balance between the speed and portability of automation and the accuracy and flexibility of manual effort is a crucial goal to making text markup a successful venture.
We have established a journal article mark-up pipeline that links GENETICS journal articles and the model organism database (MOD) WormBase. This pipeline uses a lexicon built with entities from the database as a first step. The entity markup pipeline results in links from over nine classes of objects including genes, proteins, alleles, phenotypes and anatomical terms. New entities and ambiguities are discovered and resolved by a database curator through a manual quality control (QC) step, along with help from authors via a web form that is provided to them by the journal. New entities discovered through this pipeline are immediately sent to an appropriate curator at the database. Ambiguous entities that do not automatically resolve to one link are resolved by hand ensuring an accurate link. This pipeline has been extended to other databases, namely Saccharomyces Genome Database (SGD) and FlyBase, and has been implemented in marking up a paper with links to multiple databases.
Our semi-automated pipeline hyperlinks articles published in GENETICS to model organism databases such as WormBase. Our pipeline results in interactive articles that are data rich with high accuracy. The use of a manual quality control step sets this pipeline apart from other hyperlinking tools and results in benefits to authors, journals, readers and databases.
PMCID: PMC3213741  PMID: 21595960
9.  FlyBase: enhancing Drosophila Gene Ontology annotations 
Nucleic Acids Research  2008;37(Database issue):D555-D559.
FlyBase ( is a database of Drosophila genetic and genomic information. Gene Ontology (GO) terms are used to describe three attributes of wild-type gene products: their molecular function, the biological processes in which they play a role, and their subcellular location. This article describes recent changes to the FlyBase GO annotation strategy that are improving the quality of the GO annotation data. Many of these changes stem from our participation in the GO Reference Genome Annotation Project—a multi-database collaboration producing comprehensive GO annotation sets for 12 diverse species.
PMCID: PMC2686450  PMID: 18948289
10.  The ribosomal protein genes and Minute loci of Drosophila melanogaster 
Genome Biology  2007;8(10):R216.
A combined bioinformatic and genetic approach was used to conduct a systematic analysis of the relationship between ribosomal protein genes and Minute loci in Drosophila melanogaster, allowing the identification of 64 Minute loci corresponding to ribosomal genes.
Mutations in genes encoding ribosomal proteins (RPs) have been shown to cause an array of cellular and developmental defects in a variety of organisms. In Drosophila melanogaster, disruption of RP genes can result in the 'Minute' syndrome of dominant, haploinsufficient phenotypes, which include prolonged development, short and thin bristles, and poor fertility and viability. While more than 50 Minute loci have been defined genetically, only 15 have so far been characterized molecularly and shown to correspond to RP genes.
We combined bioinformatic and genetic approaches to conduct a systematic analysis of the relationship between RP genes and Minute loci. First, we identified 88 genes encoding 79 different cytoplasmic RPs (CRPs) and 75 genes encoding distinct mitochondrial RPs (MRPs). Interestingly, nine CRP genes are present as duplicates and, while all appear to be functional, one member of each gene pair has relatively limited expression. Next, we defined 65 discrete Minute loci by genetic criteria. Of these, 64 correspond to, or very likely correspond to, CRP genes; the single non-CRP-encoding Minute gene encodes a translation initiation factor subunit. Significantly, MRP genes and more than 20 CRP genes do not correspond to Minute loci.
This work answers a longstanding question about the molecular nature of Minute loci and suggests that Minute phenotypes arise from suboptimal protein synthesis resulting from reduced levels of cytoribosomes. Furthermore, by identifying the majority of haplolethal and haplosterile loci at the molecular level, our data will directly benefit efforts to attain complete deletion coverage of the D. melanogaster genome.
PMCID: PMC2246290  PMID: 17927810

Results 1-10 (10)