Search tips
Search criteria

Results 1-8 (8)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Overview of the gene ontology task at BioCreative IV 
Gene Ontology (GO) annotation is a common task among model organism databases (MODs) for capturing gene function data from journal articles. It is a time-consuming and labor-intensive task, and is thus often considered as one of the bottlenecks in literature curation. There is a growing need for semiautomated or fully automated GO curation techniques that will help database curators to rapidly and accurately identify gene function information in full-length articles. Despite multiple attempts in the past, few studies have proven to be useful with regard to assisting real-world GO curation. The shortage of sentence-level training data and opportunities for interaction between text-mining developers and GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. To this end, we organized a text-mining challenge task for literature-based GO annotation in BioCreative IV. More specifically, we developed two subtasks: (i) to automatically locate text passages that contain GO-relevant information (a text retrieval task) and (ii) to automatically identify relevant GO terms for the genes in a given article (a concept-recognition task). With the support from five MODs, we provided teams with >4000 unique text passages that served as the basis for each GO annotation in our task data. Such evidence text information has long been recognized as critical for text-mining algorithm development but was never made available because of the high cost of curation. In total, seven teams participated in the challenge task. From the team results, we conclude that the state of the art in automatically mining GO terms from literature has improved over the past decade while much progress is still needed for computer-assisted GO curation. Future work should focus on addressing remaining technical challenges for improved performance of automatic GO concept recognition and incorporating practical benefits of text-mining tools into real-world GO annotation.
Database URL:
PMCID: PMC4142793  PMID: 25157073
2.  BC4GO: a full-text corpus for the BioCreative IV GO task 
Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community.
Database URL:
PMCID: PMC4112614  PMID: 25070993
3.  Representing Kidney Development Using the Gene Ontology 
PLoS ONE  2014;9(6):e99864.
Gene Ontology (GO) provides dynamic controlled vocabularies to aid in the description of the functional biological attributes and subcellular locations of gene products from all taxonomic groups ( Here we describe collaboration between the renal biomedical research community and the GO Consortium to improve the quality and quantity of GO terms describing renal development. In the associated annotation activity, the new and revised terms were associated with gene products involved in renal development and function. This project resulted in a total of 522 GO terms being added to the ontology and the creation of approximately 9,600 kidney-related GO term associations to 940 UniProt Knowledgebase (UniProtKB) entries, covering 66 taxonomic groups. We demonstrate the impact of these improvements on the interpretation of GO term analyses performed on genes differentially expressed in kidney glomeruli affected by diabetic nephropathy. In summary, we have produced a resource that can be utilized in the interpretation of data from small- and large-scale experiments investigating molecular mechanisms of kidney function and development and thereby help towards alleviating renal disease.
PMCID: PMC4062467  PMID: 24941002
4.  Systematic Analysis of Experimental Phenotype Data Reveals Gene Functions 
PLoS ONE  2013;8(4):e60847.
High-throughput phenotyping projects in model organisms have the potential to improve our understanding of gene functions and their role in living organisms. We have developed a computational, knowledge-based approach to automatically infer gene functions from phenotypic manifestations and applied this approach to yeast (Saccharomyces cerevisiae), nematode worm (Caenorhabditis elegans), zebrafish (Danio rerio), fruitfly (Drosophila melanogaster) and mouse (Mus musculus) phenotypes. Our approach is based on the assumption that, if a mutation in a gene leads to a phenotypic abnormality in a process , then must have been involved in , either directly or indirectly. We systematically analyze recorded phenotypes in animal models using the formal definitions created for phenotype ontologies. We evaluate the validity of the inferred functions manually and by demonstrating a significant improvement in predicting genetic interactions and protein-protein interactions based on functional similarity. Our knowledge-based approach is generally applicable to phenotypes recorded in model organism databases, including phenotypes from large-scale, high throughput community projects whose primary mode of dissemination is direct publication on-line rather than in the literature.
PMCID: PMC3628905  PMID: 23626672
5.  The Representation of Heart Development in the Gene Ontology 
Developmental Biology  2011;354(1):9-17.
An understanding of heart development is critical in any systems biology approach to cardiovascular disease. The interpretation of data generated from high-throughput technologies (such as microarray and proteomics) is also essential to this approach. However, characterizing the role of genes in the processes underlying heart development and cardiovascular disease involves the non-trivial task of data analysis and integration of previous knowledge. The Gene Ontology (GO) Consortium provides structured controlled biological vocabularies that are used to summarize previous functional knowledge for gene products across all species. One aspect of GO describes biological processes, such as development and signaling.
In order to support high-throughput cardiovascular research, we have initiated an effort to fully describe heart development in GO; expanding the number of GO terms describing heart development from 12 to over 280. This new ontology describes heart morphogenesis, the differentiation of specific cardiac cell types, and the involvement of signaling pathways in heart development and aligns GO with the current views of the heart development research community and its representation in the literature. This extension of GO allows gene product annotators to comprehensively capture the genetic program leading to the developmental progression of the heart. This will enable users to integrate heart development data across species, resulting in the comprehensive retrieval of information about this subject.
The revised GO structure, combined with gene product annotations, should improve the interpretation of data from high-throughput methods in a variety of cardiovascular research areas, including heart development, congenital cardiac disease, and cardiac stem cell research. Additionally, we invite the heart development community to contribute to the expansion of this important dataset for the benefit of future research in this area.
PMCID: PMC3302178  PMID: 21419760
annotation; cardiovascular; development; Gene Ontology; heart
6.  FlyTF: improved annotation and enhanced functionality of the Drosophila transcription factor database 
Nucleic Acids Research  2009;38(Database issue):D443-D447.
FlyTF ( is a database of computationally predicted and/or experimentally verified site-specific transcription factors (TFs) in the fruit fly Drosophila melanogaster. The manual classification of TFs in the initial version of FlyTF that concentrated primarily on the DNA-binding characteristics of the proteins has now been extended to a more fine-grained annotation of both DNA binding and regulatory properties in the new release. Furthermore, experimental evidence from the literature was classified into a defined vocabulary, and in collaboration with FlyBase, translated into Gene Ontology (GO) annotation. While our GO annotations will also be available through FlyBase as they will be incorporated into the genes’ official GO annotation in the future, the entire evidence used for classification including computational predictions and quotes from the literature can be accessed through FlyTF. The FlyTF website now builds upon the InterMine framework, which provides experimental and computational biologists with powerful search and filter functionality, list management tools and access to genomic information associated with the TFs.
PMCID: PMC2808907  PMID: 19884132
7.  FlyBase: enhancing Drosophila Gene Ontology annotations 
Nucleic Acids Research  2008;37(Database issue):D555-D559.
FlyBase ( is a database of Drosophila genetic and genomic information. Gene Ontology (GO) terms are used to describe three attributes of wild-type gene products: their molecular function, the biological processes in which they play a role, and their subcellular location. This article describes recent changes to the FlyBase GO annotation strategy that are improving the quality of the GO annotation data. Many of these changes stem from our participation in the GO Reference Genome Annotation Project—a multi-database collaboration producing comprehensive GO annotation sets for 12 diverse species.
PMCID: PMC2686450  PMID: 18948289
8.  Identification of Jade1, a Gene Encoding a PHD Zinc Finger Protein, in a Gene Trap Mutagenesis Screen for Genes Involved in Anteroposterior Axis Development 
Molecular and Cellular Biology  2003;23(23):8553-8562.
In a gene trap screen for genes expressed in the primitive streak and tail bud during mouse embryogenesis, we isolated a mutation in Jade1, a gene encoding a PHD zinc finger protein previously shown to interact with the tumor suppressor pVHL. Expressed sequence tag analysis indicates that Jade1 is subject to posttranscriptional regulation, resulting in multiple transcripts and at least two protein isoforms. The fusion Jade1-β-galactosidase reporter produced by the gene trap allele exhibits a regulated expression during embryogenesis and localizes to the nucleus and/or cytoplasm of different cell types. In addition to the primitive streak and tail bud, β-galactosidase activity was found in other embryonic regions where pluripotent or tissue-specific progenitors are known to reside, including the early gastrulation epiblast and the ventricular zone of the cerebral cortex. Prominent reporter expression was also seen in the extraembryonic tissues as well as other differentiated cell types in the embryo, in particular the developing musculature. We show that the gene trap mutation produces a null allele. However, homozygotes for the gene trap integration are viable and fertile. Database searches identified a family of Jade proteins conserved through vertebrates. This raises the possibility that the absence of phenotype is due to a functional compensation by other family members.
PMCID: PMC262661  PMID: 14612400

Results 1-8 (8)