Model organisms are becoming increasingly important for the study of complex diseases such as type 1 diabetes (T1D). The non-obese diabetic (NOD) mouse is an experimental model for T1D having been bred to develop the disease spontaneously in a process that is similar to humans. Genetic analysis of the NOD mouse has identified around 50 disease loci, which have the nomenclature Idd for insulin-dependent diabetes, distributed across at least 11 different chromosomes. In total, 21 Idd regions across 6 chromosomes, that are major contributors to T1D susceptibility or resistance, were selected for finished sequencing and annotation at the Wellcome Trust Sanger Institute. Here we describe the generation of 40.4 mega base-pairs of finished sequence from 289 bacterial artificial chromosomes for the NOD mouse. Manual annotation has identified 738 genes in the diabetes sensitive NOD mouse and 765 genes in homologous regions of the diabetes resistant C57BL/6J reference mouse across 19 candidate Idd regions. This has allowed us to call variation consequences between homologous exonic sequences for all annotated regions in the two mouse strains. We demonstrate the importance of this resource further by illustrating the technical difficulties that regions of inter-strain structural variation between the NOD mouse and the C57BL/6J reference mouse can cause for current next generation sequencing and assembly techniques. Furthermore, we have established that the variation rate in the Idd regions is 2.3 times higher than the mean found for the whole genome assembly for the NOD/ShiLtJ genome, which we suggest reflects the fact that positive selection for functional variation in immune genes is beneficial in regard to host defence. In summary, we provide an important resource, which aids the analysis of potential causative genes involved in T1D susceptibility.
While alternative splicing (AS) can potentially expand the functional repertoire of vertebrate genomes, relatively few AS transcripts have been experimentally characterized. We describe our detailed manual annotation of vertebrate genomes, which is generating a publicly available geneset rich in AS. In order to achieve this we have adopted a highly sensitive approach to annotating gene models supported by correctly mapped, canonically spliced transcriptional evidence combined with a highly cautious approach to adding unsupported extensions to models and making decisions on their functional potential. We use information about the predicted functional potential and structural properties of every AS transcript annotated at a protein-coding or non-coding locus to place them into one of eleven subclasses. We describe the incorporation of new sequencing and proteomics technologies into our annotation pipelines, which are used to identify and validate AS. Combining all data sources has led to the production of a rich geneset containing an average of 6.3 AS transcripts for every human multi-exon protein-coding gene. The datasets produced have proved very useful in providing context to studies investigating the functional potential of genes and the effect of variation may have on gene structure and function.
Motivation: Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences.
Results: Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text.
Availability: Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/
Supplementary information: Supplementary data are available at Bioinformatics online.
The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) was first made public in 2004 and has been designed to view manual annotation of human, mouse and zebrafish genomic sequences produced at the Wellcome Trust Sanger Institute. Since its initial release, the number of human annotated loci has more than doubled to close to 33 000 and now contains comprehensive annotation on 20 of the 24 human chromosomes, four whole mouse chromosomes and around 40% of the zebrafish Danio rerio genome. In addition, we offer manual annotation of a number of haplotype regions in mouse and human and regions of comparative interest in pig and dog that are unique to Vega.
Artemis and ACT have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences
Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text.
Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites:
The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).
Two large-scale phenotyping efforts, the European Mouse Disease Clinic (EUMODIC) and the Wellcome Trust Sanger Institute Mouse Genetics Project (SANGER-MGP), started during the late 2000s with the aim to deliver a comprehensive assessment of phenotypes or to screen for robust indicators of diseases in mouse mutants. They both took advantage of available mouse mutant lines but predominantly of the embryonic stem (ES) cells resources derived from the European Conditional Mouse Mutagenesis programme (EUCOMM) and the Knockout Mouse Project (KOMP) to produce and study 799 mouse models that were systematically analysed with a comprehensive set of physiological and behavioural paradigms. They captured more than 400 variables and an additional panel of metadata describing the conditions of the tests. All the data are now available through EuroPhenome database (www.europhenome.org) and the WTSI mouse portal (http://www.sanger.ac.uk/mouseportal/), and the corresponding mouse lines are available through the European Mouse Mutant Archive (EMMA), the International Knockout Mouse Consortium (IKMC), or the Knockout Mouse Project (KOMP) Repository. Overall conclusions from both studies converged, with at least one phenotype scored in at least 80 % of the mutant lines. In addition, 57 % of the lines were viable, 13 % subviable, 30 % embryonic lethal, and 7 % displayed fertility impairments. These efforts provide an important underpinning for a future global programme that will undertake the complete functional annotation of the mammalian genome in the mouse model.
Motivation: High-throughput sequencing (HTS) technologies have made low-cost sequencing of large numbers of samples commonplace. An explosion in the type, not just number, of sequencing experiments has also taken place including genome re-sequencing, population-scale variation detection, whole transcriptome sequencing and genome-wide analysis of protein-bound nucleic acids.
Results: We present Artemis as a tool for integrated visualization and computational analysis of different types of HTS datasets in the context of a reference genome and its corresponding annotation.
Availability: Artemis is freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute websites: http://www.sanger.ac.uk/resources/software/artemis/.
Detailed and comprehensive genome annotation can be considered a prerequisite for effective analysis and interpretation of omics data. As such, Gene Ontology (GO) annotation has become a well accepted framework for functional annotation. The genus Aspergillus comprises fungal species that are important model organisms, plant and human pathogens as well as industrial workhorses. However, GO annotation based on both computational predictions and extended manual curation has so far only been available for one of its species, namely A. nidulans.
Based on protein homology, we mapped 97% of the 3,498 GO annotated A. nidulans genes to at least one of seven other Aspergillus species: A. niger, A. fumigatus, A. flavus, A. clavatus, A. terreus, A. oryzae and Neosartorya fischeri. GO annotation files compatible with diverse publicly available tools have been generated and deposited online. To further improve their accessibility, we developed a web application for GO enrichment analysis named FetGOat and integrated GO annotations for all Aspergillus species with public genome sequences. Both the annotation files and the web application FetGOat are accessible via the Broad Institute's website (http://www.broadinstitute.org/fetgoat/index.html). To demonstrate the value of those new resources for functional analysis of omics data for the genus Aspergillus, we performed two case studies analyzing microarray data recently published for A. nidulans, A. niger and A. oryzae.
We mapped A. nidulans GO annotation to seven other Aspergilli. By depositing the newly mapped GO annotation online as well as integrating it into the web tool FetGOat, we provide new, valuable and easily accessible resources for omics data analysis and interpretation for the genus Aspergillus. Furthermore, we have given a general example of how a well annotated genome can help improving GO annotation of related species to subsequently facilitate the interpretation of omics data.
The initial outcome of genome sequencing is the creation of long text strings written in a four letter alphabet. The role of in silico sequence analysis is to assist biologists in the act of associating biological knowledge with these sequences, allowing investigators to make inferences and predictions that can be tested experimentally. A wide variety of software is available to the scientific community, and can be used to identify genomic objects, before predicting their biological functions. However, only a limited number of biologically interesting features can be revealed from an isolated sequence. Comparative genomics tools, on the other hand, by bringing together the information contained in numerous genomes simultaneously, allow annotators to make inferences based on the idea that evolution and natural selection are central to the definition of all biological processes. We have developed the MicroScope platform in order to offer a web-based framework for the systematic and efficient revision of microbial genome annotation and comparative analysis (http://www.genoscope.cns.fr/agc/microscope). Starting with the description of the flow chart of the annotation processes implemented in the MicroScope pipeline, and the development of traditional and novel microbial annotation and comparative analysis tools, this article emphasizes the essential role of expert annotation as a complement of automatic annotation. Several examples illustrate the use of implemented tools for the review and curation of annotations of both new and publicly available microbial genomes within MicroScope’s rich integrated genome framework. The platform is used as a viewer in order to browse updated annotation information of available microbial genomes (more than 440 organisms to date), and in the context of new annotation projects (117 bacterial genomes). The human expertise gathered in the MicroScope database (about 280,000 independent annotations) contributes to improve the quality of microbial genome annotation, especially for genomes initially analyzed by automatic procedures alone.
Database URLs: http://www.genoscope.cns.fr/agc/mage and http://www.genoscope.cns.fr/agc/microcyc
GeneDB (http://www.genedb.org) is a genome database for prokaryotic and eukaryotic pathogens and closely related organisms. The resource provides a portal to genome sequence and annotation data, which is primarily generated by the Pathogen Genomics group at the Wellcome Trust Sanger Institute. It combines data from completed and ongoing genome projects with curated annotation, which is readily accessible from a web based resource. The development of the database in recent years has focused on providing database-driven annotation tools and pipelines, as well as catering for increasingly frequent assembly updates. The website has been significantly redesigned to take advantage of current web technologies, and improve usability. The current release stores 41 data sets, of which 17 are manually curated and maintained by biologists, who review and incorporate data from the scientific literature, as well as other sources. GeneDB is primarily a production and annotation database for the genomes of predominantly pathogenic organisms.
With the overwhelming amount of biomedical textual information being produced, several manual curation efforts have been set up to extract and store concepts and their relationships into structured resources. As manual annotation is a demanding and expensive task, computerized solutions were developed to perform such tasks automatically. However, high-end information extraction techniques are still not widely used by biomedical research communities, mainly because of the lack of standards and limitations in usability. Interactive annotation tools intend to fill this gap, taking advantage of automatic techniques and existing knowledge bases to assist expert curators in their daily tasks. This article presents Egas, a web-based platform for biomedical text mining and assisted curation with highly usable interfaces for manual and automatic in-line annotation of concepts and relations. A comprehensive set of de facto standard knowledge bases are integrated and indexed to provide straightforward concept normalization features. Real-time collaboration and conversation functionalities allow discussing details of the annotation task as well as providing instant feedback of curator’s interactions. Egas also provides interfaces for on-demand management of the annotation task settings and guidelines, and supports standard formats and literature services to import and export documents. By taking advantage of Egas, we participated in the BioCreative IV interactive annotation task, targeting the assisted identification of protein–protein interactions described in PubMed abstracts related to neuropathological disorders. When evaluated by expert curators, it obtained positive scores in terms of usability, reliability and performance. These results, together with the provided innovative features, place Egas as a state-of-the-art solution for fast and accurate curation of information, facilitating the task of creating and updating knowledge bases and annotated resources.
Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort.
We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser.
We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at .
Leishmania spp. are sandfly transmitted protozoan parasites that cause a spectrum of diseases in more than 12 million people worldwide. Much research is now focusing on how these parasites adapt to the distinct nutrient environments they encounter in the digestive tract of the sandfly vector and the phagolysosome compartment of mammalian macrophages. While data mining and annotation of the genomes of three Leishmania species has provided an initial inventory of predicted metabolic components and associated pathways, resources for integrating this information into metabolic networks and incorporating data from transcript, protein, and metabolite profiling studies is currently lacking. The development of a reliable, expertly curated, and widely available model of Leishmania metabolic networks is required to facilitate systems analysis, as well as discovery and prioritization of new drug targets for this important human pathogen.
The LeishCyc database was initially built from the genome sequence of Leishmania major (v5.2), based on the annotation published by the Wellcome Trust Sanger Institute. LeishCyc was manually curated to remove errors, correct automated predictions, and add information from the literature. The ongoing curation is based on public sources, literature searches, and our own experimental and bioinformatics studies. In a number of instances we have improved on the original genome annotation, and, in some ambiguous cases, collected relevant information from the literature in order to help clarify gene or protein annotation in the future. All genes in LeishCyc are linked to the corresponding entry in GeneDB (Wellcome Trust Sanger Institute).
The LeishCyc database describes Leishmania major genes, gene products, metabolites, their relationships and biochemical organization into metabolic pathways. LeishCyc provides a systematic approach to organizing the evolving information about Leishmania biochemical networks and is a tool for analysis, interpretation, and visualization of Leishmania Omics data (transcriptomics, proteomics, metabolomics) in the context of metabolic pathways. LeishCyc is the first such database for the Trypanosomatidae family, which includes a number of other important human parasites. Flexible query/visualization capabilities are provided by the Pathway Tools software and its Web interface. The LeishCyc database is made freely available over the Internet .
The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all.
The time-consuming nature of manual curation and the rapid growth of biomedical literature severely limit the number of articles that database curators can scrutinize and annotate. Hence, semi-automatic tools can be a valid support to increase annotation throughput. Although a handful of curation assistant tools are already available, to date, little has been done to formally evaluate their benefit to biocuration. Moreover, most curation tools are designed for specific problems. Thus, it is not easy to apply an annotation tool for multiple tasks. BioQRator is a publicly available web-based tool for annotating biomedical literature. It was designed to support general tasks, i.e. any task annotating entities and relationships. In the BioCreative IV edition, BioQRator was tailored for protein– protein interaction (PPI) annotation by migrating information from PIE the search. The results obtained from six curators showed that the precision on the top 10 documents doubled with PIE the search compared with PubMed search results. It was also observed that the annotation time for a full PPI annotation task decreased for a beginner-intermediate level annotator. This finding is encouraging because text-mining techniques were not directly involved in the full annotation task and BioQRator can be easily integrated with any text-mining resources.
Information processing algorithms require significant amounts of annotated data for training and testing. The availability of such data is often hindered by the complexity and high cost of production. In this paper, we investigate the benefits of a state-of-the-art tool to help with the semantic annotation of a large set of biomedical information queries.
Seven annotators were recruited to annotate a set of 10,000 PubMed® queries with 16 biomedical and bibliographic categories. About half of the queries were annotated from scratch, while the other half were automatically pre-annotated and manually corrected. The impact of the automatic pre-annotations was assessed on several aspects of the task: time, number of actions, annotator satisfaction, inter-annotator agreement, quality and number of the resulting annotations.
The analysis of annotation results showed that the number of required hand annotations is 28.9% less when using pre-annotated results from automatic tools. As a result, the overall annotation time was substantially lower when pre-annotations were used, while inter-annotator agreement was significantly higher. In addition, there was no statistically significant difference in the semantic distribution or number of annotations produced when pre-annotations were used. The annotated query corpus is freely available to the research community.
This study shows that automatic pre-annotations are found helpful by most annotators. Our experience suggests using an automatic tool to assist large-scale manual annotation projects. This helps speed-up the annotation time and improve annotation consistency while maintaining high quality of the final annotations.
PubMed queries; biomedical entities; annotation standards; annotation methods
Motivation: Large-scale phenotyping projects such as the Sanger Mouse
Genetics project are ongoing efforts to help identify the influences of genes and their
modification on phenotypes. Gene–phenotype relations are crucial to the improvement
of our understanding of human heritable diseases as well as the development of drugs.
However, given that there are ∼20 000 genes in higher vertebrate genomes
and the experimental verification of gene–phenotype relations requires a lot of
resources, methods are needed that determine good candidates for testing.
Results: In this study, we applied an association rule mining approach to
the identification of promising secondary phenotype candidates. The predictions rely on a
large gene–phenotype annotation set that is used to find occurrence patterns of
phenotypes. Applying an association rule mining approach, we could identify 1967 secondary
phenotype hypotheses that cover 244 genes and 136 phenotypes. Using two automated and one
manual evaluation strategies, we demonstrate that the secondary phenotype candidates
possess biological relevance to the genes they are predicted for. From the results we
conclude that the predicted secondary phenotypes constitute good candidates to be
experimentally tested and confirmed.
Availability: The secondary phenotype candidates can be browsed through at
email@example.com or firstname.lastname@example.org
Supplementary data are available at Bioinformatics
Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community.
Recent developments in high-throughput sequencing (HTS) technologies have made it feasible to sequence the complete transcriptomes of non-model organisms or metatranscriptomes from environmental samples. The challenge after generating hundreds of millions of sequences is to annotate these transcripts and classify the transcripts based on their putative functions. Because many biological scientists lack the knowledge to install Linux-based software packages or maintain databases used for transcript annotation, we developed an automatic annotation tool with an easy-to-use interface.
To elucidate the potential functions of gene transcripts, we integrated well-established annotation tools: Blast2GO, PRIAM and RPS BLAST in a web-based service, FastAnnotator, which can assign Gene Ontology (GO) terms, Enzyme Commission numbers (EC numbers) and functional domains to query sequences.
Using six transcriptome sequence datasets as examples, we demonstrated the ability of FastAnnotator to assign functional annotations. FastAnnotator annotated 88.1% and 81.3% of the transcripts from the well-studied organisms Caenorhabditis elegans and Streptococcus parasanguinis, respectively. Furthermore, FastAnnotator annotated 62.9%, 20.4%, 53.1% and 42.0% of the sequences from the transcriptomes of sweet potato, clam, amoeba, and Trichomonas vaginalis, respectively, which lack reference genomes. We demonstrated that FastAnnotator can complete the annotation process in a reasonable amount of time and is suitable for the annotation of transcriptomes from model organisms or organisms for which annotated reference genomes are not avaiable.
The sequencing process no longer represents the bottleneck in the study of genomics, and automatic annotation tools have become invaluable as the annotation procedure has become the limiting step. We present FastAnnotator, which was an automated annotation web tool designed to efficiently annotate sequences with their gene functions, enzyme functions or domains. FastAnnotator is useful in transcriptome studies and especially for those focusing on non-model organisms or metatranscriptomes. FastAnnotator does not require local installation and is freely available at http://fastannotator.cgu.edu.tw.
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.
Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.
As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.
NEMBASE (available at http://www.nematodes.org) is a publicly available online database providing access to the sequence and associated meta-data currently being generated as part of the Edinburgh–Wellcome Trust Sanger Institute parasitic nematode EST project. NEMBASE currently holds ∼100 000 sequences from 10 different species of nematode. To facilitate ease of use, sequences have been processed to generate a non-redundant set of gene objects (‘partial genome’) for each species. Users may query the database on the basis of BLAST annotation, sequence similarity or expression profiles. NEMBASE also features an interactive Java-based tool (SimiTri) which allows the simultaneous display and analysis of the relative similarity relationships of groups of sequences to three different databases. NEMBASE is currently being expanded to include sequence data from other nematode species. Other developments include access to accurate peptide predictions, improved functional annotation and incorporation of automated processes allowing rapid analysis of nematode-specific gene families.
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation.
Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned ‘unknown’ annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome.
Database URL: http://www.yeastgenome.org