Analysis of functional genomics (transcriptomics and proteomics) datasets is hindered in agricultural species because agricultural genome sequences have relatively poor structural and functional annotation. To facilitate systems biology in these species we have established the curated, web-accessible, public resource ‘AgBase’ (). We have improved the structural annotation of agriculturally important genomes by experimentally confirming the in vivo expression of electronically predicted proteins and by proteogenomic mapping. Proteogenomic data are available from the AgBase proteogenomics link. We contribute Gene Ontology (GO) annotations and we provide a two tier system of GO annotations for users. The ‘GO Consortium’ gene association file contains the most rigorous GO annotations based solely on experimental data. The ‘Community’ gene association file contains GO annotations based on expert community knowledge (annotations based directly from author statements and submitted annotations from the community) and annotations for predicted proteins. We have developed two tools for proteomics analysis and these are freely available on request. A suite of tools for analyzing functional genomics datasets using the GO is available online at the AgBase site. We encourage and publicly acknowledge GO annotations from researchers and provide an online mechanism for agricultural researchers to submit requests for GO annotations.
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation.
In the UniProt Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these “electronic GO annotations” are generally perceived as unreliable; they are disregarded in many studies. In this article, we introduce novel methodology to systematically evaluate the quality of electronic annotations. We then provide the first comprehensive assessment of the reliability of electronic GO annotations. Overall, we found that electronic annotations are more reliable than generally believed, to an extent that they are competitive with annotations inferred by curators when they use evidence other than experiments from primary literature. But we also report significant variations among inference methods, types of annotations, and organisms. This work provides guidance for Gene Ontology users and lays the foundations for improving computational approaches to GO function inference.
Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort.
We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser.
We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at .
Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly.
A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked.
In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System.
Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae.
When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication). The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics") is widely recognized, but existing approaches are either manual or not explicitly based on phylogenetic trees.
Here we present RIO (Resampled Inference of Orthologs), a procedure for automated phylogenomics using explicit phylogenetic inference. RIO analyses are performed over bootstrap resampled phylogenetic trees to estimate the reliability of orthology assignments. We also introduce supplementary concepts that are helpful for functional inference. RIO has been implemented as Perl pipeline connecting several C and Java programs. It is available at http://www.genetics.wustl.edu/eddy/forester/. A web server is at http://www.rio.wustl.edu/. RIO was tested on the Arabidopsis thaliana and Caenorhabditis elegans proteomes.
The RIO procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. We also describe how some orthologies can be misleading for functional inference.
CharProtDB (http://www.jcvi.org/charprotdb/) is a curated database of biochemically characterized proteins. It provides a source of direct rather than transitive assignments of function, designed to support automated annotation pipelines. The initial data set in CharProtDB was collected through manual literature curation over the years by analysts at the J. Craig Venter Institute (JCVI) [formerly The Institute of Genomic Research (TIGR)] as part of their prokaryotic genome sequencing projects. The CharProtDB has been expanded by import of selected records from publicly available protein collections whose biocuration indicated direct rather than homology-based assignment of function. Annotations in CharProtDB include gene name, symbol and various controlled vocabulary terms, including Gene Ontology terms, Enzyme Commission number and TransportDB accession. Each annotation is referenced with the source; ideally a journal reference, or, if imported and lacking one, the original database source.
Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned ‘unknown’ annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome.
Database URL: http://www.yeastgenome.org
The transcriptome of an organism can be studied with the analysis of expressed sequence tag (EST) data sets that offers a rapid and cost effective approach with several new and updated bioinformatics approaches and tools for assembly and annotation. The comprehensive analyses comprehend an organism along with the genome and proteome analysis. With the advent of large-scale sequencing projects and generation of sequence data at protein and cDNA levels, automated analysis pipeline is necessary to store, organize and annotate ESTs.
TranSeqAnnotator is a workflow for large-scale analysis of transcriptomic data with the most appropriate bioinformatics tools for data management and analysis. The pipeline automatically cleans, clusters, assembles and generates consensus sequences, conceptually translates these into possible protein products and assigns putative function based on various DNA and protein similarity searches. Excretory/secretory (ES) proteins inferred from ESTs/short reads are also identified. The TranSeqAnnotator accepts FASTA format raw and quality ESTs along with protein and short read sequences and are analysed with user selected programs. After pre-processing and assembly, the dataset is annotated at the nucleotide, protein and ES protein levels.
TranSeqAnnotator has been developed in a Linux cluster, to perform an exhaustive and reliable analysis and provide detailed annotation. TranSeqAnnotator outputs gene ontologies, protein functional identifications in terms of mapping to protein domains and metabolic pathways. The pipeline is applied to annotate large EST datasets to identify several novel and known genes with therapeutic experimental validations and could serve as potential targets for parasite intervention. TransSeqAnnotator is freely available for the scientific community at http://estexplorer.biolinfo.org/TranSeqAnnotator/.
In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future.
cluster; gene models; pipeline; plant genome; structural and functional annotation; transposable elements; wheat
The most renowned biological ontology, Gene Ontology (GO) is widely used for annotations of genes and gene products of different organisms. However, there are shortcomings in the Resource Description Framework (RDF) data file provided by the GO consortium: 1) Lack of sufficient semantic relationships between pairs of terms coming from the three independent GO sub-ontologies, that limit the power to provide complex semantic queries and inference services based on it. 2) The term-centric view of GO annotation data and the fact that all information is stored in a single file. This makes attempts to retrieve GO annotations based on big volume datasets unmanageable. 3) No support of GOSlim.
We propose a RDF model, GORouter, which encodes heterogeneous original data in a uniform RDF format, creates additional ontology mappings between GO terms, and introduces a set of inference rulebases. Furthermore, we use the Oracle Network Data Model (NDM) as the native RDF data repository and the table function RDF_MATCH to seamlessly combine the result of RDF queries with traditional relational data. As a result, the scale of GORouter is minimized; information not directly involved in semantic inference is put into relational tables.
Our work demonstrates how to use multiple semantic web tools and techniques to provide a mixture of semantic query and inference solutions of GO and its associations. GORouter is licensed under Apache License Version 2.0, and is accessible via the website: .
Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities.
PIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases.
PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA.
We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%).
Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used.
The algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.
The Gene Ontology Annotation (GOA) project at the EBI (http://www.ebi.ac.uk/goa) provides high-quality electronic and manual associations (annotations) of Gene Ontology (GO) terms to UniProt Knowledgebase (UniProtKB) entries. Annotations created by the project are collated with annotations from external databases to provide an extensive, publicly available GO annotation resource. Currently covering over 160 000 taxa, with greater than 32 million annotations, GOA remains the largest and most comprehensive open-source contributor to the GO Consortium (GOC) project. Over the last five years, the group has augmented the number and coverage of their electronic pipelines and a number of new manual annotation projects and collaborations now further enhance this resource. A range of files facilitate the download of annotations for particular species, and GO term information and associated annotations can also be viewed and downloaded from the newly developed GOA QuickGO tool (http://www.ebi.ac.uk/QuickGO), which allows users to precisely tailor their annotation set.
The Gene Ontology (GO) project () develops and uses a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see ). The GO Consortium continues to improve to the vocabulary content, reflecting the impact of several novel mechanisms of incorporating community input. A growing number of model organism databases and genome annotation groups contribute annotation sets using GO terms to GO's public repository. Updates to the AmiGO browser have improved access to contributed genome annotations. As the GO project continues to grow, the use of the GO vocabularies is becoming more varied as well as more widespread. The GO project provides an ontological annotation system that enables biologists to infer knowledge from large amounts of data.
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation.
Predicting protein function has become increasingly demanding in the era of next generation sequencing technology. The task to assign a curator-reviewed function to every single sequence is impracticable. Bioinformatics tools, easy to use and able to provide automatic and reliable annotations at a genomic scale, are necessary and urgent. In this scenario, the Gene Ontology has provided the means to standardize the annotation classification with a structured vocabulary which can be easily exploited by computational methods.
Argot2 is a web-based function prediction tool able to annotate nucleic or protein sequences from small datasets up to entire genomes. It accepts as input a list of sequences in FASTA format, which are processed using BLAST and HMMER searches vs UniProKB and Pfam databases respectively; these sequences are then annotated with GO terms retrieved from the UniProtKB-GOA database and the terms are weighted using the e-values from BLAST and HMMER. The weighted GO terms are processed according to both their semantic similarity relations described by the Gene Ontology and their associated score. The algorithm is based on the original idea developed in a previous tool called Argot. The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes.
The revised algorithm has been already employed and successfully tested during in-house genome projects of grape and apple, and has proven to have a high precision and recall in all our benchmark conditions. It has also been successfully compared with Blast2GO, one of the methods most commonly employed for sequence annotation. The server is freely accessible at http://www.medcomp.medicina.unipd.it/Argot2.
The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content.
Biological research is increasingly dependent on the availability of well-structured representations of biological data with detailed, accurate descriptions provided by the curators of the data repositories. The Reference Genome project's goal is to provide comprehensive functional annotation for the genomes of human as well as eleven organisms that are important models in biomedical research. To achieve this, we have developed an approach that superposes experimentally-based annotations onto the leaves of phylogenetic trees and then we manually annotate the function of the common ancestors, predicated on the assumption that the ancestors possessed the experimentally determined functions that are held in common at these leaves, and that these functions are likely to be conserved in all other descendents of each family.
Fourmidable is an infrastructure to curate and share the emerging genetic, molecular, and functional genomic data and protocols for ants.
The Fourmidable assembly pipeline groups nucleotide sequences into clusters before independently assembling each cluster. Subsequently, assembled sequences are annotated via Interproscan and BLAST against general and insect-specific databases. Gene-specific information can be retrieved using gene identifiers, searching for similar sequences or browsing through inferred Gene Ontology annotations. The database will readily scale as ultra-high throughput sequence data and sequences from additional species become available.
Fourmidable currently houses EST data from two ant species and microarray gene expression data for one of these. Fourmidable is publicly available at
The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, protein-coding region annotations for the human and mouse reference genome assemblies. The CCDS data set reflects a ‘gold standard’ definition of best supported protein annotations, and corresponding genes, which pass a standard series of quality assurance checks and are supported by manual curation. This data set supports use of genome annotation information by human and mouse researchers for effective experimental design, analysis and interpretation. The CCDS project consists of analysis of automated whole-genome annotation builds to identify identical CDS annotations, quality assurance testing and manual curation support. Identical CDS annotations are tracked with a CCDS identifier (ID) and any future change to the annotated CDS structure must be agreed upon by the collaborating members. CCDS curation guidelines were developed to address some aspects of curation in order to improve initial annotation consistency and to reduce time spent in discussing proposed annotation updates. Here, we present the current status of the CCDS database and details on our procedures to track and coordinate our efforts. We also present the relevant background and reasoning behind the curation standards that we have developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts. Examples are provided to illustrate the application of these guidelines.
Database URL: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
PeerGAD is a web-based database-driven application that allows community-wide peer-reviewed annotation of prokaryotic genome sequences. The application was developed to support the annotation of the Pseudomonas syringae pv. tomato strain DC3000 genome sequence and is easily portable to other genome sequence annotation projects. PeerGAD incorporates several innovative design and operation features and accepts annotations pertaining to gene naming, role classification, gene translation and annotation derivation. The annotator tool in PeerGAD is built around a genome browser that offers users the ability to search and navigate the genome sequence. Because the application encourages annotation of the genome sequence directly by researchers and relies on peer review, it circumvents the need for an annotation curator while providing added value to the annotation data. Support for the Gene Ontology™ vocabulary, a structured and controlled vocabulary used in classification of gene roles, is emphasized throughout the system. Here we present the underlying concepts integral to the functionality of PeerGAD.
The growth in the number of completely sequenced microbial genomes (bacterial and archaeal) has generated a need for a procedure that provides UniProtKB/Swiss-Prot-quality annotation to as many protein sequences as possible. We have devised a semi-automated system, HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), that uses manually built annotation templates for protein families to propagate annotation to all members of manually defined protein families, using very strict criteria. The HAMAP system is composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline. The proteome database comprises biological and sequence information for each completely sequenced microbial proteome, and it offers several tools for CDS searches, BLAST options and retrieval of specific sets of proteins. The family database currently comprises more than 1500 manually curated protein families and their annotation templates that are used to annotate proteins that belong to one of the HAMAP families. On the HAMAP website, individual sequences as well as whole genomes can be scanned against all HAMAP families. The system provides warnings for the absence of conserved amino acid residues, unusual sequence length, etc. Thanks to the implementation of HAMAP, more than 200 000 microbial proteins have been fully annotated in UniProtKB/Swiss-Prot (HAMAP website: http://www.expasy.org/sprot/hamap).
Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increasing need for bioinformatics resources which are able to cope with large amount of sequence data, produce valuable annotation results and are easily accessible to laboratories where functional genomics projects are being undertaken. We present the Blast2GO suite as an integrated and biologist-oriented solution for the high-throughput and automatic functional annotation of DNA or protein sequences based on the Gene Ontology vocabulary. The most outstanding Blast2GO features are: (i) the combination of various annotation strategies and tools controlling type and intensity of annotation, (ii) the numerous graphical features such as the interactive GO-graph visualization for gene-set function profiling or descriptive charts, (iii) the general sequence management features and (iv) high-throughput capabilities. We used the Blast2GO framework to carry out a detailed analysis of annotation behaviour through homology transfer and its impact in functional genomics research. Our aim is to offer biologists useful information to take into account when addressing the task of functionally characterizing their sequence data.
One of the challenges of the post-genomic era is to provide accurate function annotations for large volumes of data resulting from genome sequencing projects. Most function prediction servers utilize methods that transfer existing database annotations between orthologous sequences. In contrast, there are few methods that are independent of homology and can annotate distant and orphan protein sequences. The FFPred server adopts a machine-learning approach to perform function prediction in protein feature space using feature characteristics predicted from amino acid sequence. The features are scanned against a library of support vector machines representing over 300 Gene Ontology (GO) classes and probabilistic confidence scores returned for each annotation term. The GO term library has been modelled on human protein annotations; however, benchmark performance testing showed robust performance across higher eukaryotes. FFPred offers important advantages over traditional function prediction servers in its ability to annotate distant homologues and orphan protein sequences, and achieves greater coverage and classification accuracy than other feature-based prediction servers. A user may upload an amino acid and receive annotation predictions via email. Feature information is provided as easy to interpret graphics displayed on the sequence of interest, allowing for back-interpretation of the associations between features and function classes.
There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein–protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein–protein interaction data and PSI-MI terms referring to interaction detection methods.
Annotation of proteins with gene ontology (GO) terms is ongoing work and a complex task. Manual GO annotation is precise and precious, but it is time-consuming. Therefore, instead of curated annotations most of the proteins come with uncurated annotations, which have been generated automatically. Text-mining systems that use literature for automatic annotation have been proposed but they do not satisfy the high quality expectations of curators.
In this paper we describe an approach that links uncurated annotations to text extracted from literature. The selection of the text is based on the similarity of the text to the term from the uncurated annotation. Besides substantiating the uncurated annotations, the extracted texts also lead to novel annotations. In addition, the approach uses the GO hierarchy to achieve high precision. Our approach is integrated into GOAnnotator, a tool that assists the curation process for GO annotation of UniProt proteins.
The GO curators assessed GOAnnotator with a set of 66 distinct UniProt/SwissProt proteins with uncurated annotations. GOAnnotator provided correct evidence text at 93% precision. This high precision results from using the GO hierarchy to only select GO terms similar to GO terms from uncurated annotations in GOA. Our approach is the first one to achieve high precision, which is crucial for the efficient support of GO curators. GOAnnotator was implemented as a web tool that is freely available at .
Protein Analysis THrough Evolutionary Relationships (PANTHER) is a comprehensive software system for inferring the functions of genes based on their evolutionary relationships. Phylogenetic trees of gene families form the basis for PANTHER and these trees are annotated with ontology terms describing the evolution of gene function from ancestral to modern day genes. One of the main applications of PANTHER is in accurate prediction of the functions of uncharacterized genes, based on their evolutionary relationships to genes with functions known from experiment. The PANTHER website, freely available at http://www.pantherdb.org, also includes software tools for analyzing genomic data relative to known and inferred gene functions. Since 2007, there have been several new developments to PANTHER: (i) improved phylogenetic trees, explicitly representing speciation and gene duplication events, (ii) identification of gene orthologs, including least diverged orthologs (best one-to-one pairs), (iii) coverage of more genomes (48 genomes, up to 87% of genes in each genome; see http://www.pantherdb.org/panther/summaryStats.jsp), (iv) improved support for alternative database identifiers for genes, proteins and microarray probes and (v) adoption of the SBGN standard for display of biological pathways. In addition, PANTHER trees are being annotated with gene function as part of the Gene Ontology Reference Genome project, resulting in an increasing number of curated functional annotations.