Biocuration has become a cornerstone for analyses in biology, and to meet needs, the amount of annotations has considerably grown in recent years. However, the reliability of these annotations varies; it has thus become necessary to be able to assess the confidence in annotations. Although several resources already provide confidence information about the annotations that they produce, a standard way of providing such information has yet to be defined. This lack of standardization undermines the propagation of knowledge across resources, as well as the credibility of results from high-throughput analyses. Seeded at a workshop during the Biocuration 2012 conference, a working group has been created to address this problem. We present here the elements that were identified as essential for assessing confidence in annotations, as well as a draft ontology—the Confidence Information Ontology—to illustrate how the problems identified could be addressed. We hope that this effort will provide a home for discussing this major issue among the biocuration community.
In recent years, high-throughput technologies have brought big data to the life sciences. The march of progress has been rapid, leaving in its wake a demand for courses in data analysis, data stewardship, computing fundamentals, etc., a need that universities have not yet been able to satisfy—paradoxically, many are actually closing “niche” bioinformatics courses at a time of critical need. The impact of this is being felt across continents, as many students and early-stage researchers are being left without appropriate skills to manage, analyse, and interpret their data with confidence. This situation has galvanised a group of scientists to address the problems on an international scale. For the first time, bioinformatics educators and trainers across the globe have come together to address common needs, rising above institutional and international boundaries to cooperate in sharing bioinformatics training expertise, experience, and resources, aiming to put ad hoc training practices on a more professional footing for the benefit of all.
One year ago the Human Proteome Project (HPP) leadership designated the baseline metrics for the Human Proteome Project to be based upon neXtProt with a total of 13 664 proteins validated at protein evidence level 1 (PE1) by mass spectrometry, antibody-capture, Edman sequencing, or 3D structures. Corresponding chromosome-specific data were provided from PeptideAtlas, GPMdb, and Human Protein Atlas. This year the neXtProt total is 15 646 and the other resources, which are inputs to neXtProt, have high quality identifications and additional annotations for 14 012 in PeptideAtlas, 14 869 in GPMdb, and 10 976 in HPA. We propose to remove 638 genes from the denominator that are “uncertain” or “dubious” in Ensembl, UniProt/SwissProt, and neXtProt. That leaves 3844 “missing proteins”, currently having no or inadequate documentation, to be found from a new denominator of 19 490 protein-coding genes. We present those tabulations and weblinks and discuss current strategies to find the missing proteins.
Human Proteome Project; neXtProt; PeptideAtlas; GPMdb; Human Protein Atlas; metrics; missing proteins
neXtProt (http://www.nextprot.org) is a human protein-centric knowledgebase developed at the SIB Swiss Institute of Bioinformatics. Focused solely on human proteins, neXtProt aims to provide a state of the art resource for the representation of human biology by capturing a wide range of data, precise annotations, fully traceable data provenance and a web interface which enables researchers to find and view information in a comprehensive manner. Since the introductory neXtProt publication, significant advances have been made on three main aspects: the representation of proteomics data, an extended representation of human variants and the development of an advanced search capability built around semantic technologies. These changes are presented in the current neXtProt update.
We report progress assembling the parts list for chromosome 17 and illustrate the various processes that we have developed to integrate available data from diverse genomic and proteomic knowledge bases. As primary resources we have used GPMDB, neXtProt, PeptideAtlas, Human Protein Atlas (HPA), and GeneCards. All sites share the common resource of Ensembl for the genome modeling information. We have defined the chromosome 17 parts list with the following information: 1169 protein-coding genes, the numbers of proteins confidently identified by various experimental approaches as documented in GPMDB, neXtProt, PeptideAtlas, and HPA, examples of typical data sets obtained by RNASeq and proteomic studies of epithelial derived tumor cell lines (disease proteome) and a normal proteome (peripheral mononuclear cells), reported evidence of post-translational modifications, and examples of alternative splice variants (ASVs). We have constructed a list of the 59 ‘missing’ proteins as well as 201 proteins that have inconclusive mass spectrometric (MS) identifications. In this report we have defined a process to establish a baseline for the incorporation of new evidence on protein identification and characterization as well as related information from transcriptome analyses. This initial list of ‘missing’ proteins that will guide the selection of appropriate samples for discovery studies as well as antibody reagents. Also we have illustrated the significant diversity of protein variants (including post-translational modifications, PTMs) using regions on chromosome 17 that contain important oncogenes. We emphasize the need for mandated deposition of proteomics data in public databases, the further development of improved PTM, ASV and single nucleotide variant (SNV) databases and the construction of websites that can integrate and regularly update such information. In addition, we describe the distribution of both clustered and scattered sets of protein families on the chromosome. Since chromosome 17 is rich in cancer associated genes we have focused the clustering of cancer associated genes in such genomic regions and have used the ERBB2 amplicon as an example of the value of a proteogenomic approach in which one integrates transcriptomic with proteomic information and captures evidence of co-expression through coordinated regulation.
Chromosome-Centric Human Proteome Project; Chromosome 17 Parts List; ERBB2; Oncogene
Chronic exposure of β-cells to metabolic stresses impairs their function and potentially induces apoptosis. Mitochondria play a central role in coupling glucose metabolism to insulin secretion. However, little is known on mitochondrial responses to specific stresses; i.e. low versus high glucose, saturated versus unsaturated fatty acids, or oxidative stress. INS-1E cells were exposed for 3 days to 5.6 mM glucose, 25 mM glucose, 0.4 mM palmitate, and 0.4 mM oleate. Culture at standard 11.1 mM glucose served as no-stress control and transient oxidative stress (200 µM H2O2 for 10 min at day 0) served as positive stressful condition. Mito-array analyzed transcripts of 60 mitochondrion-associated genes with special focus on members of the Slc25 family. Transcripts of interest were evaluated at the protein level by immunoblotting. Bioinformatics analyzed the expression profiles to delineate comprehensive networks. Chronic exposure to the different metabolic stresses impaired glucose-stimulated insulin secretion; revealing glucotoxicity and lipo-dysfunction. Both saturated and unsaturated fatty acids increased expression of the carnitine/acylcarnitine carrier CAC, whereas the citrate carrier CIC and energy sensor SIRT1 were specifically upregulated by palmitate and oleate, respectively. High glucose upregulated CIC, the dicarboxylate carrier DIC and glutamate carrier GC1. Conversely, it reduced expression of energy sensors (AMPK, SIRT1, SIRT4), metabolic genes, transcription factor PDX1, and anti-apoptotic Bcl2. This was associated with caspase-3 cleavage and cell death. Expression levels of GC1 and SIRT4 exhibited positive and negative glucose dose-response, respectively. Expression profiles of energy sensors and mitochondrial carriers were selectively modified by the different conditions, exhibiting stress-specific signatures.
The 5th International Biocuration Conference brought together over 300 scientists to exchange on their work, as well as discuss issues relevant to the International Society for Biocuration’s (ISB) mission. Recurring themes this year included the creation and promotion of gold standards, the need for more ontologies, and more formal interactions with journals. The conference is an essential part of the ISB's goal to support exchanges among members of the biocuration community. Next year's conference will be held in Cambridge, UK, from 7 to 10 April 2013. In the meanwhile, the ISB website provides information about the society's activities (http://biocurator.org), as well as related events of interest.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
Curated databases are an integral part of the tool set that researchers use on a daily basis for their work. For most users, however, how databases are maintained, and by whom, is rather obscure. The International Society for Biocuration (ISB) represents biocurators, software engineers, developers and researchers with an interest in biocuration. Its goals include fostering communication between biocurators, promoting and describing their work, and highlighting the added value of biocuration to the world. The ISB recently conducted a survey of biocurators to better understand their educational and scientific backgrounds, their motivations for choosing a curatorial job and their career goals. The results are reported here. From the responses received, it is evident that biocuration is performed by highly trained scientists and perceived to be a stimulating career, offering both intellectual challenges and the satisfaction of performing work essential to the modern scientific community. It is also apparent that the ISB has at least a dual role to play to facilitate biocurators’ work: (i) to promote biocuration as a career within the greater scientific community; (ii) to aid the development of resources for biomedical research through promotion of nomenclature and data-sharing standards that will allow interconnection of biological databases and better exploit the pivotal contributions that biocurators are making.
neXtProt (http://www.nextprot.org/) is a new human protein-centric knowledge platform. Developed at the Swiss Institute of Bioinformatics (SIB), it aims to help researchers answer questions relevant to human proteins. To achieve this goal, neXtProt is built on a corpus containing both curated knowledge originating from the UniProtKB/Swiss-Prot knowledgebase and carefully selected and filtered high-throughput data pertinent to human proteins. This article presents an overview of the database and the data integration process. We also lay out the key future directions of neXtProt that we consider the necessary steps to make neXtProt the one-stop-shop for all research projects focusing on human proteins.
The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested.
A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation.
The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.
The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods.
gene ontology; genome annotation; reference genome; gene function prediction; phylogenetics
The social amoebae (Dictyostelia) are a diverse group of Amoebozoa that achieve multicellularity by aggregation and undergo morphogenesis into fruiting bodies with terminally differentiated spores and stalk cells. There are four groups of dictyostelids, with the most derived being a group that contains the model species Dictyostelium discoideum.
We have produced a draft genome sequence of another group dictyostelid, Dictyostelium purpureum, and compare it to the D. discoideum genome. The assembly (8.41 × coverage) comprises 799 scaffolds totaling 33.0 Mb, comparable to the D. discoideum genome size. Sequence comparisons suggest that these two dictyostelids shared a common ancestor approximately 400 million years ago. In spite of this divergence, most orthologs reside in small clusters of conserved synteny. Comparative analyses revealed a core set of orthologous genes that illuminate dictyostelid physiology, as well as differences in gene family content. Interesting patterns of gene conservation and divergence are also evident, suggesting function differences; some protein families, such as the histidine kinases, have undergone little functional change, whereas others, such as the polyketide synthases, have undergone extensive diversification. The abundant amino acid homopolymers encoded in both genomes are generally not found in homologous positions within proteins, so they are unlikely to derive from ancestral DNA triplet repeats. Genes involved in the social stage evolved more rapidly than others, consistent with either relaxed selection or accelerated evolution due to social conflict.
The findings from this new genome sequence and comparative analysis shed light on the biology and evolution of the Dictyostelia.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
This report summarizes the proceedings of the second workshop of the ‘Minimum Information for Biological and Biomedical Investigations’ (MIBBI) consortium held on Dec 1-2, 2010 in Rüdesheim, Germany through the sponsorship of the Beilstein-Institute. MIBBI is an umbrella organization uniting communities developing Minimum Information (MI) checklists to standardize the description of data sets, the workflows by which they were generated and the scientific context for the work. This workshop brought together representatives of more than twenty communities to present the status of their MI checklists and plans for future development. Shared challenges and solutions were identified and the role of MIBBI in MI checklist development was discussed. The meeting featured some thirty presentations, wide-ranging discussions and breakout groups. The top outcomes of the two-day workshop as defined by the participants were: 1) the chance to share best practices and to identify areas of synergy; 2) defining a series of tasks for updating the MIBBI Portal; 3) reemphasizing the need to maintain independent MI checklists for various communities while leveraging common terms and workflow elements contained in multiple checklists; and 4) revision of the concept of the MIBBI Foundry to focus on the creation of a core set of MIBBI modules intended for reuse by individual MI checklist projects while maintaining the integrity of each MI project. Further information about MIBBI and its range of activities can be found at http://mibbi.org/.
This report summarizes the proceedings of the one day BioSharing meeting held at the Intelligent Systems for Molecular Biology (ISMB) 2010 conference in Boston, MA, USA This inaugural BioSharing event was hosted by the Genomic Standards Consortium as part of its M3 & BioSharing special interest group (SIG) workshop. The BioSharing event included invited talks from a range of community leaders and a panel discussion at the end of the day. The panel session led to the formal agreement among community leaders to join together to promote cross-community knowledge exchange and collaborations. A key focus of the newly formed Biosharing community will be linking up resources to promote real-world data sharing (virtuous cycle of data) and supporting compliance with data policies through the creation of a one-stop-portal of information. Further information about the newly established BioSharing effort can be found at http://biosharing.org.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
dictyBase (http://www.dictybase.org), the model organism database for Dictyostelium, aims to provide the broad biomedical research community with well integrated, high quality data and tools for Dictyostelium discoideum and related species. dictyBase houses the complete genome sequence, ESTs, and the entire body of literature relevant to Dictyostelium. This information is curated to provide accurate gene models and functional annotations, with the goal of fully annotating the genome to provide a ‘reference genome’ in the Amoebozoa clade. We highlight several new features in the present update: (i) new annotations; (ii) improved interface with web 2.0 functionality; (iii) the initial steps towards a genome portal for the Amoebozoa; (iv) ortholog display; and (v) the complete integration of the Dicty Stock Center with dictyBase.
Protein Analysis THrough Evolutionary Relationships (PANTHER) is a comprehensive software system for inferring the functions of genes based on their evolutionary relationships. Phylogenetic trees of gene families form the basis for PANTHER and these trees are annotated with ontology terms describing the evolution of gene function from ancestral to modern day genes. One of the main applications of PANTHER is in accurate prediction of the functions of uncharacterized genes, based on their evolutionary relationships to genes with functions known from experiment. The PANTHER website, freely available at http://www.pantherdb.org, also includes software tools for analyzing genomic data relative to known and inferred gene functions. Since 2007, there have been several new developments to PANTHER: (i) improved phylogenetic trees, explicitly representing speciation and gene duplication events, (ii) identification of gene orthologs, including least diverged orthologs (best one-to-one pairs), (iii) coverage of more genomes (48 genomes, up to 87% of genes in each genome; see http://www.pantherdb.org/panther/summaryStats.jsp), (iv) improved support for alternative database identifiers for genes, proteins and microarray probes and (v) adoption of the SBGN standard for display of biological pathways. In addition, PANTHER trees are being annotated with gene function as part of the Gene Ontology Reference Genome project, resulting in an increasing number of curated functional annotations.
dictyBase (http://dictybase.org) is the model organism database for Dictyostelium discoideum. It houses the complete genome sequence, ESTs and the entire body of literature relevant to Dictyostelium. This information is curated to provide accurate gene models and functional annotations, with the goal of fully annotating the genome. This dictyBase update describes the annotations and features implemented since 2006, including improved strain and phenotype representation, integration of predicted transcriptional regulatory elements, protein domain information, biochemical pathways, improved searching and a wiki tool that allows members of the research community to provide annotations.
Dictyostelium discoideum is a model system for studying many important physiological processes including chemotaxis, phagocytosis, and signal transduction. The recent sequencing of the genome has revealed the presence of over 12,500 protein-coding genes. The model organism database dictyBase hosts the genome sequence as well as a large amount of manually curated information.
We present here an anatomy ontology for Dictyostelium based upon the life cycle of the organism.
Anatomy ontologies are necessary to annotate species-specific events such as phenotypes, and the Dictyostelium anatomy ontology provides an essential tool for curation of the Dictyostelium genome.
Aspergillus niger, a saprophyte commonly found on decaying vegetation, is widely used and studied for industrial purposes. Despite its place as one of the most important organisms for commercial applications, the lack of available information about its genetic makeup limits research with this filamentous fungus.
We present here the analysis of 12,820 expressed sequence tags (ESTs) generated from A. niger cultured under seven different growth conditions. These ESTs identify about 5,108 genes of which 44.5% code for proteins sharing similarity (E ≤ 1e -5) with GenBank entries of known function, 38% code for proteins that only share similarity with GenBank entries of unknown function and 17.5% encode proteins that do not have a GenBank homolog. Using the Gene Ontology hierarchy, we present a first classification of the A. niger proteins encoded by these genes and compare its protein repertoire with other well-studied fungal species. We have established a searchable web-based database that includes the EST and derived contig sequences and their annotation. Details about this project and access to the annotated A. niger database are available.
This EST collection and its annotation provide a significant resource for fundamental and applied research with A. niger. The gene set identified in this manuscript will be highly useful in the annotation of the genome sequence of A. niger, the genes described in the manuscript, especially those encoding hydrolytic enzymes will provide a valuable source for researchers interested in enzyme properties and applications.