One year ago the Human Proteome Project (HPP) leadership designated the baseline metrics for the Human Proteome Project to be based upon neXtProt with a total of 13 664 proteins validated at protein evidence level 1 (PE1) by mass spectrometry, antibody-capture, Edman sequencing, or 3D structures. Corresponding chromosome-specific data were provided from PeptideAtlas, GPMdb, and Human Protein Atlas. This year the neXtProt total is 15 646 and the other resources, which are inputs to neXtProt, have high quality identifications and additional annotations for 14 012 in PeptideAtlas, 14 869 in GPMdb, and 10 976 in HPA. We propose to remove 638 genes from the denominator that are “uncertain” or “dubious” in Ensembl, UniProt/SwissProt, and neXtProt. That leaves 3844 “missing proteins”, currently having no or inadequate documentation, to be found from a new denominator of 19 490 protein-coding genes. We present those tabulations and weblinks and discuss current strategies to find the missing proteins.
Human Proteome Project; neXtProt; PeptideAtlas; GPMdb; Human Protein Atlas; metrics; missing proteins
We report progress assembling the parts list for chromosome 17 and illustrate the various processes that we have developed to integrate available data from diverse genomic and proteomic knowledge bases. As primary resources we have used GPMDB, neXtProt, PeptideAtlas, Human Protein Atlas (HPA), and GeneCards. All sites share the common resource of Ensembl for the genome modeling information. We have defined the chromosome 17 parts list with the following information: 1169 protein-coding genes, the numbers of proteins confidently identified by various experimental approaches as documented in GPMDB, neXtProt, PeptideAtlas, and HPA, examples of typical data sets obtained by RNASeq and proteomic studies of epithelial derived tumor cell lines (disease proteome) and a normal proteome (peripheral mononuclear cells), reported evidence of post-translational modifications, and examples of alternative splice variants (ASVs). We have constructed a list of the 59 ‘missing’ proteins as well as 201 proteins that have inconclusive mass spectrometric (MS) identifications. In this report we have defined a process to establish a baseline for the incorporation of new evidence on protein identification and characterization as well as related information from transcriptome analyses. This initial list of ‘missing’ proteins that will guide the selection of appropriate samples for discovery studies as well as antibody reagents. Also we have illustrated the significant diversity of protein variants (including post-translational modifications, PTMs) using regions on chromosome 17 that contain important oncogenes. We emphasize the need for mandated deposition of proteomics data in public databases, the further development of improved PTM, ASV and single nucleotide variant (SNV) databases and the construction of websites that can integrate and regularly update such information. In addition, we describe the distribution of both clustered and scattered sets of protein families on the chromosome. Since chromosome 17 is rich in cancer associated genes we have focused the clustering of cancer associated genes in such genomic regions and have used the ERBB2 amplicon as an example of the value of a proteogenomic approach in which one integrates transcriptomic with proteomic information and captures evidence of co-expression through coordinated regulation.
Chromosome-Centric Human Proteome Project; Chromosome 17 Parts List; ERBB2; Oncogene
The SIB Swiss Institute of Bioinformatics (www.isb-sib.ch) was created in 1998 as an institution to foster excellence in bioinformatics. It is renowned worldwide for its databases and software tools, such as UniProtKB/Swiss-Prot, PROSITE, SWISS-MODEL, STRING, etc, that are all accessible on ExPASy.org, SIB's Bioinformatics Resource Portal. This article provides an overview of the scientific and training resources SIB has consistently been offering to the life science community for more than 15 years.
Vertebrate genomes contain around 20,000 protein-encoding genes, of which a large fraction is still not associated with specific functions. A major task in future genomics will thus be to assign physiological roles to all open reading frames revealed by genome sequencing. Here we show that C2orf62, a highly conserved protein with little homology to characterized proteins, is strongly expressed in testis in zebrafish and mammals, and in various types of ciliated cells during zebrafish development. By yeast two hybrid and GST pull-down, C2orf62 was shown to interact with TTC17, another uncharacterized protein. Depletion of either C2orf62 or TTC17 in human ciliated cells interferes with actin polymerization and reduces the number of primary cilia without changing their length. Zebrafish embryos injected with morpholinos against C2orf62 or TTC17, or with mRNA coding for the C2orf62 C-terminal part containing a RII dimerization/docking (R2D2) – like domain show morphological defects consistent with imperfect ciliogenesis. We provide here the first evidence for a C2orf62-TTC17 axis that would regulate actin polymerization and ciliogenesis.
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools.
The methionine salvage pathway is widely distributed among some eubacteria, yeast, plants and animals and recycles the sulfur-containing metabolite 5-methylthioadenosine (MTA) to methionine. In eukaryotic cells, the methionine salvage pathway takes place in the cytosol and usually involves six enzymatic activities: MTA phosphorylase (MTAP, EC 220.127.116.11), 5′-methylthioribose-1-phosphate isomerase (mtnA, EC 18.104.22.168), 5′-methylthioribulose-1-phosphate dehydratase (mtnB, EC: 22.214.171.124), 2,3-dioxomethiopentane-1-phosphate enolase/phosphatase (mtnC, EC 126.96.36.199), aci-reductone dioxygenase (mtnD, EC 188.8.131.52) and 4-methylthio-2-oxo-butanoate (MTOB) transaminase (EC 2.6.1.-). The aim of this study was to complete the available information on the methionine salvage pathway in human by identifying the enzyme responsible for the dehydratase step. Using a bioinformatics approach, we propose that a protein called APIP could perform this role. The involvement of this protein in the methionine salvage pathway was investigated directly in HeLa cells by transient and stable short hairpin RNA interference. We show that APIP depletion specifically impaired the capacity of cells to grow in media where methionine is replaced by MTA. Using a Shigella mutant auxotroph for methionine, we confirm that the knockdown of APIP specifically affects the recycling of methionine. We also show that mutation of three potential phosphorylation sites does not affect APIP activity whereas mutation of the potential zinc binding site completely abrogates it. Finally, we show that the N-terminal region of APIP that is missing in the short isoform is required for activity. Together, these results confirm the involvement of APIP in the methionine salvage pathway, which plays a key role in many biological functions like cancer, apoptosis, microbial proliferation and inflammation.
The human aldehyde dehydrogenase (ALDH) gene superfamily consists of 19 genes encoding enzymes critical for NAD(P)+-dependent oxidation of endogenous and exogenous aldehydes, including drugs and environmental toxicants. Mutations in ALDH genes are the molecular basis of several disease states (e.g. Sjögren-Larsson syndrome, pyridoxine-dependent seizures, and type II hyperprolinemia) and may contribute to the etiology of complex diseases such as cancer and Alzheimer’s disease. The aim of this nomenclature update was to identify splice transcriptional variants principally for the human ALDH genes.
Data-mining methods were used to retrieve all human ALDH sequences. Alternatively-spliced transcriptional variants were determined based upon: a) criteria for sequence integrity and genomic alignment; b) evidence of multiple independent cDNA sequences corresponding to a variant sequence; and c) if available, empirical evidence of variants from the literature.
RESULTS AND CONCLUSION
Alternatively-spliced transcriptional variants and their encoded proteins exist for most of the human ALDH genes; however, their function and significance remain to be established. When compared with the human genome, rat and mouse include an additional gene, Aldh1a7, in the ALDH1A subfamily. In order to avoid confusion when identifying splice variants in various genomes, nomenclature guidelines for the naming of such alternative transcriptional variants and proteins are recommended herein. In addition, a web database (www.aldh.org) has been developed to provide up-to-date information and nomenclature guidelines for the ALDH superfamily.
Aldehyde Dehydrogenase; ALDH; Alternatively-Spliced Variants; Nomenclature; Human
neXtProt (http://www.nextprot.org/) is a new human protein-centric knowledge platform. Developed at the Swiss Institute of Bioinformatics (SIB), it aims to help researchers answer questions relevant to human proteins. To achieve this goal, neXtProt is built on a corpus containing both curated knowledge originating from the UniProtKB/Swiss-Prot knowledgebase and carefully selected and filtered high-throughput data pertinent to human proteins. This article presents an overview of the database and the data integration process. We also lay out the key future directions of neXtProt that we consider the necessary steps to make neXtProt the one-stop-shop for all research projects focusing on human proteins.
UniPathway (http://www.unipathway.org) is a fully manually curated resource for the representation and annotation of metabolic pathways. UniPathway provides explicit representations of enzyme-catalyzed and spontaneous chemical reactions, as well as a hierarchical representation of metabolic pathways. This hierarchy uses linear subpathways as the basic building block for the assembly of larger and more complex pathways, including species-specific pathway variants. All of the pathway data in UniPathway has been extensively cross-linked to existing pathway resources such as KEGG and MetaCyc, as well as sequence resources such as the UniProt KnowledgeBase (UniProtKB), for which UniPathway provides a controlled vocabulary for pathway annotation. We introduce here the basic concepts underlying the UniPathway resource, with the aim of allowing users to fully exploit the information provided by UniPathway.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The molecular diversity of viruses complicates the interpretation of viral genomic and proteomic data. To make sense of viral gene functions, investigators must be familiar with the virus host range, replication cycle and virion structure. Our aim is to provide a comprehensive resource bridging together textbook knowledge with genomic and proteomic sequences. ViralZone web resource (www.expasy.org/viralzone/) provides fact sheets on all known virus families/genera with easy access to sequence data. A selection of reference strains (RefStrain) provides annotated standards to circumvent the exponential increase of virus sequences. Moreover ViralZone offers a complete set of detailed and accurate virion pictures.
The UniProt knowledgebase, UniProtKB, is the main product of the UniProt consortium. It consists of two sections, UniProtKB/Swiss-Prot, the manually curated section, and UniProtKB/TrEMBL, the computer translation of the EMBL/GenBank/DDBJ nucleotide sequence database. Taken together, these two sections cover all the proteins characterized or inferred from all publicly available nucleotide sequences. The Plant Proteome Annotation Program (PPAP) of UniProtKB/Swiss-Prot focuses on the manual annotation of plant-specific proteins and protein families. Our major effort is currently directed towards the two model plants Arabidopsis thaliana and Oryza sativa. In UniProtKB/Swiss-Prot, redundancy is minimized by merging all data from different sources in a single entry. The proposed protein sequence is frequently modified after comparison with ESTs, full length transcripts or homologous proteins from other species. The information present in manually curated entries allows the reconstruction of all described isoforms. The annotation also includes proteomics data such as PTM and protein identification MS experimental results. UniProtKB and the other products of the UniProt consortium are accessible online at www.uniprot.org.
Database; UniProt; Manual annotation; Plant; Proteomics; PTM
Although research on influenza lasted for more than 100 years, it is still one of the most prominent diseases causing half a million human deaths every year. With the recent observation of new highly pathogenic H5N1 and H7N7 strains, and the appearance of the influenza pandemic caused by the H1N1 swine-like lineage, a collaborative effort to share observations on the evolution of this virus in both animals and humans has been established. The OpenFlu database (OpenFluDB) is a part of this collaborative effort. It contains genomic and protein sequences, as well as epidemiological data from more than 27 000 isolates. The isolate annotations include virus type, host, geographical location and experimentally tested antiviral resistance. Putative enhanced pathogenicity as well as human adaptation propensity are computed from protein sequences. Each virus isolate can be associated with the laboratories that collected, sequenced and submitted it. Several analysis tools including multiple sequence alignment, phylogenetic analysis and sequence similarity maps enable rapid and efficient mining. The contents of OpenFluDB are supplied by direct user submission, as well as by a daily automatic procedure importing data from public repositories. Additionally, a simple mechanism facilitates the export of OpenFluDB records to GenBank. This resource has been successfully used to rapidly and widely distribute the sequences collected during the recent human swine flu outbreak and also as an exchange platform during the vaccine selection procedure. Database URL: http://openflu.vital-it.ch.
Peptide toxins synthesized by venomous animals have been extensively studied in the last decades. To be useful to the scientific community, this knowledge has been stored, annotated and made easy to retrieve by several databases. The aim of this article is to present what type of information users can access from each database. ArachnoServer and ConoServer focus on spider toxins and cone snail toxins, respectively. UniProtKB, a generalist protein knowledgebase, has an animal toxin-dedicated annotation program that includes toxins from all venomous animals. Finally, the ATDB metadatabase compiles data and annotations from other databases and provides toxin ontology.
animal toxin; ArachnoServer; ATDB; ConoServer; database; Tox-Prot; UniProtKB/Swiss-Prot; venom protein
PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of these profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries. Among the 983 (DNA-binding) domains, repeats and zinc fingers present in Swiss-Prot (release 57.8 of 22 September 2009), 696 (∼70%) are annotated with PROSITE descriptors using information from ProRule. In order to allow better functional characterization of domains, PROSITE developments focus on subfamily specific profiles and a new profile building method giving more weight to functionally important residues. Here, we describe AMSA, an annotated multiple sequence alignment format used to build a new generation of generalized profiles, the migration of ScanProsite to Vital-IT, a cluster of 633 CPUs, and the adoption of the Distributed Annotation System (DAS) to facilitate PROSITE data integration and interchange with other sources. The latest version of PROSITE (release 20.54, of 22 September 2009) contains 1308 patterns, 863 profiles and 869 ProRules. PROSITE is accessible at: http://www.expasy.org/prosite/.
The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the domain was switched to the newly developed site described in this paper in July 2008.
The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access.
is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to email@example.com.
The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
The growth in the number of completely sequenced microbial genomes (bacterial and archaeal) has generated a need for a procedure that provides UniProtKB/Swiss-Prot-quality annotation to as many protein sequences as possible. We have devised a semi-automated system, HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), that uses manually built annotation templates for protein families to propagate annotation to all members of manually defined protein families, using very strict criteria. The HAMAP system is composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline. The proteome database comprises biological and sequence information for each completely sequenced microbial proteome, and it offers several tools for CDS searches, BLAST options and retrieval of specific sets of proteins. The family database currently comprises more than 1500 manually curated protein families and their annotation templates that are used to annotate proteins that belong to one of the HAMAP families. On the HAMAP website, individual sequences as well as whole genomes can be scanned against all HAMAP families. The system provides warnings for the absence of conserved amino acid residues, unusual sequence length, etc. Thanks to the implementation of HAMAP, more than 200 000 microbial proteins have been fully annotated in UniProtKB/Swiss-Prot (HAMAP website: http://www.expasy.org/sprot/hamap).
The potency of the immune response has still to be harnessed effectively to combat human cancers. However, the discovery of T-cell targets in melanomas and other tumors has raised the possibility that cancer vaccines can be used to induce a therapeutically effective immune response against cancer. The targets, cancer-testis (CT) antigens, are immunogenic proteins preferentially expressed in normal gametogenic tissues and different histological types of tumors. Therapeutic cancer vaccines directed against CT antigens are currently in late-stage clinical trials testing whether they can delay or prevent recurrence of lung cancer and melanoma following surgical removal of primary tumors. CT antigens constitute a large, but ill-defined, family of proteins that exhibit a remarkably restricted expression. Currently, there is a considerable amount of information about these proteins, but the data are scattered through the literature and in several bioinformatic databases. The database presented here, CTdatabase (http://www.cta.lncc.br), unifies this knowledge to facilitate both the mining of the existing deluge of data, and the identification of proteins alleged to be CT antigens, but that do not have their characteristic restricted expression pattern. CTdatabase is more than a repository of CT antigen data, since all the available information was carefully curated and annotated with most data being specifically processed for CT antigens and stored locally. Starting from a compilation of known CT antigens, CTdatabase provides basic information including gene names and aliases, RefSeq accession numbers, genomic location, known splicing variants, gene duplications and additional family members. Gene expression at the mRNA level in normal and tumor tissues has been collated from publicly available data obtained by several different technologies. Manually curated data related to mRNA and protein expression, and antigen-specific immune responses in cancer patients are also available, together with links to PubMed for relevant CT antigen articles.
WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at .
PROSITE consists of documentation entries describing protein domains, families and functional sites, as well as associated patterns and profiles to identify them. It is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. In this article, we describe the implementation of a new method to assign a status to pattern matches, the new PROSITE web page and a new approach to improve the specificity and sensitivity of PROSITE methods. The latest version of PROSITE (release 20.19 of 11 September 2007) contains 1319 patterns, 745 profiles and 764 ProRules. Over the past 2 years, about 200 domains have been added, and now 53% of UniProtKB/Swiss-Prot entries (release 54.2 of 11 September 2007) have a PROSITE match. PROSITE is available on the web at: http://www.expasy.org/prosite/.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .