With the continued growth in the volume both of experimental G protein-coupled receptor (GPCR) data and of the related peer-reviewed literature, the ability of GPCR researchers to keep up-to-date is becoming increasingly curtailed.
We present work that integrates the biological data and annotations in the GPCR information system (GPCRDB) with next-generation methods for intelligently exploring, visualising and interacting with the scientific articles used to disseminate them. This solution automatically retrieves relevant information from GPCRDB and displays it both within and as an adjunct to an article.
This approach allows researchers to extract more knowledge more swiftly from literature. Importantly, it allows reinterpretation of data in articles published before GPCR structure data became widely available, thereby rescuing these valuable data from long-dormant sources.
Regions of protein sequences with biased amino acid composition (so-called Low-Complexity Regions (LCRs)) are abundant in the protein universe. A number of studies have revealed that i) these regions show significant divergence across protein families; ii) the genetic mechanisms from which they arise lends them remarkable degrees of compositional plasticity. They have therefore proved difficult to compare using conventional sequence analysis techniques, and functions remain to be elucidated for most of them. Here we undertake a systematic investigation of LCRs in order to explore their possible functional significance, placed in the particular context of Protein-Protein Interaction (PPI) networks and Gene Ontology (GO)-term analysis.
In keeping with previous results, we found that LCR-containing proteins tend to have more binding partners across different PPI networks than proteins that have no LCRs. More specifically, our study suggests i) that LCRs are preferentially positioned towards the protein sequence extremities and, in contrast with centrally-located LCRs, such terminal LCRs show a correlation between their lengths and degrees of connectivity, and ii) that centrally-located LCRs are enriched with transcription-related GO terms, while terminal LCRs are enriched with translation and stress response-related terms.
Our results suggest not only that LCRs may be involved in flexible binding associated with specific functions, but also that their positions within a sequence may be important in determining both their binding properties and their biological roles.
The small leucine-rich repeat proteins and proteoglycans (SLRPs) form an important family of regulatory molecules that participate in many essential functions. They typically control the correct assembly of collagen fibrils, regulate mineral deposition in bone, and modulate the activity of potent cellular growth factors through many signalling cascades. SLRPs belong to the group of extracellular leucine-rich repeat proteins that are flanked at both ends by disulphide-bonded caps that protect the hydrophobic core of the terminal repeats. A capping motif specific to SLRPs has been recently described in the crystal structures of the core proteins of decorin and biglycan. This motif, designated as LRRCE, differs in both sequence and structure from other, more widespread leucine-rich capping motifs. To investigate if the LRRCE motif is a common structural feature found in other leucine-rich repeat proteins, we have defined characteristic sequence patterns and used them in genome-wide searches.
The LRRCE motif is a structural element exclusive to the main group of SLRPs. It appears to have evolved during early chordate evolution and is not found in protein sequences from non-chordate genomes. Our search has expanded the family of SLRPs to include new predicted protein sequences, mainly in fishes but with intriguing putative orthologs in mammals. The chromosomal locations of the newly predicted SLRP genes would support the large-scale genome or gene duplications that are thought to have occurred during vertebrate evolution. From this expanded list we describe a new class of SLRP sequences that could be representative of an ancestral SLRP gene.
Given its exclusivity the LRRCE motif is a useful annotation tool for the identification and classification of new SLRP sequences in genome databases. The expanded list of members of the SLRP family offers interesting insights into early vertebrate evolution and suggests an early chordate evolutionary origin for the LRRCE capping motif.
The PRINTS database, now in its 21st year, houses a collection of diagnostic protein family ‘fingerprints’. Fingerprints are groups of conserved motifs, evident in multiple sequence alignments, whose unique inter-relationships provide distinctive signatures for particular protein families and structural/functional domains. As such, they may be used to assign uncharacterized sequences to known families, and hence to infer tentative functional, structural and/or evolutionary relationships. The February 2012 release (version 42.0) includes 2156 fingerprints, encoding 12 444 individual motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. Here, we report the current status of the database, and introduce a number of recent developments that help both to render a variety of our annotation and analysis tools easier to use and to make them more widely available.
Curated databases are an integral part of the tool set that researchers use on a daily basis for their work. For most users, however, how databases are maintained, and by whom, is rather obscure. The International Society for Biocuration (ISB) represents biocurators, software engineers, developers and researchers with an interest in biocuration. Its goals include fostering communication between biocurators, promoting and describing their work, and highlighting the added value of biocuration to the world. The ISB recently conducted a survey of biocurators to better understand their educational and scientific backgrounds, their motivations for choosing a curatorial job and their career goals. The results are reported here. From the responses received, it is evident that biocuration is performed by highly trained scientists and perceived to be a stimulating career, offering both intellectual challenges and the satisfaction of performing work essential to the modern scientific community. It is also apparent that the ISB has at least a dual role to play to facilitate biocurators’ work: (i) to promote biocuration as a career within the greater scientific community; (ii) to aid the development of resources for biomedical research through promotion of nomenclature and data-sharing standards that will allow interconnection of biological databases and better exploit the pivotal contributions that biocurators are making.
Funding bodies are increasingly recognizing the need to provide graduates and researchers with access to short intensive courses in a variety of disciplines, in order both to improve the general skills base and to provide solid foundations on which researchers may build their careers. In response to the development of ‘high-throughput biology’, the need for training in the field of bioinformatics, in particular, is seeing a resurgence: it has been defined as a key priority by many Institutions and research programmes and is now an important component of many grant proposals. Nevertheless, when it comes to planning and preparing to meet such training needs, tension arises between the reward structures that predominate in the scientific community which compel individuals to publish or perish, and the time that must be devoted to the design, delivery and maintenance of high-quality training materials. Conversely, there is much relevant teaching material and training expertise available worldwide that, were it properly organized, could be exploited by anyone who needs to provide training or needs to set up a new course. To do this, however, the materials would have to be centralized in a database and clearly tagged in relation to target audiences, learning objectives, etc. Ideally, they would also be peer reviewed, and easily and efficiently accessible for downloading. Here, we present the Bioinformatics Training Network (BTN), a new enterprise that has been initiated to address these needs and review it, respectively, to similar initiatives and collections.
Bioinformatics; training; end users; bioinformatics courses; learning bioinformatics
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
The NucleaRDB is a Molecular Class-Specific Information System that collects, combines, validates and disseminates large amounts of heterogeneous data on nuclear hormone receptors. It contains both experimental and computationally derived data. The data and knowledge present in the NucleaRDB can be accessed using a number of different interactive and programmatic methods and query systems. A nuclear hormone receptor-specific PDF reader interface is available that can integrate the contents of the NucleaRDB with full-text scientific articles. The NucleaRDB is freely available at http://www.receptors.org/nucleardb.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
We live in interesting times. Portents of impending catastrophe pervade the literature, calling us to action in the face of unmanageable volumes of scientific data. But it isn't so much data generation per se, but the systematic burial of the knowledge embodied in those data that poses the problem: there is so much information available that we simply no longer know what we know, and finding what we want is hard – too hard. The knowledge we seek is often fragmentary and disconnected, spread thinly across thousands of databases and millions of articles in thousands of journals. The intellectual energy required to search this array of data-archives, and the time and money this wastes, has led several researchers to challenge the methods by which we traditionally commit newly acquired facts and knowledge to the scientific record. We present some of these initiatives here – a whirlwind tour of recent projects to transform scholarly publishing paradigms, culminating in Utopia and the Semantic Biochemical Journal experiment. With their promises to provide new ways of interacting with the literature, and new and more powerful tools to access and extract the knowledge sequestered within it, we ask what advances they make and what obstacles to progress still exist? We explore these questions, and, as you read on, we invite you to engage in an experiment with us, a real-time test of a new technology to rescue data from the dormant pages of published documents. We ask you, please, to read the instructions carefully. The time has come: you may turn over your papers…
dynamic document content; interactive PDF; linking documents with research data; manuscript mark-up; mark-up standards; semantic publishing; BJ, Biochemical Journal; COHSE, Conceptual Open Hypermedia Services Environment; DOI, Digital Object Identifier; GO, Gene Ontology; GPCR, G protein-coupled receptor; HTML, HyperText Mark-up Language; IUPAC, International Union of Pure and Applied Chemistry; NTD, Neglected Tropical Diseases; OBO, Open Biomedical Ontologies; PDB, Protein Data Bank; PDF, Portable Document Format; PLoS, Public Library of Science; PMC, PubMed Central; PTM, post-translational modification; RSC, Royal Society of Chemistry; SDA, Structured Digital Abstract; STM, Scientific, Technical and Medical; UD, Utopia Documents; XML, eXtensible Mark-up Language; XMP, eXtensible Metadata Platform
Aspergillus Genomes is a public resource for viewing annotated genes predicted by various Aspergillus sequencing projects. It has arisen from the union of two significant resources: the Aspergillus/Aspergillosis website and the Central Aspergillus Data REpository (CADRE). The former has primarily served the medical community, providing information about Aspergillus and associated diseases to medics, patients and scientists; the latter has focused on the fungal genomic community, providing a central repository for sequences and annotation extracted from Aspergillus Genomes. By merging these databases, genomes benefit from extensive cross-linking with medical information to create a unique resource, spanning genomics and clinical aspects of the genus. Aspergillus Genomes is accessible from http://www.aspergillus-genomes.org.uk.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
Based on Bayesian Networks, methods were created that address protein sequence-based bacterial subcellular location prediction. Distinct predictive
algorithms for the eight bacterial subcellular locations were created. Several variant methods were explored. These variations included differences in
the number of residues considered within the query sequence - which ranged from the N-terminal 10 residues to the whole sequence - and residue representation -
which took the form of amino acid composition, percentage amino acid composition, or normalised amino acid composition. The accuracies of the best performing
networks were then compared to PSORTB. All individual location methods outperform PSORTB except for the Gram+ cytoplasmic protein predictor, for which accuracies
were essentially equal, and for outer membrane protein prediction, where PSORTB outperforms the binary predictor. The method described here is an important new
approach to method development for subcellular location prediction. It is also a new, potentially valuable tool for candidate subunit vaccine selection.
Bayesian networks; prediction method; subcellular location; membrane protein; periplasmic protein; secreted protein
We describe a novel and potentially important tool for candidate subunit vaccine selection through in silico reverse-vaccinology. A set of Bayesian networks able to make individual predictions for specific subcellular locations
is implemented in three pipelines with different architectures: a parallel implementation with a confidence level-based decision engine and two serial implementations with a hierarchical decision structure, one initially rooted by
prediction between membrane types and another rooted by soluble versus membrane prediction. The parallel pipeline outperformed the serial pipeline, but took twice as long to execute. The soluble-rooted serial pipeline outperformed
the membrane-rooted predictor. Assessment using genomic test sets was more equivocal, as many more predictions are made by the parallel pipeline, yet the serial pipeline identifies 22 more of the 74 proteins of known location.
beta barrel transmembrane protein; prokaryotic membrane proteins; Bayesian Networks; prediction method; subcellular location
Two algorithms, based on Bayesian Networks (BNs), for bacterial subcellular
location prediction, are explored in this paper: one predicts all locations for
Gram+ bacteria and the other all locations for Gram- bacteria. Methods were
evaluated using different numbers of residues (from the N-terminal 10 residues
to the whole sequence) and residue representation (amino acid-composition,
percentage amino acid-composition or normalised amino acid-composition). The
accuracy of the best resulting BN was compared to PSORTB. The accuracy of this
multi-location BN was roughly comparable to PSORTB; the difference in
predictions is low, often less than 2%. The BN method thus represents
both an important new avenue of methodological development for subcellular
location prediction and a potentially value new tool of true utilitarian value
for candidate subunit vaccine selection.
Bayesian networks; prediction method; subcellular location; membrane protein; periplasmic protein; secreted protein
Accurate protein structure prediction remains an active objective of research in bioinformatics. Membrane proteins comprise approximately 20% of most genomes. They are,
however, poorly tractable targets of experimental structure determination. Their analysis using bioinformatics thus makes an important contribution to their on-going
study. Using a method based on Bayesian Networks, which provides a flexible and powerful framework for statistical inference, we have addressed the alignment-free discrimination of membrane from non-membrane
proteins. The method successfully identifies prokaryotic and eukaryotic α-helical membrane proteins at 94.4% accuracy, β-barrel proteins at 72.4% accuracy, and distinguishes assorted non-membranous proteins with 85.9%
accuracy. The method here is an important potential advance in the computational analysis of membrane protein structure. It represents a useful tool for the characterisation of membrane proteins with a wide variety of potential
α-helical membrane proteins; β-barrel membrane proteins; membrane protein discrimination; Bayesian Network; alignment-free prediction
Membrane proteins, which constitute approximately 20% of most genomes, form two main classes: alpha helical and beta barrel transmembrane proteins. Using methods based on Bayesian Networks,
a powerful approach for statistical inference, we have sought to address β-barrel topology prediction. The β-barrel topology predictor reports individual strand accuracies of 88.6%. The method
outlined here represents a potentially important advance in the computational determination of membrane protein topology.
beta barrel transmembrane protein; prokaryotic membrane proteins; Bayesian Networks; prediction method; sub-cellular location
Membrane proteins, which constitute approximately 20% of most genomes, are poorly tractable targets for experimental structure determination, thus analysis by prediction and
modelling makes an important contribution to their on-going study. Membrane proteins form two main classes: alpha helical and beta barrel trans-membrane proteins. By using a method
based on Bayesian Networks, which provides a flexible and powerful framework for statistical inference, we addressed α-helical topology prediction. This method has accuracies of 77.4%
for prokaryotic proteins and 61.4% for eukaryotic proteins. The method described here represents an important advance in the computational determination of membrane protein topology and
offers a useful, and complementary, tool for the analysis of membrane proteins for a range of applications.
trans-membrane protein; alpha helix; static full Bayesian model; prediction; amino acid descriptors
The twin arginine translocation (TAT) system ferries folded proteins across the bacterial membrane. Proteins are directed
into this system by the TAT signal peptide present at the amino terminus of the precursor protein, which contains the twin
arginine residues that give the system its name. There are currently only two computational methods for the prediction of
TAT translocated proteins from sequence. Both methods have limitations that make the creation of a new algorithm for
TAT-translocated protein prediction desirable. We have developed TATPred, a new sequence-model method, based on a
Nave-Bayesian network, for the prediction of TAT signal peptides. In this approach, a comprehensive range of models was
tested to identify the most reliable and robust predictor. The best model comprised 12 residues: three residues prior to the
twin arginines and the seven residues that follow them. We found a prediction sensitivity of 0.979 and a specificity of 0.942.
twin arginine motif; Bayesian Network; TAT translocation; signal sequence; vaccine
Bacterial lipoproteins have many important functions and represent a class of possible vaccine candidates. The
prediction of lipoproteins from sequence is thus an important task for computational vaccinology. Naïve-Bayesian
networks were trained to identify SpaseII cleavage sites and their preceding signal sequences using a set of 199 distinct
lipoprotein sequences. A comprehensive range of sequence models was used to identify the best model for lipoprotein
signal sequences. The best performing sequence model was found to be 10-residues in length, including the conserved
cysteine lipid attachment site and the nine residues prior to it. The sensitivity of prediction for LipPred was 0.979,
while the specificity was 0.742. Here, we describe LipPred, a web server for lipoprotein prediction; available at the
LipPred is the most accurate method available for the detection of SpaseIIcleaved lipoprotein signal sequences and the prediction
of their cleavage sites.
alternative splicing; Naïve-Bayesian networks; reverse vaccinology; prediction; server
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created to integrate the major protein signature databases. Currently, it includes PROSITE, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF and SUPERFAMILY. Signatures are manually integrated into InterPro entries that are curated to provide biological and functional information. Annotation is provided in an abstract, Gene Ontology mapping and links to specialized databases. New features of InterPro include extended protein match views, taxonomic range information and protein 3D structure data. One of the new match views is the InterPro Domain Architecture view, which shows the domain composition of protein matches. Two new entry types were introduced to better describe InterPro entries: these are active site and binding site. PIRSF and the structure-based SUPERFAMILY are the latest member databases to join InterPro, and CATH and PANTHER are soon to be integrated. InterPro release 8.0 contains 11 007 entries, representing 2573 domains, 8166 families, 201 repeats, 26 active sites, 21 binding sites and 20 post-translational modification sites. InterPro covers over 78% of all proteins in the Swiss-Prot and TrEMBL components of UniProt. The database is available for text- and sequence-based searches via a webserver (http://www.ebi.ac.uk/interpro), and for download by anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).