BioJS is an open source software project that develops visualization tools for different types of biological data. Here we report on the factors that influenced the growth of the BioJS user and developer community, and outline our strategy for building on this growth. The lessons we have learned on BioJS may also be relevant to other open source software projects.
Elucidating the effects of naturally occurring genetic variation is one of the major challenges for personalized health and personalized medicine. Here, we introduce SNAP2, a novel neural network based classifier that improves over the state-of-the-art in distinguishing between effect and neutral variants. Our method's improved performance results from screening many potentially relevant protein features and from refining our development data sets. Cross-validated on >100k experimentally annotated variants, SNAP2 significantly outperformed other methods, attaining a two-state accuracy (effect/neutral) of 83%. SNAP2 also outperformed combinations of other methods. Performance increased for human variants but much more so for other organisms. Our method's carefully calibrated reliability index informs selection of variants for experimental follow up, with the most strongly predicted half of all effect variants predicted at over 96% accuracy. As expected, the evolutionary information from automatically generated multiple sequence alignments gave the strongest signal for the prediction. However, we also optimized our new method to perform surprisingly well even without alignments. This feature reduces prediction runtime by over two orders of magnitude, enables cross-genome comparisons, and renders our new method as the best solution for the 10-20% of sequence orphans. SNAP2 is available at: https://rostlab.org/services/snap2web
Delta, input feature that results from computing the difference feature scores for native amino acid and feature scores for variant amino acid; nsSNP, non-synoymous SNP; PMD, Protein Mutant Database; SNAP, Screening for non-acceptable polymorphisms; SNP, single nucleotide polymorphism; variant, any amino acid changing sequence variant.
functional effect prediction; variant effect; neural network; from sequence; SNP effect
TSPO translocator proteins bind steroids and porphyrins, and they are implicated in many human diseases, for which they serve as biomarkers and therapeutic targets. TSPOs have tryptophan-rich sequences that are fhighly conserved from bacteria to mammals. We report crystal structures for Bacillus cereus TSPO (BcTSPO) down to 1.7Å resolution, including a complex with the benzodiazepine-like inhibitor PK11195. We also describe BcTSPO-mediated protoporphyrin IX (PpIX) reactions, including catalytic degradation to a previously undescribed heme derivative. We used structure-inspired mutations to investigate reaction mechanisms, and we showed that TSPOs from Xenopus and man have similar PpIX-directed activities. Although TSPOs have been regarded as transporters, the catalytic activity in PpIX degradation suggests physiological importance for TSPOs in protection against oxidative stress.
Human bestrophin 1 (hBest1) is a calcium-activated chloride channel from the retinal pigment epithelium, where it can suffer mutations associated with vitelliform macular degeneration, or Best disease. We describe the structure of a bacterial homolog (KpBest) of hBest1 and functional characterizations of both channels. KpBest is a pentamer that forms a five-helix transmembrane pore, closed by three rings of conserved hydrophobic residues, and has a cytoplasmic cavern with a restricted exit. From electrophysiological analysis of structure-inspired mutations in KpBest and hBest1, we find a subtle control of ion selectivity in the bestrophins, including reversal of anion/cation selectivity, and dramatic activation by mutations at the exit restriction. A homology model of hBest1 shows the locations of disease-causing mutations and suggests possible roles in regulation.
Calcium-activated chloride channel; Crystal structure; Macular degeneration; Single-wavelength anomalous diffraction (SAD); Sodium channel
Speed is of the essence in combating Ebola; thus, computational approaches should form a significant component of Ebola research. As for the development of any modern drug, computational biology is uniquely positioned to contribute through comparative analysis of the genome sequences of Ebola strains as well as 3-D protein modeling. Other computational approaches to Ebola may include large-scale docking studies of Ebola proteins with human proteins and with small-molecule libraries, computational modeling of the spread of the virus, computational mining of the Ebola literature, and creation of a curated Ebola database. Taken together, such computational efforts could significantly accelerate traditional scientific approaches. In recognition of the need for important and immediate solutions from the field of computational biology against Ebola, the International Society for Computational Biology (ISCB) announces a prize for an important computational advance in fighting the Ebola virus. ISCB will confer the ISCB Fight against Ebola Award, along with a prize of US$2,000, at its July 2016 annual meeting (ISCB Intelligent Systems for Molecular Biology [ISMB] 2016, Orlando, Florida).
Speed is of the essence in combating Ebola; thus, computational approaches should form a significant component of Ebola research. As for the development of any modern drug, computational biology is uniquely positioned to contribute through comparative analysis of the genome sequences of Ebola strains as well as 3-D protein modeling. Other computational approaches to Ebola may include large-scale docking studies of Ebola proteins with human proteins and with small-molecule libraries, computational modeling of the spread of the virus, computational mining of the Ebola literature, and creation of a curated Ebola database.
Taken together, such computational efforts could significantly accelerate traditional scientific approaches. In recognition of the need for important and immediate solutions from the field of computational biology against Ebola, the International Society for Computational Biology (ISCB) announces a prize for an important computational advance in fighting the Ebola virus. ISCB will confer the ISCB Fight against Ebola Award, along with a prize of US$2,000, at its July 2016 annual meeting (ISCB Intelligent Systems for Molecular Biology (ISMB) 2016, Orlando, Florida).
Calcium homeostasis balances passive calcium leak and active calcium uptake. Human Bax inhibitor 1 (hBI-1) is an anti-apoptotic protein that mediates a calcium leak and is representative of highly conserved and widely distributed family, the transmembrane Bax inhibitor motif (TMBIM) proteins. Here we present crystal structures of a bacterial homolog and characterize its calcium leak activity. The structure has a seven-transmembrane-helix fold that features two triple-helix sandwiches wrapped around a central C-terminal helix. Structures obtained in closed and open conformations are reversibly inter-convertible by change of pH. A hydrogen-bonded, pKa-perturbed pair of conserved aspartate residues explains the pH dependence of this equilibrium, and biochemical studies show that pH regulates calcium influx in proteoliposomes. Homology models for hBI-1 provide insights into TMBIM-mediated calcium leak and cytoprotective activity.
We report on several proteins recently solved by structural genomics consortia, in particular by the Northeast Structural Genomics consortium (NESG). The proteins considered in this study differ substantially in their sequences but they share a similar structural core, characterized by a pseudobarrel five-stranded beta sheet. This core corresponds to the PUA domain-like architecture in the SCOP database. By connecting sequence information with structural knowledge, we characterize a new subgroup of these proteins that we propose to be distinctly different from previously described PUA domain-like domains such as PUA proper or ASCH. We refer to these newly defined domains as EVE. Although EVE may have retained the ability of PUA domains to bind RNA, the available experimental and computational data suggests that both the details of its molecular function and its cellular function differ from those of other PUA domain-like domains. This study of EVE and its relatives illustrates how the combination of structure and genomics creates new insights by connecting a cornucopia of structures that map to the same evolutionary potential. Primary sequence information alone would have not been sufficient to reveal these evolutionary links.
structural genomics; protein function prediction; PUA domain-like domains; X-ray crystallography; NMR
The prediction of protein sub-cellular localization is an important step toward elucidating protein function. For each query protein sequence, LocTree2 applies machine learning (profile kernel SVM) to predict the native sub-cellular localization in 18 classes for eukaryotes, in six for bacteria and in three for archaea. The method outputs a score that reflects the reliability of each prediction. LocTree2 has performed on par with or better than any other state-of-the-art method. Here, we report the availability of LocTree3 as a public web server. The server includes the machine learning-based LocTree2 and improves over it through the addition of homology-based inference. Assessed on sequence-unique data, LocTree3 reached an 18-state accuracy Q18 = 80 ± 3% for eukaryotes and a six-state accuracy Q6 = 89 ± 4% for bacteria. The server accepts submissions ranging from single protein sequences to entire proteomes. Response time of the unloaded server is about 90 s for a 300-residue eukaryotic protein and a few hours for an entire eukaryotic proteome not considering the generation of the alignments. For over 1000 entirely sequenced organisms, the predictions are directly available as downloads. The web server is available at http://www.rostlab.org/services/loctree3.
PredictProtein is a meta-service for sequence analysis that has been predicting
structural and functional features of proteins since 1992. Queried with a
protein sequence it returns: multiple sequence alignments, predicted aspects of
structure (secondary structure, solvent accessibility, transmembrane helices
(TMSEG) and strands, coiled-coil regions, disulfide bonds and disordered
regions) and function. The service incorporates analysis methods for the
identification of functional regions (ConSurf), homology-based inference of Gene
Ontology terms (metastudent), comprehensive subcellular localization prediction
(LocTree3), protein–protein binding sites (ISIS2),
protein–polynucleotide binding sites (SomeNA) and predictions of the
effect of point mutations (non-synonymous SNPs) on protein function (SNAP2). Our
goal has always been to develop a system optimized to meet the demands of
experimentalists not highly experienced in bioinformatics. To this end, the
PredictProtein results are presented as both text and a series of intuitive,
interactive and visually appealing figures. The web server and sources are
available at http://ppopen.rostlab.org.
The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.
20 years of improved technology and growing sequences now renders residue-residue contact constraints in large protein families through correlated mutations accurate enough to drive de novo predictions of protein three-dimensional structure. The method EVfold broke new ground using mean-field Direct Coupling Analysis (EVfold-mfDCA); the method PSICOV applied a related concept by estimating a sparse inverse covariance matrix. Both methods (EVfold-mfDCA and PSICOV) are publicly available, but both require too much CPU time for interactive applications. On top, EVfold-mfDCA depends on proprietary software.
Here, we present FreeContact, a fast, open source implementation of EVfold-mfDCA and PSICOV. On a test set of 140 proteins, FreeContact was almost eight times faster than PSICOV without decreasing prediction performance. The EVfold-mfDCA implementation of FreeContact was over 220 times faster than PSICOV with negligible performance decrease. EVfold-mfDCA was unavailable for testing due to its dependency on proprietary software. FreeContact is implemented as the free C++ library “libfreecontact”, complete with command line tool “freecontact”, as well as Perl and Python modules. All components are available as Debian packages. FreeContact supports the BioXSD format for interoperability.
FreeContact provides the opportunity to compute reliable contact predictions in any environment (desktop or cloud).
Protein structure prediction; Protein sequence analysis; Fast protein contact prediction; 2D prediction; Open-source software; EVfold; EVcouplings; PSICOV; mfDCA; BioXSD; Debian package
Summary: The HeatMapViewer is a BioJS component that lays-out and renders two-dimensional (2D) plots or heat maps that are ideally suited to visualize matrix formatted data in biology such as for the display of microarray experiments or the outcome of mutational studies and the study of SNP-like sequence variants. It can be easily integrated into documents and provides a powerful, interactive way to visualize heat maps in web applications. The software uses a scalable graphics technology that adapts the visualization component to any required resolution, a useful feature for a presentation with many different data-points. The component can be applied to present various biological data types. Here, we present two such cases – showing gene expression data and visualizing mutability landscape analysis.
This study uses the Pfam database to show that the sequence redundancy of protein structures deposited in the PDB is increasing. The possible reasons behind this trend are discussed.
High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquired their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.
Pfam families; structural coverage; protein-sequence space
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools.
We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome.
The ribosome consists of small and large subunits each comprised of dozens of proteins and RNA molecules. However, the functions of many of the individual protomers within the ribosome are still unknown. Here we describe the solution NMR structure of the ribosomal protein RP-L35Ae from the archaeon Pyrococcus furiosus. RP-L35Ae is buried within the large subunit of the ribosome and belongs to Pfam protein domain family PF01247, which is highly conserved in eukaryotes, present in a few archaeal genomes, but absent in bacteria. The protein adopts a six-stranded anti-parallel β-barrel analogous to the ‘tRNA binding motif’ fold. The structure of the P. furiosus RP-L35Ae presented here constitutes the first structural representative from this protein domain family.
ribosomal protein; L35Ae; PF01247; tRNA binding; solution NMR; structural genomics
We show that amino acid co-variation in proteins, extracted from the evolutionary sequence record, can be used to fold transmembrane proteins. We use this technique to predict previously unknown, 3D structures for 11 transmembrane proteins (with up to 14 helices) from their sequences alone. The prediction method (EVfold_membrane), applies a maximum entropy approach to infer evolutionary co-variation in pairs of sequence positions within a protein family and then generates all-atom models with the derived pairwise distance constraints. We benchmark the approach with blinded, de novo computation of known transmembrane protein structures from 23 families, demonstrating unprecedented accuracy of the method for large transmembrane proteins. We show how the method can predict oligomerization, functional sites, and conformational changes in transmembrane proteins. With the rapid rise in large-scale sequencing, more accurate and more comprehensive information on evolutionary constraints can be decoded from genetic variation, greatly expanding the repertoire of transmembrane proteins amenable to modelling by this method.
One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.
Recent years have seen the establishment of structural genomics centers that explicitly target integral membrane proteins. Here, we review the advances in targeting these extremely high-hanging fruits of structural biology in high-throughput mode. We observe that the experimental determination of high-resolution structures of integral membrane proteins is increasingly successful both in terms of getting structures and of covering important protein families, e.g. from Pfam. Structural genomics has begun to contribute significantly toward this progress. An important component of this contribution is the set up of robotic pipelines that generate a wealth of experimental data for membrane proteins. We argue that prediction methods for the identification of membrane regions and for the comparison of membrane proteins largely suffice to meet the challenges of target selection for structural genomics of membrane proteins. In contrast, we need better methods to prioritize the most promising members in a family of closely related proteins and to annotate protein function from sequence and structure in absence of homology.
alpha-helical integral membrane proteins; structural genomics; protein families; protein structure; target selection; function prediction
Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference.
Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements.
Results and conclusions
During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA.
The plant SLAC1 anion channel controls turgor pressure in the aperture-defining guard cells of plant stomata, thereby regulating exchange of water vapor and photosynthetic gases in response to environmental signals such as drought or high levels of carbon dioxide. We determined the crystal structure of a bacterial homolog of SLAC1 at 1.20Å resolution, and we have used structure-inspired mutagenesis to analyze the conductance properties of SLAC1 channels. SLAC1 is a symmetric trimer composed from quasi-symmetric subunits, each having ten transmembrane helices arranged from helical hairpin pairs to form a central five-helix transmembrane pore that is gated by an extremely conserved phenylalanine residue. Conformational features suggest a mechanism for control of gating by kinase activation, and electrostatic features of the pore coupled with electrophysiological characteristics suggest that selectivity among different anions is largely a function of the energetic cost of ion dehydration.
Motivation: Subcellular localization is one aspect of protein function. Despite advances in high-throughput imaging, localization maps remain incomplete. Several methods accurately predict localization, but many challenges remain to be tackled.
Results: In this study, we introduced a framework to predict localization in life's three domains, including globular and membrane proteins (3 classes for archaea; 6 for bacteria and 18 for eukaryota). The resulting method, LocTree2, works well even for protein fragments. It uses a hierarchical system of support vector machines that imitates the cascading mechanism of cellular sorting. The method reaches high levels of sustained performance (eukaryota: Q18=65%, bacteria: Q6=84%). LocTree2 also accurately distinguishes membrane and non-membrane proteins. In our hands, it compared favorably with top methods when tested on new data.
Availability: Online through PredictProtein (predictprotein.org); as standalone version at http://www.rostlab.org/services/loctree2.
Supplementary data are available at Bioinformatics online.
The intricate molecular details of protein-protein interactions (PPIs) are crucial for function. Therefore, measuring the same interacting protein pair again, we expect the same result. This work measured the similarity in the molecular details of interaction for the same and for homologous protein pairs between different experiments. All scores analyzed suggested that different experiments often find exceptions in the interfaces of similar PPIs: up to 22% of all comparisons revealed some differences even for sequence-identical pairs of proteins. The corresponding number for pairs of close homologs reached 68%. Conversely, the interfaces differed entirely for 12–29% of all comparisons. All these estimates were calculated after redundancy reduction. The magnitude of interface differences ranged from subtle to the extreme, as illustrated by a few examples. An extreme case was a change of the interacting domains between two observations of the same biological interaction. One reason for different interfaces was the number of copies of an interaction in the same complex: the probability of observing alternative binding modes increases with the number of copies. Even after removing the special cases with alternative hetero-interfaces to the same homomer, a substantial variability remained. Our results strongly support the surprising notion that there are many alternative solutions to make the intricate molecular details of PPIs crucial for function.
The number of known protein-protein interactions (PPIs) grows rapidly, yet their molecular details remain largely unknown. Over the last years, structural biologists have addressed this issue with an increased output of structurally resolved hetero complexes. This wealth now enables statistically significant quantitative statements about interface properties. Here, we addressed the question how interfaces differ when observing the same proteinprotein interaction twice. A new dataset derived from the entire PDB was analyzed employing different definitions for the “same interaction” and a range of interface similarity measures. The hypothesis was that the interface between the same pair of proteins stays the same irrespectively of how often it is measured. Although the results mostly confirm this hypothesis, the surprising finding was how often it was not true: for many comparisons of interfaces, the molecular details of the interaction differed importantly, often without the slightest change of amino acids. In addition, no matter how much “special cases” were sieved out, the essential message remained: interfaces appear immensely plastic. Hand-selected sample structures largely support this view. In general, we complement a series of recent studies focusing either on family-family interactions or exploring other aspects of protein-protein complexes.