The need to retrieve or classify protein molecules using structure or sequence-based similarity measures underlies a wide range of biomedical applications. Traditional protein search methods rely on a pairwise dissimilarity/similarity measure for comparing a pair of proteins. This kind of pairwise measures suffer from the limitation of neglecting the distribution of other proteins and thus cannot satisfy the need for high accuracy of the retrieval systems. Recent work in the machine learning community has shown that exploiting the global structure of the database and learning the contextual dissimilarity/similarity measures can improve the retrieval performance significantly. However, most existing contextual dissimilarity/similarity learning algorithms work in an unsupervised manner, which does not utilize the information of the known class labels of proteins in the database.
In this paper, we propose a novel protein-protein dissimilarity learning algorithm, ProDis-ContSHC. ProDis-ContSHC regularizes an existing dissimilarity measure dij by considering the contextual information of the proteins. The context of a protein is defined by its neighboring proteins. The basic idea is, for a pair of proteins (i, j), if their context N(i) and N(j) is similar to each other, the two proteins should also have a high similarity. We implement this idea by regularizing dij by a factor learned from the context N(i) and N(j).
Moreover, we divide the context to hierarchial sub-context and get the contextual dissimilarity vector for each protein pair. Using the class label information of the proteins, we select the relevant (a pair of proteins that has the same class labels) and irrelevant (with different labels) protein pairs, and train an SVM model to distinguish between their contextual dissimilarity vectors. The SVM model is further used to learn a supervised regularizing factor. Finally, with the new Supervised learned Dissimilarity measure, we update the Protein Hierarchial Context Coherently in an iterative algorithm--ProDis-ContSHC.
We test the performance of ProDis-ContSHC on two benchmark sets, i.e., the ASTRAL 1.73 database and the FSSP/DALI database. Experimental results demonstrate that plugging our supervised contextual dissimilarity measures into the retrieval systems significantly outperforms the context-free dissimilarity/similarity measures and other unsupervised contextual dissimilarity measures that do not use the class label information.
Using the contextual proteins with their class labels in the database, we can improve the accuracy of the pairwise dissimilarity/similarity measures dramatically for the protein retrieval tasks. In this work, for the first time, we propose the idea of supervised contextual dissimilarity learning, resulting in the ProDis-ContSHC algorithm. Among different contextual dissimilarity learning approaches that can be used to compare a pair of proteins, ProDis-ContSHC provides the highest accuracy. Finally, ProDis-ContSHC compares favorably with other methods reported in the recent literature.
A research area that has greatly benefited from the development of new and improved analysis technologies is Proteomics and large amounts of data have been generated by proteomic analysis as a consequence. Previously, the storage, management and analysis of these data have been done manually. This is, however, incompatible with the volume of data generated by modern proteomic analysis. Several attempts have been made to automate the tasks of data analysis and management. In this work we propose PRODIS (Proteomics Database Integrated System), a system for proteomic experimental data management. The proposed system enables an efficient management of the proteomic experimentation workflow, simplifies controlling experiments and associated data and establishes links between similar experiments through the experiment tracking function.
PRODIS is fully web based which simplifies data upload and gives the system the flexibility necessary for use in complex projects. Data from Liquid Chromatography, 2D-PAGE and Mass Spectrometry experiments can be stored in the system. Moreover, it is simple to use, researchers can insert experimental data directly as experiments are performed, without the need to configure the system or change their experiment routine. PRODIS has a number of important features, including a password protected system in which each screen for data upload and retrieval is validated; users have different levels of clearance, which allow the execution of tasks according to the user clearance level. The system allows the upload, parsing of files, storage and display of experiment results and images in the main formats used in proteomics laboratories: for chromatographies the chromatograms and lists of peaks resulting from separation are stored; For 2D-PAGE images of gels and the files resulting from the analysis are stored, containing information on positions of spots as well as its values of intensity, volume, etc; For Mass Spectrometry, PRODIS presents a function for completion of the mapping plate that allows the user to correlate the positions in plates to the samples separated by 2D-PAGE. Furthermore PRODIS allows the tracking of experiments from the first stage until the final step of identification, enabling an efficient management of the complete experimental process.
The construction of data management systems for Proteomics data importing and storing is a relevant subject. PRODIS is a system complementary to other proteomics tools that combines a powerful storage engine (the relational database) and a friendly access interface, aiming to assist Proteomics research directly at data handling and storage.
Motivation: GSATools is a free software package to analyze conformational ensembles and to detect functional motions in proteins by means of a structural alphabet. The software integrates with the widely used GROMACS simulation package and can generate a range of graphical outputs. Three applications can be supported: (i) investigation of the conformational variability of local structures; (ii) detection of allosteric communication; and (iii) identification of local regions that are critical for global functional motions. These analyses provide insights into the dynamics of proteins and allow for targeted design of functional mutants in theoretical and experimental studies.
Availability: The C source code of the GSATools, along with a set of pre-compiled binaries, is freely available under GNU General Public License from http://mathbio.nimr.mrc.ac.uk/wiki/GSATools.
firstname.lastname@example.org or email@example.com
Supplementary data are available at Bioinformatics online.
Wordom is a versatile, user-friendly, and efficient program for manipulation and analysis of molecular structures and dynamics. The following new analysis modules have been added since the publication of the original Wordom paper in 2007: assignment of secondary structure, calculation of solvent accessible surfaces, elastic network model, motion cross correlations, protein structure network, shortest intra-molecular and inter-molecular communication paths, kinetic grouping analysis, and calculation of mincut-based free energy profiles. In addition, an interface with the Python scripting language has been built and the overall performance and user accessibility enhanced. The source code of Wordom (in the C programming language) as well as documentation for usage and further development are available as an open source package under the GNU General Purpose License from http://wordom.sf.net. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2011
structural/dynamics analysis program; free energy landscape; elastic network model; protein structure network; communication paths
Mocapy++ is a toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs). It supports a wide range of DBN architectures and probability distributions, including distributions from directional statistics (the statistics of angles, directions and orientations).
The program package is freely available under the GNU General Public Licence (GPL) from SourceForge http://sourceforge.net/projects/mocapy. The package contains the source for building the Mocapy++ library, several usage examples and the user manual.
Mocapy++ is especially suitable for constructing probabilistic models of biomolecular structure, due to its support for directional statistics. In particular, it supports the Kent distribution on the sphere and the bivariate von Mises distribution on the torus. These distributions have proven useful to formulate probabilistic models of protein and RNA structure in atomic detail.
The huge difference between the number of known sequences and known tertiary structures has justified the use of automated methods for protein analysis. Although a general methodology to solve these problems has not been yet devised, researchers are engaged in developing more accurate techniques and algorithms whose training plays a relevant role in determining their performance. From this perspective, particular importance is given to the training data used in experiments, and researchers are often engaged in the generation of specialized datasets that meet their requirements.
To facilitate the task of generating specialized datasets we devised and implemented ProDaMa, an open source Python library than provides classes for retrieving, organizing, updating, analyzing, and filtering protein data.
ProDaMa has been used to generate specialized datasets useful for secondary structure prediction and to develop a collaborative web application aimed at generating and sharing protein structure datasets. The library, the related database, and the documentation are freely available at the URL .
Summary: The increasing size and complexity of biological databases has led to a growing trend to federate rather than duplicate them. In order to share data between federated databases, protocols for the exchange mechanism must be developed. One such data exchange protocol that is widely used is the Distributed Annotation System (DAS). For example, DAS has enabled small experimental groups to integrate their data into the Ensembl genome browser. We have developed ProServer, a simple, lightweight, Perl-based DAS server that does not depend on a separate HTTP server. The ProServer package is easily extensible, allowing data to be served from almost any underlying data model. Recent additions to the DAS protocol have enabled both structure and alignment (sequence and structural) data to be exchanged. ProServer allows both of these data types to be served.
Availability: ProServer can be downloaded from http://www.sanger.ac.uk/proserver/ or CPAN http://search.cpan.org/~rpettett/. Details on the system requirements and installation of ProServer can be found at http://www.sanger.ac.uk/proserver/.
Supplementary Materials: DasClientExamples.pdf
Existing repositories for experimental datasets typically capture snapshots of data
acquired using a single experimental technique and often require manual population and
continual curation. We present a storage system for heterogeneous research data that
performs dynamic automated indexing to provide powerful search, discovery and
collaboration features without the restrictions of a structured repository. ADAM is able
to index many commonly used file formats generated by laboratory assays and therefore
offers specific advantages to the experimental biology community. However, it is not
domain specific and can promote sharing and re-use of working data across scientific
Availability and implementation: ADAM is implemented using Java and
supported on Linux. It is open source under the GNU General Public License v3.0.
Installation instructions, binary code, a demo system and virtual machine image and are
available at http://www.imperial.ac.uk/bioinfsupport/resources/software/adam.
Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources.
We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL.
The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.
The pep2pro database was built to support effective high-throughput proteome data analysis. Its database schema allows the coherent integration of search results from different database-dependent search algorithms and filtering of the data including control for unambiguous assignment of peptides to proteins. The capacity of the pep2pro database has been exploited in data analysis of various Arabidopsis proteome datasets. The diversity of the datasets and the associated scientific questions required thorough querying of the data. This was supported by the relational format structure of the data that links all information on the sample, spectrum, search database, and algorithm to peptide and protein identifications and their post-translational modifications. After publication of datasets they are made available on the pep2pro website at www.pep2pro.ethz.ch. Further, the pep2pro data analysis pipeline also handles data export do the PRIDE database (http://www.ebi.ac.uk/pride) and data retrieval by the MASCP Gator (http://gator.masc-proteomics.org/). The utility of pep2pro will continue to be used for analysis of additional datasets and as a data warehouse. The capacity of the pep2pro database for proteome data analysis has now also been made publicly available through the release of pep2pro4all, which consists of a database schema and a script that will populate the database with mass spectrometry data provided in mzIdentML format.
database; mzIdentML; pep2pro; plant proteomics; standard format
ACPYPE (or AnteChamber PYthon Parser interfacE) is a wrapper script around the ANTECHAMBER software that simplifies the generation of small molecule topologies and parameters for a variety of molecular dynamics programmes like GROMACS, CHARMM and CNS. It is written in the Python programming language and was developed as a tool for interfacing with other Python based applications such as the CCPN software suite (for NMR data analysis) and ARIA (for structure calculations from NMR data). ACPYPE is open source code, under GNU GPL v3, and is available as a stand-alone application at http://www.ccpn.ac.uk/acpype and as a web portal application at http://webapps.ccpn.ac.uk/acpype.
We verified the topologies generated by ACPYPE in three ways: by comparing with default AMBER topologies for standard amino acids; by generating and verifying topologies for a large set of ligands from the PDB; and by recalculating the structures for 5 protein–ligand complexes from the PDB.
ACPYPE is a tool that simplifies the automatic generation of topology and parameters in different formats for different molecular mechanics programmes, including calculation of partial charges, while being object oriented for integration with other applications.
MD; GROMACS; AMBER; CNS; ANTECHAMBER; NMR; Ligand; Topology
Summary: Profile-based similarity search is an essential step in structure-function studies of proteins. However, inclusion of non-homologous sequence segments into a profile causes its corruption and results in false positives. Profile corruption is common in multidomain proteins, and single domains with long insertions are a significant source of errors. We developed a procedure (HangOut) that, for a single domain with specified insertion position, cleans erroneously extended PSI-BLAST alignments to generate better profiles.
Availability: HangOut is implemented in Python 2.3 and runs on all Unix-compatible platforms. The source code is available under the GNU GPL license at http://prodata.swmed.edu/HangOut/
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
The HapMap project is a publicly available catalogue of common genetic variants that occur in humans, currently including several million SNPs across 1115 individuals spanning 11 different populations. This important database does not provide any programmatic access to the dataset, furthermore no standard relational database interface is provided.
interPopula is a Python API to access the HapMap dataset. interPopula provides integration facilities with both the Python ecology of software (e.g. Biopython and matplotlib) and other relevant human population datasets (e.g. Ensembl gene annotation and UCSC Known Genes). A set of guidelines and code examples to address possible inconsistencies across heterogeneous data sources is also provided.
interPopula is a straightforward and flexible Python API that facilitates the construction of scripts and applications that require access to the HapMap dataset.
Proteins are known to be dynamic in nature, changing from one conformation to another while performing vital cellular tasks. It is important to understand these movements in order to better understand protein function. At the same time, experimental techniques provide us with only single snapshots of the whole ensemble of available conformations. Computational protein morphing provides a visualization of a protein structure transitioning from one conformation to another by producing a series of intermediate conformations.
We present a novel, efficient morphing algorithm, Morph-Pro based on linear interpolation. We also show that apart from visualization, morphing can be used to provide plausible intermediate structures. We test this by using the intermediate structures of a c-Jun N-terminal kinase (JNK1) conformational change in a virtual docking experiment. The structures are shown to dock with higher score to known JNK1-binding ligands than structures solved using X-Ray crystallography. This experiment demonstrates the potential applications of the intermediate structures in modeling or virtual screening efforts.
Visualization of protein conformational changes is important for characterization of protein function. Furthermore, the intermediate structures produced by our algorithm are good approximations to true structures. We believe there is great potential for these computationally predicted structures in protein-ligand docking experiments and virtual screening. The Morph-Pro web server can be accessed at http://morph-pro.bioinf.spbau.ru.
Protein morphing; Molecular docking; Virtual screening
The integration of genomic information with quantitative experimental data is a key component of systems biology. An increasing number of microbial genomes are being sequenced, leading to an increasing amount of data from post-genomics technologies. The genomes of prokaryotes contain many structures of interest, such as operons, pathogenicity islands and prophage sequences, whose behaviour is of interest during infection and disease. There is a need for simple and novel tools to display and analyse data from these integrated datasets, and we have developed ProGenExpress as a tool for visualising arbitrarily complex numerical data in the context of prokaryotic genomes.
Here we describe ProGenExpress, an R package that allows researchers to easily and quickly visualize quantitative measurements, such as those produced by microarray experiments, in the context of the genome organization of sequenced prokaryotes. Data from microarrays, proteomics or other whole-genome technologies can be accurately displayed on the genome. ProGenExpress can also search for novel regions of interest that consist of groups of adjacent genes that show similar patterns across the experimental data set. We demonstrate ProGenExpress with microarray data from a time-course experiment involving Salmonella typhimurium.
ProGenExpress can be used to visualize quantitative data from complex experiments in the context of the genome of sequenced prokaryotes, and to find novel regions of interest.
Summary: Kalign2 is one of the fastest and most accurate methods for multiple alignments. However, in contrast to other methods Kalign2 does not allow externally supplied position specific gap penalties. Here, we present a modification to Kalign2, KalignP, so that it accepts such penalties. Further, we show that KalignP using position specific gap penalties obtained from predicted secondary structures makes steady improvement over Kalign2 when tested on Balibase 3.0 as well as on a dataset derived from Pfam-A seed alignments.
Availability and Implementation: KalignP is freely available at http://kalignp.cbr.su.se. The source code of KalignP is available under the GNU General Public License, Version 2 or later from the same website.
Supplementary information: Supplementary data are available at Bioinformatics online.
A major problem in structural biology is the recognition of errors in experimental and theoretical models of protein structures. The ProSA program (Protein Structure Analysis) is an established tool which has a large user base and is frequently employed in the refinement and validation of experimental protein structures and in structure prediction and modeling. The analysis of protein structures is generally a difficult and cumbersome exercise. The new service presented here is a straightforward and easy to use extension of the classic ProSA program which exploits the advantages of interactive web-based applications for the display of scores and energy plots that highlight potential problems spotted in protein structures. In particular, the quality scores of a protein are displayed in the context of all known protein structures and problematic parts of a structure are shown and highlighted in a 3D molecule viewer. The service specifically addresses the needs encountered in the validation of protein structures obtained from X-ray analysis, NMR spectroscopy and theoretical calculations. ProSA-web is accessible at https://prosa.services.came.sbg.ac.at
Chaos Game Representation (CGR) is a generalized scale-independent Markov transition table, which is useful for the visualization and comparative study of genomic signature, or for the study of characteristic sequence motifs. However, in order to fully utilize the scale-independent properties of CGR, it should be accessible through scale-independent user interface instead of static images. Here we describe a web server and Perl library for generating zoomable CGR images utilizing Google Maps API, which is also easily searchable for specific motifs. The web server is freely accessible at , and the Perl library as well as the source code is distributed with the G-language Genome Analysis Environment under GNU General Public License.
The SPACER server provides an interactive framework for exploring allosteric communication in proteins with different sizes, degrees of oligomerization and function. SPACER uses recently developed theoretical concepts based on the thermodynamic view of allostery. It proposes easily tractable and meaningful measures that allow users to analyze the effect of ligand binding on the intrinsic protein dynamics. The server shows potential allosteric sites and allows users to explore communication between the regulatory and functional sites. It is possible to explore, for instance, potential effector binding sites in a given structure as targets for allosteric drugs. As input, the server only requires a single structure. The server is freely available at http://allostery.bii.a-star.edu.sg/.
Robustness, maintaining a constant phenotype despite perturbations, is a fundamental property of biological systems that is incorporated at various levels of biological complexity. Although robustness has been frequently observed in nature, its evolutionary origin remains unknown. Current hypotheses suggest that robustness originated as a direct consequence of natural selection, as an intrinsic property of adaptations, or as a congruent correlate of environment robustness. To elucidate the evolutionary origins of robustness, a convenient computational package is strongly needed.
In this study, we developed the open-source integrated system EvoRSR (Evolution of RNA Structural Robustness) to explore the evolution of robustness based on biologically important landscapes induced by RNA folding. EvoRSR is object-oriented, modular, and freely available at under the GNU/GPL license. We present an overview of EvoRSR package and illustrate its features with the miRNA gene cel-mir-357.
EvoRSR is a novel and flexible package for exploring the evolution of robustness. Accordingly, EvoRSR can be used for future studies to investigate the evolution and origin of robustness and to address other common questions about robustness. While the current EvoRSR environment is a versatile analysis framework, future versions can include features to enhance evolutionary studies of robustness.
The availability of large-scale curated protein interaction datasets has given rise to the opportunity to investigate higher level organization and modularity within the protein interaction network (PPI) using graph theoretic analysis. Despite the recent progress, systems level analysis of PPIS remains a daunting task as it is challenging to make sense out of the deluge of high-dimensional interaction data. Specifically, techniques that automatically abstract and summarize PPIS at multiple resolutions to provide high level views of its functional landscape are still lacking. We present a novel data-driven and generic algorithm called FUSE (Functional Summary Generator) that generates functional maps of a PPI at different levels of organization, from broad process-process level interactions to in-depth complex-complex level interactions, through a pro t maximization approach that exploits Minimum Description Length (MDL) principle to maximize information gain of the summary graph while satisfying the level of detail constraint.
We evaluate the performance of FUSE on several real-world PPIS. We also compare FUSE to state-of-the-art graph clustering methods with GO term enrichment by constructing the biological process landscape of the PPIS. Using AD network as our case study, we further demonstrate the ability of FUSE to quickly summarize the network and identify many different processes and complexes that regulate it. Finally, we study the higher-order connectivity of the human PPI.
By simultaneously evaluating interaction and annotation data, FUSE abstracts higher-order interaction maps by reducing the details of the underlying PPI to form a functional summary graph of interconnected functional clusters. Our results demonstrate its effectiveness and superiority over state-of-the-art graph clustering methods with GO term enrichment.
Non-synonymous coding SNPs (nsSNPs) that are associated to disease can also be related with alterations in protein stability. Computational methods are available to predict the effect of single amino acid substitutions (SASs) on protein stability based on a single folded structure. However, the native state of a protein is not unique and it is better represented by the ensemble of its conformers in dynamic equilibrium. The maintenance of the ensemble is essential for protein function. In this work we investigated how protein conformational diversity can affect the discrimination of neutral and disease related SASs based on protein stability estimations. For this purpose, we used 119 proteins with 803 associated SASs, 60% of which are disease related. Each protein was associated with its corresponding set of available conformers as found in the Protein Conformational Database (PCDB). Our dataset contains proteins with different extensions of conformational diversity summing up a total number of 1023 conformers.
The existence of different conformers for a given protein introduces great variability in the estimation of the protein stability (ΔΔG) after a single amino acid substitution (SAS) as computed with FoldX. Indeed, in 35% of our protein set at least one SAS can be described as stabilizing, destabilizing or neutral when a cutoff value of ±2 kcal/mol is adopted for discriminating neutral from perturbing SASs. However, when the ΔΔG variability among conformers is taken into account, the correlation among the perturbation of protein stability and the corresponding disease or neutral phenotype increases as compared with the same analysis on single protein structures. At the conformer level, we also found that the different conformers correlate in a different way to the corresponding phenotype.
Our results suggest that the consideration of conformational diversity can improve the discrimination of neutral and disease related protein SASs based on the evaluation of the corresponding Gibbs free energy change.
The development of accurate protein-protein docking programs is making this kind of simulations an effective tool to predict the 3D structure and the surface of interaction between the molecular partners in macromolecular complexes. However, correctly scoring multiple docking solutions is still an open problem. As a consequence, the accurate and tedious screening of many docking models is usually required in the analysis step.
All the programs under CONS-COCOMAPS have been written in python, taking advantage of python libraries such as SciPy and Matplotlib. CONS-COCOMAPS is freely available as a web tool at the URL:
Here we presented CONS-COCOMAPS, a novel tool to easily measure and visualize the consensus in multiple docking solutions. CONS-COCOMAPS uses the conservation of inter-residue contacts as an estimate of the similarity between different docking solutions. To visualize the conservation, CONS-COCOMAPS uses intermolecular contact maps.
The application of CONS-COCOMAPS to test-cases taken from recent CAPRI rounds has shown that it is very efficient in highlighting even a very weak consensus that often is biologically meaningful.
Summary: PyRosetta is a stand-alone Python-based implementation of the Rosetta molecular modeling package that allows users to write custom structure prediction and design algorithms using the major Rosetta sampling and scoring functions. PyRosetta contains Python bindings to libraries that define Rosetta functions including those for accessing and manipulating protein structure, calculating energies and running Monte Carlo-based simulations. PyRosetta can be used in two ways: (i) interactively, using iPython and (ii) script-based, using Python scripting. Interactive mode contains a number of help features and is ideal for beginners while script-mode is best suited for algorithm development. PyRosetta has similar computational performance to Rosetta, can be easily scaled up for cluster applications and has been implemented for algorithms demonstrating protein docking, protein folding, loop modeling and design.
Availability: PyRosetta is a stand-alone package available at http://www.pyrosetta.org under the Rosetta license which is free for academic and non-profit users. A tutorial, user's manual and sample scripts demonstrating usage are also available on the web site.