Search tips
Search criteria

Results 1-25 (767641)

Clipboard (0)

Related Articles

1.  Features and development of Coot  
Coot is a molecular-graphics program designed to assist in the building of protein and other macromolecular models. The current state of development and available features are presented.
Coot is a molecular-graphics application for model building and validation of biological macromolecules. The program displays electron-density maps and atomic models and allows model manipulations such as idealization, real-space refinement, manual rotation/translation, rigid-body fitting, ligand search, solvation, mutations, rotamers and Ramachandran idealization. Furthermore, tools are provided for model validation as well as interfaces to external programs for refinement, validation and graphics. The software is designed to be easy to learn for novice users, which is achieved by ensuring that tools for common tasks are ‘discoverable’ through familiar user-interface elements (menus and toolbars) or by intuitive behaviour (mouse controls). Recent developments have focused on providing tools for expert users, with customisable key bindings, extensions and an extensive scripting interface. The software is under rapid development, but has already achieved very widespread use within the crystallographic community. The current state of the software is presented, with a description of the facilities available and of some of the underlying methods employed.
PMCID: PMC2852313  PMID: 20383002
Coot; model building
2.  Tools for macromolecular model building and refinement into electron cryo-microscopy reconstructions 
A description is given of new tools to facilitate model building and refinement into electron cryo-microscopy reconstructions.
The recent rapid development of single-particle electron cryo-microscopy (cryo-EM) now allows structures to be solved by this method at resolutions close to 3 Å. Here, a number of tools to facilitate the interpretation of EM reconstructions with stereochemically reasonable all-atom models are described. The BALBES database has been repurposed as a tool for identifying protein folds from density maps. Modifications to Coot, including new Jiggle Fit and morphing tools and improved handling of nucleic acids, enhance its functionality for interpreting EM maps. REFMAC has been modified for optimal fitting of atomic models into EM maps. As external structural information can enhance the reliability of the derived atomic models, stabilize refinement and reduce overfitting, ProSMART has been extended to generate interatomic distance restraints from nucleic acid reference structures, and a new tool, LIBG, has been developed to generate nucleic acid base-pair and parallel-plane restraints. Furthermore, restraint generation has been integrated with visualization and editing in Coot, and these restraints have been applied to both real-space refinement in Coot and reciprocal-space refinement in REFMAC.
PMCID: PMC4304694  PMID: 25615868
model building; refinement;  electron cryo-microscopy reconstructions; LIBG
3.  RCrane: semi-automated RNA model building 
RCrane is a new tool for the partially automated building of RNA crystallographic models into electron-density maps of low or intermediate resolution. This tool helps crystallographers to place phosphates and bases into electron density and then automatically predicts and builds the detailed all-atom structure of the traced nucleotides.
RNA crystals typically diffract to much lower resolutions than protein crystals. This low-resolution diffraction results in unclear density maps, which cause considerable difficulties during the model-building process. These difficulties are exacerbated by the lack of computational tools for RNA modeling. Here, RCrane, a tool for the partially automated building of RNA into electron-density maps of low or intermediate resolution, is presented. This tool works within Coot, a common program for macromolecular model building. RCrane helps crystallographers to place phosphates and bases into electron density and then automatically predicts and builds the detailed all-atom structure of the traced nucleotides. RCrane then allows the crystallographer to review the newly built structure and select alternative backbone conformations where desired. This tool can also be used to automatically correct the backbone structure of previously built nucleotides. These automated corrections can fix incorrect sugar puckers, steric clashes and other structural problems.
PMCID: PMC3413212  PMID: 22868764
RCrane; RNA model building
4.  The PDB_REDO server for macromolecular structure model optimization 
IUCrJ  2014;1(Pt 4):213-220.
The PDB_REDO pipeline aims to improve macromolecular structures by optimizing the crystallographic refinement parameters and performing partial model building. Here, algorithms are presented that allowed a web-server implementation of PDB_REDO, and the first user results are discussed.
The refinement and validation of a crystallographic structure model is the last step before the coordinates and the associated data are submitted to the Protein Data Bank (PDB). The success of the refinement procedure is typically assessed by validating the models against geometrical criteria and the diffraction data, and is an important step in ensuring the quality of the PDB public archive [Read et al. (2011 ▶), Structure, 19, 1395–1412]. The PDB_REDO procedure aims for ‘constructive validation’, aspiring to consistent and optimal refinement parameterization and pro-active model rebuilding, not only correcting errors but striving for optimal interpretation of the electron density. A web server for PDB_REDO has been implemented, allowing thorough, consistent and fully automated optimization of the refinement procedure in REFMAC and partial model rebuilding. The goal of the web server is to help practicing crystallo­graphers to improve their model prior to submission to the PDB. For this, additional steps were implemented in the PDB_REDO pipeline, both in the refinement procedure, e.g. testing of resolution limits and k-fold cross-validation for small test sets, and as new validation criteria, e.g. the density-fit metrics implemented in EDSTATS and ligand validation as implemented in YASARA. Innovative ways to present the refinement and validation results to the user are also described, which together with auto-generated Coot scripts can guide users to subsequent model inspection and improvement. It is demonstrated that using the server can lead to substantial improvement of structure models before they are submitted to the PDB.
PMCID: PMC4107921  PMID: 25075342
PDB_REDO; validation; model optimization
5.  Automated nucleic acid chain tracing in real time 
IUCrJ  2014;1(Pt 6):387-392.
A method is presented for the automatic building of nucleotide chains into electron density which is fast enough to be used in interactive model-building software. Likely nucleotides lying in the vicinity of the current view are located and then grown into connected chains in a fraction of a second. When this development is combined with existing tools, assisted manual model building is as simple as or simpler than for proteins.
The crystallographic structure solution of nucleotides and nucleotide complexes is now commonplace. The resulting electron-density maps are often poorer than for proteins, and as a result interpretation in terms of an atomic model can require significant effort, particularly in the case of large structures. While model building can be performed automatically, as with proteins, the process is time-consuming, taking minutes to days depending on the software and the size of the structure. A method is presented for the automatic building of nucleotide chains into electron density which is fast enough to be used in interactive model-building software, with extended chain fragments built around the current view position in a fraction of a second. The speed of the method arises from the determination of the ‘fingerprint’ of the sugar and phosphate groups in terms of conserved high-density and low-density features, coupled with a highly efficient scoring algorithm. Use cases include the rapid evaluation of an initial electron-density map, addition of nucleotide fragments to prebuilt protein structures, and in favourable cases the completion of the structure while automated model-building software is still running. The method has been incorporated into the Coot software package.
PMCID: PMC4224457  PMID: 25485119
nucleic acid chain tracing; Coot
6.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
7.  Using a commodity high-definition television for collaborative structural biology 
Journal of Applied Crystallography  2014;47(Pt 3):1153-1157.
A method exploiting off-the-rack television and game controllers for use by crystallographers and their collaborators for enhanced experience of co-located discussions is presented.
Visualization of protein structures using stereoscopic systems is frequently needed by structural biologists working to understand a protein’s structure–function relationships. Often several scientists are working as a team and need simultaneous interaction with each other and the graphics representations. Most existing molecular visualization tools support single-user tasks, which are not suitable for a collaborative group. Expensive caves, domes or geowalls have been developed, but the availability and low cost of high-definition televisions (HDTVs) and game controllers in the commodity entertainment market provide an economically attractive option to achieve a collaborative environment. This paper describes a low-cost environment, using standard consumer game controllers and commercially available stereoscopic HDTV monitors with appropriate signal converters for structural biology collaborations employing existing binary distributions of commonly used software packages like Coot, PyMOL, Chimera, VMD, O, Olex2 and others.
PMCID: PMC4038803  PMID: 24904249
collaborative structural biology; protein structure visualization
8.  Automated macromolecular model building for X-ray crystallography using ARP/wARP version 7 
Nature protocols  2008;3(7):1171-1179.
ARP/wARP is a software suite to build macromolecular models in X-ray crystallography electron density maps. Structural genomics initiatives and the study of complex macromolecular assemblies and membrane proteins all rely on advanced methods for 3D structure determination. ARP/wARP meets these needs by providing the tools to obtain a macromolecular model automatically, with a reproducible computational procedure. ARP/wARP 7.0 tackles several tasks: iterative protein model building including a high-level decision-making control module; fast construction of the secondary structure of a protein; building flexible loops in alternate conformations; fully automated placement of ligands, including a choice of the best fitting ligand from a “cocktail”; and finding ordered water molecules. All protocols are easy to handle by a non-expert user through a graphical user interface or a command line. The time required is typically a few minutes although iterative model building may take a few hours.
PMCID: PMC2582149  PMID: 18600222
Structural genomics; X-ray crystallography; software; model building; ligand placement
9.  A subgraph isomorphism algorithm and its application to biochemical data 
BMC Bioinformatics  2013;14(Suppl 7):S13.
Graphs can represent biological networks at the molecular, protein, or species level. An important query is to find all matches of a pattern graph to a target graph. Accomplishing this is inherently difficult (NP-complete) and the efficiency of heuristic algorithms for the problem may depend upon the input graphs. The common aim of existing algorithms is to eliminate unsuccessful mappings as early as and as inexpensively as possible.
We propose a new subgraph isomorphism algorithm which applies a search strategy to significantly reduce the search space without using any complex pruning rules or domain reduction procedures. We compare our method with the most recent and efficient subgraph isomorphism algorithms (VFlib, LAD, and our C++ implementation of FocusSearch which was originally distributed in Modula2) on synthetic, molecules, and interaction networks data. We show a significant reduction in the running time of our approach compared with these other excellent methods and show that our algorithm scales well as memory demands increase.
Subgraph isomorphism algorithms are intensively used by biochemical tools. Our analysis gives a comprehensive comparison of different software approaches to subgraph isomorphism highlighting their weaknesses and strengths. This will help researchers make a rational choice among methods depending on their application. We also distribute an open-source package including our system and our own C++ implementation of FocusSearch together with all the used datasets ( In future work, our findings may be extended to approximate subgraph isomorphism algorithms.
PMCID: PMC3633016  PMID: 23815292
Subgraph isomorphism algorithms; biochemical graph data; search strategies; algorithms comparisons and distributions
10.  The ALARM Monitor and the Bone-Marrow Transplant Therapy Advisor: A Demonstration of Two Probabilistic Expert Systems in KNET 
ALARM (A Logical Alarm Reduction Mechanism) is a diagnostic application used to explore probabilistic reasoning techniques in belief networks. ALARM implements an alarm message system for patient monitoring; it calculates probabilities for a differential diagnosis based on available evidence [1]. The medical knowledge is encoded in a graphical structure connecting 8 diagnoses, 16 findings and 13 intermediate variables.
The goal of the ALARM monitoring system is to provide specific text messages advising the user of possible problems. This is a diagnostic task, and we have chosen to represent the relevant knowledge in the language of a belief network. This graphical representation [6] facilitates the integration of qualitative and quantitative knowledge, the assessment of multiple faults, as required by our domain, and nonmonotonic and bidirectional reasoning.
We have also created a belief network, the Bone-Marrow Transplant Therapy Advisor, that represents prognostic factors and their effects on possible outcomes of a bone-marrow transplant. For pediatric patients in the advanced stages of acute lymphoblastic leukemia (ALL), bone-marrow transplantation is generally considered the most promising therapy. For the patient and parents, the decision to proceed with transplantation is often difficult. Morbidity after transplantation is usually severe, and a significant percentage of those who receive a bone marrow transplantation die within a year of transplantation [7]. Many factors, however, offer significant insight into the expected outcome of marrow transplantation. A few examples of such prognostic factors include the white blood count at diagnosis, the age at diagnosis, the number of recurrence episodes before transplantation, and the quality of the match with the marrow donor. Some of those factors indicate the progress of the disease, whereas others define sensitivity to the chemotherapeutic conditioning regimee or the likelihood of Graft-versus-Host Disease (GvHD).
Within the discipline of medical informatics, many researchers have studied methodologies for encoding the knowledge of expert clinicians as computational artifacts. KNET, the support software for ALARM and the bone-marrow transplant advisor, is a general-purpose environment for constructing probabilistic, knowledge-intensive systems based on belief networks and decision networks [2]. KNET differs from other tools for expert-system construction in that it combines a direct-manipulation visual interface with a normative, probabilistic scheme for the management of uncertain information and inference. The KNET architecture defines a complete separation between the hypermedia user interface on the one hand, and the representation and management of expert opinion on the other.
In our laboratory, we and others have used KNET to build not only the ALARM and bone-marrow transplant systems, but also consultation programs for lymph-node pathology and clinical epidemiology [2,4]. KNET imposes few restrictions on the interface design. Indeed, we have rapidly prototyped several direct-manipulation interfaces that use graphics, buttons, menus, text, and icons to organize the display of static and inferred knowledge. The underlying normative representation of knowledge remains constant.
We present ALARM and the transplant therapy advisor as part of a suite of probabilistic, knowledge-intensive medical expert systems. Such systems
• Manage large quantities of extensively cross-referenced information
• Emphasize clarity in acquiring, storing, and displaying expert knowledge
• Incorporate tools for building hypertext user interfaces
• Impose a limited number of constraints on the knowledge engineer's design choices
• Share an axiomatic grounding for diagnosis and decision-making in probability theory and utility theory
• Make normatively correct decisions and diagnoses in the face of uncertain, incomplete, and contradictory information
• Draw inferences from knowledge bases large enough to model significant, real-world medical domains, and do so in polynomial time on low-cost hardware
In this demonstration, we show how ALARM and the therapy advisor synthesize physiologic measurements and prognostic indicators into a diagnostic conclusion according to a belief-network model of the domains. We demonstrate KNET's hypertext interface and the transparent integration of probabilistic reasoning into a diagnostic application. KNET runs on any Macintosh II personal computer with at least 4 megabytes of random-access memory. The authors will provide all the necessary software on a SCSI hard disk. KNET fully supports color and monochrome monitors of any size, and requires no special hardware. We prefer, but do not require, a large color monitor, which demonstrates the capabilities of KNET to greatest advantage.
PMCID: PMC2245721
11.  ETE: a python Environment for Tree Exploration 
BMC Bioinformatics  2010;11:24.
Many bioinformatics analyses, ranging from gene clustering to phylogenetics, produce hierarchical trees as their main result. These are used to represent the relationships among different biological entities, thus facilitating their analysis and interpretation. A number of standalone programs are available that focus on tree visualization or that perform specific analyses on them. However, such applications are rarely suitable for large-scale surveys, in which a higher level of automation is required. Currently, many genome-wide analyses rely on tree-like data representation and hence there is a growing need for scalable tools to handle tree structures at large scale.
Here we present the Environment for Tree Exploration (ETE), a python programming toolkit that assists in the automated manipulation, analysis and visualization of hierarchical trees. ETE libraries provide a broad set of tree handling options as well as specific methods to analyze phylogenetic and clustering trees. Among other features, ETE allows for the independent analysis of tree partitions, has support for the extended newick format, provides an integrated node annotation system and permits to link trees to external data such as multiple sequence alignments or numerical arrays. In addition, ETE implements a number of built-in analytical tools, including phylogeny-based orthology prediction and cluster validation techniques. Finally, ETE's programmable tree drawing engine can be used to automate the graphical rendering of trees with customized node-specific visualizations.
ETE provides a complete set of methods to manipulate tree data structures that extends current functionality in other bioinformatic toolkits of a more general purpose. ETE is free software and can be downloaded from
PMCID: PMC2820433  PMID: 20070885
12.  A graph-theory method for pattern identification in geographical epidemiology – a preliminary application to deprivation and mortality 
Graph theoretical methods are extensively used in the field of computational chemistry to search datasets of compounds to see if they contain particular molecular sub-structures or patterns. We describe a preliminary application of a graph theoretical method, developed in computational chemistry, to geographical epidemiology in relation to testing a prior hypothesis. We tested the methodology on the hypothesis that if a socioeconomically deprived neighbourhood is situated in a wider deprived area, then that neighbourhood would experience greater adverse effects on mortality compared with a similarly deprived neighbourhood which is situated in a wider area with generally less deprivation.
We used the Trent Region Health Authority area for this study, which contained 10,665 census enumeration districts (CED). Graphs are mathematical representations of objects and their relationships and within the context of this study, nodes represented CEDs and edges were determined by whether or not CEDs were neighbours (shared a common boundary). The overall area in this study was represented by one large graph comprising all CEDs in the region, along with their adjacency information. We used mortality data from 1988–1998, CED level population estimates and the Townsend Material Deprivation Index as an indicator of neighbourhood level deprivation. We defined deprived CEDs as those in the top 20% most deprived in the Region. We then set out to classify these deprived CEDs into seven groups defined by increasing deprivation levels in the neighbouring CEDs. 506 (24.2%) of the deprived CEDs had five adjacent CEDs and we limited pattern development and searching to these CEDs. We developed seven query patterns and used the RASCAL (Rapid Similarity Calculator) program to carry out the search for each of the query patterns. This program used a maximum common subgraph isomorphism method which was modified to handle geographical data.
Of the 506 deprived CEDs, 10 were not identified as belonging to any of the seven groups because they were adjacent to a CED with a missing deprivation category quintile, and none fell within query Group 1 (a deprived CED for which all five adjacent CEDs were affluent). Only four CEDs fell within Group 2, which was defined as having four affluent adjacent CEDs and one non-affluent adjacent CED. The numbers of CEDs in Groups 3–7 were 17, 214, 95, 81 and 85 respectively. Age and sex adjusted mortality rate ratios showed a non-significant trend towards increasing mortality risk across Groups (Chi-square = 3.26, df = 1, p = 0.07).
Graph theoretical methods developed in computational chemistry may be a useful addition to the current GIS based methods available for geographical epidemiology but further developmental work is required. An important requirement will be the development of methods for specifying multiple complex search patterns. Further work is also required to examine the utility of using distance, as opposed to adjacency, to describe edges in graphs, and to examine methods for pattern specification when the nodes have multiple attributes attached to them.
PMCID: PMC2686691  PMID: 19439082
13.  FReDoWS: a method to automate molecular docking simulations with explicit receptor flexibility and snapshots selection 
BMC Genomics  2011;12(Suppl 4):S6.
In silico molecular docking is an essential step in modern drug discovery when driven by a well defined macromolecular target. Hence, the process is called structure-based or rational drug design (RDD). In the docking step of RDD the macromolecule or receptor is usually considered a rigid body. However, we know from biology that macromolecules such as enzymes and membrane receptors are inherently flexible. Accounting for this flexibility in molecular docking experiments is not trivial. One possibility, which we call a fully-flexible receptor model, is to use a molecular dynamics simulation trajectory of the receptor to simulate its explicit flexibility. To benefit from this concept, which has been known since 2000, it is essential to develop and improve new tools that enable molecular docking simulations of fully-flexible receptor models.
We have developed a Flexible-Receptor Docking Workflow System (FReDoWS) to automate molecular docking simulations using a fully-flexible receptor model. In addition, it includes a snapshot selection feature to facilitate acceleration the virtual screening of ligands for well defined disease targets. FReDoWS usefulness is demonstrated by investigating the docking of four different ligands to flexible models of Mycobacterium tuberculosis’ wild type InhA enzyme and mutants I21V and I16T. We find that all four ligands bind effectively to this receptor as expected from the literature on similar, but wet experiments.
A work that would usually need the manual execution of many computer programs, and the manipulation of thousands of files, was efficiently and automatically performed by FReDoWS. Its friendly interface allows the user to change the docking and execution parameters. Besides, the snapshot selection feature allowed the acceleration of docking simulations. We expect FReDoWS to help us explore more of the role flexibility plays in receptor-ligand interactions. FReDoWS can be made available upon request to the authors.
PMCID: PMC3287589  PMID: 22369186
14.  Application of DEN refinement and automated model building to a difficult case of molecular-replacement phasing: the structure of a putative succinyl-diaminopimelate desuccinylase from Corynebacterium glutamicum  
DEN refinement and automated model building with AutoBuild were used to determine the structure of a putative succinyl-diaminopimelate desuccinylase from C. glutamicum. This difficult case of molecular-replacement phasing shows that the synergism between DEN refinement and AutoBuild outperforms standard refinement protocols.
Phasing by molecular replacement remains difficult for targets that are far from the search model or in situations where the crystal diffracts only weakly or to low resolution. Here, the process of determining and refining the structure of Cgl1109, a putative succinyl-diaminopimelate desuccinylase from Corynebacterium glutamicum, at ∼3 Å resolution is described using a combination of homology modeling with MODELLER, molecular-replacement phasing with Phaser, deformable elastic network (DEN) refinement and automated model building using AutoBuild in a semi-automated fashion, followed by final refinement cycles with phenix.refine and Coot. This difficult molecular-replacement case illustrates the power of including DEN restraints derived from a starting model to guide the movements of the model during refinement. The resulting improved model phases provide better starting points for automated model building and produce more significant difference peaks in anomalous difference Fourier maps to locate anomalous scatterers than does standard refinement. This example also illustrates a current limitation of automated procedures that require manual adjustment of local sequence misalignments between the homology model and the target sequence.
PMCID: PMC3322598  PMID: 22505259
reciprocal-space refinement; DEN refinement; real-space refinement; automated model building; succinyl-diaminopimelate desuccinylase
15.  Small Molecule Subgraph Detector (SMSD) toolkit 
Finding one small molecule (query) in a large target library is a challenging task in computational chemistry. Although several heuristic approaches are available using fragment-based chemical similarity searches, they fail to identify exact atom-bond equivalence between the query and target molecules and thus cannot be applied to complex chemical similarity searches, such as searching a complete or partial metabolic pathway.
In this paper we present a new Maximum Common Subgraph (MCS) tool: SMSD (Small Molecule Subgraph Detector) to overcome the issues with current heuristic approaches to small molecule similarity searches. The MCS search implemented in SMSD incorporates chemical knowledge (atom type match with bond sensitive and insensitive information) while searching molecular similarity. We also propose a novel method by which solutions obtained by each MCS run can be ranked using chemical filters such as stereochemistry, bond energy, etc.
In order to benchmark and test the tool, we performed a 50,000 pair-wise comparison between KEGG ligands and PDB HET Group atoms. In both cases the SMSD was shown to be more efficient than the widely used MCS module implemented in the Chemistry Development Kit (CDK) in generating MCS solutions from our test cases.
Presently this tool can be applied to various areas of bioinformatics and chemo-informatics for finding exhaustive MCS matches. For example, it can be used to analyse metabolic networks by mapping the atoms between reactants and products involved in reactions. It can also be used to detect the MCS/substructure searches in small molecules reported by metabolome experiments, as well as in the screening of drug-like compounds with similar substructures.
Thus, we present a robust tool that can be used for multiple applications, including the discovery of new drug molecules. This tool is freely available on
PMCID: PMC2820491  PMID: 20298518
16.  Conformation-independent structural comparison of macromolecules with ProSMART  
The Procrustes Structural Matching Alignment and Restraints Tool (ProSMART) has been developed to allow local comparative structural analyses independent of the global conformations and sequence homology of the compared macromolecules. This allows quick and intuitive visualization of the conservation of backbone and side-chain conformations, providing complementary information to existing methods.
The identification and exploration of (dis)similarities between macromolecular structures can help to gain biological insight, for instance when visualizing or quantifying the response of a protein to ligand binding. Obtaining a residue alignment between compared structures is often a prerequisite for such comparative analysis. If the conformational change of the protein is dramatic, conventional alignment methods may struggle to provide an intuitive solution for straightforward analysis. To make such analyses more accessible, the Procrustes Structural Matching Alignment and Restraints Tool (ProSMART) has been developed, which achieves a conformation-independent structural alignment, as well as providing such additional functionalities as the generation of restraints for use in the refinement of macromolecular models. Sensible comparison of protein (or DNA/RNA) structures in the presence of conformational changes is achieved by enforcing neither chain nor domain rigidity. The visualization of results is facilitated by popular molecular-graphics software such as CCP4mg and PyMOL, providing intuitive feedback regarding structural conservation and subtle dissimilarities between close homologues that can otherwise be hard to identify. Automatically generated colour schemes corresponding to various residue-based scores are provided, which allow the assessment of the conservation of backbone and side-chain conformations relative to the local coordinate frame. Structural comparison tools such as ProSMART can help to break the complexity that accompanies the constantly growing pool of structural data into a more readily accessible form, potentially offering biological insight or influencing subsequent experiments.
PMCID: PMC4157452  PMID: 25195761
ProSMART; Procrustes; structural comparison; alignment; external restraints; refinement
17.  The Phenix Software for Automated Determination of Macromolecular Structures 
Methods (San Diego, Calif.)  2011;55(1):94-106.
X-ray crystallography is a critical tool in the study of biological systems. It is able to provide information that has been a prerequisite to understanding the fundamentals of life. It is also a method that is central to the development of new therapeutics for human disease. Significant time and effort are required to determine and optimize many macromolecular structures because of the need for manual interpretation of complex numerical data, often using many different software packages, and the repeated use of interactive three-dimensional graphics. The Phenix software package has been developed to provide a comprehensive system for macromolecular crystallographic structure solution with an emphasis on automation. This has required the development of new algorithms that minimize or eliminate subjective input in favour of built-in expert-systems knowledge, the automation of procedures that are traditionally performed by hand, and the development of a computational framework that allows a tight integration between the algorithms. The application of automated methods is particularly appropriate in the field of structural proteomics, where high throughput is desired. Features in Phenix for the automation of experimental phasing with subsequent model building, molecular replacement, structure refinement and validation are described and examples given of running Phenix from both the command line and graphical user interface.
PMCID: PMC3193589  PMID: 21821126
Macromolecular Crystallography; Automation; Phenix; X-ray; Diffraction; Python
18.  An novel frequent probability pattern mining algorithm based on circuit simulation method in uncertain biological networks 
BMC Systems Biology  2014;8(Suppl 3):S6.
Motif mining has always been a hot research topic in bioinformatics. Most of current research on biological networks focuses on exact motif mining. However, due to the inevitable experimental error and noisy data, biological network data represented as the probability model could better reflect the authenticity and biological significance, therefore, it is more biological meaningful to discover probability motif in uncertain biological networks. One of the key steps in probability motif mining is frequent pattern discovery which is usually based on the possible world model having a relatively high computational complexity.
In this paper, we present a novel method for detecting frequent probability patterns based on circuit simulation in the uncertain biological networks. First, the partition based efficient search is applied to the non-tree like subgraph mining where the probability of occurrence in random networks is small. Then, an algorithm of probability isomorphic based on circuit simulation is proposed. The probability isomorphic combines the analysis of circuit topology structure with related physical properties of voltage in order to evaluate the probability isomorphism between probability subgraphs. The circuit simulation based probability isomorphic can avoid using traditional possible world model. Finally, based on the algorithm of probability subgraph isomorphism, two-step hierarchical clustering method is used to cluster subgraphs, and discover frequent probability patterns from the clusters.
The experiment results on data sets of the Protein-Protein Interaction (PPI) networks and the transcriptional regulatory networks of E. coli and S. cerevisiae show that the proposed method can efficiently discover the frequent probability subgraphs. The discovered subgraphs in our study contain all probability motifs reported in the experiments published in other related papers.
The algorithm of probability graph isomorphism evaluation based on circuit simulation method excludes most of subgraphs which are not probability isomorphism and reduces the search space of the probability isomorphism subgraphs using the mismatch values in the node voltage set. It is an innovative way to find the frequent probability patterns, which can be efficiently applied to probability motif discovery problems in the further studies.
PMCID: PMC4243085  PMID: 25350277
19.  Cryo-EM of macromolecular assemblies at near-atomic resolution 
Nature protocols  2010;5(10):1697-1708.
With single-particle electron cryomicroscopy (cryo-eM), it is possible to visualize large, macromolecular assemblies in near-native states. although subnanometer resolutions have been routinely achieved for many specimens, state of the art cryo-eM has pushed to near-atomic (3.3–4.6 Å) resolutions. at these resolutions, it is now possible to construct reliable atomic models directly from the cryo-eM density map. In this study, we describe our recently developed protocols for performing the three-dimensional reconstruction and modeling of Mm-cpn, a group II chaperonin, determined to 4.3 Å resolution. this protocol, utilizing the software tools eMan, Gorgon and coot, can be adapted for use with nearly all specimens imaged with cryo-eM that target beyond 5 Å resolution. additionally, the feature recognition and computational modeling tools can be applied to any near-atomic resolution density maps, including those from X-ray crystallography.
PMCID: PMC3107675  PMID: 20885381
20.  Autofix for backward-fit sidechains: using MolProbity and real-space refinement to put misfits in their place 
Misfit sidechains in protein crystal structures are a stumbling block in using those structures to direct further scientific inference. Problems due to surface disorder and poor electron density are very difficult to address, but a large class of systematic errors are quite common even in well-ordered regions, resulting in sidechains fit backwards into local density in predictable ways. The MolProbity web site is effective at diagnosing such errors, and can perform reliable automated correction of a few special cases such as 180° flips of Asn or Gln sidechain amides, using all-atom contacts and H-bond networks. However, most at-risk residues involve tetrahedral geometry, and their valid correction requires rigorous evaluation of sidechain movement and sometimes backbone shift. The current work extends the benefits of robust automated correction to more sidechain types. The Autofix method identifies candidate systematic, flipped-over errors in Leu, Thr, Val, and Arg using MolProbity quality statistics, proposes a corrected position using real-space refinement with rotamer selection in Coot, and accepts or rejects the correction based on improvement in MolProbity criteria and on χ angle change. Criteria are chosen conservatively, after examining many individual results, to ensure valid correction. To test this method, Autofix was run and analyzed for 945 representative PDB files and on the 50S ribosomal subunit of file 1YHQ. Over 40% of Leu, Val, and Thr outliers and 15% of Arg outliers were successfully corrected, resulting in a total of 3,679 corrected sidechains, or 4 per structure on average. Summary Sentences: A common class of misfit sidechains in protein crystal structures is due to systematic errors that place the sidechain backwards into the local electron density. A fully automated method called “Autofix” identifies such errors for Leu, Val, Thr, and Arg and corrects over one third of them, using MolProbity validation criteria and Coot real-space refinement of rotamers.
Electronic supplementary material
The online version of this article (doi:10.1007/s10969-008-9045-8) contains supplementary material, which is available to authorized users.
PMCID: PMC2704614  PMID: 19002604
Automation; Structure improvement; Crystallography; Sidechain rotamers; Protein/RNA interactions
21.  Maximum common subgraph: some upper bound and lower bound results 
BMC Bioinformatics  2006;7(Suppl 4):S6.
Structure matching plays an important part in understanding the functional role of biological structures. Bioinformatics assists in this effort by reformulating this process into a problem of finding a maximum common subgraph between graphical representations of these structures. Among the many different variants of the maximum common subgraph problem, the maximum common induced subgraph of two graphs is of special interest.
Based on current research in the area of parameterized computation, we derive a new lower bound for the exact algorithms of the maximum common induced subgraph of two graphs which is the best currently known. Then we investigate the upper bound and design techniques for approaching this problem, specifically, reducing it to one of finding a maximum clique in the product graph of the two given graphs. Considering the upper bound result, the derived lower bound result is asymptotically tight.
Parameterized computation is a viable approach with great potential for investigating many applications within bioinformatics, such as the maximum common subgraph problem studied in this paper. With an improved hardness result and the proposed approaches in this paper, future research can be focused on further exploration of efficient approaches for different variants of this problem within the constraints imposed by real applications.
PMCID: PMC1780128  PMID: 17217524
22.  Study of Uremic Toxin Fluxes Across Nanofabricated Hemodialysis Membranes Using Irreversible Thermodynamics 
The flux of uremic toxin middle molecules through currently used hemodialysis membranes is suboptimal, mainly because of the membranes’ pore architecture.
Identifying the modifiable sieving parameters that can be improved by nanotechnology to enhance fluxes of uremic toxins across the walls of dialyzers’ capillaries.
We determined the maximal dimensions of endothelin, cystatin C, and interleukin – 6 using the macromolecular modeling software, COOT. We also applied the expanded Nernst-Plank equation to calculate the changes in the overall flux as a function of increased electro-migration and pH of the respective molecules.
In a high flux hemodialyzer, the effective diffusivities of endothelin, cystatin C, and interleukin – 6 are 15.00 x 10-10 cm2/s, 7.7 x 10-10 cm2/s, and 5.4 x 10-10 cm2/s, respectively, through the capillaries’ walls. In a nanofabricated membrane, the effective diffusivities of endothelin, cystatin C, and interleukin – 6 are 13.87 x 10-7 cm2/s, 5.73 x 10-7 cm2/s, and 3.45 x 10-7 cm2/s, respectively, through a nanofabricated membrane. Theoretical modeling showed that a 96% reduction in the membrane's thickness and the application of an electric potential of 10 mV across the membrane could enhance the flux of endothelin, cystatin C, and interleukin - 6 by a factor of 25. A ΔpH of 0.07 altered the fluxes minimally.
Nanofabricated hemodialysis membranes with a reduced thickness and an applied electric potential can enhance the effective diffusivity and electro-migration flux of the respective uremic toxins by 3 orders of magnitude as compared to those passing through the high flux hemodialyzer.
PMCID: PMC3962091  PMID: 24688713
uremic toxins; nanofabrication; hemodialysis membranes; endothelin; cystatin C; interleukin-6
23.  GenoLink: a graph-based querying and browsing system for investigating the function of genes and proteins 
BMC Bioinformatics  2006;7:21.
A large variety of biological data can be represented by graphs. These graphs can be constructed from heterogeneous data coming from genomic and post-genomic technologies, but there is still need for tools aiming at exploring and analysing such graphs. This paper describes GenoLink, a software platform for the graphical querying and exploration of graphs.
GenoLink provides a generic framework for representing and querying data graphs. This framework provides a graph data structure, a graph query engine, allowing to retrieve sub-graphs from the entire data graph, and several graphical interfaces to express such queries and to further explore their results. A query consists in a graph pattern with constraints attached to the vertices and edges. A query result is the set of all sub-graphs of the entire data graph that are isomorphic to the pattern and satisfy the constraints. The graph data structure does not rely upon any particular data model but can dynamically accommodate for any user-supplied data model. However, for genomic and post-genomic applications, we provide a default data model and several parsers for the most popular data sources. GenoLink does not require any programming skill since all operations on graphs and the analysis of the results can be carried out graphically through several dedicated graphical interfaces.
GenoLink is a generic and interactive tool allowing biologists to graphically explore various sources of information. GenoLink is distributed either as a standalone application or as a component of the Genostar/Iogma platform. Both distributions are free for academic research and teaching purposes and can be requested at A commercial licence form can be obtained for profit company at See also .
PMCID: PMC1382257  PMID: 16417636
24.  A chemogenomics view on protein-ligand spaces 
BMC Bioinformatics  2009;10(Suppl 6):S13.
Chemogenomics is an emerging inter-disciplinary approach to drug discovery that combines traditional ligand-based approaches with biological information on drug targets and lies at the interface of chemistry, biology and informatics. The ultimate goal in chemogenomics is to understand molecular recognition between all possible ligands and all possible drug targets. Protein and ligand space have previously been studied as separate entities, but chemogenomics studies deal with large datasets that cover parts of the joint protein-ligand space. Since drug discovery has traditionally focused on ligand optimization, the chemical space has been studied extensively. The protein space has been studied to some extent, typically for the purpose of classification of proteins into functional and structural classes. Since chemogenomics deals not only with ligands but also with the macromolecules the ligands interact with, it is of interest to find means to explore, compare and visualize protein-ligand subspaces.
Two chemogenomics protein-ligand interaction datasets were prepared for this study. The first dataset covers the known structural protein-ligand space, and includes all non-redundant protein-ligand interactions found in the worldwide Protein Data Bank (PDB). The second dataset contains all approved drugs and drug targets stored in the DrugBank database, and represents the approved drug-drug target space. To capture biological and physicochemical features of the chemogenomics datasets, sequence-based descriptors were computed for the proteins, and 0, 1 and 2 dimensional descriptors for the ligands. Principal component analysis (PCA) was used to analyze the multidimensional data and to create global models of protein-ligand space. The nearest neighbour method, computed using the principal components, was used to obtain a measure of overlap between the datasets.
In this study, we present an approach to visualize protein-ligand spaces from a chemogenomics perspective, where both ligand and protein features are taken into account. The method can be applied to any protein-ligand interaction dataset. Here, the approach is applied to analyze the structural protein-ligand space and the protein-ligand space of all approved drugs and their targets. We show that this approach can be used to visualize and compare chemogenomics datasets, and possibly to identify cross-interaction complexes in protein-ligand space.
PMCID: PMC2697636  PMID: 19534738
25.  STSE: Spatio-Temporal Simulation Environment Dedicated to Biology 
BMC Bioinformatics  2011;12:126.
Recently, the availability of high-resolution microscopy together with the advancements in the development of biomarkers as reporters of biomolecular interactions increased the importance of imaging methods in molecular cell biology. These techniques enable the investigation of cellular characteristics like volume, size and geometry as well as volume and geometry of intracellular compartments, and the amount of existing proteins in a spatially resolved manner. Such detailed investigations opened up many new areas of research in the study of spatial, complex and dynamic cellular systems. One of the crucial challenges for the study of such systems is the design of a well stuctured and optimized workflow to provide a systematic and efficient hypothesis verification. Computer Science can efficiently address this task by providing software that facilitates handling, analysis, and evaluation of biological data to the benefit of experimenters and modelers.
The Spatio-Temporal Simulation Environment (STSE) is a set of open-source tools provided to conduct spatio-temporal simulations in discrete structures based on microscopy images. The framework contains modules to digitize, represent, analyze, and mathematically model spatial distributions of biochemical species. Graphical user interface (GUI) tools provided with the software enable meshing of the simulation space based on the Voronoi concept. In addition, it supports to automatically acquire spatial information to the mesh from the images based on pixel luminosity (e.g. corresponding to molecular levels from microscopy images). STSE is freely available either as a stand-alone version or included in the linux live distribution Systems Biology Operational Software (SB.OS) and can be downloaded from The Python source code as well as a comprehensive user manual and video tutorials are also offered to the research community. We discuss main concepts of the STSE design and workflow. We demonstrate it's usefulness using the example of a signaling cascade leading to formation of a morphological gradient of Fus3 within the cytoplasm of the mating yeast cell Saccharomyces cerevisiae.
STSE is an efficient and powerful novel platform, designed for computational handling and evaluation of microscopic images. It allows for an uninterrupted workflow including digitization, representation, analysis, and mathematical modeling. By providing the means to relate the simulation to the image data it allows for systematic, image driven model validation or rejection. STSE can be scripted and extended using the Python language. STSE should be considered rather as an API together with workflow guidelines and a collection of GUI tools than a stand alone application. The priority of the project is to provide an easy and intuitive way of extending and customizing software using the Python language.
PMCID: PMC3114743  PMID: 21527030

Results 1-25 (767641)