A method for automated macromolecular main-chain model building is described.
An algorithm for the automated macromolecular model building of polypeptide backbones is described. The procedure is hierarchical. In the initial stages, many overlapping polypeptide fragments are built. In subsequent stages, the fragments are extended and then connected. Identification of the locations of helical and β-strand regions is carried out by FFT-based template matching. Fragment libraries of helices and β-strands from refined protein structures are then positioned at the potential locations of helices and strands and the longest segments that fit the electron-density map are chosen. The helices and strands are then extended using fragment libraries consisting of sequences three amino acids long derived from refined protein structures. The resulting segments of polypeptide chain are then connected by choosing those which overlap at two or more Cα positions. The fully automated procedure has been implemented in RESOLVE and is capable of model building at resolutions as low as 3.5 Å. The algorithm is useful for building a preliminary main-chain model that can serve as a basis for refinement and side-chain addition.
model building; template matching; fragment extension
A number of techniques for the location of small and medium-sized model fragments in experimentally phased electron-density maps are explored. The application of one of these techniques to automated model building is discussed.
Molecular replacement is a powerful tool for the location of large models using structure-factor magnitudes alone. When phase information is available, it becomes possible to locate smaller fragments of the structure ranging in size from a few atoms to a single domain. The calculation is demanding, requiring a six-dimensional rotation and translation search. A number of approaches have been developed to this problem and a selection of these are reviewed in this paper. The application of one of these techniques to the problem of automated model building is explored in more detail, with particular reference to the problem of sequencing a protein main-chain trace.
model fragments; electron-density maps; model building
Conformational sampling is one of the bottlenecks in fragment-based protein structure prediction approaches. They generally start with a coarse-grained optimization where mainchain atoms and centroids of side chains are considered, followed by a fine-grained optimization with an all-atom representation of proteins. It is during this coarse-grained phase that fragment-based methods sample intensely the conformational space. If the native-like region is sampled more, the accuracy of the final all-atom predictions may be improved accordingly. In this work we present EdaFold, a new method for fragment-based protein structure prediction based on an Estimation of Distribution Algorithm. Fragment-based approaches build protein models by assembling short fragments from known protein structures. Whereas the probability mass functions over the fragment libraries are uniform in the usual case, we propose an algorithm that learns from previously generated decoys and steers the search toward native-like regions. A comparison with Rosetta AbInitio protocol shows that EdaFold is able to generate models with lower energies and to enhance the percentage of near-native coarse-grained decoys on a benchmark of proteins. The best coarse-grained models produced by both methods were refined into all-atom models and used in molecular replacement. All atom decoys produced out of EdaFold’s decoy set reach high enough accuracy to solve the crystallographic phase problem by molecular replacement for some test proteins. EdaFold showed a higher success rate in molecular replacement when compared to Rosetta. Our study suggests that improving low resolution coarse-grained decoys allows computational methods to avoid subsequent sampling issues during all-atom refinement and to produce better all-atom models. EdaFold can be downloaded from http://www.riken.jp/zhangiru/software/.
The highly automated PHENIX AutoBuild wizard is described. The procedure can be applied equally well to phases derived from isomorphous/anomalous and molecular-replacement methods.
The PHENIX AutoBuild wizard is a highly automated tool for iterative model building, structure refinement and density modification using RESOLVE model building, RESOLVE statistical density modification and phenix.refine structure refinement. Recent advances in the AutoBuild wizard and phenix.refine include automated detection and application of NCS from models as they are built, extensive model-completion algorithms and automated solvent-molecule picking. Model-completion algorithms in the AutoBuild wizard include loop building, crossovers between chains in different models of a structure and side-chain optimization. The AutoBuild wizard has been applied to a set of 48 structures at resolutions ranging from 1.1 to 3.2 Å, resulting in a mean R factor of 0.24 and a mean free R factor of 0.29. The R factor of the final model is dependent on the quality of the starting electron density and is relatively independent of resolution.
model building; model completion; macromolecular models; Protein Data Bank; structure refinement; PHENIX
The exponential growth of research in molecular biology has brought concomitant proliferation of databases for stocking its findings. A variety of protein sequence databases exist. While all of these strive for completeness, the range of user interests is often beyond their scope. Large databases covering a broad range of domains tend to offer less detailed information than smaller, more specialized resources, often creating a need to combine data from many sources in order to obtain a complete picture. Scientific researchers are continually developing new specific databases to enhance their understanding of biological processes.
In this article, we present the implementation of a new tool for protein data analysis. With its easy-to-use user interface, this software provides the opportunity to build more specialized protein databases from a universal protein sequence database such as Swiss-Prot. A family of proteins known as bacteriocins is analyzed as 'proof of concept'.
SciDBMaker is stand-alone software that allows the extraction of protein data from the Swiss-Prot database, sequence analysis comprising physicochemical profile calculations, homologous sequences search, multiple sequence alignments and the building of new and more specialized databases. It compiles information with relative ease, updates and compares various data relevant to a given protein family and could solve the problem of dispersed biological search results.
In the emerging field of RNA-based nanotechnology there is a need for automation of the structure design process. Our goal is to develop computer methods for aiding in this process. Towards that end, we created the RNAJunction database, which is a repository of RNA junctions, i.e. internal, multi-branch and kissing loops with emanating stem stubs, extracted from the larger RNA structures stored in the PDB database. These junctions can be used as building blocks for nanostructures. Two programs developed in our laboratory, NanoTiler and RNA2D3D, can combine such building blocks with idealized fragments of A-form helices to produce desired 3D nanostructures. Initially, the building blocks are treated as rigid objects and the resulting geometry is tested against the design objectives. Experimental data, however, shows that RNA accommodates its shape to the constraints of larger structural contexts. Therefore we are adding analysis of the flexibility of our building blocks to the full design process. Here we present an example of RNA-based nanostructure design, putting emphasis on the need to characterize the structural flexibility of the building blocks to induce ring closure in the automated exploration. We focus on the use of kissing loops (KL) in nanostructure design, since they have been shown to play an important role in RNA self-assembly. By using an experimentally proven system, the RNA tectosquare, we show that considering the flexibility of the KLs as well as distortions of helical regions may be necessary to achieve a realistic design.
RNA; Nanostructure; Design; Modeling; Flexibility; Molecular dynamics
MODBASE () is a database of annotated comparative protein structure models for all available protein sequences that can be matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on MODELLER for fold assignment, sequence–structure alignment, model building and model assessment (). MODBASE is updated regularly to reflect the growth in protein sequence and structure databases, and improvements in the software for calculating the models. MODBASE currently contains 3 094 524 reliable models for domains in 1 094 750 out of 1 817 889 unique protein sequences in the UniProt database (July 5, 2005); only models based on statistically significant alignments and models assessed to have the correct fold despite insignificant alignments are included. MODBASE also allows users to generate comparative models for proteins of interest with the automated modeling server MODWEB (). Our other resources integrated with MODBASE include comprehensive databases of multiple protein structure alignments (DBAli, ), structurally defined ligand binding sites and structurally defined binary domain interfaces (PIBASE, ) as well as predictions of ligand binding sites, interactions between yeast proteins, and functional consequences of human nsSNPs (LS-SNP, ).
Applying high-throughput Top-Down MS to an entire proteome requires a yet-to-be-established model for data processing. Since Top-Down is becoming possible on a large scale, we report our latest software pipeline dedicated to capturing the full value of intact protein data in automated fashion. For intact mass detection, we combine algorithms for processing MS1 data from both isotopically resolved (FT) and charge-state resolved (ion trap) LC-MS data, which are then linked to their fragment ions for database searching using ProSight. Automated determination of human keratin and tubulin isoforms is one result. Optimized for the intricacies of whole proteins, new software modules visualize proteome-scale data based on the LC retention time and intensity of intact masses and enable selective detection of PTMs to automatically screen for acetylation, phosphorylation, and methylation. Software functionality was demonstrated using comparative LC-MS data from yeast strains in addition to human cells undergoing chemical stress. We further these advances as a key aspect of realizing Top-Down MS on a proteomic scale.
Bioinformatics; Data reduction; Deconvolution; Intact protein; Tandem MS; Top down
Here our goal is to carry out nanotube design using naturally occurring protein building blocks. Inspection of the protein structural database reveals the richness of the conformations of proteins, their parts, and their chemistry. Given target functional protein nanotube geometry, our strategy involves scanning a library of candidate building blocks, combinatorially assembling them into the shape and testing its stability. Since self-assembly takes place on time scales not affordable for computations, here we propose a strategy for the very first step in protein nanotube design: we map the candidate building blocks onto a planar sheet and wrap the sheet around a cylinder with the target dimensions. We provide examples of three nanotubes, two peptide and one protein, in atomistic model detail for which there are experimental data. The nanotube models can be used to verify a nanostructure observed by low-resolution experiments, and to study the mechanism of tube formation.
Nanobiology is challenging to computational biology. The aim is to predict candidate nanostructures consisting of biopolymers. The idea the authors have followed for some years is to employ naturally occurring protein building blocks for protein and nanostructure design. Drawing on proteins as material is attractive, since proteins and their building blocks have a large repertoire of shapes and surface chemistry. The authors assume that the shape is given: either a target protein scaffold or a functional nanoparticle shape. Ideally, building blocks should self-assemble spontaneously. In practice, self-assembly involves time scales not affordable for computations. Instead, here is presented the first step in knowledge-based nanotube design: creating a nanotube with specified protein arrangement and tube geometry. Models are constructed by wrapping a planar sheet onto a cylinder surface. The sheet is shaped by a repeating 2-D lattice. This simplification reduces the complexity to the protein arrangement in the lattice and does not prevent construction of all possible nanotubes of repeated units. It allows optimization with all-atom force field. This is important since local energy minimization may screen whether a specified nanotube is a feasible nanostructure.
Modulation of DNA base excision repair (BER) has the potential to enhance response to chemotherapy and improve outcomes in tumours such as melanoma and glioma. APE1, a critical protein in BER that processes potentially cytotoxic abasic sites (AP sites), is a promising new target in cancer. In the current study, we aimed to develop small molecule inhibitors of APE1 for cancer therapy.
An industry-standard high throughput virtual screening strategy was adopted. The Sybyl8.0 (Tripos, St Louis, MO, USA) molecular modelling software suite was used to build inhibitor templates. Similarity searching strategies were then applied using ROCS 2.3 (Open Eye Scientific, Santa Fe, NM, USA) to extract pharmacophorically related subsets of compounds from a chemically diverse database of 2.6 million compounds. The compounds in these subsets were subjected to docking against the active site of the APE1 model, using the genetic algorithm-based programme GOLD2.7 (CCDC, Cambridge, UK). Predicted ligand poses were ranked on the basis of several scoring functions. The top virtual hits with promising pharmaceutical properties underwent detailed in vitro analyses using fluorescence-based APE1 cleavage assays and counter screened using endonuclease IV cleavage assays, fluorescence quenching assays and radiolabelled oligonucleotide assays. Biochemical APE1 inhibitors were then subjected to detailed cytotoxicity analyses.
Several specific APE1 inhibitors were isolated by this approach. The IC50 for APE1 inhibition ranged between 30 n and 50 μ. We demonstrated that APE1 inhibitors lead to accumulation of AP sites in genomic DNA and potentiated the cytotoxicity of alkylating agents in melanoma and glioma cell lines.
Our study provides evidence that APE1 is an emerging drug target and could have therapeutic application in patients with melanoma and glioma.
melanoma; glioma; DNA repair; human apurinic/apyrimidinic endonuclease 1 (APE1); small molecule inhibitors
Shotgun proteomics is a powerful tool for identifying the protein content of complex mixtures via liquid chromatography and tandem mass spectrometry. The most widely used class of algorithms for analyzing mass spectra of peptides has been database search software such as SEQUEST. A new sequence tag database search algorithm, called GutenTag, makes it possible to identify peptides with unknown posttranslational modifications or sequence variations. This software automates the process of inferring partial sequence “tags” directly from the spectrum and efficiently examines a sequence database for peptides that match these tags. When multiple candidate sequences result from the database search, the software evaluates which is the best match by a rapid examination of spectral fragment ions. We compare GutenTag’s accuracy to that of SEQUEST on a defined protein mixture, showing that both modified and unmodified peptides can be successfully identified by this approach. GutenTag analyzed 33 000 spectra from a human lens sample, identifying peptides that were missed in prior SEQUEST analysis due to sequence polymorphisms and posttranslational modifications. The software is available under license; visit http://fields.scripps.edu for information.
PeptideProphet is a post-processing algorithm designed to evaluate the confidence in identifications of MS/MS spectra returned by a database search. In this manuscript we describe the "what and how" of PeptideProphet in a manner aimed at statisticians and life scientists who would like to gain a more in-depth understanding of the underlying statistical modeling. The theory and rationale behind the mixture-modeling approach taken by PeptideProphet is discussed from a statistical model-building perspective followed by a description of how a model can be used to express confidence in the identification of individual peptides or sets of peptides. We also demonstrate how to evaluate the quality of model fit and select an appropriate model from several available alternatives. We illustrate the use of PeptideProphet in association with the Trans-Proteomic Pipeline, a free suite of software used for protein identification.
Fragment assembly is a powerful method of protein structure prediction that builds protein models from a pool of candidate fragments taken from known structures. Stochastic sampling is subsequently used to refine the models. The structures are first represented as coarse-grained models and then as all-atom models for computational efficiency. Many models have to be generated independently due to the stochastic nature of the sampling methods used to search for the global minimum in a complex energy landscape. In this paper we present , a fragment-based approach which shares information between the generated models and steers the search towards native-like regions. A distribution over fragments is estimated from a pool of low energy all-atom models. This iteratively-refined distribution is used to guide the selection of fragments during the building of models for subsequent rounds of structure prediction. The use of an estimation of distribution algorithm enabled to reach lower energy levels and to generate a higher percentage of near-native models. uses an all-atom energy function and produces models with atomic resolution. We observed an improvement in energy-driven blind selection of models on a benchmark of in comparison with the AbInitioRelax protocol.
In vivo recombination of overlapping DNA fragments for assembly of large DNA constructs in the yeast Saccharomyces cerevisiae holds great potential for pathway engineering on a small laboratory scale as well as for automated high-throughput strain construction. However, the current in vivo assembly methods are not consistent with respect to yields of correctly assembled constructs and standardization of parts required for routine laboratory implementation has not been explored. Here, we present and evaluate an optimized and robust method for in vivo assembly of plasmids from overlapping DNA fragments in S. cerevisiae.
To minimize occurrence of misassembled plasmids and increase the versatility of the assembly platform, two main improvements were introduced; i) the essential elements of the vector backbone (yeast episome and selection marker) were disconnected and ii) standardized 60 bp synthetic recombination sequences non-homologous with the yeast genome were introduced at each flank of the assembly fragments. These modifications led to a 100 fold decrease in false positive transformants originating from the backbone as compared to previous methods. Implementation of the 60 bp synthetic recombination sequences enabled high flexibility in the design of complex expression constructs and allowed for fast and easy construction of all assembly fragments by PCR. The functionality of the method was demonstrated by the assembly of a 21 kb plasmid out of nine overlapping fragments carrying six glycolytic genes with a correct assembly yield of 95%. The assembled plasmid was shown to be a high fidelity replica of the in silico design and all glycolytic genes carried by the plasmid were proven to be functional.
The presented method delivers a substantial improvement for assembly of multi-fragment expression vectors in S. cerevisiae. Not only does it improve the efficiency of in vivo assembly, but it also offers a versatile platform for easy and rapid design and assembly of synthetic constructs. The presented method is therefore ideally suited for the construction of complex pathways and for high throughput strain construction programs for metabolic engineering purposes. In addition its robustness and ease of use facilitate the construction of any plasmid carrying two or more genes.
In vivo assembly; Saccharomyces cerevisiae; Synthetic biology; Pathway engineering; Homologous recombination
MODBASE (http://salilab.org/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on the MODELLER package for fold assignment, sequence–structure alignment, model building and model assessment (http:/salilab.org/modeller). MODBASE uses the MySQL relational database management system for flexible querying and CHIMERA for viewing the sequences and structures (http://www.cgl.ucsf.edu/chimera/). MODBASE is updated regularly to reflect the growth in protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different data sets. The largest data set contains 1 262 629 models for domains in 659 495 out of 1 182 126 unique protein sequences in the complete Swiss-Prot/TrEMBL database (August 25, 2003); only models based on alignments with significant similarity scores and models assessed to have the correct fold despite insignificant alignments are included. Another model data set supports target selection and structure-based annotation by the New York Structural Genomics Research Consortium; e.g. the 53 new structures produced by the consortium allowed us to characterize structurally 24 113 sequences. MODBASE also contains binding site predictions for small ligands and a set of predicted interactions between pairs of modeled sequences from the same genome. Our other resources associated with MODBASE include a comprehensive database of multiple protein structure alignments (DBALI, http://salilab.org/dbali) as well as web servers for automated comparative modeling with MODPIPE (MODWEB, http://salilab.org/modweb), modeling of loops in protein structures (MODLOOP, http://salilab.org/modloop) and predicting functional consequences of single nucleotide polymorphisms (SNPWEB, http://salilab.org/snpweb).
The discovery that RNA molecules can fold into complex structures and carry out diverse cellular roles has led to interest in developing tools for modeling RNA tertiary structure. While significant progress has been made in establishing that the RNA backbone is rotameric, few libraries of discrete conformations specifically for use in RNA modeling have been validated. Here, we present six libraries of discrete RNA conformations based on a simplified pseudo-torsional notation of the RNA backbone, comparable to phi and psi in the protein backbone. We evaluate the ability of each library to represent single nucleotide backbone conformations and we show how individual library fragments can be assembled into dinucleotides that are consistent with established RNA backbone descriptors spanning from sugar to sugar. We then use each library to build all-atom models of 20 test folds and we show how the composition of a fragment library can limit model quality. Despite the limitations inherent in using discretized libraries, we find that several hundred discrete fragments can rebuild RNA folds up to 174 nucleotides in length with atomic-level accuracy (<1.5Å RMSD). We anticipate the libraries presented here could easily be incorporated into RNA structural modeling, analysis, or refinement tools.
RNA structure; RNA backbone conformation; RNA fragment library; RNA modeling
ARP/wARP is a software suite to build macromolecular models in X-ray crystallography electron density maps. Structural genomics initiatives and the study of complex macromolecular assemblies and membrane proteins all rely on advanced methods for 3D structure determination. ARP/wARP meets these needs by providing the tools to obtain a macromolecular model automatically, with a reproducible computational procedure. ARP/wARP 7.0 tackles several tasks: iterative protein model building including a high-level decision-making control module; fast construction of the secondary structure of a protein; building flexible loops in alternate conformations; fully automated placement of ligands, including a choice of the best fitting ligand from a “cocktail”; and finding ordered water molecules. All protocols are easy to handle by a non-expert user through a graphical user interface or a command line. The time required is typically a few minutes although iterative model building may take a few hours.
Structural genomics; X-ray crystallography; software; model building; ligand placement
BLAST is a commonly-used software package for comparing a query sequence to a database of known sequences; in this study, we focus on protein sequences. Position-specific-iterated BLAST (PSI-BLAST) iteratively searches a protein sequence database, using the matches in round i to construct a position-specific score matrix (PSSM) for searching the database in round i + 1. Biegert and Söding developed Context-sensitive BLAST (CS-BLAST), which combines information from searching the sequence database with information derived from a library of short protein profiles to achieve better homology detection than PSI-BLAST, which builds its PSSMs from scratch.
We describe a new method, called domain enhanced lookup time accelerated BLAST (DELTA-BLAST), which searches a database of pre-constructed PSSMs before searching a protein-sequence database, to yield better homology detection. For its PSSMs, DELTA-BLAST employs a subset of NCBI’s Conserved Domain Database (CDD). On a test set derived from ASTRAL, with one round of searching, DELTA-BLAST achieves a ROC5000 of 0.270 vs. 0.116 for CS-BLAST. The performance advantage diminishes in iterated searches, but DELTA-BLAST continues to achieve better ROC scores than CS-BLAST.
DELTA-BLAST is a useful program for the detection of remote protein homologs. It is available under the “Protein BLAST” link at http://blast.ncbi.nlm.nih.gov.
This article was reviewed by Arcady Mushegian, Nick V. Grishin, and Frank Eisenhaber.
The combination of molecular replacement and single-wavelength anomalous diffraction improves the performance of automated structure determination with Auto-Rickshaw.
A combination of molecular replacement and single-wavelength anomalous diffraction phasing has been incorporated into the automated structure-determination platform Auto-Rickshaw. The complete MRSAD procedure includes molecular replacement, model refinement, experimental phasing, phase improvement and automated model building. The improvement over the standard SAD or MR approaches is illustrated by ten test cases taken from the JCSG diffraction data-set database. Poor MR or SAD phases with phase errors larger than 70° can be improved using the described procedure and a large fraction of the model can be determined in a purely automatic manner from X-ray data extending to better than 2.6 Å resolution.
automated structure determination; molecular replacement; single-wavelength anomalous diffraction
RNA molecules fold into characteristic secondary and tertiary structures that account for their diverse functional activities. Many of these RNA structures are assembled from a collection of RNA structural motifs. These basic building blocks are used repeatedly, and in various combinations, to form different RNA types and define their unique structural and functional properties. Identification of recurring RNA structural motifs will therefore enhance our understanding of RNA structure and help associate elements of RNA structure with functional and regulatory elements. Our goal was to develop a computer program that can describe an RNA structural element of any complexity and then search any nucleotide sequence database, including the complete prokaryotic and eukaryotic genomes, for these structural elements. Here we describe in detail a new computational motif search algorithm, RNAMotif, and demonstrate its utility with some motif search examples. RNAMotif differs from other motif search tools in two important aspects: first, the structure definition language is more flexible and can specify any type of base–base interaction; second, RNAMotif provides a user controlled scoring section that can be used to add capabilities that patterns alone cannot provide.
Structural studies are increasingly providing huge amounts of information on multi-protein assemblies. Although a complete understanding of cellular processes will be dependent on an explicit characterization of the intermolecular interactions that underlie these assemblies and mediate molecular recognition, these are not well described by standard representations.
Here we present PICCOLO, a comprehensive relational database capturing the details of structurally characterized protein-protein interactions. Interactions are described at the level of interacting pairs of atoms, residues and polypeptide chains, with the physico-chemical nature of the interactions being characterized. Distance and angle terms are used to distinguish 12 different interaction types, including van der Waals contacts, hydrogen bonds and hydrophobic contacts. The explicit aim of PICCOLO is to underpin large-scale analyses of the properties of protein-protein interfaces. This is exemplified by an analysis of residue propensity and interface contact preferences derived from a much larger data set than previously reported. However, PICCOLO also supports detailed inspection of particular systems of interest.
The current PICCOLO database comprises more than 260 million interacting atom pairs from 38,202 protein complexes. A web interface for the database is available at http://www-cryst.bioc.cam.ac.uk/piccolo.
Building on our recently introduced library-based Monte Carlo (LBMC) approach, we describe a flexible protocol for mixed coarse-grained (CG)/all-atom (AA) simulation of proteins and ligands. In the present implementation of LBMC, protein side chain configurations are pre-calculated and stored in libraries, while bonded interactions along the backbone are treated explicitly. Because the AA side chain coordinates are maintained at minimal run-time cost, arbitrary sites and interaction terms can be turned on to create mixed-resolution models. For example, an AA region of interest such as a binding site can be coupled to a CG model for the rest of the protein. We have additionally developed a hybrid implementation of the generalized Born/surface area (GBSA) implicit solvent model suitable for mixed-resolution models, which in turn was ported to a graphics processing unit (GPU) for faster calculation. The new software was applied to study two systems: (i) the behavior of spin labels on the B1 domain of protein G (GB1) and (ii) docking of randomly initialized estradiol configurations to the ligand binding domain of the estrogen receptor (ERα). The performance of the GPU version of the code was also benchmarked in a number of additional systems.
The molecular viewer ArpNavigator allows easy execution of ARP/wARP model-building routines while model-update steps are shown in real time, rendering the whole process transparent to the user.
Automated model-building software aims at the objective interpretation of crystallographic diffraction data by means of the construction or completion of macromolecular models. Automated methods have rapidly gained in popularity as they are easy to use and generate reproducible and consistent results. However, the process of model building has become increasingly hidden and the user is often left to decide on how to proceed further with little feedback on what has preceded the output of the built model. Here, ArpNavigator, a molecular viewer tightly integrated into the ARP/wARP automated model-building package, is presented that directly controls model building and displays the evolving output in real time in order to make the procedure transparent to the user.
model building; ARP/wARP; molecular graphics
MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10–4) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server.
The identification and modelling of ligands into macromolecular models is important for understanding molecule's function and for designing inhibitors to modulate its activities. We describe new algorithms for the automated building of ligands into electron density maps in crystal structure determination. Location of the ligand-binding site is achieved by matching numerical shape features describing the ligand to those of density clusters using a “fragmentation-tree” density representation. The ligand molecule is built using two distinct algorithms exploiting free atoms with inter-atomic connectivity and Metropolis-based optimisation of the conformational state of the ligand, producing an ensemble of structures from which the final model is derived. The method was validated on several thousand entries from the Protein Data Bank. In the majority of cases, the ligand-binding site could be correctly located and the ligand model built with a coordinate accuracy of better than 1 Å. We anticipate that the method will be of routine use to anyone modelling ligands, lead compounds or even compound fragments as part of protein functional analyses or drug design efforts.
electron density map; small-molecule binders; shape; hybrid approach; drug design