|Home | About | Journals | Submit | Contact Us | Français|
ModBase (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by ModPipe, an automated modeling pipeline that relies primarily on Modeller for fold assignment, sequence–structure alignment, model building and model assessment (http://salilab.org/modeller/). ModBase currently contains 10355444 reliable models for domains in 2421920 unique protein sequences. ModBase allows users to update comparative models on demand, and request modeling of additional sequences through an interface to the ModWeb modeling server (http://salilab.org/modweb). ModBase models are available through the ModBase interface as well as the Protein Model Portal (http://www.proteinmodelportal.org/). Recently developed associated resources include the SALIGN server for multiple sequence and structure alignment (http://salilab.org/salign), the ModEval server for predicting the accuracy of protein structure models (http://salilab.org/modeval), the PCSS server for predicting which peptides bind to a given protein (http://salilab.org/pcss) and the FoXS server for calculating and fitting Small Angle X-ray Scattering profiles (http://salilab.org/foxs).
Genome sequencing efforts are providing us with complete genetic blueprints for hundreds of organisms. We are faced with assigning and understanding the functions of proteins encoded by these genomes. This task is generally facilitated by knowing the proteins’ 3D structures, which are best determined by experimental methods such as X-ray crystallography and NMR spectroscopy. In the last two years, the number of experimentally determined protein structures in the Protein Data Bank (PDB) has increased by 30% to 67794 (September 2010) (1). However, in the same timeframe, the number of protein sequences in the comprehensive public sequence databases such as GenBank (2) and UniProtKB (3) has grown even more rapidly; for example, the number of sequences in UniProtKB has nearly doubled to >12 million. Protein structure prediction methods are attempting to bridge this gap. The need for accurate models can sometimes be met by homology or comparative modeling (4–8). Comparative modeling is carried out in four sequential steps: identifying known structures (templates) related to the sequence to be modeled (target), aligning the target sequence with the templates, building models and assessing the models. For this reason, comparative modeling is only applicable when the target sequence is detectably related to a known protein structure.
As more experimental structures become available, and more reliable models become accessible to the biologists, web-accessible resources that assist in analyzing protein structures and structural models and evaluating their reliability become of increasing importance.
Here, we describe the current state of the ModBase database of comparative protein structure models, the ModWeb comparative modeling web-server and several new associated resources: the SALIGN server for multiple sequence and structure alignment (http://salilab.org/salign) (9), the ModEval server for predicting the accuracy of protein structure models (http://salilab.org/modeval), the PCSS server for predicting which peptides bind to a given protein (http://salilab.org/pcss) (10) and the FoXS server for calculating and fitting Small Angle X-ray Scattering profiles (http://salilab.org/foxs) (11). We also present new modules of the UCSF Chimera molecular graphics package that retrieve models from ModBase and act as a graphical interface to Modeller. Finally, we illustrate the use of comparative models by calculating modeling leverage for structural genomics, superfamily member identification and functional annotation, prediction of protein–protein interactions and genome-wide functional annotation.
Models in ModBase are calculated using our automated software pipeline for comparative protein structure modeling, ModPipe (12). The software relies mostly on modules of Modeller (13), and is designed to process data sets of protein sequences on a Linux cluster.
ModPipe uses sequence–sequence (14), sequence–profile (7,15) and profile–profile (7,16) methods for fold assignment and target–template alignment, using a promiscuous E-value threshold of 1.0 to increase the likelihood of identifying the best available template structure. These alignments can cover only a segment or the whole target sequence. By default, for each target–template alignment, 10 models are calculated (13) and the model with the best value of the Discrete Optimized Protein Energy (DOPE) statistical potential (17) is selected and then evaluated by several additional quality criteria: (i) target–template sequence identity, (ii) GA341 score (18), (iii) Z-DOPE score (17), (iv) ModPipe Quality Score (MPQS) and (v) TSVMod score (19). The MPQS score is a composite model quality criterion that includes the coverage of the modeled sequence, sequence identity, the fraction of gaps in the alignment, the compactness of the model and various statistical potential Z-scores. A short description of the other scores can be found below in the section ‘ModEval: server for predicting errors in structural models’. The models that score best with at least one of these quality criteria are selected for further filtering. If more than 30 residues of a target sequence are not covered by a selected model, additional models are selected even if they don’t score best with at least one of the quality criteria. Finally, only the models with quality criteria values above specified thresholds or with an E-value <10−4 are included in the final model set.
A key feature of the pipeline is not prejudging the validity of sequence–structure relationships at the fold-assignment stage; instead, sequence–structure matches are assessed after the construction of the models and their evaluation. This approach enables a thorough exploration of fold assignments, sequence–structure alignments and conformations, with the aim of finding the model with the best evaluation score, at the expense of increasing the computational time significantly, since for some sequences, a few thousand models can be calculated.
The source code for ModPipe is freely accessible under the GPL terms (http://salilab.org/modpipe). The binary code for Modeller is also available freely to academics for a number of different machine types (http://salilab.org/modeller).
Models in ModBase are organized in data sets. Because of the rapid growth of the public sequence databases, we concentrate our efforts on adding data sets that are useful for specific projects, rather than attempt to model all known protein sequences with detectable template structures. Currently, ModBase includes a model data set for each of 43 complete genomes, as well as a data set for the complete SwissProt/TrEMBL database (2005) (http://salilab.org/modbase/statistics). We identified the genomes with the highest access statistics (Homo sapiens, Saccharomyces cerevisiae, Escherichia coli, Mycobacterium tuberculosis, Mus musculus, Arabidopsis thaliana, Drosophila melanogaster, Rattus norvegicus and Caenorhabditis elegans), and are updating the corresponding models more frequently (approximately once a year). Together with other project-oriented data sets, ModBase currently contains 10355444 reliable models for domains in 2421920 unique sequences.
The ModWeb comparative modeling web-server is an integral module of ModBase (http://salilab.org/modweb) (12). In the default mode, ModWeb accepts one or more sequences in the FASTA format, followed by calculating and evaluating their models using ModPipe based on the best available templates from the PDB. Alternatively, ModWeb also accepts a protein structure as input, calculates a multiple sequence profile and identifies all homologous sequences in the UniProtKB database, followed by modeling these homologs based on the user-provided structure. This alternative protocol is a useful tool for measuring the impact of new structures, such as those generated by structural genomics efforts (20). Additionally, new members of sequence superfamilies with at least one known structure can be identified (21).
In addition to the existing anonymous access, we recently added a user registration option. Registered users get unified access to all their ModWeb data sets and can submit template-based calculations.
A number of web-services are associated with ModBase. Some of these are tightly integrated with ModBase, while others contain data that are derived through ModBase—e.g. single nucleotide polymorphism (SNP) annotations created by LS-SNP (22). We have already described the interactions of ModBase with the ModLoop server for loop modeling in protein structures (http://salilab.org/modloop) (23), the PIBASE database of protein–protein interaction (http://salilab.org/pibase) (24), the DBAli database of structural alignments (http://salilab.org/dbali) (25,26) and the LS-SNP database of structural annotations of human non-synonymous single-nucleotide polymorphisms (http://salilab.org/LS-SNP) elsewhere (22,27,28). Here, we describe several additional servers that are now interacting with ModBase.
Accurate alignment of protein sequences and structures is crucial for comparative modeling; for example, sequence–structure alignment is needed for template identification (16) and target–template alignment (29); structure–structure alignments are useful for comparing multiple templates with each other (9), in preparation for comparative modeling based on multiple template structures (13). The SALIGN web-server (http://salilab.org/salign) performs sequence–sequence, sequence–structure and structure–structure alignments of two or more proteins (H. Braberg et al., manuscript in preparation). Depending on the provided input and desired output, a number of different algorithms and options implemented in Modeller can be applied, including global and local dynamic programming; linear and non-linear gap penalty functions; sequence- and structure-based similarity matrices and progressive/tree-based multiple alignments (9,16,29,30).
Given an input of sequences and/or structures, the server proposes the optimal alignment protocol. For instance, given more than two input structures and sequences, the structures and sequences are separately aligned to each other. The two multiple alignments are then aligned with one another, making use of the variable gap penalty function (29). Two sets of multiple sequence alignments can also be aligned using a profile–profile method (16). The user can override the default choice of algorithms and parameters. We have previously demonstrated the effectiveness of the algorithms used in the server in the context of comparative modeling (28,31) and identification of interacting protein partners (32).
Model evaluation is an essential step in protein structure modeling, as its results allow the user to judge the level of accuracy of the model and whether or not a model is suitable for the intended application. Two model evaluation methods are available within Modeller. First, GA341 (18) is a statistical potential-based score, which discriminates between models of correct and incorrect fold. It is derived from a nonlinear combination (evolved by a genetic algorithm) of three model features (33): model length, ZPAIR (a distance statistical potential Z-score) and ZSURF (a surface-accessibility statistical potential Z-score). The two Z-scores are combined in the ZCOMB score. Second, the DOPE score is an atomic-distance-dependent statistical potential derived from known protein structures (17). To facilitate comparison between models of different sequences, a normalized DOPE score (Z-DOPE) for the whole model is also reported, as is a profile of the residue Z-DOPE scores that allows identification of problematic regions of a model.
Recently, we developed TSVMod (19,34), a method to estimate the Cα RMSD error and the native overlap (the fraction of Cα atoms within 3.5Å of their native positions) of a model. The error prediction relies on a model-specific scoring function constructed by a support vector machine that optimizes the weights of up to nine features, including various sequence similarity measures and statistical potentials, extracted from a tailored training set of models unique to the model being assessed. If possible, the training relies on similarly sized models with the same fold; otherwise, similarly sized models with the same secondary structure composition are used.
The ModEval server (http://salilab.org/modeval) accepts a protein structure, an alignment in the PIR format (optional) and the sequence–template sequence identity (optional). It then computes the TSVMod scores, the Z-DOPE score and profile and all components of the GA341 score. Upon completion of the job, the user receives an email notification.
Protein–protein recognition is frequently mediated by small peptide regions of one protein binding to a pocket or groove of another protein. Examples include scaffolding domains such as PDZ and SH3 (35), which recognize peptides 6–10 residues in length; and protease–substrate specificity, in which the substrate peptide associates with the protease active site cleft before catalysis (36). This recognition is mediated by the sequence of the peptide and its structural environment in the binding protein. It is often helpful to be able to identify these peptides; for example, detecting a peptide that is cleaved by a protease can lead to hypotheses of the effect of this cleavage on protein substrate function. To aid in this prediction effort, the PCSS web-server (http://salilab.org/pcss) has been created that allows the user to provide positive and negative examples of peptide binding to a given protein. From these training data, a statistical model is generated that can then be used by the server to search for similar peptides in other protein sequences.
The PCSS web-server has two modes, ‘Training’ and ‘Application’. In the training mode, the user uploads a set of proteins containing the peptides of interest, specified by their UniProtKB accession numbers. The user indicates for each peptide whether it is a positive or negative example of the peptide motif. The server then validates the input and uses the sequence and structure features of the peptides to create a support vector machine model. The structure features of the peptides are derived from experimental structures or high-quality comparative models in ModBase, when available. In the application mode, the user provides a set of target proteins and uses the model created in the training mode to search for further examples of positive peptides. While training support vector machines generally requires expert knowledge, the PCSS server automates the process of feature selection and encoding, parameter sampling and benchmarking, thereby increasing the efficiency of its construction.
The algorithm implemented in PCSS was recently used to predict two substrates of the pro-apoptotic serine protease Granzyme B (GrB) (10): apoptosis-inducing factor 1 and survival motor neuron protein 1. Both were experimentally validated as being a GrB substrate in vitro, and are implicated in apoptosis. Their cleavage potentially represents a mechanism that natural killer cells and cytotoxic lymphocytes use to induce programmed cell death in virally-infected and neoplastic cells.
Small Angle X-ray Scattering (SAXS) is a common technique for low-resolution structural characterization of molecules in solution (37–39). SAXS experiments determine the scattering intensity of a molecule as a function of spatial frequency, resulting in a SAXS profile that can be easily converted into the approximate distribution of atomic distances in the measured system. SAXS experiments can be performed with the protein sample in solution, and usually take only a few minutes on a well-equipped synchrotron beamline (39).
FoXS (http://salilab.org/foxs) is a rapid and accurate method for calculating a SAXS profile of a given molecular structure based on the Debye formula (11). The method explicitly computes all inter-atomic distances, and models the first solvation layer based on solvent accessibility. FoXS was tested with all eight structures in the PDB that have an experimental SAXS profile in the open access SAXS database (http://bioisis.net/) as well as 16 additional structures with SAXS profiles from our collaborations. The FoXS resource can contribute to many applications, such as comparing a conformation in solution with the corresponding X-ray structure, modeling a flexible or multi-modular protein and assembling a macromolecular complex from its subunits.
UCSF Chimera is a graphics program for analysis and interactive visualization of molecular structures and related data (40). New modules have been added to Chimera for interaction with ModBase and Modeller. From within Chimera, all models for a given sequence in ModBase can be retrieved over the web by entering a sequence identifier (such as the UniProtKB accession number) into the Chimera ‘Fetch by ID’ dialog or command line. The fetched models are displayed in the main Chimera window, and their scores, residue range, template identifier and other information are listed in a table (similar to Figure 1, bottom left). Any of the general analysis features in Chimera can be applied to the models, such as calculation of hydrogen bonds, steric clashes and structure superpositions. The PDB files returned by ModBase contain content to allow for coloring the model by the degree to which the restraints have been satisfied, which can be used to predict model errors (Figure 1, right).
Additional new functionality in UCSF Chimera includes a graphical interface to build a model from scratch using Modeller, using as input only the amino acid sequence of the target protein. Chimera uses BLAST to search the PDB for potential templates, which are displayed in the Multalign Viewer tool (Figure 1, top) (41). The Viewer allows for alignment editing, for example, to remove gaps that fall within an element of regular secondary structure in the template, which frequently contribute to model error. Additional sequences can be added to the alignment, either as text or from other structures in Chimera. When the alignment is satisfactory, the user builds models using Modeller within Chimera. This process is run in the background and can be monitored via Chimera’s task manager. When the results become available, the models are displayed in Chimera and their scores shown in a table (Figure 1, bottom left). This functionality is also available for models already stored in ModBase, to allow for refinement of those models through editing the alignment and incorporating additional templates. Chimera can run a locally installed copy of Modeller or use a Modeller web service provided by the UCSF Resource for Biocomputing, Visualization, and Informatics (http://www.rbvi.ucsf.edu).
Model assessment by interactive visualization of structures and template–target sequence alignments is an important complement to the statistical scores available in ModBase. While model evaluation scores allow efficient filtering of the models most likely to be correct (17,19), interactive visualization may better reveal specific problematic regions, and more importantly, may allow for adjusting such regions in an iterative alignment/modeling process.
One of the metrics guiding target selection in structural genomics is modeling leverage. Modeling leverage of a structure is defined as the number of proteins sequences that can be modeled based on the structure at >30% sequence identity. The New York Structural GenomiX Research Center (NYSGXRC) recently determined the structure of a putative BenF-like porin from P. fluorescens (PflBenF), which has the same fold as structurally defined members of the OprD superfamily (20). Members of this superfamily are thought to mediate transport of most small molecules across the cell membrane in Pseudomonads (42). To determine the modeling leverage of PflBenF, template-based modeling as implemented in ModWeb was performed, using the sequences and structures of PflBenF as well as two previously determined similar structures, OpdK (43) and OprD (44), both from P. aeruginosa. A total of 221 unique protein sequences were identified in the UniProtKB database, with sequence identities >30% to at least one of these three protein structures. The first structure of a member of this fold family, PaOprD, enabled modeling of 165 related proteins. Subsequent determination of the structure of PaOpdK resulted in models for an additional three protein sequences. In contrast, determination of the PflBenF structure enabled homology modeling of 53 additional protein sequences. Thus, the structure of PflBenF expands significantly the number of useful homology models of the porins in the OprD and OpdK families. Experimental structures of additional OprD/OpdK subfamily members should provide useful guides for planning experiments aimed at defining the mechanisms governing pore selectivity. The modeling leverage statistics for this project can be accessed at http:///modbase.compbio.ucsf.edu/modbase-cgi/model_leverage.cgi?type=master_partha.
Solute carriers are a group of approximately 400 biomedically important membrane proteins that control the uptake and efflux of solutes, including essential cellular compounds and therapeutic drugs (45). Numerous variants that are important for clinical drug response have been identified in solute carriers by the Pharmacogenomics of Membrane Transporters project (PMT) at UCSF (46). Solute carriers can share similar structural features despite weak sequence similarities.
We defined solute carrier families by comparing their sequences using structure and profile–profile alignments as well as similarity networks. The families were analyzed in the context of substrate type, transport mode, organism conservation and tissue specificity (47). The classification is useful for inferring similarities and differences in various structural and functional features such as fold, ligand-binding site and molecular mechanism of uncharacterized solute carriers based on their characterized aligned homologs. We used these family definitions to show which solute carriers have known structures or have good quality comparative models—i.e. models based on >30% sequence identity to a known template structure over at least 70% of their sequences, or are assessed to have the correct fold by various scores (47). In addition to ModBase and the Protein Model Portal (48), the solute carrier alignments and models are freely accessible via PharmGKB (49). A phylogenetic tree for each modeled solute carrier is also provided through a link from the ModBase model pages (http://salilab.org/modbase/search?dataset=slc).
S. mansoni is a parasitic flatworm and the major causative agent of schistosomiasis, a disease affecting >200 million people in developing countries. The pathogen employs many strategies to infect the human host and evade the immune response through different life-cycle stages (50). To understand these mechanisms of pathogenesis, we applied a host–pathogen protein–protein interaction prediction pipeline to the human and S. mansoni proteomes. This pipeline, previously applied on 10 pathogens (51), relies on comparative modeling of human and pathogen proteins based on template domain–domain interactions and subsequent evaluation of the complex model interface using the MODTIE statistical potential (32). Application of the pipeline resulted in over 500 predicted complexes involving both human and S. mansoni proteins. Some of these predictions include parasite proteins expressed in the invasive cercarial life-cycle as well as human proteins known to play a role in immunomodulatory processes. Several of these predictions are currently being tested by experiment.
The Gram-negative bacterium H. pylori inhabits the human stomach. The presence of pathogenic strains has been shown to lead to gastric ulcers, gastritis and gastric cancer (52). As part of our effort to provide functional annotations for genes in the H. pylori genome (http://phylogenomics.berkeley.edu/phylofacts/), we created a ModBase data set of models for all sequences in the proteome of the H. pylori strain 26695 that are detectably related to an experimental structure. For 61 of the 1575 proteins in this strain, crystal structures of domains or whole proteins already exist. For 1467 of the remaining 1514 proteins in this strain, at least one reliable model was built. The number of proteins with models based on 0–20, 20–30, 30–40, 40–50, 50–60 and 60–100% sequence identity is 40, 368, 603, 275, 96 and 85, respectively. Of these, 584 had at least one model for which TSVMod (19) predicted a Cα RMSD ≤ 3.5Å. The available templates lie at varying evolutionary distances from the target proteins, and different regions of a single target protein may be homologous to different templates.
We illustrate the use of these models with the enzyme biotin carboxylase (locus HP_0370, UniProt accession O25134, gi 2313468). Biotin carboxylase catalyzes an early step in fatty acid biosynthesis. Thus, bacterial biotin carboxylases are investigated as potential drug targets using virtual screening (53). Because these enzymes occur across the Tree of Life (including human), detailed knowledge of the catalytic site geometry may help in designing drugs that are specific to the pathogen and don’t bind to the host proteins. Prediction of functional sites by similarity to experimentally characterized functional sites is facilitated by the use of comparative models to visualize and probe protein function (25,54,55).
The ModPipe pipeline produced several models for this protein based on templates at different evolutionary distances. Analysis of the H. pylori biotin carboxylase with the Berkeley PHOG algorithm (56), a phylogenomic method of orthology prediction, supports the annotation of this protein as a biotin carboxylase based on super-orthology—the most stringent definition of orthology (57)—with two experimentally characterized proteins in the BRENDA database (58): Q54755 (Synechococcus elongatus strain PCC 7942) and Q10YA8 (Trichodesmium erythraeum strain IMS101). A human mitochondrial ortholog, Q96RQ3 (PDB ID 2ejm), includes annotation of site-specific features from SwissProt (59).
To predict functional residues using the ModBase models for this enzyme, we submitted the H. pylori biotin carboxylase to the INTREPID webserver (60) that uses a phylogenomic algorithm to predict evolutionarily conserved sites (61). Of the top 10 residues predicted by INTREPID, 7 are supported by experimental studies based on homology to the biotin carboxylase subunit of Acetyl-Coa Carboxylase (PDB ID 1bnc): C243 [equivalent to C230 in 1bnc, whose catalytic function is supported (62)], H222 [H209 in 1bnc (63)], H312 [H297 in 1bnc, adjacent to active site (63)], M304 [M289 in 1bnc (63)], Q246 [Q233 in 1bnc (64)], Q250 [Q237 in 1bnc (63)] and Q309 [Q294 in 1bnc (63)]. Three residues (F93, Y74 and Q226) may represent novel predictions of functional sites. INTREPID predictions and known active site residues are displayed in Figure 2, illustrating the use of comparative models to predict functional sites. The complete genome modeling data set for H. pylori can be downloaded from ftp://salilab.org/databases/modbase/projects/genomes/.
The main access to ModBase is through its web interface at http://salilab.org/modbase, by querying with UniprotKB (3) and GI (2) identifiers, gene names, annotation keywords, PDB (65) codes, data set names, organism names, sequence similarity to the modeled sequences (BLAST (15)) and model-specific criteria such as model reliability, model size and target–template sequence identity. Additionally, it is possible to retrieve coordinate files and alignment files as text files. Select genome data sets are also available from our ftp server (ftp://salilab.org/databases/modbase/projects).
The output of a search is displayed on pages with varying amounts of information about the modeled sequences, template structures, alignments and functional annotations. An example of the output from a search resulting in one model is shown in Figure 3. A ribbon diagram of the model with the highest target–template sequence identity is displayed by default, together with some details of the modeling calculation. Ribbon thumbprints of additional models for this sequence link to corresponding pages with more information. Ribbon diagrams are generated on the fly using Molscript (66) and Raster3D (67). A pull-down menu provides links to additional functionality: the SNP module, retrieval of coordinate and alignment files as well as molecular visualization by Chimera that allows the user to display template and model coordinates together with their alignment. If mutation information is available for a protein sequence, links to the details are provided in the cross-references section. Additionally, cross-references to various other databases, including PDB (65), UniProtKB (68), the UCSC Genome Browser (69), EBI’s InterPro (70), PharmGKB (71) and SFLD (72) are given. Other ModBase pages provide overviews of more than one sequence or structure. All ModBase pages are interconnected to facilitate easy navigation between different views.
The Protein Model Portal (PMP) has become a valuable option for accessing ModBase models (http://proteinmodelportal.org) (49,73). The PMP is a single point of entry for accessing protein structure models from a number of different databases, by querying all participating source model databases, and serving the model coordinates, alignments and quality criteria from a central location.
ModBase models in academic and public data sets are also directly accessible from several other databases, including UniProtKB (3), PIR’s iProClass (68), EBI’s InterPro (70), the UCSC Genome Browser (69), PubMed (LinkOut) (74), PharmGKB (71) and SFLD (72).
ModBase will grow by adding models calculated on demand by external users (using ModWeb) as well as our own calculations of model data sets that are needed for our research projects (using ModPipe, ModWeb or Modeller). These updates will reflect improvements in the methods and software used for calculating the models as well as new template structures in the PDB and new sequences in UniProtKB. In the future, we expect that most of the users will access ModBase models through the PMP.
Users of ModBase are requested to cite this article in their publications.
National Institutes of Health (R01 GM54762, U54 GM074945, U54 GM074929, U01 GM61390, P01 GM71790 to A.S., F32 GM088991 to A.Sch., P41 RR001081 to T.E.F.); the National Science Foundation (0732065 to A.S. and K.S.); the Department of Energy (DE-SC0004916 to K.S.); Sandler Family Supporting Foundation (to A.S). Funding for open access charge: NIH (U54 GM074945).
Conflict of interest statement. None declared.
For linking to ModBase from their databases, the authors thank Torsten Schwede (PMP), David Haussler and Jim Kent (UCSC Genome Browser), Amos Bairoch (SwissProt/TrEMBL), Rolf Apweiler (InterPro), Patsy Babbitt (SFLD), Russ Altman (PharmGKB) and Kathy Wu (PIR/iProClass). The authors are also grateful for computing hardware gifts from Mike Homer, Ron Conway, NetApp, IBM, Hewlett Packard and Intel.