|Home | About | Journals | Submit | Contact Us | Français|
We describe the proceedings and conclusions from a “Workshop on Applications of Protein Models in Biomedical Research” that was held at University of California at San Francisco on 11 and 12 July, 2008. At the workshop, international scientists involved with structure modeling explored (i) how models are currently used in biomedical research, (ii) what the requirements and challenges for different applications are, and (iii) how the interaction between the computational and experimental research communities could be strengthened to advance the field.
Three-dimensional modeling of biological molecules and their interactions has a long history and is now established as a cornerstone of modern structural biology. Classic examples include the molecular model of the DNA double helix that was built by James Watson and Francis Crick in 1953 (Watson and Crick, 1953), models for the polypeptide α-helix and β-sheet proposed by Linus Pauling some two years earlier (Pauling et al., 1951), and the first homology model of a protein, built by David Phillips and coworkers for α-lactalbumin based on hen egg white lysozyme (Browne et al., 1969). While not every model can have the same impact as these early landmark examples, the potential of molecular modeling to produce new biological insights has never been greater than it is today, thanks to the recent explosion of sequence and structural data, advances in modeling methods and vastly more powerful computers.
Protein structure prediction methods differ in terms of the needed input information and the aspects of protein structure that can be computed. The secondary structure, trans-membrane segments, and disordered regions can be predicted from a protein sequence (Bryson et al., 2005; Rost, 2003); an atomic model of a domain can be obtained from the sequence alone by ab initio or de novo prediction methods (Das and Baker, 2008); fold assignment and sequence-structure alignment can be achieved by threading against a library of known folds (Godzik, 2003); atomic models of a protein can be calculated on the basis of known template structures by homology modeling (Marti-Renom et al., 2000; Petrey and Honig, 2005; Schwede et al., 2003); and atomic and reduced representation models of protein complexes with small ligands and other macromolecules, such as nucleic acids, can be derived with various docking methods (Lensink et al., 2007). Increasingly, integrative or hybrid methods rely on more than one type of information, especially for the structural characterization of protein assemblies (Alber et al., 2008).
A stimulating catalyst for molecular modeling is the Protein Structure Initiative (PSI) that aims to determine representative atomic structures of most major protein families by X-ray crystallography and NMR spectroscopy, so that the remainder of protein sequences can be characterized by homology modeling (http://www.nigms.nih.gov/Initiatives/PSI/) (Chandonia and Brenner, 2006; Liu et al., 2007). In the PSI, experimental structure determination and molecular modeling are especially mutually reinforcing. On the one hand, the experiments provide essential template structures for homology modeling specific sequences and the expanded dataset of protein structures provides opportunities for developing better modeling methods. On the other hand, modeling greatly leverages experimentally determined structures. By judicious selection of target proteins determined by experiment, each experimental structure enables the modeling of many protein sequences that could not be modeled well before (Liu et al., 2007). Molecular modeling can also add value to both experimentally determined structures and models; for example, docking of small molecules to proteins can be used for functional annotation (Hermann et al., 2007a) and docking of proteins can be used for characterization of large macromolecular machines (Lensink et al., 2007). Finally, integrative methods have actually begun to improve the process of experimental structure determination itself (Alber et al., 2007a; Qian et al., 2007).
To make the fruits of PSI available as widely as possible, the PSI Structural Genomics Knowledgebase was launched in February 2008 (http://kb.psi-structuralgenomics.org) (Berman, 2008). The Knowledgebase is designed to provide a “marketplace of ideas” that connects protein sequence information to experimentally determined structures and computationally predicted models, enhances functional annotation, and facilitate access to new experimental protocols and materials. The initial version of the Knowledgebase is a web portal to a series of modules, including the Experimental Tracking, Material Repository, Models, Annotation, and Technology portals. The Protein Model Portal in particular provides access to models calculated by SWISS-MODEL (Kopp and Schwede, 2004), ModBase (Pieper et al., 2006), as well as models produced by the four PSI large-scale production centers (http://www.proteinmodelportal.org/). Its design and implementation are based on the recommendations proposed at the Workshop on Biological Macromolecular Structure Models in 2005 (Berman et al., 2006). The Model Portal aims to foster effective usage of molecular models in biomedical research by providing convenient and comprehensive access to the models and their annotations. An associated annual workshop will be a forum for developers and users of modeling methods on best practices, including methods for estimating model accuracy, guidelines for publishing theoretical models, and educational resources on using models for different biological applications. Thus, the Model Portal is a major opportunity to increase the impact of molecular modeling on biology and medicine.
Sixty-four participants from 30 academic, industry, and government institutions worldwide, including 9 from non-US locations, attended a workshop at the University of California at San Francisco (http://www.proteinmodelportal.org/workshop/). The participants discussed state-of-the-art applications of molecular modeling to biomedical problems, the requirements and challenges for various applications, as well as ways to strengthen the collaboration between the modeling and experimental communities. While the workshop was concerned primarily with applications of homology modeling as a cornerstone of the PSI, other relevant molecular modeling areas were also covered, including application of modeling to improving experimental structure determination (eg, molecular replacement in X-ray crystallography) and the use of homology models in conjunction with other methods (eg, docking of small molecules and proteins). The participants’ consensus was formulated as specific recommendations, aimed to increase the impact of molecular modeling in biology and medicine.
On the first day, 16 presentations were given on topics that ranged from coverage of protein sequence-structure space (Section 2) to the uses of modeling in biology and medicine (Section 3). On the second day, four independent discussion groups were asked to address the same set of specific questions covering the topics of the workshop, report on their findings, and make recommendations for the future (Section 5). Thus, each set of participants approached the issues in their own way; the resulting redundancy provided a rich source of ideas revealing both a commonality and a diversity of opinions that are incorporated in this document.
The utility of molecular modeling hinges on its coverage and accuracy. In other words, modeling needs to be applicable to many proteins and the models need to be sufficiently accurate for biological applications. The coverage issue was addressed in a recent comprehensive analysis of the current sampling of the protein universe (Levitt, 2008). The protein universe is the set of protein sequences and structures in all organisms. It was explored in terms of sequence families that have single or multi-domain architectures, with or without known structures. The domains were defined based on the CDART resource at NCBI (Geer et al., 2002), which contains almost 30,000 domain families. Growth of single domain families has now saturated: almost all current growth comes from multi-domain architectures that are combinations of single domains. Structures are known for a quarter of the single-domain families and half of all known sequences can be partially modeled due to their membership in these families; 20% of the structures for such modeling come from the structural genomics effort, in particular, from the PSI. Multi-domain architecture families continue to grow rapidly and at the same rate as deposited sequences; almost all novelty, therefore, arises from the arrangement of known single domains within a chain, particularly for eukaryotes. A quarter of the sequences do not appear to match any domain pattern and constitute the dark matter of the protein universe.
These empirical observations demonstrate the relatively high degree of applicability of homology modeling and the important role that structural genomics plays in increasing this coverage. Moreover, the generation of novel proteins through combining individual domains increases the importance of molecular docking as a means to characterize the structures of the multi-domain proteins.
Modeling is not only widely applicable (Section 2), but often sufficiently accurate to make an impact on biology and medicine. To demonstrate this point, we do not discuss here the purely technical measures of geometrical accuracy of a model; instead, we focus on the bottom line corresponding to the numerous published studies where models have helped provide important biological insights. In most examples presented at the Workshop, the models have been combined with experimental efforts to produce results of significant biomedical impact. Therefore, despite its remaining limitations, modeling can certainly add substantial value to experimentally determined protein structures.
Homology modeling is widely applied in the pharmaceutical industry and is integrated into most stages of pharmaceutical research (Tramontano, 2006). For example, it is used to design protein constructs and to enhance protein production, solubility, and crystallization. Once a protein is established as a viable pharmaceutical target, homology modeling is used in assay development, compound screening, identification of biologically active small molecules, and further optimization of the potency of those compounds.
Homology models are used in “structure-based ligand discovery”, facilitating investigation of ligand-protein interactions in an effort to find ligands and improve their potency (Rester, 2008). One technique, “virtual screening”, computationally screens large libraries of organic molecules for those that complement the structure of a protein binding site (Huey et al., 2007). Success rates for identifying compounds with biological activity range typically from 1% to 15% of those molecules that are predicted to bind (Babaoglu et al., 2008; Doman et al., 2002). This relatively high false positive rate reflects the remaining challenges with accurate prediction of affinity. Nevertheless, virtual screening was found to be as useful as experimental “high-throughput screening” in side-by-side prospective studies (Babaoglu et al., 2008; Doman et al., 2002). Homology models accelerate the virtual screening process and can help make helpful suggestions before crystal structures are available or experimental high-throughput screening begins (Oshiro et al., 2004).
Other applications of structural models involve “optimization” of hits from virtual screening or high throughput screening by detailed examination of the ligand-protein interactions and the exploitation of new contacts with the protein via ligand modification (Noble et al., 2004). The discovery and development of neuraminidase inhibitors is an important case where structure-based methods were used to guide the design of the first anti-influenza drug Relenza (zanamivir), brought to market by GlaxoSmithKline (von Itzstein et al., 1993). Coupled with informed molecular biology efforts, even crude homology models based on remotely related structures have been successful in facilitating drug discovery (de Paulis, 2007). Modeling is especially robust and informative when used in a target class mode; for example, homology modeling of kinases has been applied to ligand discovery, as well as optimization of binding potency and selectivity (Buckley et al., 2008; Diller and Li, 2003; Rockey and Elcock, 2006). Long before experimental structures of GPCRs were determined, models helped the selection and introduction of GPCR ligands into the clinic (Engel et al., 2008; Webb and Krystek; Webb et al., 1996). Clearly, the recent GPCR structures (Cherezov et al., 2007; Rasmussen et al., 2007; Warne et al., 2008) will further aid modeling of this important class of biological targets.
Several biotherapeutics have been developed with the aid of homology modeling. Antibody construct design and humanization is a mature field (Lippow et al., 2007). Of the 21 antibodies on the market as of 2007, it is estimated that 11 were the result of computational design of humanized constructs via homology modeling. Three examples are Zenapax (humanized anti-Tac or daclizumab), Herceptin (humanized anti-HER2 or trastuzumab), and Avastin (humanized anti-VEGF or bevacizumab) (Carter et al., 1992; Presta et al., 1997; Queen et al., 1989). Many more have reached clinical trials. Similar techniques have been used to engineer smaller antibody fragments with improved specificity, affinity, and half-life (Hinton et al., 2004; Lazar et al., 2006; Lippow et al., 2007).
Enzymes and other biologicals are widely used in biotechnology and industrial processes; they are key components of detergents and animal feed, and are used in the production of bread, wine and fruit juice, as well as in the treatment of textiles, paper, and leather. Enzymes frequently replace traditional chemicals or additives and help to save water and energy in a variety of production processes. Molecular modeling often provides the basis for understanding and engineering their biophysical properties, such as stability at high temperature and oxidation, activity at low temperatures, and substrate specificity (Alquati et al., 2002; Hult and Berglund, 2003).
Most proteins act in the cell through interactions with other proteins. Therefore, the impact of individual models, as well as experimentally determined atomic structures, can be increased by computational docking methods that produce models of protein complexes. The need for computational docking is emphasized by the difficulty of experimental structure determination for complexes, especially the more transient ones. Despite remaining challenges, the results of the CAPRI effort (Critical Assessment of Predicted Interactions) (Janin et al., 2003) demonstrate that substantial progress in docking methods has been made during the last few years (Lensink et al., 2007). The ClusPro docking server, which returns best-scoring models of a complex between two input atomic structures or models, is a case in point (Comeau et al., 2004). The main applications of the server have included modeling multi-domain proteins and oligomers, frequently in combination with additional data from experimental or other computational techniques.
For example, the configuration of the histone domain relative to the Dbl-homology, pleckstrin-homology and catalytic domains in the Ras-specific nucleotide exchange factor son of sevenless (SOS) was determined by filtering top scoring docking models by small-angle X-ray scattering, mutagenesis, and calorimetry data (Sondermann et al., 2005); the orientation and position of the histone domain implicated it as a potential mediator of membrane-dependent activation signals. Similarly, the high-resolution solution structure of the 15.4 kDa homodimer CylR2, the regulator of cytolysin production from Enterococcus faecalis, was solved by combining paramagnetic relaxation enhancement data with docking (Rumpel et al., 2008). Further, the binding of cofilin to monomeric actin (Kamal et al., 2007) was characterized by a combination of docking with mass spectrometry data (Kamal and Chance, 2008). Additional examples of docking include a model of the human p53-controlled ribonucleic reductase (p53R2) homodimer, which was used to explain mutations that cause mitochondrial DNA depletion (Bourdon et al., 2007); and an L-type Ca2+ channel, which was used for the characterization of binding interactions with 1,4-dihydropyridines (Cosconati et al., 2007).
The recognition by peripheral membrane proteins of different biological membranes and distinct phospholipids underlies a variety of signaling processes. What is the molecular basis of these recognition mechanisms? In close collaboration with experimental groups, modelers studied this problem by first building homology models of proteins, both within functional families and across genomes, and then predicting the sub-cellular localization of proteins based on the calculated electrostatic properties of those models. For example, a computational study of structures and models for all retroviral matrix domains, such as those from HIV-1, revealed that matrix domains contain a characteristic basic surface patch and, thus, exploit electrostatic interactions to bind membrane surfaces (Dalton et al., 2005; Murray et al., 2005). This discovery provides insight into the mechanism used by matrix domains to localize to the plasma membrane of infected cells.
The construction of models of the membrane binding domains from different families (Ananthanarayanan et al., 2002; Blatner et al., 2004; Stahelin et al., 2004; Yu et al., 2004) also illustrates how homology modeling allows the identification of functional properties of proteins that are different than a family member whose structure has been determined by experiment. Specifically, calculations with a homology model for the PX domain from phospholipase D-1 showed that this domain binds membranes containing the cellular growth-inducing PI, PI(3,4,5)P3, primarily through electrostatic interactions, although the model was built using the structure of a PX domain that binds to PI(3,4)P2-containing membranes with significant hydrophobic penetration (Stahelin et al., 2004).
Members of the NSS transporter family are responsible for uptake of neurotransmitters (such as glycine, γ-amino butyric acid, serotonin, dopamine, and norepinephrine) from the synaptic cleft; mutations in NSS transporters have been implicated in psychological and digestive disorders including schizophrenia. Furthermore, several NSS transporters have been shown to be targets for psychoactive compounds such as cocaine. Thus, an understanding of the molecular mechanisms underlying transport by these proteins is of considerable interest. It has been extremely difficult to crystallize mammalian members of this family, but bacterial substitutes have been more tractable. These structures can then be used as templates to construct homology models of mammalian homologs, which in turn can be used to deduce function. In a specific example, the chloride binding site of the serotonin transporter, SerT, was identified from a homology model built from the previously published structure of a bacterial amino acid transporter, LeuT, which does not bind chloride (Forrest et al., 2007). The prediction was confirmed experimentally. The work was highlighted in an Editor’s Choice in Science, emphasizing the importance of homology modeling to this class of problems (Chin and Yeston, 2007).
Many enzymes encoded by sequenced genomes and metagenomes have unknown functions. One promising approach to leverage structures for functional annotation is to dock libraries of possible substrates or chemical intermediates against the enzyme active site (Hermann et al., 2006; Hermann et al., 2007b; Kalyanaraman et al., 2005). Homology models can extend the utility of this approach to the many uncharacterized enzymes lacking experimental structures, and enable prediction of substrate specificity among related enzymes in protein families.
In a joint computational and experimental effort, homology models were created for approximately 100 homologs of an Ala-Glu epimerase enzyme for which a crystallographic structure was available (Kalyanaraman et al., 2008). Docking possible substrates against the models suggested that many had different substrate specificities and, hence, biological functions. Subsequent experimental screening confirmed several novel functions, including N-succinyl-Arg racemase (Song et al., 2007) and Ala-Phe epimerase (Kalyanaraman et al., 2008), and crystal structures confirmed the predicted binding modes. Because enzyme specificity is related to fine details of the binding site, such as precise orientations of side chains, one promising approach is to treat the binding site of homology models as flexible during docking, reducing the sensitivity of the results to small errors in the model (Hamblin et al., 2008; Song et al., 2007).
The onrush of personal genetic data adds new urgency for more effective computational analysis of the structural and functional impact of mutations, such as non-synonymous, single DNA base variants (ie, those that change the encoded amino acid residue type) (Karchin et al., 2007). Exon sequencing is already providing single base somatic mutation information in individual cancer cell lines. Many more data of this type are expected shortly (Di Bernardo et al., 2008; Sjoblom et al., 2006; Stacey et al., 2008). It is impossible to characterize functional consequences of all mutations by experiment, because there are too many of them. Therefore, computational approaches are required that are based on general principles of protein evolution, structure, and function. Full utilization of the mass of mutation data will require knowledge of the structure of human proteins, and that knowledge will come primarily from models.
With a particular machine learning method, homology models based on experimental templates down to 40% sequence identity provide as accurate a prediction of functional impact of a DNA base variant as do experimental structures (Yue et al., 2005). Use of these models doubles the number of human common base variants that can be fully analyzed for likely impact, compared with using experimentally determined structures alone. Further improvements in modeling methods enabling the use of models based on sequence identity down to 20% would add a further 50% to the number of analyzable single point mutations. Recent progress measured in the CASP experiments (Kopp et al., 2007; Kryshtafovych et al., 2007) suggests this coverage is not an unreasonable expectation. A particularly successful example is provided in the next section.
Homology modeling and other computational tools have also been used to study structure-function relationships of proteins involved in DNA repair, cell cycle progression, chromatin formation, apoptosis, and other cellular processes associated with cancer development. Recent examples include explaining mutant phenotypes in a complex of yeast cyclin C and its cyclin-dependent kinase, cdk8p (Krasley et al., 2006), analysis of patient-derived mutants of c-kit in gastrointestinal stromal tumors (Tarn et al., 2005), and a prediction of the docking structure of BAK with p53 in apoptosis that relied on structure-based design of mutants (Pietsch et al., 2008).
One of the most useful applications of molecular modeling in cancer biology is to dissect the roles of multiple interacting proteins in various pathways associated with cancer (Huang et al., 2008). As an example, collaboration between experimental biologists and molecular modelers at the Fox Chase Cancer Center was aimed at understanding different phenotypes of overexpression of the chromatin remodeling protein ASF1a in humans (Tang et al., 2006; Zhang et al., 2005). Overexpression of this protein causes two different phenotypes: an increase in the formation of senescence-associated heterochromatin foci (SAHF) and G2-cell-cycle arrest. A homology model of the human ASF1a protein was constructed based on an experimentally determined yeast protein structure. It was found that mutations affecting SAHF formation were clustered together at one end, whereas mutations that did not affect SAHF formation were scattered in other regions of the structure (Tang et al., 2006). To investigate the cell-cycle arrest phenotype, modelers searched for a cluster of surface residues elsewhere in the model that were conserved within ASF1a, but different from ASF1b (which does not exhibit the cell-cycle arrest phenotype). Mutations of residues that were predicted to affect cell-cycle arrest, but not the SAHF phenotype, were subsequently verified experimentally.
Molecular modeling plays an increasing role in experimental structure determination. In point of fact, the experimentally or theoretically derived information about a structure being determined must always be converted to an explicit structural model through computation. The “integrative” or “hybrid” approaches explicitly combine diverse experimental and theoretical information, with the aim to increase the accuracy, precision, coverage, and efficiency of structure determination (Alber et al., 2008; Robinson et al., 2007). Input information may vary greatly in terms of resolution (i.e., precision), accuracy, and quantity. To be precise, all structure determination methods are integrative, but there is a difference in degree. At one end of the spectrum, even atomic structure determinations by X-ray crystallography and NMR spectroscopy rely on a molecular mechanics force field as well as on the “raw” X-ray and NMR data, respectively. An archetypal hybrid method is flexible docking of comparative models for component proteins into an electron density map of their assembly determined by cryo-electron microscopy (Rossmann et al., 2005; Topf et al., 2008). Such hybrid methods begin to blur the distinction between models based primarily on theoretical considerations and those based primarily on experimental data about the characterized system.
Modelers have begun to contribute directly to atomic structure determination of proteins. In crystallography, de novo protein structure prediction can sometimes solve the phase problem, via molecular replacement models for proteins of distant homology or even no detectable homology to previously solved structures (Qian et al., 2007). In structure determination by satisfaction of NMR-derived restraints, high-resolution physics-based refinement can now consistently improve the accuracy of NMR model ensembles (Bhattacharya et al., 2008; Qian et al., 2007). Perhaps most promising are methods that can dramatically accelerate NMR-based structural inference, by bringing together limited chemical shift data with modeling techniques to achieve structures with near-atomic resolution (Cavalli et al., 2007; Shen et al., 2008).
Even low-resolution biophysical and biochemical data can provide a rich source of structural information that can be integrated into realistic representations of macromolecular assemblies, as shown by determining the positions of the 456 constituent proteins in the yeast nuclear pore complex (NPC) (Alber et al., 2007a; Alber et al., 2007b). The structure was determined at approximately 5 nm resolution by satisfying spatial restraints that encoded protein and nuclear envelope excluded volumes (from the protein sequences and ultracentrifugation), protein positions (from immunoelectron microscopy), protein contacts (from affinity purification), and the eight-fold and two-fold symmetries of the NPC (from electron microscopy). Although each individual restraint may contain little structural information, the concurrent satisfaction of all restraints derived from independent experiments drastically reduced the degeneracy of the structural solutions. The resulting low-resolution map was combined with atomic structures and homology models of constituent proteins, resulting in insights about the evolution and function of the NPC. This study illustrates how structural genomics and the PSI can make a major impact even on the most challenging structural biology problems, through providing atomic structures and homology models of the individual proteins that are then assembled into models of large macromolecular machines and processes.
We now summarize the recommendations reached by consensus among the four independent workshop discussion groups. The recommendations are concerned with (i) coverage of the sequence space by homology modeling; (ii) publication and archiving of models; (iii) standards for data formats; (iv) estimating model accuracy; (v) communication between modelers and experimentalists; and (vi) development and role of the Protein Model Portal.
As discussed above, modeling can significantly expand the structural coverage of the protein universe. It remains unclear how best to integrate the experimental structure determination and computational modeling to maximize the impact of structural genomics on biology. The present focus of the PSI on large families that have no structural representatives and on very large families with limited structural coverage is a promising approach to achieve this goal.
We recommend that the modeling and structural genomics communities interact closely to formulate how maximizing the structural coverage can be most efficiently achieved. Suitable metrics for measuring structural coverage must be developed by the modeling community. Once these metrics are adopted, the PSI Knowledgebase will continually update and report them.
At the present time, models are published with different amounts of information about how these models were derived. A set of guidelines for what should be included in a modeling paper needs to be established. For homology modeling, these guidelines may include decisions leading to choice of the template structure(s), details of sequence alignment, methods used to derive the model, indication of the expected accuracy of the model, and how the model may be accessed publicly. These guidelines should be shared with journal editors and reviewers.
Models that have been peer reviewed and referred to in published literature should be publicly available. Without access to the model coordinates and sufficient annotation of the model, it is impossible for the reader to interpret the results and to assess the validity of the interpretations. In the past, some of the models were archived in the Protein Data Bank (PDB). Since 2006, only structures that have been determined experimentally are allowed to be deposited in the PDB (Berman et al., 2006).
We recommend that a Model Working Group be established to set standards for journal publication, to define minimum annotation standards, and to establish the scope and requirements of a public archive of in silico models. Membership of this group should consist of a representative of the wwPDB (Berman et al., 2003), the Protein Model Portal, as well as members of the modeling and user communities.
While the experimental structural biology community has essentially reached a consensus on the definition of common data formats that enable the seamless exchange of data and algorithms (Westbrook and Fitzgerald, 2003; Winn, 2003), most software tools for protein structure modeling use proprietary data formats for input data, parameters, and results. Although data formats from experimental structures can be applied to the protein model coordinates, data types specific to computational modeling, such as target-template alignments, error estimates, force field parameters, and specific details of the individual modeling algorithms, frequently vary between different applications. This incompatibility is a serious impediment for the exchange of tools and algorithms; it hinders both method development and the widespread use of tools outside of the developer groups themselves.
We recommend that the Model Working Group initiates a community-wide mechanism for reaching an agreement on a common open data format for information related to molecular modeling, with the aim of facilitating the exchange of algorithms and data. Once these standards are established, the services offered by the Protein Model Portal should be based exclusively on these common formats.
As with structures determined by X-ray crystallography and other methods, accuracy can be estimated globally, akin to the crystallographic R-value, or locally, akin to residue-specific, real-space correlation coefficients and R-values. Applications of models strongly depend on their accuracy, with different applications having varied requirements on accuracy and precision. Even if the overall accuracy of the model is high, the accuracy of specific regions (binding sites, loops, pockets, surface features, and overall fold) may vary. Criteria based on the global correctness of Cα coordinates are often insufficient to decide whether a model is suitable for a specific application, such as modeling ligand binding (Kopp et al., 2007). Accuracy measures that convey the suitability of models for specific applications need to be established.
Methods for estimating model accuracy are being actively studied. No accurate or dominant method has yet emerged. In one type of approach, global and local model properties are compared against expected values from statistical analyses of experimentally determined structures, such as main-chain dihedral angle distributions, rotamer probabilities, and solvation properties (Benkert et al., 2008; Bhattacharya et al., 2008; Pettitt et al., 2005; Shen and Sali, 2006; Sippl, 1993; Wallner and Elofsson, 2003). However, it is still possible for an inaccurate model to pass these checks. In cases where a number of independent models are available for a given target, consensus-based approaches can be applied (Ginalski et al., 2003; Wallner et al., 2003).
We recommend that the Model Working Group establishes guidelines for estimating model accuracy, with special emphasis on identifying criteria reflecting the suitability of models for specific biological applications. For this purpose, the Group should work most closely with members of the experimental research community representing specific model application requirements. The Protein Model Portal should provide a technical platform to make validated tools for estimating model accuracy available to the users of the models; it should also establish a mechanism for a continuous evaluation and improvement of these tools.
At present, many members of the scientific community are unaware of the advances in molecular modeling, its limitations and its applications. It is primarily the responsibility of modelers to educate the community about their area of research (eg, in the form of scientific publications, presentations, collaborative projects, and web resources). However, the Workshop participants felt that molecular modeling is often not used to its full potential in biomedical research, and that the impact of structural biology in general could be increased by better education about the optimal use of existing modeling methods.
We recommend that the PSI Knowledgebase and its Protein Model Portal proactively solicit educational contributions from the modeling community in the form of reviews, tutorials or even open workshops, aimed at demonstrating applications and limitations of computational modeling methods.
The discussion at the Workshop explored how to maximize the impact of the Protein Model Portal (http://www.proteinmodelportal.org/) on the application of molecular models in biomedical research.
We recommend that the Portal provide unified access to molecular models and their annotations, and support the development of data standards to facilitate exchange of information and algorithms. The Portal should play an active role in facilitating discussions between developers of computational methods and their users, provide access to tools for estimating model accuracy, and promote their further development. Its user interface should allow a broad range of queries to the participating model databases as well as links to experimental data. Tools for estimating model errors and selecting the likely best model among the available models should be included. An interface to interactive services for modeling should be established. Mechanisms to notify users when a particular sequence is modeled (or experimental data becomes available) should be implemented. The Portal should work closely with the Knowledgebase to establish a series of online documents with community feedback to explain the value and limitations of protein structure models. Finally, the Portal should be as inclusive of all method developers and prediction methods as technically feasible.
The workshop on Applications of Protein Models in Biomedical Research and the PSI Knowledgebase Protein Model Portal were supported by the National Institutes of Health (P20 GM076222-02S1, Roland Dunbrack, PI; and U54 GM074958-04S2, Helen Berman, PI).