|Home | About | Journals | Submit | Contact Us | Français|
The Protein Structure Initiative (PSI) was established in 2000 by the National Institutes of General Medical Sciences with the long-term goal of providing 3D (three-dimensional) structural information for most proteins in nature. As advances in genomic sequencing, bioinformatics, homology modelling, and methods for rapid determination of 3D structures of proteins by X-ray crystallography and nuclear magnetic resonance (NMR) converged, it was proposed that our understanding of the biology of protein structure and evolution could be greatly enabled by ‘genomic-scale’ protein structure determination. Over the past 12 years, the PSI has evolved from a testing bed for new methods of sample and structure production to a core component of a wide range of biology programs.
The vision of the PSI is to make 3D protein structure information an integral part of biology research. Structural Genomics has the potential to transform biomedical research, creating a powerful new infrastructure capable of addressing some of the most challenging molecular problems of modern biology. Large-scale genome sequencing efforts have provided new insights into the richness and diversity of life, and the genomic bases of evolution and function. However, natural selection is largely driven by the physical properties of the 3D protein structure. A complete understanding of protein function and evolution, thus, requires information about both protein sequence and 3D structure [1,2]. The PSI has developed over three phases: the first phase (from 2000-2001) was a pilot phase to test the feasibility and develop the methodology; the second phase was to solve large numbers of structures using insights from the first phase; and the third phase, PSI:Biology, aims to expand the role of the 3D structure in biological research using advances from the first two phases. In this commentary, I summarize some of the achievements of the PSI and the vision for expanding these in the new National Institute of General Medical Science PSI:Biology program.
Several recent reviews have outlined the progress and achievements of the PSI program [3-9]. Ultimately, the success of the initiative will be determined by the scientific impact of the new technologies, reagents, and 3D structures provided into the public domain, and the knowledge that is gained from these data. One operational metric of the program is a count of 3D structures of ‘distinct’ proteins (or domains), referred to as ‘Distinct Structures’, deposited into the Protein Data Bank (PDB). Two protein sequences are ‘distinct’ if they share < 98% sequence identity over the full-length of the shortest sequence of the pair, e.g. though each provides uniquely valuable information, two crystal structures of the same protein bound to different ligands count as a single Distinct Structure. “Novel” structures are defined as those which have < 30% sequence identity with any structure in the PDB at the time of deposition. In the second phase of the PSI program (called PSI2, 2005-2010), investigators achieved their goal set in 2005 of depositing more than 3,000 Distinct Structures into the PDB. Most of these were also Novel Structures, greatly expanding our knowledge of the relationship between protein sequences and 3D structure. Over the full ten years of the PSI program, investigators completed and deposited into the public domain more than 5,000 3D protein structures, including protein-ligand complexes and pairs of X-ray and NMR structures which, though not counted as “Distinct”, have important scientific value (see for example ).
Many of the structures determined in first and second phases of the PSI program were, at the time of deposition into the PDB, the first representatives from extensive protein domain families . A ‘protein domain family’ is a set of homologous protein domains likely to have similar structures and possibly similar biochemical functions. These included both domains with known biochemical functions and domains of unknown function, known as DUFs. These structures are being used as templates for modelling tens of thousands of homologous proteins [12-17] and provide a database of protein sequences, structures, and biophysical properties (e.g. chemical shifts) that also inform the fields of protein structure prediction, design, and engineering. By focusing the choice of targets on proteins that have minimal sequence similarity with known structures, PSI structures have greatly increased the size of the non-redundant protein structure knowledge base that is being used to develop improved structure prediction algorithms, including fragment-based search algorithms and knowledge-based atomic potentials. In some cases, these structural data are accompanied by extensive chemical shift, nuclear overhauser effect (NOE), and other NMR data that are being used in hybrid structure determination methods .
The PSI has also become the primary contributor of structural data that can be used for testing new methods for protein structure prediction and automated data analysis, including data used in such projects as the Critical Assessment of Structure Prediction [19,20] and the Critical Assessment of Automated Structure Determination of Proteins from NMR data . Structural Genomics projects are unique as they can provide their data for community-wide tests of computational methods without concern regarding how it impacts their priorities for publishing a particular structure. PSI Centers are also involved in collaborative projects aimed at accelerating the field of protein NMR structure analysis [22-28] and computational protein design [29-31].
As part of the community-outreach goals of the PSI program, the National Institute of General Medical Science has created the PSI Structural Biology Knowledge Base (PSI-SBKB) , for organizing and disseminating the entire repertoire of scientific information generated by the PSI program, and the PSI Materials Repository (PSI-MR) , designed to provide easy, rapid, and broad access to the biochemical reagents produced by PSI Centers, particularly the protein expression systems. These resources serve as a platform for PSI-funded investigators to provide information on protein samples and 3D structures to the broad biological community in an “open source” fashion, in which intermediate results, protein expression systems and protocols, protein structures, and new technologies are made available to the community as soon as the data and/or methods are deemed to be reliable. The PSI-SBKB also provides access to 3D protein models generated using various comparative modeling methods , together with coordinates of structures solved by the PSI program.
The PSI program has also instituted a Community Nomination Target (CNT) program, through which scientists can nominate targets for study by PSI centers and collaborate on functional follow on studies (http://sbkb.org/cnt/). This program provides a unique method of connecting PSI investigators with important biological problems and top-tier biological investigators and provides access to PSI Centers by a wide range of collaborators who are not directly funded by the PSI. Several of these CNT projects have yielded important and challenging structures enabling the research programs of individual investigators across the globe (see for example [34-42]).
The successful demonstration of the feasibility of ‘high-throughput structure production’ opens doors to a wide range of new opportunities for biological research that could not be considered without such infrastructure. The third phase of the PSI program, PSI:Biology, aims to expand the role of 3D structure in biological research by supporting several “high-throughput-enabled biology partnerships”, designed to leverage the protein sample and structure production horsepower of the PSI High-Throughput Production Centers in applications involving broad and/or challenging biological questions. Examples of project areas that are emphasized in the PSI:Biology program include the following:
These exciting applications provide a vision of the broad impact Structural Genomics platforms and technologies will have on biological and biomedical research. Examples of key areas that are being explored in the PSI:Biology program are outlined in the following three sections.
Structural Genomics provides 3D atomic-resolution structural information of large numbers of gene products (so far, primarily proteins) and, thus, lays the foundation required for systems biology. For example, having protein samples, affinity capture reagents (e.g. phage display antibodies), and complete 3D structural descriptions of the enzymes and protein-protein complexes associated with a specific biological process, such as epigenetically-regulated gene expression or protein translation, will open new avenues to model and understand such complex biological systems. Such a comprehensive view would also allow improved diagnosis and treatment of diseases.
Most proteins function by forming complexes. Indeed, some proteins are simply not folded in the absence of their macromolecular and/or small molecule partners. Genomes encode large numbers of natively disordered proteins or protein regions that are functionally important for protein-protein interactions, modulating binding affinities, and regulating signaling pathways [51-55]. This has been born out by biophysical studies on thousands of proteins expressed and purified in the PSI program, demonstrating that a large portion of the eukaryotic proteome codes for intrinsically disordered proteins and/or protein regions. In recent years, research groups of the PSI have begun to address such disordered regions of proteins (or entire intrinsically-disordered protein families), particularly those that become ordered upon complex formation . Important technological goals include development of high-throughput methods for protein co-expression, crystallization-enhancing chaperones generated by phage display methods [57,58], and various technologies for identifying, co-expressing, and forming complexes between proteins, including those that involve disorder-order transitions.
Membrane proteins remain a major challenge to structural biology. However, our understanding of biology will not be complete without extensive structural information on integral membrane proteins. Structural genomics pipelines, involving coordinated teams of scientists working together with shared resources and infrastructure (e.g. [59-61]), have the potential to make major breakthroughs in creating new technologies and protocols for determining 3D structures and dynamics of integral membrane proteins. Some important integral membrane protein structures, including several G protein-coupled receptors, known as GPCRs, and their complexes with ligands, have recently been determined in PSI-funded projects using both X-ray crystallography [62-71] and NMR spectroscopy [72,73]. In the PSI:Biology program, expanded support is provided for the technology development needed for membrane protein sample production, coordinated, project-wide structural analysis of human integral membrane proteins, and community-nominated studies of membrane protein structure and function.
One of the unique features of the PSI program is the extensive cooperation and synergy between potentially competing National Institutes of Health (NIH)-funded centers. For example, in the second phase of the PSI program, target selection of proteins for structure determination was coordinated by a BioInformatics Group committee that included members from each of the computational biology teams associated with the four Large Scale Centers . The successful collaboration of the different teams resulted not only in more rational and comprehensive target selection but also minimized duplications of effort by providing coordination among hundreds of scientists and synergies that would not be possible in smaller individual laboratory research projects.
The third phase of the PSI program, PSI:Biology, is an experiment in large-scale biological research. Preliminary progress using large-scale 3D protein structure production to enable biological research partnerships is very exciting. The program currently consists of four Centers for High-Throughput Structure Determination, nine Centers for Membrane Protein Structure Determination, 12 High-Throughput Enabled Biology Partners, and two Resource Centers (the PSI-SBKB and PSI-MR). In addition, several of these centers host extensive CNT projects, involving many research groups that are not directly funded by the PSI:Biology program. These CNT projects have the potential to create new and unexpected uses of protein structure, and to enable, with 3D protein structures, a wide range of biological and biochemical studies.
A challenge faced by the PSI:Biology program is to create strong cooperation and synergies among the 27 PSI:Biology centers, each of which is itself a multi-investigator team, as well between the PSI:Biology centers and individual investigators associated with the CNT programs. Such an integrated program will be a unique engine for biological discovery. Obviously, issues may arise within such a research network that require innovative thinking and creative management. A model for this integration is the cooperation and synergy achieved between the Large Scale Centers. Despite challenges that may present themselves in the early phases of the program, the power of such a unique, integrated infrastructure for biological research has the potential to rapidly advance biology and biomedical science.
The PSI program provides a novel paradigm for biological science discovery. Rather than determining 3D structures as a means for testing specific hypotheses, the Structural Genomics approach aims to discover new science by analyzing the information provided by 3D structures, sometimes even before the biological significance of the protein is recognized. It has been a powerful and successful driving force for a wide range of method developments that have been realized only by collecting homogeneous, fully-documented data across large numbers of protein samples and structures (see for example [10,21,26,74-81]). The PSI has also provided a test bed for “network biological science”, enabling discovery through cooperative interactions across a network of collaborating scientists. The unique multi-laboratory structure of the PSI “centers” also provides a model of how the internet can be used to integrate research activities across a network of real time collaborations. This paradigm is making significant contributions to biology, utilizing the high-throughput platforms developed in the PSI program to enable biology with 3D structural information. Indeed, the concerted effort of the more than 500 scientists participating in the PSI:Biology program has the potential to revolutionize the utility and impact of protein 3D structure information for the broad biological community.
I thank Prof. T. Szyperski for insightful discussions and comments on the manuscript. This work was supported by the National Institutes of General Medical Science Protein Structure Initiative (PSI:Biology) program, grant U54 GM094597.
The electronic version of this article is the complete one and can be found at: http://f1000.com/reports/b/4/7
The author is Principal Investigator of the Northeast Structural Genomics Consortium, one of several NIH National Institute of General Medical Science funded projects of the PSI.