|Home | About | Journals | Submit | Contact Us | Français|
3D cryo-electron microscopy reconstruction methods are uniquely able to reveal structures of many important macromolecules and macromolecular complexes. EMDataBank.org, a joint effort of the Protein Data Bank in Europe (PDBe), the Research Collaboratory for Structural Bioinformatics (RCSB), and the National Center for Macromolecular Imaging (NCMI), is a “one-stop shop” resource for global deposition and retrieval of cryoEM map, model and associated metadata. The resource unifies public access to the two major EM Structural Data archives: EM Data Bank (EMDB) and Protein Data Bank (PDB), and facilitates use of EM structural data of macromolecules and macromolecular complexes by the wider scientific community.
Structural biology of macromolecules has become an indispensable branch of molecular biology. Researchers use the results from structural studies to explain the functions and mechanisms of biological processes at the molecular level, leading to more targeted experiments to explore structure and function. Many key biological processes are carried out by large macromolecular complexes, including signal transduction, genome replication, transcription, translation, chaperonin-assisted protein folding, viral infection, and motility. It is becoming increasingly feasible to determine three-dimensional structures of these complexes in different functional or chemical states using cryo electron microscopy (cryoEM).
Specimens for cryoEM studies come in many forms and shapes, e.g. two- or three-dimensional crystals (Gonen et al., 2005; Henderson et al., 1990; Schmid et al., 2004), one-dimensional filaments or tubular crystals possessing helical symmetry (Unwin, 2005; Wang et al., 2006), and individual particles with or without symmetry (Gabashvili et al., 2000; Olson et al., 1990; Zhou et al., 2001). CryoEM is also being applied to large samples consisting of irregular ensembles of complexes and cells using tomographic reconstruction methods (Baumeister, 2004; Murphy and Jensen, 2007). Other chapters in this volume describe preparation and image reconstruction methods for the different specimen types including 2D crystals, helical arrays, single particles, and unique structures.
At present, cryoEM researchers are rapidly producing a large body of knowledge regarding the 3D structural arrangements of components within large macromolecular complexes, within subcellular assemblies, and even within whole cells, based on map volumes with resolution limits ranging from 80 Å to 2 Å. Interpretation varies according to the map resolution, available tools, and additional knowledge of the system and/or its components and may involve either segmentation, rigid body fitting of atomic coordinates determined using X-ray crystallography or NMR, or ab initio model building.
Public access to cryoEM map volumes and their associated fitted model interpretations permits independent assessment and interpretation of structural results and stimulates development of new tools for visualization, fitting, and validation. The EM Data Bank (EMDB) is the major repository for 3D map volumes solved using electron microscopy (Tagari et al., 2002), while the Protein Data Bank (PDB) collects atomic coordinates fitted into EM map volumes (Dutta et al., 2009). The Unified Data Resource for CryoEM (EMDataBank.org) was created in order to unify data deposition, processing and retrieval of maps and fitted models. This chapter provides an overview of the EM structural data archives and the unified resource, including historical context, current content and use, and future prospects.
The EMDB was established at the European Bioinformatics Institute (EBI) in Hinxton, UK and began operations in 2002. It was initially supported by two European Union-funded projects, the Integration of Information about Macromolecular Structure project (IIMS) and the 3DEM Network of Excellence (3DEM NoE). An IIMS-sponsored workshop was held in Nov 2002 that focused on data exchange, harvesting, deposition issues, and presentation of EM data to non-specialists. Guidelines and release policies were set for the newly founded EMDB, and the workshop established the database as a resource for the international community, with an announcement published in Structure (Fuller, 2003), followed by an editorial in Nature Structural Biology (2003). The workshop concluded with a strong endorsement of EM map volume deposition and linkage of EMDB with other archival databases in biomedical research.
Working closely with IIMS project partners, leading European electron microscopy laboratories and PDB partners, an initial data model was produced for electron microscopy derived maps. A web-based deposition system, EMDEP, was developed to handle data capture (Henrick et al., 2003). EMDEP validates data via an interactive depositor-driven operation, and it relies on the knowledge and expertise of the experimenters for the complete and accurate description of the structural experiment and its results. The captured metadata, e.g., sample description, specimen preparation, imaging, reconstruction, fitting details are stored in a “header” file, and the deposited map is converted to a common format for redistribution. A database query tool, EMSEARCH, was also designed and implemented to enable web-based searches.
By the time the IIMS project was completed in December 2003, the EMDB had become an operational public database with 65 map volumes deposited by major EM laboratories in Europe and the USA. At this time the PDB began to see a significant increase in EM-related coordinate depositions, in many cases models that were fitted into maps deposited to EMDB.
The PDB archive was established in 1971 as a public repository for X-ray crystal structures of biological macromolecules (Bernstein et al., 1977), and is presently maintained by the global organization world-wide PDB (wwPDB, Berman et al., 2003), a consortium consisting of the RCSB PDB (www.pdb.org), the Protein Data Bank in Europe (PDBe) at EBI (www.ebi.ac.uk/pdbe), the Protein Data Bank Japan (PDBj, www.pdbj.org), and the Biological Magnetic Resonance Bank (ude.csiw.brmb). The number of structures in PDB has grown from the initial seven to over 65,000 entries. Over time, the PDB began to collect coordinates of structures determined by methods other than X-ray crystallography. In the 1980's, coordinates and restraint data determined from NMR methods began to be included in the PDB and these now represent about 15% of the archive. In the 1990's, model coordinates for structures determined using EM began to be archived, beginning with models for bacteriorhodopsin (Henderson et al., 1990), and the RecA hexamer (Yu and Egelman, 1997). EM structures account currently for less than 0.5% of all PDB entries, but the rate of deposition is increasing more rapidly than for any other experimental method.
Two workshops held in 2004 invited the EM community to participate in development of an improved data model for describing cryoEM experiments, and also set in motion efforts to unify deposition and access to EM maps and models.
The 2004 3DEM NoE workshop at EBI reviewed tools and software practices used in the field of cryoEM, examined data items and data models required to fully describe EM experiments, and allocated tasks to different groups to develop the required standards. The workshop also defined goals for further development of the database including providing archiving capabilities for cryo-electron tomography, providing cross-referencing between EMDB maps and PDB coordinate models and converting common map formats in a lossless manner.
The second 2004 workshop held at the Research Collaboratory for Structural Bioinformatics (RCSB) at Rutgers, USA and co-sponsored by the National Center for Macromolecular Imaging (NCMI) aimed to develop a global community consensus on data items needed for deposition of 3D map volumes and fitted atomic models derived from cryoEM studies. In addition to discussion of desired improvements in the areas of visualization, data mining and data integration, a unanimous recommendation of workshop attendees was the need to develop a “one-stop shop” for deposition of EM map and model data in order to eliminate the duplication of effort involved in creating separate depositions to EMDB and PDB.
Based on recommendations gathered at the 2004 workshops, a revised and expanded EM dictionary was created in a three-way collaboration between EBI, RCSB, and NCMI with broad community input and was presented in 2005 at the 3DEM Gordon Research Conference in New Hampshire as well as a 3D-EM Developers workshop held in the UK. Recommended conventions for exchange of cryoEM data were published (Heymann et al., 2005), and a standards task force was created to gather information on the different cryoEM map and image conventions and formats to facilitate conversion. In June 2005, a notice posted to the 3DEM community email bulletin board (ude.dscu.med3) announced the intention of EBI and RCSB to jointly collaborate towards further development of 3DEM database services.
Following the recommendations of the EM community to create a one-stop shop for EM, the Unified Data Resource for CryoEM (EMDataBank.org) was established in 2007 with funding from the National Institutes of Health/NIGMS as a joint effort of EBI, RCSB, and NCMI. The resource is creating a global deposition and retrieval network for cryoEM map, model and associated metadata, as well as a portal for software tools for standardized map format conversion, map, segmentation and model assessment, visualization, and data integration. The first goal of this three-way collaboration was completed in early 2008 when RCSB joined EBI as a second EMDB deposition and retrieval site. Joint EMDB (map) and PDB (model) deposition systems were developed and put into operation at both EBI and RCSB in early 2009, and web-based 3D visualization tools have been integrated into EMDB atlas pages. Efforts to improve uniformity and usability of the EM structural data in both EMDB and PDB databases are ongoing, and additional services for data harvesting and evaluation are planned.
The EMDB currently holds more than 800 map volume entries while PDB holds more than 300 entries of coordinates fitted into EM map volumes (Figure 1). Map volume and fitted model deposition rates are on the rise, currently ~150 and ~40 per year, respectively, with roughly 40% of all published EM structures being captured in the databases. As the importance of EM-derived structural information continues to increase, it is anticipated that more journals and funding agencies will require deposition.
Each EMDB entry holds a single map volume plus associated experimental metadata; each associated PDB entry holds the fitted coordinate models and associated experimental metadata plus primary sequence information for each polymer. The metadata information that is shared by both databases is automatically transferred during the joint deposition process and includes:
The correspondences between maps and associated fitted coordinate models are maintained in both archives. EMDB entries can optionally hold associated masks, structure factors, and/or layerline data; PDB entries can also hold structure factors. The underlying dictionaries for the two databases have direct translations and are regularly being updated to reflect changes in experimental apparatus and methods. For readers interested in depositing EM structural data, some guidelines are provided in the last section of this chapter.
The map archive includes several different types of maps generated by electron microscopy imaging (Figure 2). The majority of entries are single particle reconstructions (81%), which represent ensemble averages of thousands of individual imaged particles, often with additional symmetry averaging. The largest class of single particle specimens represented are the viruses (20% of all holdings), the majority of which are icosahedrally averaged (Chiu and Rixon, 2002; Huiskonen and Butcher, 2007; Lee and Johnson, 2003). Virus entries typically represent distinct states of maturation, or complexes with antibodies or receptors.
The second largest class of single particle specimens represented are the ribosomes (15%). Ribosome entries define key structural conformations encountered in translating messenger RNA into protein, or elucidate structural variations across diverse species (Frank, 2009; Mitra and Frank, 2006). Other single particle specimens represented include macromolecular machines involved in protein folding, protein degradation, energy metabolism, cell cycle processes, DNA replication, DNA repair, RNA transcription, and RNA splicing.
The map archive also holds densities for 2D crystals and for helical arrays, including intracellular filaments and microtubules, flagella, and helical crystals. There are also several tomographic maps of unique structures as well as maps that represent 3D averages of aligned tomograms. Diverse specimens currently held of this type include flagellar motors, insect flight muscle tissue, and desmosomes. A gallery of representative map volumes is presented in Figure 3.
Coordinates of EM entries are obtained using a variety of modelling methods including manual docking, rigid body fitting, homology modelling, and computational refinement algorithms. EM entries in the PDB are classified either under electron microscopy or electron crystallography as the experimental method. For structures with regular point or helical symmetry, coordinates are given for the asymmetric unit along with a set of transformation matrices to build the biological assembly (Lawson et al., 2008).
Access to EM structural data and related services is through the EMDataBank.org web site. The EMDB and PDB archives are updated weekly on Wednesdays at 00:00 GMT. EMDB is distributed on two ftp mirrors supported by the two EMDataBank.org distribution partners in the UK and the USA, while PDB is distributed on ftp mirrors supported by each of the wwPDB partners.
The EMSEARCH web service is also maintained and updated weekly at both distribution sites. EMSEARCH enables browsing and searching of EMDB metadata uploaded into a relational database. Simple searches can be performed based on author name, title, sample name, citation abstract word, aggregation type, resolution, and release date range. Search summaries link to atlas pages for each entry, which include summary, visualization, sample, experiment, processing, map information, and download pages (Figure 4).
The atlas visualization page provides several ways to view EM maps. In addition to a static 2D image provided by the deposition author, it is also possible to launch two different Java-based 3D viewers (Figure 5, left and right top panels). EMViewer, developed by Powei Feng and Joe Warren at Rice University in collaboration with NCMI, provides a simple, single isosurface representation of a map at a predetermined contour level. Clicking on the “Launch EMViewer” button will bring up a simple 3D representation of the map that can be manipulated by mouse drag and click actions. AstexViewer, a molecular graphics program originally developed to display crystallographic data (Hartshorn, 2002) was recently re-released under an open source license and has been adapted by EMDataBank.org for display of EM maps and associated PDB coordinate models. To improve web download speed and minimize memory requirements, a compact map format is used (BRIX), and larger maps are also down-sampled by a factor of 2-5. Current capabilities include ability to control map contour level, opacity, color, solid vs. mesh surface rendering, and concurrent display of a PDB coordinate entry. Viewing large maps may require increasing Java Applet Runtime memory allocation.
Map volumes are distributed in standard CCP4/MRC format and can be viewed with locally installed software such as UCSF Chimera (Pettersen et al., 2004), Pymol (Delano, 2002), VMD (Hsin et al., 2008) Coot (Emsley and Cowtan, 2004) or other graphics programs, enabling investigation with a more extensive set of tools. Links for map download are available on atlas download and map information pages. Recent distributions of UCSF Chimera enable direct downloads of EMDB maps plus associated fitted PDB models via simple queries (e.g. author name, title) to the EMSEARCH relational database through the EMDataBank.org beta-web service, which is a self-contained programming interface based on the SOAP protocol (Figure 5, bottom panel).
EM Navigator at PDBj (pj.ca.u-akaso.nietorp.ivanme) is an additional resource for browsing, viewing, and downloading maps and fitted coordinate models from the EMDB and PDB databases. Each of the wwPDB partners (wwpdb.org) has web interfaces to database representations of the PDB archive with advanced searching and browsing capabilities. ViperDB (Natarajan et al., 2005), a database specifically for icosahedral viruses, also holds some EM maps and related coordinates.
A major goal of the EMDataBank.org unified data resource is to archive EM-derived structural information in a way that will enable further research. The impact of availability of structural data for smaller proteins has already been amply demonstrated by the success of the PDB, which is accessed globally by thousands of individuals every day. Listed here are a few examples of how archived EM data facilitates subsequent scientific exploration by other investigators.
The availability of a coordinate model representation of a macromolecular complex leads directly to a dramatic increase in our fundamental understanding of biological function. Atomic coordinates permit 3D mapping of a wide variety of data including sites of mutations leading to disease, modification sites, antibody recognition sites, amino-acid sequence variability, and electrostatic properties. But in many cases, high-resolution structural models are not available for every component within a cryoEM map at the time it is first interpreted. By preserving map and model information together in a freely accessible database, new structural information can be incorporated as it becomes available. Examples include (1) reinterpretation of ribosome stalk regions in multiple EMDB-archived maps after the crystal structure of the L7/L12 complex was determined (Diaconu et al., 2005), and (2) progression towards a complete fitted coordinate model for bacteriophage T4 EM reconstructions based on fitting of components as they have become available (Aksyuk et al., 2009a; Aksyuk et al., 2009b).
Structural knowledge can be crucial for interpreting biochemical and biophysical data. For instance, dengue and West Nile virus cryoEM structures (Kuhn et al., 2002; Mukhopadhyay et al., 2003) have led the way to mapping sites of glycosylation (Hanna et al., 2005), amino-acid sequence variability (Modis et al., 2005), neutralizing antibody binding (Nybakken et al., 2005), design of vaccines (Ledizet et al., 2005), and a structural basis for understanding membrane fusion (Modis et al., 2004). Recently, multiple EMDB map entries were examined in light of biochemical data designed to distinguish between two distinct models for the hexameric subunit arrangement of two closely related disaggregating proteins, ClpB and Hsp10 (Wendler and Saibil, 2010).
CryoEM maps can be used to initiate crystallographic phasing. Isomorphous replacement phasing methods that are routinely used in X-ray crystallography are technically difficult to apply to large macromolecular complexes. A low-resolution cryoEM map of the complex under study thus provides a valuable complementary source of phase information. For high-symmetry structures, such as icosahedral viruses, application of robust computational averaging and extension algorithms to initial cryoEM-derived phases is often sufficient to complete the structure determination without additional phase information. Examples of crystal structures phased with cryoEM envelopes include proteases (Bosch et al., 2001; Wang et al., 1998), ribosomes (Ban et al., 1998; Cate et al., 1999; Thygesen et al., 1996), and icosahedral viruses (Dokland et al., 1997; Dokland et al., 1998; Grimes et al., 1998; Helgstrand et al., 2003; Prasad et al., 1999; Reinisch et al., 2000; Wynne et al., 1999). Additional case studies are described by Xiong (2008), and use of a stain EM reconstruction for phasing is described by Trapani et al. (2010). Routine archiving of EM-derived maps facilitates the use of this method by the crystallography community.
By capturing the coordinates, maps, and metadata and making them available through searchable databases, it is possible to easily compare structures, select structures for further analysis using a variety of criteria, perform experimental design for new analyses, and design new algorithms to improve the state of the art methodology. For example, observation of publicly available maps and coordinates led to the conclusion that the double-stranded DNA tailed phages and herpesvirus have a similar fold in their major capsid proteins though there is little sequence similarity among them (Baker et al., 2005). In a second example, a flexible fitting approach applied to 43 EMDB map volumes of bacterial 70S ribosome in various functional states revealed global conformational differences between the EM structures involving large-scale ratchet-like deformations (Matsumoto and Ishida, 2009).
Availability of EM map volumes in the EMDB facilitates development and validation of software for map viewing, analysis, manipulation, coordinate model fitting and validation. Development of algorithms for fitting of EM maps with atomic coordinates is a particularly active area. For low to medium-resolution studies in which the source coordinates are likely to be derived from crystallographic or NMR studies, algorithms being explored include rigid-body fitting (Wriggers, 2010), normal mode analysis (Tama et al., 2004), spatial interpolation (Rusu et al., 2008), conformational sampling under low-resolution restraints (Schroder et al., 2007), molecular dynamics flexible fitting (Trabuco et al., 2009), and simulated annealing approaches (Tan et al., 2008; Topf et al., 2008). For medium to high-resolution studies where de novo model building becomes possible, methods are being developed for skeletonization and secondary structure element detection (Baker et al., 2007) and incorporation of structure prediction from primary sequence (DiMaio et al., 2009). Additional examples of algorithm development facilitated by public availability of EMDB maps include investigations of map denoising (Jiang et al., 2003), B-factor sharpening (Fernandez et al., 2008), map resolution determination (Sousa and Grigorieff, 2007), and automated segmentation (Baker et al., 2006; Pintilie et al., 2010). The UCSF Chimera team has made particularly effective use of the EMDB resource for development and testing of a large, versatile set of tools for manipulating volume data (Goddard et al., 2007; Pettersen et al., 2004).
3D cryo-electron microscopy reconstruction methods are uniquely able to reveal structural aspects of many important macromolecules and macromolecular complexes, and for this reason the field is in a period of rapid expansion and development. Based on current growth of EM entries and publication of EM structures, the total number of structures of large biological assemblies contributed by EM is anticipated to approach 10,000 by the year 2020.
In the near future, deposition and archiving of EM structural data will be integrated in a common tool that is being developed by the wwPDB partners to handle depositions from all structural biology methods. In addition, as the field matures, validation tools and criteria for assessment of map and fitted coordinate models will play an important role in providing guidance to users of cryoEM derived structural data. To this end, a validation task force is being assembled by the unified data resource partners (NCMI, RCSB and EBI) to develop recommendations as to how best to assess the quality of both maps and models that have been obtained from cryoEM data. The recommendations will form the basis for a validation suite that will be used by EMDB and PDB.
This section provides an overview of joint map + fitted coordinate model deposition to the EM DataBank (EMDB) and Protein Data Bank (PDB). To prepare for deposition, the following items should be available:
EM structural data can be submitted to deposition sites at PDBe (UK) or RCSB-PDB (USA). Go to emdatabank.org/deposit.html to select the deposition site. To initiate an EMDEP session, click on “Start Session.” Select “new deposition”, or select “based on previous submission” if the experiment is related to a prior deposition.
The map deposition is created page by page. When the relevant information on the page is entered and saved, the symbol for the page on the left hand menu changes from a red arrow to a green circle. Before submission, it is possible to go back and change answers on completed pages. Help text is available for every data item by clicking on the item.
Many pages contain sections for entering metadata information about the experiment that can be duplicated. For instance, two “sample component” sections should be completed for each unique component in an Fab:virus complex assembly, one section for the Fab, and a second section for the virus. Multiple imaging sessions/microscopes can be defined.
Atomic coordinates from existing PDB entries (e.g., X-ray or NMR structures) used in fitting of the map are specified by the depositor on the “fitting” page, which can also be duplicated as needed. In contrast, fitted atomic coordinates representing the depositor's molecular interpretation of the EM map are deposited to the PDB following the map deposition (see below), and the resulting PDB id is associated with the map entry by the database curator.
After completing map deposition using EMDEP at either site, a link will appear on the EMDEP left hand menu to initiate deposition of one or more models to PDB with automatic transfer of experimental metadata (sample description, microscope type, etc). At the PDBe site, the link opens an AutoDep session; at the RCSB-PDB site, the link opens an EM-Adit session. Sequence information must be provided for each protein or nucleic acid entity should include the entire sequence of the imaged material, including any mutations or expression tags.
After completion of map and model depositions, accession IDs are assigned by the EMDB and PDB, respectively. The accession ids are associated with each other by the two databases. All provided ids should be included in the primary publication describing the EM structure.
Many current and past members of the EMDataBank.org team have made significant contributions to the development of the Unified Data Resource for CryoEM including Kim Henrick, Wah Chiu, Helen Berman, Gerard Kleywegt, Richard Newman, John Westbrook, Glen van Ginkel, Batsal Devkota, Matt Baker, Tom Oldfield, Christoph Best, Gaurav Sahni, Raul Sala, Chunxiao Bi, Powei Feng, Joe Warren, Matt Dougherty, Steve Ludtke, and Ian Rees. The Resource is funded by National Institutes of Health GM079429 to Baylor College of Medicine, Rutgers University, and the European Bioinformatics Institute.
Book Chapter for Methods in Enzymology, Volume on CryoEM, Grant Jensen, editor