|Home | About | Journals | Submit | Contact Us | Français|
Cryo-electron microscopy reconstruction methods are uniquely able to reveal structures of many important macromolecules and macromolecular complexes. EMDataBank.org, a joint effort of the Protein Data Bank in Europe (PDBe), the Research Collaboratory for Structural Bioinformatics (RCSB) and the National Center for Macromolecular Imaging (NCMI), is a global ‘one-stop shop’ resource for deposition and retrieval of cryoEM maps, models and associated metadata. The resource unifies public access to the two major archives containing EM-based structural data: EM Data Bank (EMDB) and Protein Data Bank (PDB), and facilitates use of EM structural data of macromolecules and macromolecular complexes by the wider scientific community.
Cryo-electron microscopy (cryoEM) has become an essential technique in structural biology, bridging the gap between cell biology, X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy (1,2). CryoEM reconstruction methods are being used to determine structures of large macromolecules, macromolecular complexes and cell components involved in many key biological processes including signal transduction, genome replication, transcription, translation, chaperonin-assisted protein folding, viral infection and motility. Three-dimensional (3D) density maps derived from cryoEM experiments reveal overall molecular shape and may be further interpreted through segmentation algorithms, rigid-body fitting of atomic coordinates determined using X-ray crystallography or NMR, and/or ab initio model building, depending on map resolution (3–5).
Public access to cryoEM map volumes and their fitted model interpretations permits independent assessment and analysis of structural results and stimulates development of new tools for visualization, fitting and validation. The EM Data Bank (EMDB) is the major repository for 3D map volumes obtained using electron microscopy (6), while the Protein Data Bank (PDB) collects atomic coordinates fitted into EM map volumes (7). The Unified Data Resource for CryoEM (http://EMDataBank.org) was created in order to unify data deposition, processing and retrieval of maps and fitted models. Here, we provide an overview of the EM structural data archives and the unified resource, including their historical context, current content and use, and future prospects.
The EMDB was established in 2002 by the Macromolecular Structure Database group (now PDBe) at the European Bioinformatics Institute (EBI) in Hinxton, UK (6,8,9), and was initially supported by two European Union-funded projects: Integration of Information about Macromolecular Structure (IIMS) and the 3D Electron Microscopy Network of Excellence (3DEM NoE). A web-based deposition system, EMDEP, was developed to handle data capture (10). EMDEP validates data via an interactive depositor-driven operation, relying on the knowledge and expertise of the experimenters for the complete and accurate description of the structural experiment and its results. The captured metadata (e.g. sample description, specimen preparation, imaging, reconstruction and fitting details) are stored in an XML-style ‘header’ file, and the deposited map is converted to a common format for redistribution. A database query tool, EMSEARCH, was also designed and implemented to enable web-based searches. By December 2003, the EMDB was an operational public database with 65 maps deposited by major EM laboratories from Europe and the USA. At this time the PDB began to see an increase in EM-related coordinate depositions, in many cases models fitted into maps deposited to EMDB.
The PDB archive was established in 1971 as a public repository for X-ray crystal structures of biological macromolecules (11), and is presently maintained by the global organization world-wide PDB [wwPDB (12)]. The number of structures in the PDB has grown from the initial 7 to over 67000 entries. Over time, the PDB began to collect coordinates of structures determined by methods other than X-ray crystallography, including NMR spectroscopy, neutron diffraction, fiber diffraction, electron crystallography, electron microscopy and solution scattering. Coordinates for structures determined using EM began to be archived in the 1990s, beginning with models for bacteriorhodopsin (13) and the RecA hexamer (14). Currently, the deposition rate for EM entries in the PDB is increasing more rapidly than for any other experimental method.
Two workshops held in 2004 (3DEM NoE workshop at EBI and CryoEM Structure Deposition Workshop at RCSB-PDB, co-sponsored by NCMI) invited the EM community to participate in development of an improved data model for describing cryoEM experiments, and also set in motion efforts to unify deposition and access to EM-derived maps and models. Following the workshops, a revised and expanded EM dictionary handling both map and model metadata was created in a three-way collaboration between PDBe, RCSB and NCMI with broad community input and was presented at the 2005 3DEM Gordon Research Conference in New Hampshire as well as a 3DEM Developers workshop held in the UK. During this period, Heymann et al. (15) also published recommended conventions for exchange of cryoEM data, and a standards task force was created to gather information on the different cryoEM map and image conventions and formats to facilitate conversion.
Following the recommendations of the EM community to create a ‘one-stop shop for EM,’ the Unified Data Resource for CryoEM (EMDataBank.org) was established in 2007 with funding from the National Institutes of Health/NIGMS as a joint effort of PDBe, RCSB and NCMI. The mission of the resource is to build up a global deposition and retrieval network for cryoEM map, model and associated metadata, as well as a portal for software tools for standardized map format conversion, map, segmentation and model assessment, visualization and data integration. The first goal of this collaboration was achieved in early 2008 when RCSB joined PDBe as a second map deposition and retrieval site. Joint EMDB (map) and PDB (model) deposition systems were developed and put into operation at both PDBe and RCSB in early 2009, and web-based 3D visualization tools have been integrated into EMDB atlas pages. Efforts to improve uniformity and usability of the EM structural data in both EMDB and PDB databases are ongoing; we recently completed remediation of voxel sizes and density statistics stored in all EMDB map files to ensure display at correct physical scale and with reasonable density contour level. Additional services for data harvesting and evaluation are planned.
The EMDB currently holds more than 800 map entries with resolution limits ranging from 80 to 2Å, while PDB holds more than 300 entries of coordinates fitted into EM map volumes (Figure 1). Map volume and fitted model deposition rates in 2008–09 were ~150 and ~40 per year, respectively, with roughly 40% of all published EM structures being captured in the databases. As the importance of EM-derived structural information continues to increase, it is anticipated that more journals and funding agencies will require deposition to structural databases.
Each EMDB entry holds a single map plus associated experimental metadata; each PDB entry holds a fitted coordinate model and associated experimental metadata plus primary sequence information for each polymer. The metadata information that is shared by both databases is automatically transferred during the joint deposition process and includes:
Correspondences between maps and associated fitted coordinate models are maintained within both archives. EMDB entries can optionally hold associated masks, structure factors and/or layerline data; PDB entries can also hold structure factors. The underlying dictionaries for the two databases have direct translations and are regularly updated to reflect changes in experimental apparatus and methods.
The map archive includes maps generated by a number of different electron microscopy reconstruction methods (Figure 2). The majority of entries (83%) are single particle reconstructions, which represent ensemble averages of thousands of individual imaged particles, often with additional symmetry averaging. The largest class of single particle specimens represented are the viruses (20% of all holdings), the majority of which are icosahedrally averaged. Virus entries typically represent distinct states of maturation, or complexes with antibodies or receptors. The second largest class of single particle specimens represented are the ribosomes (15%). Ribosome entries define key structural conformations encountered in translating messenger RNA into protein, or elucidate structural variations across diverse species. Other single particle specimens represented include macromolecular machines involved in protein folding, protein degradation, energy metabolism, cell cycle processes, DNA replication, DNA repair, RNA transcription, RNA splicing and ion channels.
The map archive also holds densities for 2D crystals and for helical arrays, including intracellular filaments and microtubules, flagella and helical crystals. There are also several tomographic maps of unique structures as well as maps that represent 3D averages of aligned tomograms. Diverse specimens currently held of this type include flagellar motors, insect flight muscle tissue, desmosomes and viruses.
EM-derived coordinates are obtained using a variety of modeling methods including manual docking, rigid-body fitting, homology modeling, de novo modeling and computational refinement algorithms. EM entries in the PDB are classified either under electron microscopy or electron crystallography as the experimental method. For structures with regular point or helical symmetry, coordinates are given for the asymmetric unit along with a set of transformation matrices to build the biological assembly (16).
A gallery of representative maps including some maps with fitted models is presented in Figure 3.
Access to EM structural data and related services is through the EMDataBank.org web site (Table 1). The EMDB and PDB archives are updated weekly on Wednesdays at 00:00 GMT. EMDB is distributed on ftp mirrors supported by the two EMDataBank.org distribution partners in the UK and the USA, while PDB is distributed on ftp mirrors supported by each of the wwPDB partners. Upon depositor request, EM entries may be held for up to 2 years from the deposition date for map entries in EMDB, and up to 1 year for coordinate models in PDB. However, we strongly encourage depositors to make their data publicly available as soon as possible.
The EMSEARCH web service is maintained and updated weekly at both distribution sites. EMSEARCH enables browsing and searching of EMDB metadata uploaded into a relational database. Simple searches can be performed based on author name, title, entry id, sample name, citation abstract word, aggregation type, resolution and release date range. Search summaries link to a set of atlas pages for each entry, which include ‘Summary’, ‘Visualization’, ‘Sample’, ‘Experiment’, ‘Processing’, ‘Map Information’ and ‘Download’ pages (Figure 4). In addition, a full-text search system based on the Lucene indexer is nearing completion. All search options can be accessed from the EMDataBank.org search tab.
The ‘Visualization’ page for each EM map entry provides several viewing options. In addition to a static 2D image provided by the deposition author, it is also possible to launch two different Java-based 3D viewers (Figure 5, left and right top panels). EMViewer provides a simple, single isosurface representation of a map at a predetermined contour level that can be manipulated by mouse drag and click actions. OpenAstexViewer, a molecular graphics program originally developed to display crystallographic data (17) has been adapted for display of EM maps and associated coordinate models. To improve web download speed and minimize memory requirements, a compact map format is used (BRIX), and larger maps are also down-sampled by a factor of 2–5. Current capabilities include ability to control map contour level, opacity, color, solid versus mesh surface rendering and concurrent display of a PDB coordinate entry. Viewing large maps may require increasing Java Applet Runtime memory allocation.
Maps are distributed in CCP4 format and can be viewed with locally installed software such as UCSF Chimera (18), Pymol (19), VMD (20), Coot (21), enabling further analysis and manipulation with an extensive set of tools. Links for map download are available on the ‘Download’ and ‘Map Information’ pages of each entry. Recent distributions of UCSF Chimera enable direct downloads of EMDB maps plus associated PDB models via simple queries (e.g. author name, title) to the EMSEARCH relational database through the EMDataBank.org beta-web service, a self-contained programing interface based on the SOAP protocol (Figure 5, bottom panel).
A major goal of the EMDataBank.org unified data resource is to archive EM-derived structural information in a way that will enable further research. The impact of availability of structural data for smaller proteins has already been amply demonstrated by the success of the PDB, which is accessed by thousands of individuals around the globe every day. Listed here are a few examples of how archived EM data facilitates subsequent scientific exploration by other investigators.
Atomic coordinates permit 3D mapping of a wide variety of data including sites of mutations leading to disease, modification sites, antibody recognition sites, amino acid sequence variability and electrostatic properties. But in many cases, high-resolution structural models are not available for every component within a cryoEM map at the time it is first interpreted. By preserving map and model information together in a freely accessible database, new structural information can be incorporated as it becomes available. Examples include (i) reinterpretation of ribosome stalk regions in multiple archived maps [EMD-1005 through 1008, 1055, 1056; (22)] and (ii) recent progress by the Rossmann group towards complete fitted coordinate models for bacteriophage T4 in multiple functional states [EMD-1048 and 3H3W; EMD-1086 and 3H3Y (23)].
Structural knowledge can be crucial for interpreting biochemical and biophysical data. For instance, dengue and West Nile virus cryoEM structures (24,25) have led the way to mapping sites of glycosylation (26), amino acid sequence variability (27), neutralizing antibody binding (28), design of vaccines (29) and a structural basis for understanding membrane fusion (30). Fitting of X-ray crystal structures of cadherin into tomographic maps of epidermal desmosomes have yielded insights into cell adhesion [EMD-1051, 1052, 1053, 1374, 1449; (31,32)].
For low-symmetry complexes such as ribosomes, a low-resolution cryoEM map can provide a valuable complementary source of information for crystallographic phasing. For high-symmetry structures such as icosahedral viruses, application of robust computational averaging and extension algorithms to initial cryoEM-derived phases is often sufficient to complete the structure determination without additional phase information. Recent case studies have employed ribosome maps [EMD-1008 and 1019; (33)] and a dodecameric enzyme map [EMD-1680; (34)]. Routine archiving of EM-derived maps facilitates the use of this method by the crystallography community.
By capturing coordinates, maps and metadata and making them available through searchable databases, it is possible to compare and select structures for further analysis using a variety of criteria, perform experimental design for new analyses, and design new algorithms to improve the state of the art methodology. For example, comparison of publicly available maps and coordinates led to the conclusion that the double-stranded DNA tailed phages and herpesvirus have a similar fold in their major capsid proteins though there is little sequence similarity among them [EMD-1101 and PDB entries 1OHG, 1YUE; (35)]. In a second example, a flexible fitting approach applied to 43 maps of bacterial 70S ribosome from EMDB in various functional states revealed global conformational differences between the EM structures involving large-scale ratchet-like deformations (36).
Many large molecular machines are not amenable to crystallization for high-resolution studies. CryoEM methods have begun to yield structures of membrane proteins, viruses and chaperonins determined to resolution limits beyond 4.5Å where de novo methods can trace a Cα-backbone. In many cases, side chain densities are also visible. In one recent example, over 80% of side-chain densities in the archaeal chaperonin hexadecamer Mm-cpn chaperonin could be resolved [EMD-5137 and PDB 3LOS; (37)].
Availability of EM map volumes in the EMDB facilitates development and validation of software for map viewing, analysis, manipulation, coordinate model fitting and validation. Development of algorithms for fitting of EM maps with atomic coordinates is a particularly active area. For low to medium-resolution studies in which the source coordinates are likely to be derived from crystallographic or NMR studies, algorithms being explored include rigid-body fitting (38), normal mode analysis (39), spatial interpolation (40), conformational sampling under low-resolution restraints (37,41), molecular dynamics flexible fitting (42) and simulated annealing approaches (43,44). For medium- to high-resolution studies where de novo model building becomes possible, methods are being developed for skeletonization and secondary-structure element detection (45) and incorporation of structure prediction from primary sequence (46). Additional examples of algorithm development facilitated by public availability of EMDB maps include investigations of map denoising (47), B-factor sharpening (48), map resolution determination (49) and automated segmentation (50,51). The UCSF Chimera team has made particularly effective use of the EMDB resource for development and testing of a large, versatile set of tools for manipulating volume data (18,52).
3D cryoEM reconstruction methods are uniquely able to reveal structural aspects of many important macromolecules and macromolecular complexes, and for this reason the field is in a period of rapid expansion and development. Based on current growth of EM entries and publication of EM structures, the total number of structures of large biological assemblies contributed by EM is anticipated to approach 10000 by the year 2020.
In the near future, deposition and archiving of EM structural data will be integrated in a common tool that is being developed by the wwPDB partners to handle depositions from all structural biology methods. In addition, as the field matures, validation tools and criteria for assessment of map and fitted coordinate models will play an important role in providing guidance to users of cryoEM derived structural data. To this end, a validation task force is being assembled by the unified data resource partners (NCMI, RCSB and PDBe) to develop recommendations as to how best to assess the quality of both maps and models that have been obtained from cryoEM data. The recommendations will form the basis for a validation suite to be used by EMDB and PDB. The ‘CryoEM Modelling Challenge 2010′ (http://ncmi.bcm.edu/challenge), in which the modeling community is being asked to apply their modeling tools to a selected set of cryoEM densities at different resolutions, will permit careful comparison of the many modeling methods under development and will yield useful benchmark data for validation suite development.
EMDataBank.org is supported by a National Institutes of Health grant [R01GM079429] to Baylor College of Medicine, Rutgers University and the European Bioinformatics Institute. PDBe receives additional support for EMDB development from the Biotechnology and Biological Sciences Research Council [BBG022577] and from EMBL. Funding for open access charge: NIH grant R01GM079429.
Conflict of interest statement. None declared.
We are grateful to Tom Goddard for integrating the EMDataBank.org web service into UCSF Chimera.