|Home | About | Journals | Submit | Contact Us | Français|
The Protein Data Bank (PDB) is the world-wide repository of macromolecular structure information. We present a series of databases that run parallel to the PDB. Each database holds one entry, if possible, for each PDB entry. DSSP holds the secondary structure of the proteins. PDBREPORT holds reports on the structure quality and lists errors. HSSP holds a multiple sequence alignment for all proteins. The PDBFINDER holds easy to parse summaries of the PDB file content, augmented with essentials from the other systems. PDB_REDO holds re-refined, and often improved, copies of all structures solved by X-ray. WHY_NOT summarizes why certain files could not be produced. All these systems are updated weekly. The data sets can be used for the analysis of properties of protein structures in areas ranging from structural genomics, to cancer biology and protein design.
The Protein Data Bank (PDB) (1–4) is the world-wide repository of macromolecular structures solved by X-ray diffraction, NMR, or (cryo-)electron microscopy. More than 67000 entries in the PDB (summer 2010) are a true treasure trove for scientists in fields such as protein engineering, human genetics, drug design, molecular biology, biochemistry, etcetera. In protein engineering, for example, one often needs to know that whether a residue is conserved, and if not, which residue types are observed at the equivalent positions in related proteins. In human genetics, one often wonders where an observed disease causing mutation is located in the structure relative to the active site, the DNA binding site, a multimer interface or another functionally important site. Such questions, and many more, normally can be answered if the structure of the protein at hand is available, and if a lot of additional data and tools are available. In protein structure bioinformatics it is common practice to use molecular visualization software, the UniProtKB/Swiss-Prot (5) file, a multiple sequence alignment, a report about the quality of the structure used, articles and many other types of information.
We maintain a series of databases, Web servers and Web services to aid the scientists with their macromolecular structure-based research. About 75 Web servers that take PDB files as input are available at http://swift.cmbi.ru.nl/ (unpublished results), and we recently described the first series of 50 Web services (6) that act on PDB files. These facilities are used hundreds to thousands of times per day, but in some cases it makes better sense to pre-store the results rather than to generate them on-the-fly. One reason is that a result may be used frequently. In that case, generating the same result over and over is a waste of CPU time on the side of the server and waiting time on the side of the user. Another reason is that some PDB derived results take too much time to generate to be used as a service. Creating a typical entry for the PDB_REDO database (7), for instance, takes several hours. The most important reason to store results in databases instead of generating them when they are needed is that it allows for quick data mining. Say, we want a list of all PDB entries that contain threonine residues with inversed chirality at their Cβ atoms. Checking all threonines in the PDB will take hours or even days, checking all the PDBREPORT (8) records for this specific problem will only take minutes, and with a pre-indexed version of PDBREPORT this list can be retrieved in seconds.
We describe several databases that can be used to obtain insight in the many aspects of a specific protein, but can also help to select data sets for (structural) analysis, to find properties of proteins in general, or find suited test sets to create, test and optimize new methods in structural biology research.
At http://swift.cmbi.ru.nl/gv/facilities/ an overview of all systems mentioned (and a few more) is given, and pointers are provided to extensive documentation that includes help for downloading whole databases.
A short summary of our databases, their purpose and their locations is given in Table 1. The first four databases listed in Table 1 provide PDB file annotation in terms of structure, sequence and quality (and the improvability thereof). The next three are aimed at data set selection and are partly derived from the first four databases. The final database provides information about the entries of the other databases, or rather about the entries missing from these databases.
The secondary structure of proteins is an important aspect in many fields of bioinformatics. A simple Google search for the exact string ‘secondary structure prediction server’ gives more than 70000 hits and new methods to predict protein secondary structure are still published regularly (9–13). This might seem a bit surprising because there are not that many biological questions that require knowledge of a protein’s secondary structure to be answered, but in practice the secondary structure of a protein is an important tool for classification and comparison purposes (see for examples the CATH (14) and SCOP (15) protein classification databases).
The DSSP software (16) describes the secondary structure of a protein based on its three dimensional structure. Over the years, several alternatives for DSSP have been produced. Looking at DSSP's thousands of citations, and at the fact that today, nearly 30 years after DSSP was written, this software is still distributed on average at least once per week and cited 4–5 times per week, it is safe to state that DSSP is the de facto standard in the field of secondary structure determination, and thus also in the field of secondary structure prediction. The DSSP database contains DSSP descriptions for every PDB entry. Figure 1 shows a very small part of a DSSP file with a short explanation.
The concept of residue conservation is highly conserved in many protein structure related research fields and has been mentioned many tens of thousands of times in the literature. A literature search reveals that sequence conservation is used to improve alignments, to score docking solutions, to find functional regions, to cluster residues involved in similar aspects of the protein’s function, in drug design, in optimizing HIV drug administration regimes, in evolutionary studies, in the prediction of protein interaction surfaces, in structure-function relation predictions, in secondary structure prediction, in the analysis of crystal contacts and in protein engineering, to mention just a few of the applications. The HSSP [Homology-derived Secondary Structure of Proteins; (17–21)] database holds for each PDB entry a multiple sequence alignment against all UniProt entries that can be aligned against the PDB file’s sequence with 5% more confidence than required to infer structural similarity (Figure 2). The sequence variance and the sequence entropy at each position in the protein sequence are given. Together with the alignment, this illustrates the structural and functional importance of each residue in the PDB file.
PDB files are the result of experimental work, and thus are prone to experimental errors. These errors range from administrative mistakes such as violation of nomenclature, through small inaccuracies in bond geometry and small mistakes like wrong side chain rotamers, badly modelled flexible loops, or strange solvent models, all the way to gross errors, a few of which have lead to retractions (e.g. (22)). We have designed the WHAT_CHECK (23–31) software to search for these errors, to list them, quantify them, to try to find their origin and to suggest how to fix them when possible. We ran this software on all PDB files and the resulting reports list about 8.5 million errors, 33.6 million warnings and 17.2 million notes. These reports are available from the PDBREPORT database (8). The WHAT_CHECK reports present the users with 10 sections. The first two sections deal with problems that are detrimental to quality of the validation in the sections that follow. These sections deal with space group related topics, topology determinations, missing atoms, etc. The third section provides a description of the molecule that is informative for the quality; this includes the Ramachandran plot, and the secondary structure as described by DSSP. Further sections deal with occupancies, B-factors, terminal groups, nomenclature issues, elementary geometric features, torsion angles, proline rings, atomic clashes, threading issues, water molecules and ions and hydrogen bond related topics such as His, Asn, or Gln that need their side chain flipped by 180° to make (better) hydrogen bonds. Each report ends with a summary of the most essential statistics. When used interactively, the WHAT_CHECK software finishes with a set of recommendations for further refinement, but this section is not included in the PDBREPORT database as it is only relevant for the crystallographer solving the structure. Table 2 lists for a few error types their frequency in the PDB.
The vast majority of structures in the PDB are solved by means of X-ray crystallography. The computational methods to produce a structure model based on the experimental X-ray data have improved dramatically since the beginning of the PDB and still improve today. Additionally, computers can now do in a day what was not even possible in a year in the early 90's, and we understand the biophysical and structural characteristics much better than in the years past. As a result of all this, crystallographers can now build better structure models than ever before. These advances come with a side effect: as new PDB entries improve, older PDB entries, which were solved with older computational methods, start to lag behind in terms of structure quality. To solve this issue, we started applying these new methods to existing, older PDB entries. Using the crystallographic program Refmac (32,33), we re-refined all X-ray structure models in the PDB for which the experimental X-ray data were deposited (34). In this process the fit of the atomic coordinates to the experimental X-ray data is optimized, which improved not only the fit to the experimental data for 67% of the PDB entries, but also the overall quality of structure models as judged by WHAT_CHECK. These updated PDB entries are stored in the PDB_REDO database (7) and can be used for structural biology research exactly as regular PDB files.
The PDB_REDO pipeline is still a topic of intense research in two collaborating groups so that further improvements are expected in the years to come. A recent improvement is the implementation of new algorithms that optimize the orientation of the peptide planes in the protein backbone, rebuild existing amino acid side chains, build missing ones and optimize hydrogen bonding (unpublished results). This allows PDB_REDO to actively improve structure models instead of relying on the radius of convergence of X-ray refinement software. Over a test set of 4100, PDB entries (deposited from 1996 to 2004) we saw an improvement of the fit of the atomic coordinates with the experimental data for 85% of the test set structures. We are currently updating all PDB_REDO entries to include these new developments. This will be completed by the end of 2010.
PDB entry 1bvs (35) is an example of the extent to which a structure is changed by optimizing it in PDB_REDO. The protein (the RuvA-Holliday junction complex) is an octamer consisting of two tetramers that dimerize by strong ionic interactions between anti-parallel helices (Figure 3a). 1bvs is a relatively low-resolution (3Å) structure. The methods to refine such structures have improved substantially since the time this structure was solved (in 1998). By employing improved refinement methods, such as TLS refinement (33,36) and local non-crystallographic symmetry restraints, we could improve the fit of the structure model with the data: R-free went down from 31.9 to 28.4%. More importantly, this refinement led to new electron density maps that allowed us to rebuild the side chains at the dimerisation interface (Figure 3b). The rebuilding lead to another improvement of R-free (down to 26.0%) and the rebuilt interface has much better ionic interactions, which is reflected by the residue packing score from WHAT_CHECK (27): the structure moves from a packing score 2σ below the average of a high resolution test set to a score slightly higher than the same set. This means that the case made by the depositor about the nature of the dimerisation interface is now better supported by the updated 3D structure than it was with the original structure.
Considering the substantial size of the aforementioned databases PDB, DSSP, HSSP and PDBREPORT (summing up to 160 GB in the summer of 2010), there is a strong need for a compact summary that can be parsed, searched [e.g. by SRS (37), EBeye (38) or MRS (39)], and analyzed quickly. For this purpose, the PDBFINDER (40) and more recently the PDBFINDER2 databases have been created. Both are actually single flat text files, optimized for minimum size and maximum parsing speed (when compared e.g. to the XML format). While the PDBFINDER (current size 0.16 GB) summarizes the PDB, the PDBFINDER2 includes information from DSSP, HSSP and PDBREPORT, simply added as extra lines (1.1 GB).
As can be seen from the example in Figure 4, the PDBFINDER contains information about the compound (including EC numbers for enzymes), the source, the authors, the experimental method and refinement software (partly manually curated), small molecules (HET-groups), the overall secondary structure content, and a list of all chains (proteins and nucleic acids), including secondary structure content, cysteine bridges and most importantly the sequence. The latter is actually the sequence extracted from the ATOM section of the PDB file, and thus contains only residues for which 3D coordinates are available. This is especially useful for all molecular modelling applications, since other PDB sequence summaries (e.g. the FASTA file generated weekly by the NCBI and commonly used for BLAST searches) are based on the SEQRES section, which includes all residues, even those whose structure could not be determined (and which are thus useless for modelling). The PDBFINDER2 provides many more per-residue data aligned with the sequence, which are described in the caption of Figure 4.
In our daily experience, there are two main applications for the PDBFINDERs. The first is complex structure selection queries that cannot be expressed easily in a database language like SQL. For instance, PDBFINDER allows us to quickly select all PDB entries that contain a specific enzyme (by employing the EC number) or all PDB entries that have more than 10 incomplete side chains. The required parsing of the PDBFINDER format takes just a few lines of code, but we also provide a Python module at www.yasara.org/biotools/. The second main application is visualization of the data by mapping it onto the corresponding 3D structure. For this purpose we developed a Python plug-in for the free molecular modelling program ‘YASARA View’ (42), available from www.yasara.org/viewdl/. Both Python scripts are licensed under the GNU GPL. Figure 5 shows examples of how information from PDBFINDER2 can be visualized.
To study specific properties of proteins structural biologist can study the entire PDB or a representative subset. Such subsets are lists of PDB entries created by filtering the PDB based on criteria of structural uniqueness, structure model quality and experimental parameters. Structural uniqueness is asserted by looking at the pairwise sequence alignment of all entries in the list and setting a cut-off for the maximum allowed sequence identity. From the Sander–Schneider plot, (Figure 2) we see that 25% identity is a safe cut-off. Structure model quality can coarsely be determined by looking at the crystallographic (free) R-factor, but a more detailed filter for structure quality uses the results from structure validation software like PROCHECK (44) or WHAT_CHECK. Experimental parameters are usually the type of experiment used to ‘solve’ the structure (e.g. X-ray crystallography or NMR spectroscopy) and the X-ray resolution. PDBselect by Hobohm and Sander (45–47) provides a good example of methods to select a representative subset of the PDB and so do the PISCES system (48) and the PDB_REPRDB (49–51). We also have precompiled representative lists of PDB entries in the PDB_SELECT database at http://swift.cmbi.ru.nl/gv/select/ (52). In PDB_SELECT, we have sorted the entries by their quality so that users who take the first N entries from one of the lists will automatically get the best N PDB files where ‘best’ is defined as a function of resolution, R-factor, and a few WHAT_CHECK quality parameters as described above. Historically, we used a sequence identity cut-off of 30% to balance the requirement of structural uniqueness and getting a large enough data set. With the large increase in size of the PDB, a lower cut-off can be used in future PDB_SELECT sets.
The databases discussed above are kept up-to-date automatically so new entries are continuously added. Sometimes PDB entries are made obsolete rendering their corresponding database entries also obsolete. We developed the WHY_NOT database to keep track of these changes in our other databases. WHY_NOT uses a crawler that runs through a local copy of the PDB and lists which database entries could (in principle) exist and then checks all the databases to see which entries actually do exist, which entries are missing and which entries are obsolete. As the name WHY_NOT implies, the most important function is storing the reasons why certain entries are missing. This serves both the users and maintainers of our databases. For users it is helpful to know that an entry cannot be made and an alternative should be sought, for maintainers it is good to know which entries we should stop trying to make over and over again.
The most trivial reason for a missing database entry is that the PDB entry is so new that corresponding database entries were not created yet. Another simple reason for missing entries is the lack of input data. For instance, PDB_REDO needs the experimental X-ray data; if such data was not deposited, or the structure was solved by other means than X-ray crystallography (such as NMR spectroscopy) a PDB_REDO entry cannot be made. Similarly, a HSSP entry can only be made if a DSSP entry exists. These are obvious reasons for missing entries, but many problems are not straightforward and are annotated in WHY_NOT as ‘comments’. For instance DSSP cannot use protein structures that consist only of Cα-atoms, neither can it use PDB entries that contain only nucleic acids or ‘other things’ such as vancomycin (PDB entry 1sho; (53)). No PDBREPORTs will be made for PDB entries that contain no macromolecules such as PDB entry 1tn1 (54). A PDB_REDO entry cannot be made for X-ray structures in which not all atoms are explicitly listed, but need to be created through matrix operations, which is common practice with viral capsids [e.g. PDB entry 4rhv; (55)]. The most common problems listed in WHY_NOT are given in Table 3.
Most database update procedures add WHY_NOT comments automatically. The update procedure for PDB_REDO is an exception; all WHY_NOT comments are checked by hand. There are two reasons for this: some errors can be traced back to annotation problems in the PDB file or the X-ray data file (e.g. missing R-factors, corrupt TLS group selections, X-ray data stored in the wrong format) and others to limitations in the PDB_REDO software. The PDB_REDO software is topic of ongoing research and is routinely updated to improve dealing with existing PDB problems. Solvable problems in PDB files are always reported to the PDB to ensure that they are fixed at the source rather than by making elaborate workarounds. So far, we have reported some 500 errors in PDB files. Simple administrative problems were fixed swiftly by PDB annotators (typically within two weeks) after which the PDB file was re-released, scientific problems that require information from the depositor and may take longer to be solved.
All but one of PDB derived databases are updated with every new PDB release (PDB_SELECT is updated annually or upon request). When a new entry is added to the PDB or an existing entry is altered, its corresponding database entries are also (re)created. Our databases are interdependent via ‘hard’ dependencies (e.g. no HSSP entry can be made without a DSSP entry) and ‘soft’ dependencies (PDB_REDO uses PDBREPORT if an entry is available). The dependencies between databases are depicted in Figure 6. The process of building our databases resembles building software from source code where one creates object files out of source files, which are then linked into executables. Because of this similarity we have chosen the ubiquitous make to do the actual work and the rules are written in Makefiles and the result is a very flexible and robust system. Once a week, the make process is started by a ‘cron’ job and then it starts fetching the latest updates for PDB. After updating PDB, the depending databanks are built, guided by the Makefiles and the dependencies embodied therein. We have tweaked the Makefiles to allow for an exception for replacing existing HSSP files: HSSP uses the UniprotKB database, but because UniprotKB and PDB entries do not map 1-to-1, ‘all’ HSSP entries should be updated with every new release of the Uniprot knowledge base. This makes the maintenance of HSSP files a quadratic problem because each PDB entry is aligned against all UniProtKB entries, and both databases grow continuously. We do not have the CPU power available to update all HSSP files at every UniprotKB release; instead we update as many HSSP files older than 6 months as we can (typically a few thousand) with the remaining CPU time of our 1 week update cycle.
Multiple forms of access to the systems exist. The MRS system (mrs.cmbi.ru.nl) is a generic, freely available database query system that has been described elsewhere (39). MRS provides access to about 60 international databases that we use often enough to warrant in-house shadowing. MRS can also be used to query all databases mentioned in this article, except WHY_NOT. MRS also handles Web service requests, either using SOAP or the REST protocol. Five of the systems can be shadowed in-house using the rsync protocols listed in Table 4.
WHY_NOT is accessible via the WHY_NOT query system. DSSP can additionally be accessed through the WHAT IF Web servers (swift.cmbi.ru.nl) or through the WIWS Web services (WSDL address: http://wiws.cmbi.ru.nl/wsdl); these two systems also allow the user to upload his/her own PDB file for secondary structure determination.
PDB_REDO and PDBREPORT are also directly linked at every entry page of the EBI interface of the PDB.
We continue to work on our databases in order to improve the quality and usability. An improvement of quality comes mostly from adding new options to the WHAT_CHECK software and the PDB_REDO pipeline. Both are subject of ongoing research and new features are added frequently. The PDBREPORT database will be completely rebuilt when a new WHAT_CHECK is released by the end of 2010. We are also working on improving our software to reduce the number of missing entries or, if all else fails, have clear explanations why certain entries cannot be made. Our WHY_NOT database will be an important resource to achieve this.
In terms of usability, we are working on making our databases easier to access. For instance, PDBREPORT can be indexed by our MRS database searching software. PDB_REDO structures will be accessible directly from molecular viewers such as YASARA. We are also working on new dissemination tools to guide the user in using our databases. We focus strongly on visualization: the WHAT_CHECK user course currently under development has numerous visual examples of the warnings and errors that can be found in PDBREPORT. The latest version of the PDB_REDO pipeline creates YASARA scenes that show exactly which atoms moved the most when a PDB entry was optimized.
Funding for open access charge: National Institutes of Health (Grant R01 GM62612); the Stichting Nationale Computerfaciliteiten (National Computing Facilities Foundation, NCF) for the use of supercomputer facilities; Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Netherlands Organization for Scientific Research, NWO).
Conflict of interest statement. None declared.
The authors are especially grateful to those users of these databases who cite the related articles and who report problems. Each of the databases holds between fifty thousand and sixty seven thousand entries, and it is impossible to always have everything correct and up-to-date. User support is highly valued. The EU contributed in the 90′s to the initial design of several of the systems mentioned. More recent financial support came also from EMBRACE, BioSapiens, Elixir and NBIC.