A key component of creating the public archive of information is the efficient capture and curation of the data—data processing. Data processing consists of data deposition, annotation and validation. These steps are part of the fully documented and integrated data processing system shown in Figure .
In the present system (Fig. ), data (atomic coordinates, structure factors and NMR restraints) may be submitted via email or via the AutoDep Input Tool (ADIT; http://pdb.rutgers. edu/adit/ ) developed by the RCSB. ADIT, which is also used to process the entries, is built on top of the mmCIF dictionary which is an ontology of 1700 terms that define the macromolecular structure and the crystallographic experiment (
2,
3), and a data processing program called MAXIT (MAcromolecular EXchange Input Tool). This integrated system helps to ensure that the data submitted are consistent with the mmCIF dictionary which defines data types, enumerates ranges of allowable values where possible and describes allowable relationships between data values.
After a structure has been deposited using ADIT, a PDB identifier is sent to the author automatically and immediately (Fig. , Step 1). This is the first stage in which information about the structure is loaded into the internal core database (see section on the PDB Database Resource). The entry is then annotated as described in the validation section below. This process involves using ADIT to help diagnose errors or inconsistencies in the files. The completely annotated entry as it will appear in the PDB resource, together with the validation information, is sent back to the depositor (Step 2). After reviewing the processed file, the author sends any revisions (Step 3). Depending on the nature of these revisions, Steps 2 and 3 may be repeated. Once approval is received from the author (Step 4), the entry and the tables in the internal core database are ready for distribution. The schema of this core database is a subset of the conceptual schema specified by the mmCIF dictionary.
All aspects of data processing, including communications with the author, are recorded and stored in the correspondence archive. This makes it possible for the PDB staff to retrieve information about any aspect of the deposition process and to closely monitor the efficiency of PDB operations.
Current status information, comprised of a list of authors, title and release category, is stored for each entry in the core database and is made accessible for query via the WWW interface (http://www.rcsb.org/pdb/status.html ). Entries before release are categorized as ‘in processing’ (PROC), ‘in depositor review’ (WAIT), ‘to be held until publication’ (HPUB) or ‘on hold until a depositor-specified date’ (HOLD).
Content of the data collected by the PDB
All the data collected from depositors by the PDB are considered primary data. Primary data contain, in addition to the coordinates, general information required for all deposited structures and information specific to the method of structure determination. Table contains the general information that the PDB collects for all structures as well as the additional information collected for those structures determined by X-ray methods. The additional items listed for the NMR structures are derived from the International Union of Pure and Applied Chemistry recommendations (IUPAC) (
4) and will be implemented in the near future.
The information content of data submitted by the depositor is likely to change as new methods for data collection, structure determination and refinement evolve and advance. In addition, the ways in which these data are captured are likely to change as the software for structure determination and refinement produce the necessary data items as part of their output. ADIT, the data input system for the PDB, has been designed so as to easily incorporate these likely changes.
Validation
Validation refers to the procedure for assessing the quality of deposited atomic models (structure validation) and for assessing how well these models fit the experimental data (experimental validation). The PDB validates structures using accepted community standards as part of ADIT’s integrated data processing system. The following checks are run and are summarized in a letter that is communicated directly to the depositor:
Covalent bond distances and angles. Proteins are compared against standard values from Engh and Huber (
5); nucleic acid bases are compared against standard values from Clowney
et al. (
6); sugar and phosphates are compared against standard values from Gelbin
et al. (
7).
Stereochemical validation. All chiral centers of proteins and nucleic acids are checked for correct stereochemistry.
Atom nomenclature. The nomenclature of all atoms is checked for compliance with IUPAC standards (
8) and is adjusted if necessary.
Close contacts. The distances between all atoms within the asymmetric unit of crystal structures and the unique molecule of NMR structures are calculated. For crystal structures, contacts between symmetry-related molecules are checked as well.
Ligand and atom nomenclature. Residue and atom nomenclature is compared against the PDB dictionary (ftp://ftp.rcsb. org/pub/pdb/data/monomers/het_dictionary.txt ) for all ligands as well as standard residues and bases. Unrecognized ligand groups are flagged and any discrepancies in known ligands are listed as extra or missing atoms.
Sequence comparison. The sequence given in the PDB SEQRES records is compared against the sequence derived from the coordinate records. This information is displayed in a table where any differences or missing residues are marked. During structure processing, the sequence database references given by DBREF and SEQADV are checked for accuracy. If no reference is given, a BLAST (
9) search is used to find the best match. Any conflict between the PDB SEQRES records and the sequence derived from the coordinate records is resolved by comparison with various sequence databases.
Distant waters. The distances between all water oxygen atoms and all polar atoms (oxygen and nitrogen) of the macromolecules, ligands and solvent in the asymmetric unit are calculated. Distant solvent atoms are repositioned using crystallographic symmetry such that they fall within the solvation sphere of the macromolecule.
In almost all cases, serious errors detected by these checks are corrected through annotation and correspondence with the authors.
It is also possible to run these validation checks against structures before they are deposited. A validation server (http://pdb.rutgers.edu/validate/ ) has been made available for this purpose. In addition to the summary report letter, the server also provides output from PROCHECK (
10), NUCheck (Rutgers University, 1998) and SFCHECK (
11). A summary atlas page and molecular graphics are also produced.
The PDB will continually review the checking methods used and will integrate new procedures as they are developed by the PDB and members of the scientific community.
Other data deposition centers
The PDB is working with other groups to set up deposition centers. This enables people at other sites to more easily deposit their data via the Internet. Because it is critical that the final archive is kept uniform, the content and format of the final files as well as the methods used to check them must be the same. At present, the European Bioinformatics Institute (EBI) processes data that are submitted to them via AutoDep (http://autodep.ebi.ac.uk/ ). Once these data are processed they are sent to the RCSB in PDB format for inclusion in the central archive. Before this system was put in place it was tested to ensure consistency among entries in the PDB archive. In the future, the data will be exchanged in mmCIF format using a common exchange dictionary, which along with standardized annotation procedures will ensure a high degree of uniformity in the archival data. Structures deposited and processed at the EBI represent ~20% of all data deposited.
Data deposition will also soon be available from an ADIT Web site at The Institute for Protein Research at Osaka University in Japan. At first, structures deposited at this site will be processed by the PDB staff. In time, the staff at Osaka will complete the data processing for these entries and send the files to the PDB for release.
NMR data
The PDB staff recognizes that NMR data needs a special development effort. Historically these data have been retrofitted into a PDB format defined around crystallographic information. As a first step towards improving this situation, the PDB did an extensive assessment of the current NMR holdings and presented their findings to a Task Force consisting of a cross section of NMR researchers. The PDB is working with this group, the BioMagResBank (BMRB) (
12), as well as other members of the NMR community, to develop an NMR data dictionary along with deposition and validation tools specific for NMR structures. This dictionary contains among other items descriptions of the solution components, the experimental conditions, enumerated lists of the instruments used, as well as information about structure refinement.
Data processing statistics
Production processing of PDB entries by the RCSB began on January 27, 1999. The median time from deposition to the completion of data processing including author interactions is less than 10 days. The number of structures with a HOLD release status remains at ~22% of all submissions; 28% are held until publication; and 50% are released immediately after processing.
When the RCSB became fully responsible there were about 900 structures that had not been completely processed. These included so called Layer 1 structures that had been processed by computer software but had not been fully annotated. All of these structures have now been processed and are being released after author review.
The breakdown of the types of structures in the PDB is shown in Table . As of September 14, 1999, the PDB contained 10 714 publicly accessible structures with another 1169 entries on hold. Of these, 8789 (82%) were determined by X-ray methods, 1692 (16%) were determined by NMR and 233 (2%) were theoretical models. Overall, 35% of the entries have deposited experimental data.
Data uniformity
A key goal of the PDB is to make the archive as consistent and error-free as possible. All current depositions are reviewed carefully by the staff before release. Tables of features are generated from the internal data processing database and checked. Errors found subsequent to release by authors and PDB users are addressed as rapidly as possible. Corrections and updates to entries should be sent to deposit@rcsb. rutgers.edu for the changes to be implemented and re-released into the PDB archive.
One of the most difficult problems that the PDB now faces is that the legacy files are not uniform. Historically, existing data (‘legacy data’) comply with several different PDB formats and variation exists in how the same features are described for different structures within each format. The introduction of the advanced querying capabilities of the PDB makes it critical to accelerate the data uniformity process for these data. We are now at a stage where the query capabilities surpass the quality of the underlying data. The data uniformity project is being approached in two ways. Families of individual structures are being reprocessed using ADIT. The strategy of processing data files as groups of similar structures facilitates the application of biological knowledge by the annotators. In addition, we are examining particular records across all entries in the archive. As an example, we have recently completed examining and correcting the chemical descriptions of all of the ligands in the PDB. These corrections are being entered in the database. The practical consequence of this is that soon it will be possible to accurately find all the structures in the PDB bound to a particular ligand or ligand type. In addition to the efforts of the PDB to remediate the older entries, the EBI has also corrected many of the records in the PDB as part of their ‘clean-up’ project. The task of integrating all of these corrections done at both sites is very large and it is essential that there is a well-defined exchange format to do this; mmCIF will be used for this purpose.