The wwPDB (1
) maintains the Protein Data Bank (PDB) archives of biological macromolecular structure data, currently comprising over 32500 structures. Since the year 2000, the worldwide structural genomics initiatives have provided more than 2400 structures, which have also added a large number of new folds. To represent the progress of this collective effort, the RCSB PDB (2
) has developed and maintains the Structural Genomics Information Portal at http://sg.pdb.org
which consists of three main sections, outlined below.
Structural genomics initiatives
The first section of the information portal provides summary information about each structural genomics center, including target lists, target status, targets in the PDB and sequence redundancy analyses. Summary statistics describing the overall progress of all contributing projects, including sequence similarity and number of structures determined, are regularly tabulated. As an example, an analysis of the sequence similarity of structures solved by structural genomics projects relative to structures in the PDB archive is shown in .
Figure 1 August 2005 report from the structural genomics information portal showing structural genomics structures with sequence similarity <30% relative to solved structures in the PDB by year. Sequence comparisons are performed using the blastclust application (more ...)
The Targets section offers databases that track target registration data. Currently, 20 structural genomics centers contribute data to the TargetDB (3
) resource (http://targetdb.pdb.org
). These data include contributing project and target identifier; protein name, source organism and sequence; current production status (e.g. cloned, expressed and crystallized); related database references; and links to related project information. TargetDB assembles data from all contributing centers and makes these data available in a single validated XML data file which is updated weekly.
Targets can also be selected by searching TargetDB by target identifier, similar sequence, program or project, current production status, protein name or source organism. Search results can be captured in FASTA, TargetDB XML or HTML formats. The HTML report presents all of the contributed details about each target including links to related project information and archival databases [e.g. sequence, PDB and BMRB (4
)], and links out to protein domain databases. An additional online form constructs cumulative reports summarizing the status of a particular program or project.
Created as an extension to TargetDB, the Protein Expression Cloning and Purification Database, PepcDB (http://pepcdb.pdb.org
), was established to collect more detailed status information and the experimental details of each step in the protein production pipeline. PepcDB captures a complete history of the experimental steps in each production trial, in addition to describing the current target production status. The status history in PepcDB also records the time interval required to complete each experimental step, with an explanation if work on a particular target or experiment was terminated. Standard protocol descriptions are collected in text form for each step of protein production. Multiple experimental trials can be described for each target. Each trial may reference a set of standard protocols and optionally include the special details of an experimental step and the experimentally observed sequence.
A validation server has been provided for PepcDB contributors (http://pepcdb.pdb.org/validation.html
). Data files validated through this form are automatically loaded into the PepcDB database. PepcDB currently includes protocol information from the NIH Protein Structure Initiative (PSI) centers. TargetDB status data from all other structural genomics centers are merged into PepcDB. As a result, PepcDB always provides the most complete view of target status and experimental information for structural genomics projects.
The search features of PepcDB build upon those of TargetDB by offering additional tools to mine experimental protocols. Protocol searches are integrated with queries for target sequence and other target attributes. The resulting report includes the essential target description provided by TargetDB plus additional links to a chronological status history and links to related experimental protocols.
The Structures section of the RCSB PDB Structural Genomics Information Portal (http://function.rcsb.org:8080/pdb/function_distribution/index.html
) provides information about the functional distribution of solved structures, structures being determined by structural genomics and homology models determined from solved structures (5
). Function is measured relative to Ensembl-assigned functions from the human genome (6
) and disease relative to OMIM assignments for human diseases (7
). This section answers the question ‘With respect to the function of proteins identified in humans and human disease, what does the present complement of structures in the PDB, the structural genomics targets (if all were solved) and homology models that can be built from the current set of templates add to our understanding of living systems?’ The answer to this question changes over time, and the functional distribution resource provides a current answer since the constituent components needed to address the question—PDB structures, structural genomics targets, homology models from SUPERFAMILY (8
), functional assignments from Ensembl and disease classifications from OMIM—are all updated as they change, ranging from weekly for PDB structures and targets to approximately annually for SUPERFAMILY. The answer to the question also depends on the definition of a homology model. Here the structural templates used in homology modeling were a set of hidden Markov models taken from SUPERFAMILY 1.65. The sequences were aligned to the structural template with HMMER (9
). Only those assigned domains with sequence identity >30% in the alignment were considered as homology models.
Through the functional distribution site, this question can be addressed by examining molecular function, biological process and cellular component [as assigned by the Gene Ontology, GO (10
)], enzymes via their EC numbers (www.chem.qmw.ac.uk/iubmb/enzyme
), and diseases assigned through OMIM (7
). Several steps are used to define the search parameters; here molecular function is used as an example. In Step 1 (Molecular Function) the breadth of the search is defined, which in turn defines the details presented in the results. So, for example, the top level of the GO hierarchy for molecular function is displayed and used by default. All structures could be selected, or a subgroup could be selected (e.g. all structures with the molecular function ‘vitamin binding’) by browsing through the hierarchical tree. Similarly in Step 2 (Structure Type), all structures are chosen by default, but it is possible to drill down and explore just groups of structures based on the SCOP classification of class (all alpha, all beta, etc.) (11
). Step 3 selects the genome. At present only the human genome is available, but other model organisms will be added. Step 4 selects the sequence identity to use, with 40% identity the default. Sequence identity defines how the human genome sequences are clustered and a single function assigned for that cluster—at lower sequence identity there are fewer clusters, i.e. the results are effectively at lower resolution. Step 5 specifies the domain combinations needed for a match. Since PDB structures frequently represent a single domain in a larger complex, statistics can be produced requiring overlap for one or more domains up to the whole structure accounting for domain rearrangements [see (5
) for a full description].
Based on these input parameters, one of three distributions can be generated: a comparison of the distribution of PDB structures, structure genomics targets or homology models against the human genome; a ‘most wanted list’ of structures—those not in the PDB and which (by default) are not identified through homology modeling or in the structural genomics targets yet have significant presence in the human genome; and simple charts showing the distribution of the genome sequences, PDB structures, structural genomics targets or homology models. Most distributions are accompanied by two tables illustrating, first, the functional coverage by each data type (), and second, the correlation between input data types (data not shown). The actual overlap between these groups will be added as part of an on-going development. For example in , PDB structures cover 37.2% of the identified molecular functions in the human genome; if solved, structural genomics targets cover 32.4% of functions; and 56.3% of the molecular functions can be modeled from existing structures. illustrates the resulting normalized distributions for the top level of the GO molecular function hierarchy. At this level most distributions are not skewed with the exception of molecular function unknown—PDB structures are underrepresented and structural genomics targets are overrepresented. Not surprising, since until structural genomics began structural biology was dominated by determining structures of known function. In the era of structural genomics, that trend has reversed. Drilling down to more detailed descriptions of molecular function (data not shown) reveals a more uneven distribution and suggests changes in structure determination strategies.
Figure 2 Normalized functional coverage of the human genome by sequence (from Ensembl; red), by structures from the PDB (blue), by structural genomics targets (green) and homology models from SUPERFAMILY (yellow). When viewing the figure from the online structural (more ...)
An important feature of this resource is the ‘most wanted list’ of structures based on the following criteria: (i) functional categories where proteins are underrepresented by structures; (ii) from (i), proteins which can not be modeled, i.e. proteins from the human genome without SUPERFAMILY assignments; (iii) if the protein can be associated with a human disease; and (iv) proteins identified as likely to be intractable, i.e. with a transmembrane segment filtered out.