The original design of the SCOP database (
7) proved to be flexible enough to accommodate in a relatively easy way not only the growth in number of experimentally determined protein structures in the last 7 years, but also deeper modifications of the database itself, like the recent introduction of unique identifiers.
In computational terms, although SCOP is essentially a hierarchy, a mechanism for cross-linking between nodes of the tree makes it a graph, which allows the representation of complicated biological relationships (
8), as well as the more restrictive parent–child relationships in a tree.
Since the original implementation of SCOP is based on a
description of the underlying data structure, rather than on the data structure itself, it is easy to introduce new classification levels, whose need was clear since the very beginning (
2), but which have not been actually included so far. To introduce a new level, all that is required is to modify the description accordingly, and the rest will fall into place automatically.
New SCOP identifiers
It is with these future extensions in mind that we designed a new set of identifiers, unique integers (sunid) associated to each node of the hierarchy, and a new set of concise classification strings (sccs). Both will be kept stable across SCOP releases in the sense defined below.
A sunid is simply a number which uniquely identifies each entry in the SCOP hierarchy, from root to leaves, including entries corresponding to the protein level for which there was no explicit reference before. An sccs is a compact representation of a SCOP domain classification, including only the most relevant levels—for class, fold, superfamily and family. For example, the sccs for the ribosome anti-association factor domain (PDB entry 1g61, chain A) is d.126.1.1, where ‘d’ represents the class, ‘126’ the fold, ‘1’ the superfamily, and the last ‘1’ the family. Also, the associated sunids are 53931 for class, 55908 for fold, 55909 for superfamily, 55910 for family, 55911 for protein, 55912 for species and 41126 for domain. The old SCOP domain identifier, sid d1g61a_, is still valid.
Together, sunid and sccs replace the old classification page numbers (like 1.002.044.001.002.021). These classification page numbers changed with every release, because they reflected the order in which new entries appeared in SCOP, as well as any internal rearrangement of old entries. To avoid links that would randomly point to a completely different fold as soon as SCOP is updated, all pages in the SCOP 1.55 release have been renamed.
The new identifiers provide an unambigous way to link to a SCOP entry (see http://scop.mrc-lmb.cam.ac.uk/scop/release-notes.html for details) and to refer to SCOP in related research work and in the literature. The old (correct) way of linking to SCOP, using an sid which identifies a domain, together with the desired classification level, remains valid for backward compatibility, but it is not recommended for new releases.
New parseable files
All the information in SCOP, with the exclusion of comments, is available in three easy-to-parse files. Together, they replace and extend the now obsolete dir.dom.scop.txt and dir.lin.scop.txt. Each of these files has a header which includes release, version and copyright information. They fully describe all domains in SCOP and the hierarchy itself, and have been designed in such a way that the likely inclusion of new levels in the current SCOP hierarchy will not break code, provided they are properly parsed. These files are ideal for computer-based large scale analysis, comparison across releases and historical summaries.
One of the files, dir.hie.scop.txt, has no precursor in releases before 1.55. It represents the SCOP hierarchy in terms of sunid. Each entry corresponds to a node in the tree and has two additional fields: the sunid of the parent of that node (i.e. the node one step up in the tree), and the list of sunids for the children of that node (i.e. the nodes one step down in the tree). A second file, dir.cla.scop.txt, contains a description of all domains, their definition and their classification, both in terms of sunid and sccs. The third file, dir.des.scop.txt, contains a description of each node in the hierarchy, including English names for proteins, families, superfamilies, folds and classes.
Since the order in which entries appear within the same level are meaningful in SCOP (it is not uncommon for comments to group together some of the superfamilies, as in the TIM β/α-barrel fold, for example) this order is preserved in both dir.hie.scop.txt and dir.cla.scop.txt.
If a SCOP domain includes portions from different PDB chains which come from a single chain precursor, these are listed in the order in which they appear in the original single protein sequence. A new set of SCOP sequences corresponding to these ‘genetic’ domains is now available as part of the ASTRAL compendium (
5), together with a manually curated mapping between SEQRES and ATOM field in PDB, which uniquely define a SCOP domain in terms of PBD coordinates.
Stable identifiers and standard reference data sets should make comparison, linking and integration of SCOP-based or related results a trivial task. The purpose is to help develop a common language that can be used without ambiguities when talking about a domain or its classification, and to avoid duplication of efforts, so that energy can be applied to further progress, and build upon solid and well tested blocks.
New links and an improved search engine
Information in SCOP is interactively accessible as a set of HTML pages and through a search engine at http://scop.mrc-lmb.cam.ac.uk/scop or one of several mirrors scattered around the world. Previous SCOP releases, starting with SCOP 1.48, are also available online at the home SCOP site at MRC (http://scop.mrc-lmb.cam.ac.uk/scop-x.xx, where ‘x.xx’ is the release number).
The HTML pages reflect the underlying SCOP hierarchy in an easy-to-navigate way, and include pictures of SCOP domains as well as other useful links and information. The new identifiers are visible by positioning the mouse on links. All SCOP pages have been renamed in release 1.55.
A new set of links to external resources have been added at the level of SCOP domains. For each domain in the first seven classes, there are links to supplementary information related to that domain in Pfam (
9), SUPERFAMILY (
6), PartsList (
10) and, in case there is one or more sequences predicted to have that fold, to PRESAGE (
11). Links to Pfam provide alignments to homologs from sequence databases for most of the SCOP domains. SUPERFAMILY is a collection of Hidden Markov Models (HMMs) (
12) for superfamilies in SCOP, and of HMM-based genome assignments to SCOP superfamilies. PartsList adds genomic, functional and structural information to most of the SCOP entries. PRESAGE is a collaborative resource for structural genomics with a collection of proteins’ annotations reflecting current experimental status, structural assignment models and predicted folds.
Linking to SCOP from external sources is now straightforward. The same mechanism used to link can also be used to search SCOP (see http://scop.mrc-lmb.cam.ac.uk/scop/release-notes.html for details). Besides that, the standard keyword search now accepts
sunids, (possibly right-truncated)
sccss and EC numbers, as well as words that appear in any of the SCOP pages, PDB identifiers and SCOP
sids. It also accepts ASTRAL identifiers, including those for the new genetic domain sequences (
5).
The keyword search allows for right truncation (‘+’ at the right end of a word) and multiple keys, which can be combined with ‘+’ (and) and ‘–’ (and not) word-prefix operators. The simplest search form: ‘casp4’, will return all the pages with ‘CASP4’ appearing in the text; ‘yeast’ will return the list of pages containing the word ‘yeast’; ‘yeast –saccharom+’ the list of pages in which the word ‘yeast’ appears, but not any completion of ‘saccharom’; ‘yeast +saccharom+’, the list of yeast proteins restricted to Saccharomyces cerevisiae; ‘yeast +saccharom+ +elongation’ a list further restricted to elongation factor domains from Saccharomyces cerevisiae. Similarly, ‘hypoth+’ returns a list of hypothetical proteins, and ‘fivefold’ the list of pages in SCOP corresponding to the five-fold symmetry pentein fold.