We investigated the impact of structural genomics on the rates of discovery and growth of protein families and superfamilies. Using the new update procedure, we collected and classified the structures determined by worldwide Structural Genomics initiatives and related structures from structural biology projects. In our analysis, we focused on the two central levels of SCOP hierarchy, Family and Superfamily, specifying the probable near and far evolutionary relationships, respectively. Because the SCOP hierarchy is obligatory, each protein structure is assigned to a family and a superfamily, regardless of whether there is a related structure at the corresponding levels or not. If there is another protein in a family, it becomes a ‘true’ family; otherwise it is a ‘singleton’ family. Similarly, a ‘true’ superfamily must contain two or more families, whereas a ‘singleton’ superfamily consists of a single family.
There are classification caveats for functionally uncharacterized proteins. Those of similar sequences are usually grouped at Family
level, but, when their functions are established, they may be reclassified at the lower, Protein
level. Also, the lack of functional evidence hinders the discovery of Superfamily
relationships for such protein families, so that it is not always achievable at the time of creation of new families in SCOP. If there is no such relationship discovered, a new ‘singleton’ superfamily is created. This classification can be revised during a subsequent database update, when new evidence appears. For example, the hypothetical protein PA1492 (17
), initially classified in its own superfamily in release 1.69, was reclassified in the next release 1.71 in the N
-(deoxy)ribosyltransferase-like superfamily. This superfamily also unified the functionally and structurally characterized N
-deoxyribosyltransferase and ADP-ribosyl cyclase-like families, which structural and mechanistic similarities remained unrecognized for a decade since the unraveling of their structures. PA1942 has a very similar putative active site and is predicted by SCOP to have a deoxyribosyltransferase activity.
Hereafter, we refer to a SCOP family (superfamily) that contains a (domain of) structural genomics (SG) target as a SG-family (SG-superfamily, respectively). There were 1693 SG-families and 957 SG-superfamilies populated with domains from 4198 SG targets in PDB (released up to June 2007). For most of these domains, we found relationships to other proteins classified in SCOP at one or both of these two levels. For a small fraction (~5%) of SG targets, we found no close or distant homolog. Only distant homologs were identified for <7% of SG targets. There was a nominal contribution to a small fraction (30%) of SG-families, where SG domains were neither the first nor the second member of the family and <30% of its content. About 30% of the SG-families were singletons. Nearly three-fifths of these singleton families belong to ‘true’ SG-superfamilies, and only 42% were a single member of superfamily ().
Figure 2. Statistics of SCOP classification of SG targets. (A) Numbers of SG-families and SG-superfamilies by fraction of SG domains in them. (B) Division of SG-families in ‘true’ and ‘singleton’ families, their SG target contents (more ...)
Nearly half of the SG domains, at the time of their release, represented a new SCOP family. We observed a general trend, wherein the release of the first representative structure of a family was followed shortly by the release of a related structure determined by a different group of authors. In a few cases, more than one structure had been determined independently for the same protein.
A significant increase in the number of structurally characterized families facilitated the discovery of new relationships at the Superfamily
level. More than half of families in ‘true’ SG-superfamilies contain SG domains. About half of SG domains defined a new family and hence contributed to the discovery of a new relationship. Analysis of the distribution of protein families characterized by structural genomics has confirmed a dominant role of the largest known superfamilies, which have grown further in the numbers of constituent families. There are other superfamilies, which have grown large rather unexpectedly (). The evolutionary success of these ‘new rich’ superfamilies is probably due to the presence of unusual conserved and, presumably, functionally important features in their folds. Interestingly, the first several structures of ‘metagenomic’ targets from environmental samples, selected as representative of novel sequence families (18
), all belong to these ‘new rich’ superfamilies, mostly to the dimeric α + β barrel superfamily in SCOP.
Selected SG-superfamilies largely populated with SG-families