|Home | About | Journals | Submit | Contact Us | Français|
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Superfamily and family analyses provide an effective tool for the functional classification of proteins, but must be automated for use on large datasets. We describe a 'gold standard' set of enzyme superfamilies, clustered according to specific sequence, structure, and functional criteria, for use in the validation of family and superfamily clustering methods. The gold standard set represents four fold classes and differing clustering difficulties, and includes five superfamilies, 91 families, 4,887 sequences and 282 structures.
With large volumes of sequence and structural data now available, functional characterization of proteins has become the rate-limiting step in putting biological information to practical use. Large-scale functional annotation efforts have focused on automated strategies, as more traditional methods, such as experimental characterization of gene function and manually curated analysis of gene sequence and structure, can only be used efficiently on small subsets of the available data.
While this scale-up of the analysis process is required to handle the sheer volume of new information, automated analysis strategies possess inherent and serious limitations. For example, simple pairwise comparisons have been shown to be inadequate for functional classification of proteins with less than 30% to 40% identity [1-3]. Utilizing information from multiple related sequences, especially via probabilistic methods such as sequence profiles or hidden Markov models [4-6], the number of true evolutionary relationships found between proteins with less than 30% identity can be tripled [1,3]. Unfortunately, even when true homologous relationships are detected, direct transfer of functional annotation is not often possible at low levels of sequence identity [2,7-9].
Even when direct transfer of the full functional annotation is not possible, evolutionarily related proteins usually share some functional relationship. To determine what this relationship is, we must start by examining the type of evolutionary linkage between the proteins. Here we have concentrated on enzymes because they have a well-defined biochemical function - the catalysis of a particular reaction.
Horowitz suggested that ligand binding is the dominant constraint guiding enzyme evolution [10,11]. According to his theory, biochemical pathways evolved backwards. When the substrate for the final enzyme in the pathway was depleted, a new enzyme evolved from this enzyme, via gene duplication and divergence, to produce the needed substrate from an available precursor. While the reaction mechanism of the new enzyme was allowed to drift away from that of the original enzyme, the ability to bind the common substrate/product was retained. Although this theory appears to apply to some groups of enzymes, for example HisA/HisF in the histidine biosynthesis pathway and TrpF/TrpC in the tryptophan biosynthesis pathway , it does not appear to be the dominant mechanism governing enzyme evolution . Furthermore, the model typically applies only to pairs of divergent enzymes.
Chemistry-driven evolution [14-16], an alternative theory that appears to represent a substantial proportion of enzymes , identifies a chemical step or capability as the dominant constraint guiding enzyme evolution. According to this model, a newly evolved enzyme retains a fundamental chemical capability of its progenitor. The newly evolved enzyme may catalyze a reaction similar to its progenitor with only an altered substrate specificity, or it may catalyze a quite different overall reaction while still retaining some chemical capability common to its progenitor .
A group of related enzymes that share a common chemical capability mediated by conserved catalytic elements but catalyze different overall reactions has been termed a mechanistically diverse superfamily . A mechanistically diverse superfamily can be subdivided into families, where a family is defined as a group of related enzymes whose members catalyze the same overall reaction via conserved catalytic elements. Each of these mechanistically diverse superfamilies may contain hundreds or even thousands of proteins, representing many different overall functions and utilizing a wide range of substrates.
Mechanistically diverse superfamilies pose an especially difficult problem for automated functional classification methods due to the complexity of their underlying biology. For example, a newly sequenced superfamily member may not catalyze the same overall reaction as its closest relative in the superfamily, but may instead be related to other superfamily members by a more subtle conserved chemical capability. If the superfamily itself has not been characterized, the conserved chemical capability may not be immediately obvious. It is thus useful to subdivide a superfamily into families containing enzymes that catalyze the same overall reaction.
Sequence and structural similarity alone cannot be used to cluster sequences into families because different families evolve at different rates  (M.E. Glasner, R.A. Chiang, N. Fayazmanesh, M.P. Jacobsen, J.A.G, P.C.B., unpublished data; J.L.S., L.P. Wackett, P.C.B. unpublished data). Consequently, the boundaries between different families within a superfamily are uneven in sequence and structure space; in some cases, even very highly similar sequences may perform different reactions. In the mechanistically diverse amidohydrolase superfamily, for example, melamine deaminase and atrazine chlorohydrolase share 98% sequence identity, but catalyze different reactions .
Likewise, functional information alone cannot be used to cluster proteins into superfamilies and families, due to convergent evolution, in which nature has evolved more than one structural strategy to perform a given chemical reaction [19-21]. For example, George et al.  found that 69% of the functions described by three digit EC numbers are found in multiple Structural Classification of Proteins database (SCOP)  superfamilies, suggesting, at least for some of these, independent evolutionary origins. Further, some functions are found in multiple SCOP fold classes, providing further evidence that they have evolved via convergent evolution [20,21]. Thus, although enzymes in these groups catalyze the same overall reaction, they likely utilize different mechanisms.
Even within a single superfamily, the same function may have evolved more than once . For example, the ability to hydrolyze an organophosphate appears to have evolved on at least two separate occasions within the common lineage of the amidohydrolase superfamily (J.L.S., L.P. Wackett, P.C.B., unpublished data). The distinct evolutionary origins of the aryldialkylphosphatase family and the phosphotriesterase family are reflected in an extremely low overall sequence identity between the two families and by subtle differences in the constellation of active site residues used to catalyze the common reaction.
To address these issues and provide a useful test set for benchmarking and development of tools for functional inference, we have constructed a new gold standard set of mechanistically diverse enzyme superfamilies. Most importantly, these proteins are clustered according to rigorous and systematic definitions of family and superfamily. Because these definitions map specific elements of protein sequence and structure to specific elements of function, gold standard families and superfamilies are especially useful for developing tools for elucidation of function of uncharacterized members. Moreover, because they represent related proteins whose functions have diverged, sometimes substantially, they may serve as a challenging test set for automated superfamily clustering methods based on either sequence or structure. To further enhance the utility of the gold standard set as a test set for evaluation of automated superfamily clustering methodologies, evidence codes, based on those developed by the Gene Ontology consortium , are provided for all functional assignments.
As of August 2005, our five gold standard superfamilies include four distinct fold classes and contain a total of 91 families, 4,887 sequences and 282 structures (Table (Table1).1). For the purposes of this paper, we have defined two different types of families. Gold standard families contain only sequences with either experimentally determined functions or sequences that are highly similar to them, that is, show highly significant BLAST e-values (≤ 1 × 10-175) to experimentally characterized sequences. In addition, each of the sequences in a gold standard family is required to conserve all family-specific catalytic residues identified from the literature. Silver standard families contain all the sequences from the corresponding gold standard family, but may also contain additional sequences that have not been experimentally characterized, show an e-value between 1 × 10-20 and 1 × 10-175 to a characterized family member, and meet other relevant criteria (see Materials and methods).
Table Table22 gives a detailed view of the gold and silver standard families that make up each superfamily. As shown in this table, these families catalyze a wide variety of reactions, spanning five of the six EC classes. The superfamily sequence sets represent different diversity levels, as described further in the Discussion. All of the gold standard superfamilies have been rigorously studied, and their structure-function relationships extensively interpreted, providing detailed information, including reaction mechanisms, superfamily-specific catalytic residues, and family-specific catalytic residues (see J.L.S., L.P. Wackett, P.C.B., unpublished data, and [25-36] and references therein, for reviews and general descriptions of these superfamilies.) We have compiled this information (as well as information on additional superfamilies) into a publicly available database that explicitly links enzyme sequence, structure and function in the manner described above [37-39]. (Structure-Function Linkage Database (SFLD) superfamilies correspond to gold standard superfamilies in this paper. SFLD families correspond to the silver standard families in this paper.)
We compared the family and superfamily classifications of the sequences in all five of our superfamilies to that of the Protein Families database (Pfam)  (families only), SCOP (families and superfamilies) and SUPERFAMILY  (a set of hidden Markov models based on SCOP superfamilies) databases. Additional data file 1 shows the difference between our family and superfamily classifications and those of Pfam, SCOP and SUPERFAMILY, for each individual sequence in our five superfamilies.
The main difference between our family classifications and those of Pfam and SCOP is their coverage of function space. As shown in Table Table3,3, our gold and silver standard families include only sequences that catalyze a single overall reaction. Although some SCOP and Pfam families (for example, the enolase family) correspond to this level of functional similarity, Table Table33 shows that most are broader, principally because these classification systems rely mainly on overall sequence and structural similarities rather than on the finer granularity analysis focused on the subsets of catalytic residues that distinguish enzymes that perform a specific catalytic reaction. For example, the Pfam MR_MLE_N and MR_MLE families include enzymes that catalyze at least seven different overall reactions. This difference is illustrated graphically in Figure Figure11.
Figure Figure11 also shows that some of the enzymes in our gold standard enolase superfamily are classified into the Pfam IMPDH family, which contains inosine monophosphate dehydrogenases, among other enzymes. Although the members of the IMPDH family share the (β/α)8 (TIM) barrel fold common to enolase superfamily members, they do not have the amino-terminal domain found in all enolase superfamily members, nor do they use a similar set of catalytic residues to perform their functions. Thus, we believe that classification of any enolase superfamily members into the Pfam IMPDH superfamily is incorrect.
Superfamily classifications for four of our five gold standard superfamilies (amidohydrolase, enolase, haloacid dehalogenase, and vicinal oxygen chelate) correspond to the analogous SCOP and SUPERFAMILY superfamily designations. In contrast, the gold standard crotonase superfamily is only a subset of the corresponding Clp/crotonase superfamily in SCOP and SUPERFAMILY. The SCOP Crotonase-like family contains enzymes corresponding to the gold standard crotonase superfamily, while the remaining families listed in the SCOP Clp/crotonase superfamily contain enzymes that may be evolutionarily related to gold standard crotonase superfamily members, but do not have an established mechanistic linkage [42,43]. Again, because there is no explicit indication of the functional similarity contained within a SCOP or SUPERFAMILY superfamily, it is difficult to use these classifications to make functional inferences regarding uncharacterized proteins.
The five gold standard superfamilies contain enzymes exhibiting varying levels of sequence diversity. On one end of the spectrum, the enolase and crotonase superfamilies contain a rather discrete set of sequences, such that most of their constituent families exhibit statistically significant levels of sequence similarity to other superfamily members. On the other end of the spectrum are the haloacid dehalogenase superfamily and some branches of the amidohydrolase superfamily, which contain the most diverse sets of sequences, including a high proportion of outlier sequences that have only low levels of sequence identity to their closest superfamily relative(s). Because it provides a set of superfamilies with a range of sequence diversity, the gold standard set is a useful (and challenging) test set for automated methods designed to collect and cluster sequences by function.
The superfamilies in the gold standard set are not the only mechanistically diverse superfamilies found in nature. Additional mechanistically diverse superfamilies are described in the SFLD and in other work (see  for some examples), and perhaps many more uncharacterized superfamilies are likely to exist. Although no current research provides an adequate count of mechanistically diverse superfamilies, some rough estimates can be made. For example, of the 339 superfamilies listed in the SCOPEC database, 49% contain two or more families with differences in EC number at all four positions . This suggests, for the enzyme superfamilies that have been catalogued in SCOPEC, a rough upper bound on the possible number of mechanistically diverse superfamilies that include at least two different overall reactions. But because the identification of a mechanistically diverse superfamily requires an understanding of the underlying mechanism of the member enzymes, it is difficult to estimate the total number of such superfamilies found in nature. The gold standard superfamilies described in this work represent the best characterized subset of mechanistically diverse superfamilies for which we have a large amount of functional and mechanistic information and that have thus far been added to our SFLD.
Pfam, SCOP, and other similar databases have become the standards by which new tools for functional and evolutionary classification of protein sequences are validated [44-47]. (Additional test sets, such as BAliBASE  and SABmark , are designed to evaluate new sequence alignment methods rather than superfamily or family clustering algorithms.) We compare our family and superfamily classifications to those found in Pfam, SCOP, and SUPERFAMILY (a set of hidden Markov models based on SCOP superfamilies) to demonstrate the unique properties of our classifications compared to these standards.
The SCOP database classifies all proteins into structural domains. Pfam also uses structural information, where available, to ensure that families correspond to a single structural domain. In contrast, we have used both structure and function-based definitions to divide proteins into their component domains. For example, SCOP and Pfam divide the enzymes in the enolase superfamily into amino-terminal and carboxy-terminal structural domains. However, because the amino- and carboxy-terminal structural domains are both required for functionality, we have kept these sequences as a single functional domain.
In keeping with our function-based domain definition, when a protein contains two or more distinct active sites, we subdivide the protein into separate functional domains, each containing a single active site, if they occur as separate proteins in other species. These functional domains are then classified by family and superfamily.
Specific molecular function - defined here as the overall reaction catalyzed by an enzyme - is often not conserved across a group of related enzymes, particularly in mechanistically diverse enzyme superfamilies. Although early studies suggested that above 40% identity all four digits of an EC number (which specifies a single overall reaction) are conserved between enzyme-enzyme pairs , later studies that correct for database bias have challenged these conclusions. Burkhard Rost, for example, reports that less than 30% of enzyme-enzyme pairs above 50% identity have entirely identical EC numbers , and Tian and Skolnick report that pairwise sequence identity of at least 60% is required to transfer all four digits of an EC number with 90% accuracy . Thus, it is not surprising that most of the SCOP and Pfam families corresponding to our gold standard superfamilies contain enzymes that catalyze more than one overall reaction (Table (Table33 and Figure Figure11).
But while specific molecular function may not be conserved across a group of related enzymes, some aspect of molecular function is often conserved. For example, Tian and Skolnick report that pairwise sequence identity of at least 60% is required to transfer all four digits of an EC number with 90% accuracy . Furthermore, because the EC system was not designed to capture mechanistic information about the reaction in question , enzyme-enzyme pairs with completely different EC numbers may still share some aspect of function .
Our gold standard superfamilies have been designed with exactly this type of functional similarity in mind. Not only are enzymes in a gold standard superfamily thought to be evolutionarily related based on sequence and structural criteria, they also share a set of catalytic residues thought to be responsible for a common chemical capability. This common capability may be a mechanistic step (for example, abstraction of a proton alpha to a carboxylic acid to form an enolate anion intermediate in the enolase superfamily), or a structural strategy for stabilizing a common intermediate (for example, use of an oxyanion hole to stabilize an enolate anion intermediate derived from the acyl-CoA ester derivatized compounds that are substrates in the crotonase superfamily). In each superfamily, the cognate chemical capability is mapped to specific amino acids, thus allowing uncharacterized proteins identified as candidate superfamily members to be evaluated for their ability to perform the superfamily-specific chemistry based on the presence or absence of this amino acid signature.
The division of gold standard superfamilies into families again utilizes sequence, structure and functional information. Not only do the enzymes in a family form a more closely related subset, based on their sequences and structures, compared to the rest of the superfamily, they are also thought to catalyze a single overall reaction. Because the overall reaction has been mapped to a common set of catalytic amino acids shared by all family members, uncharacterized proteins can be evaluated for their ability to perform the family-specific reaction based both on overall sequence or structural similarity to family members and on the presence of the active site motif. These family-specific motifs can be used as part of a system to differentiate families within a given superfamily, as many of the family-specific motifs contain family-specific residues in addition to the superfamily-specific catalytic residues. (In fact, a recent study has demonstrated the importance of using catalytic residue information to identify proteins that are functionally related, showing that the inclusion of such information improves the accuracy of annotation transfer, especially between distantly related proteins .)
In contrast, the level of functional similarity required to classify a sequence according to SCOP, SUPERFAMILY, or Pfam is not uniform. While some SCOP and Pfam families consist of enzymes that catalyze the same overall reaction, many encompass enzymes catalyzing several reactions (Table (Table33 and Figure Figure1).1). Likewise, the level of functional similarity shared between enzymes in a SCOP or SUPERFAMILY superfamily is not uniform (see Results). Because there is no specific indication of the level of functional similarity shared by sequences in a SCOP, SUPERFAMILY, or Pfam grouping and no mapping of conserved functional elements to conserved sequence or structural elements, there is no simple and systematic way to use these classifications to infer the specific molecular function of an uncharacterized enzyme. Additional family and superfamily classifications [51-54], as well as automated methods designed to cluster proteins into superfamilies and families [41,45,47], suffer from similar problems. These databases and methods are valuable resources, but they may not be the right tools to use for all purposes. In particular, when functional classification of divergent enzymes is a goal, our gold standard families and superfamilies may serve as a more appropriate test set.
In the development of the gold standard set, we encountered several difficulties in attempting to classify sequences that belong to mechanistically diverse superfamilies into their constituent families. These difficulties largely arise from the complexity of the underlying biology, where the boundaries between different families within a superfamily may be uneven due to different evolutionary rates within each family, and, due to a number of reasons, some enzymes may not fit into the simple family classification at all.
For example, although the gold and silver families provided here represent a large number of different reactions evolved along each superfamily lineage, these proteins by no means represent all sequences that can be included in the associated superfamilies. Because annotation transfer for distantly related sequences in mechanistically diverse superfamilies is not trivial, we have not included sequences in either the gold or silver standard family sets unless they meet the stringent criteria defined in the Methods section. Thus, Figure Figure11 shows that some of the enzymes in our gold standard superfamilies have not been assigned to a family (gray areas on the inner rings), even though we can confidently assign them to a superfamily based on their overall sequence or structural similarities and the conservation of active site residues associated with the canonical superfamily partial reaction or chemical capability. In some cases, this incomplete classification is due to the fact that the family-specific overall reaction has not yet been identified. In other cases, while there may be some evidence to suggest that the enzyme in question belongs to one of the existing families, it is so distantly related in sequence that it cannot be confidently assigned to the family without additional data such as further mechanistic characterization or tertiary structural information. As a result, sequences that fall into the gray areas of the inner rings in Figure Figure11 are not included in the gold or silver family sets. It is not uncommon for half the enzymes in a gold standard superfamily to lack a family assignment.
Even when our stringent criteria for family classification are used, we cannot be absolutely certain enzymes that have not been experimentally characterized are correctly classified. For example, the enzymes melamine deaminase and atrazine chlorohydrolase from Pseudomonas are 98% identical, but catalyze different overall reactions within the amidohydrolase superfamily . The two enzymes are classified into separate families within our gold standard set; however, if experimental data had not been available to distinguish the two functions of these highly similar enzymes, we would likely have classified both enzymes into the same family due to their high sequence identity and conservation of known catalytic residues. Although such a high degree of sequence similarity coupled with functional divergence is not common [2,7,8], it is certainly possible that other such examples could exist in our gold standard set. Family boundaries are thus expected to change slightly as additional experimental information becomes available. Updated versions of our gold and silver standard families will, therefore, be made available on the SFLD website  as new information warrants.
An additional difficulty for the subclassification of superfamily enzymes into families is the somewhat arbitrary assumption we make that all enzymes in a given family catalyze a single biologically relevant overall reaction. In reality, some enzymes may have evolved to be nonspecific, for example, the cytochrome P450s, which are involved in the metabolism of a wide variety of endogenous and exogenous toxins. In addition to this rather extreme example, many enzymes can turn over multiple related substrates at varying levels of proficiency. In some cases, such promiscuity is biologically relevant, while in other cases, it may only be seen in vitro. In either case, this complicates the family classification process. For example, the extradiol dioxygenase enzymes within the vicinal oxygen chelate superfamily are difficult to subclassify into families because they are similar in sequence and utilize a common set of active site residues due to their similar chemistry. Further complicating this is the fact that many of these enzymes have been shown to catalyze the extradiol cleavage of several related substrates, and it is not always clear which substrate is biologically relevant. We have noted those families that are especially difficult to classify in the footnotes to Additional data files 1 and 2.
Despite such complications, in many cases we can find clear boundaries between functionally distinct families. In these cases, subclassification of a superfamily into families facilitates the process of making functional inferences about uncharacterized proteins.
We have described a gold standard set of proteins, clustered according to systematic and consistent definitions of family and superfamily. Because these definitions map specific elements of protein sequence and structure to specific elements of function, gold standard families and superfamilies are optimized for use in elucidation of the function of uncharacterized members, and serve as a new type of test set for automated superfamily clustering methods. The opportunities this test set provides to aid in detailed validation of such clustering methods will contribute to advances in automated annotation of newly sequenced genomes.
We define a mechanistically diverse enzyme superfamily as a group of homologous enzymes that catalyze different overall reactions via a common mechanistic attribute that requires conserved catalytic elements. We define a family as a subset of a superfamily where all enzymes catalyze the same overall reaction via the same mechanism.
Prior to addition of a superfamily to our gold standard set, we ensure that the following conditions are met. Firstly, crystal structures for proteins from at least two different families within the superfamily are available. Secondly, sufficient mechanistic information for proteins from at least two different families within the superfamily are available, thus allowing the common partial reaction or chemical capability to be identified. Thirdly, experimental evidence regarding the identity of catalytic residues involved in the conserved partial reaction or chemical capability is available for sequences in at least two different families.
We roughly based our sequence collection protocol on that outlined by Todd et al.  but used our own superfamily definitions, rather than those contained in the CATH database, to guide superfamily creation. For each family within a superfamily, we chose a sequence that had been shown experimentally to catalyze the family-specific reaction to serve as a query for PSI-BLAST . Each PSI-BLAST analysis was performed against the National Center for Biotechnology Information nonredundant protein database at an expectation value cutoff of 5 × 10-4 for 20 rounds or until convergence. All PSI-BLAST hits that aligned over at least 80% of the length of the query sequence were added to the superfamily of the query sequence.
Sequences collected via the automated protocol were inspected to verify superfamily membership by examining multiple sequence alignments for the presence of known catalytic residues and other superfamily specific sequence motifs
Superfamily sequences were classified into families according to a two-step procedure. First, sequences were roughly clustered based on sequence similarity. Functional information from the literature was then used to refine family clusters.
Two types of family clusters were constructed, at different levels of stringency. Gold standard families contain sequences with experimentally determined functions (see below) and sequences that show highly significant BLAST e-values (≤ 1 × 10-175) to experimentally characterized sequences. In addition, each of the sequences in a gold standard family is required to conserve all family-specific catalytic residues identified from the literature. Silver standard families contain all the sequences from the corresponding gold standard family, but may also contain additional sequences that have not been experimentally characterized and show an e-value between 1 × 10-20 and 1 × 10-175 to a characterized family member. (In most cases, the e-value is much more significant than 1 × 10-20.) These additional sequences do, however, conserve all family-specific catalytic residues identified in the literature, and curators have used other information, such as examination of the sequences in the context of a family alignment and examination of operon context, to increase the confidence of these assignments.
For the purposes of family classification, enzymes with experimentally characterized function include enzymes that have been shown through a direct assay to catalyze a specific reaction or enzymes whose function has been inferred based on complementation or mutagenesis data. The literature references upon which each family classification was based can be found in Additional data file 5.
We define catalytic residues similarly to Porter et al. . We do not include residues that are described in the literature only as being involved in substrate binding, because these residues may not be as well conserved across a family as residues that play a more direct role in the catalytic mechanism of the enzyme (M.E. Glasner, R.A. Chiang, N. Fayazmanesh, M.P. Jacobson, J.A.G, P.C.B, unpublished data).
Following the criteria described above, family-specific catalytic residues were identified based on experimental data from the literature, including mutagenesis and X-ray crystallography data. When the literature contained catalytic residue information for multiple enzymes within a single family, the information was pooled and applied to the entire family. In some cases, experimental information regarding catalytic residues was not available for a given family, but catalytic residues could be inferred based on sequence similarity to related families, at least for the subset of catalytic residues involved in the partial reaction or chemical capability conserved across the superfamily. Superfamily-specific catalytic residues were identified by taking the subset of family-specific catalytic residues conserved across all enzymes in a superfamily that are involved in the partial reaction or chemical capability common to the superfamily. Generally, this approach has been validated for all of the superfamilies represented in this work, including homologous sequences in families for which no structures were yet available when these relationships were initially predicted. In several of these latter cases, experimentally determined structures have validated those inferences (see [15,56,57] for examples).
Although we made every effort to use our knowledge of the family and superfamily-specific chemistry to support homology-based catalytic residue prediction, this is to some extent a subjective process, and our family and superfamily-specific catalytic residue assignments may change as further experimental information becomes available. The type of evidence used to identify a given family or superfamily-specific catalytic residue may be determined by examining the associated evidence code in the SFLD, which is updated as new information about these superfamilies becomes available.
To illustrate the differences between our family and superfamily classifications and existing classifications, we have compared our data to Pfam, SCOP and SUPERFAMILY (see Additional data files 1 and 2).
Each of the sequences in our superfamilies was compared to the global-alignment-based hidden Markov models contained in version 17.0 of the Pfam-A database , using HMMPFAM  with the gathering cutoff established by Pfam curators. Sequences were classified into the Pfam-A family to which they showed the most significant match. When a sequence corresponded to multiple Pfam domains, the most significant match for each region of the sequence was noted.
The SCOP family and superfamily classifications were obtained for each sequence in our superfamilies that had a crystal structure listed in SCOP version 1.67. Each of the sequences in our superfamilies was also compared to the SUPERFAMILY set of hidden Markov models , which were built based on SCOP release 1.67. Comparisons were performed using HMMPFAM, with an e-value cutoff of 1. Sequences were classified into the SUPERFAMILY superfamily to which they showed the most significant match. When a sequence corresponded to multiple SUPERFAMILY domains, the most significant match for each region of the sequence was noted.
The following additional data are available with the online version of this paper. Additional data file 1 lists the family and superfamily mappings for the sequences and structures in the gold standard superfamily set, with Pfam, SCOP, and SUPERFAMILY assignments listed as names. Additional data file 2 lists family and superfamily mappings for the sequences and structures in the gold standard superfamily set, with Pfam and SCOP assignments listed as accession numbers. Additional data file 3 provides fasta format sequences for gold standard superfamily proteins. Additional data file 4 contains references for the gold and silver standard family assignments. Additional data file 5 lists gold and silver standard family assignments and the corresponding references.
Superfamily and family assignments for each of the sequences and structures from this work, as well as the corresponding Pfam, SCOP, and SUPERFAMILY assignments. Pfam, SCOP, and SUPERFAMILY assignments are listed as names. 1National Center for Biotechnology Information GI number. Additional data file 3 contains the fasta format sequences corresponding to each gi number. 2Protein Data Bank identifier. 3The gold and silver standard o -succinylbenzoate synthase (OSBS) families contain a more diverse set of enzymes than many other families listed in the table. All of the OSBS enzymes are believed to catalyze the same overall reaction via the same catalytic residues and there is no convincing evidence to suggest convergent evolution from within the superfamily, so we believe that these enzymes meet our definition of family. They appear, however, to utilize a different constellation of substrate binding residues, and certain subclusters within the family catalyze the promiscuous N-acyl amino acid racemase reaction in addition to the OSBS reaction. Because the sequences that comprise this family are highly divergent, it may pose special difficulties for automated clustering methods. Additional families that may be especially challenging include the extradiol dioxygenase families within the VOC superfamily, where a relatively high degree of sequence similarity and catalytic promiscuity make accurate clustering difficult. 4Evidence code for gold and silver standard family assignment . 5ID number for the literature reference upon which gold/silver family assignment was based. When a sequence has been assigned to both a gold and silver standard family, this reference applies to both family classifications. When it has only been assigned to a silver standard family, this reference applies to the silver standard family classification. The full reference may be obtained by cross-referencing the ID number with Additional data file 4.
Superfamily and family assignments for each of the sequences and structures from this work, as well as the corresponding Pfam and SCOP assignments. Pfam and SCOP assignments are listed as accession numbers. This file is essentially identical to Additional data file 1, except that Pfam and SCOP assignments are listed as database accession numbers rather than names, and SUPERFAMILY assignments are not listed. 1National Center for Biotechnology Information GI number. 2Protein Data Bank identifier. 3The gold and silver standard o -succinylbenzoate synthase (OSBS) families contain a more diverse set of enzymes than many other families listed in the table. All of the OSBS enzymes are believed to catalyze the same overall reaction via the same catalytic residues and there is no convincing evidence to suggest convergent evolution from within the superfamily, so we believe that these enzymes meet our definition of family. They appear, however, to utilize a different constellation of substrate binding residues, and certain subclusters within the family catalyze the promiscuous N-acyl amino acid racemase reaction in addition to the OSBS reaction. Because the sequences that comprise this family are highly divergent, it may pose special difficulties for automated clustering methods. Additional families that may be especially challenging include the extradiol dioxygenase families within the VOC superfamily, where a relatively high degree of sequence similarity and catalytic promiscuity make accurate clustering difficult. 4Evidence code for gold and silver standard family assignment . 5ID number for the literature reference upon which gold/silver family assignment was based. When a sequence has been assigned to both a gold and silver standard family, this reference applies to both family classifications. When it has only been assigned to a silver standard family, this reference applies to the silver standard family classification. The full reference may be obtained by cross-referencing the ID number with Additional data file 4.
Fasta format sequences for gold standard superfamily proteins. Some protein sequences will differ from the sequence listed for the equivalent GI number at the National Center for Biotechnology Information, as they have been trimmed to remove portions of the sequence that are not part of the superfamily.
Gold and silver standard family assignments for each of the sequences in this work, including the corresponding evidence codes and literature references. 1National Center for Biotechnology Information GI number. Additional data file 3 contains the fasta format sequences corresponding to each GI number. 2The gold and silver standard o -succinylbenzoate synthase (OSBS) families contain a more diverse set of enzymes than many other families listed in the table. All of the OSBS enzymes are believed to catalyze the same overall reaction via the same catalytic residues and there is no convincing evidence to suggest convergent evolution from within the superfamily, so we believe that these enzymes meet our definition of family. They appear, however, to utilize a different constellation of substrate binding residues, and certain subclusters within the family catalyze the promiscuous N-acyl amino acid racemase reaction in addition to the OSBS reaction. Because the sequences that comprise this family are highly divergent, it may pose special difficulties for automated clustering methods. Additional families that may be especially challenging include the extradiol dioxygenase families within the VOC superfamily, where a relatively high degree of sequence similarity and catalytic promiscuity make accurate clustering difficult. 3Evidence code for gold and silver standard family assignment . 4ID number for the literature reference upon which gold/silver family assignment was based. This number corresponds to the reference ID number given in additional data files 1, 2, and 4. When a sequence has been assigned to both a gold and silver standard family, this reference applies to both family classifications. When it has only been assigned to a silver standard family, this reference applies to the silver standard family classification.
We thank Ranyee Chiang for analysis of SCOPEC to provide an estimate of the number of mechanistically diverse superfamilies that may exist. This work was supported by NIH R01-GM60595 and NSF DBI-0234768 to P.C.B., NIH GM52594 to J.A.G., and NIH GM071790 to P.C.B. and J.A.G.