|Home | About | Journals | Submit | Contact Us | Français|
This report summarizes the Critical Assessment of Protein Structure Prediction (CASP5) target proteins, which included 67 experimental models submitted from various structural genomics efforts and independent research groups. Throughout this special issue, CASP5 targets are referred to with the identification numbers T0129–T0195. Several of these targets were excluded from the assessment for various reasons: T0164 and T0166 were cancelled by the organizers; T0131, T0144, T0158, T0163, T0171, T0175, and T0180 were not available in time; T0145 was “natively unfolded”; the T0139 structure became available before the target expired; and T0194 was solved for a different sequence than the one submitted. Table I outlines the sequence and structural information available for CASP5 proteins in the context of existing folds and evolutionary relationships. This information provided the basis for a domain-based classification of the target structures into three assessment categories: comparative modeling (CM), fold recognition (FR), and new fold (NF). The FR category was further subdivided into homologues [FR(H)] and analogs [FR(A)] based on evolutionary considerations, and the overlap between assessment categories was classified as CM/FR(H) and FR(A)/NF. CASP5 domains are illustrated in Figure 1. Examples of nontrivial links between CASP5 target domains and existing structures that support our classifications are provided.
Although assessment categories are named historically on the basis of the techniques used to generate structure predictions, targets are now classified on the basis the degree of sequence and structural similarity to known folds. The nature of such a classification scheme requires targets to be split into domains, because domains represent the basic units of folding and evolution. In CASP classification, multidomain protein targets often crossed assessment categories. For example, the two domains of target T0149 (E. coli hypothetical protein yjiA) exhibited very different relationships to proteins with known folds (Fig. 2). The sequence of the N-terminal domain [Fig. 2(A), white] placed it within the Nitrogenase iron protein-like family of P-loop NTPase structures [1cp2,1 Fig. 2(B)], whereas the sequence of the C-terminal domain did not resemble that of any existing fold sequence [Fig. 2(A), gray]. However, the C-terminal domain did bear some topological similarity to an Hpr-like fold [1pch,2 Fig. 2(C)]. Thus, proper classification of this target required defining appropriate domain boundaries and assigning the resulting domains to different categories.
In addition to allowing a more discrete classification of CASP5 targets, domain parsing permitted us to provide a more accurate assessment of the structural quality of model predictions. When a domain rotation exists between an experimental structure and a model or template (e.g., T0159 with periplasmic binding protein 1gl2, known for 30° domain rotation on ligand binding3), a single superposition is not adequate to represent the similarities or differences between the two folds. Similarly, many automatic protein structure comparison methods provide lower scores for multiple-domain proteins than they do for the isolated domains. Accordingly, splitting targets into domains increased the scores assigned to CASP5 target predictions by automated evaluation approaches and ultimately provided a better estimation of group performance.
We defined CASP5 target domains manually, basing our judgment on the presence of a potentially independent hydrophobic core. We separated β-sheets when necessary to achieve such an arrangement. Precise domain boundaries were sometimes difficult to delineate. In such cases, we considered various aspects of the template domain structure: residue side-chain or backbone contacts within proposed domains, sequence or structural similarities to existing domains, and domains represented in predictor models as criteria for boundary selection. In total, 55 CASP5 target structures were divided into 80 independent domains, which corresponds to an increase from CASP4 (40 targets and 58 domains).
Mutual domain arrangements are more challenging to predict than individual domain structures, especially when difficult domain organizations such as swaps or discontinuous boundaries exist. Domain swaps, defined as either an exchange of domains between protein chains or an exchange of secondary structural elements between domains, were found in several CASP5 target structures. Target T0140 represented a synthetic hybrid protein with a dimeric OB-fold formed by a β-hairpin swap. Although such swaps are becoming well-documented phenomena in protein structure, this arrangement remains virtually impossible to predict without having precedence in the existing pool of OB-folds. As such, we used portions of two chains in our definition of the domain boundaries for this target (see Table I).
Three different CASP5 targets (T0152, T0169, and T0192) belong to the Acyl-CoA N-acyltransferase family. This family contains members that exist as either an independent fold (i.e., 1cjw4) or a β-strand-swapped dimer (i.e., 1qsm5). It is of interest that one of the target structures (T0192) formed a swapped-dimer, whereas the other two (T0152 and T0169) did not. Although prediction of such a swap was conceivable in this case, we used portions of two chains in our definition of the domain boundaries. Predictions could then be compared with either the original protein chain (β-strand extended into space) or the defined domain (β-strand completing the fold).
The second type of difficult domain organization includes structures that contained discontinuous domain boundaries with respect to their primary sequence structure. Such an arrangement may result from a domain insertion into the middle of an existing fold or from a swap of secondary structural elements between domains. Target T0148 provides an excellent example of a CASP5 target with discontinuous domain boundaries. This target contained a tandem repeat of ferredoxin-like fold domains with swapped N-terminal β-strands.
The main goal of CASP5 classification was to place target domains among existing structures so that predictions could be assessed according to three main categories: comparative modeling, fold recognition, and new fold. To accomplish this task, we used sequence and structure similarity measures to find the closest neighbors (templates) to target domains and a classification scheme similar to that defined by the Structural Classification of Proteins (SCOP) database.6 This procedure allowed us to hypothesize about the evolutionary relationships between CASP5 targets and existing protein structures.
To evaluate the similarities of CASP5 targets to proteins of known folds, we used a combination of sequence/profile and structure database-searching approaches. All domains identified with sequence-based methods were assigned to the comparative modeling assessment category (CM). In general, simple BLAST7 searches (E-value cutoff 0.005) of the nonredundant database (NR, September 18, 2002) identified close homologues of target sequences (26 domains). Sequence profile searches using multiple iterations (up to 5) of PSI-BLAST8 (E-value cutoff 0.005) identified more distant homologs (17 domains), and transitive PSI-BLAST searches (E-value cutoff 0.02 with manual filtering) initiated from a number of sequences found to be homologous to the initial target sequence identified additional remote homologues (8 domains). For all of these cases, structural similarity to identified folds in the Protein Data Bank9 was confirmed with inspection of Dali structure superpositions.10,11 Unusual structural differences revealed in this inspection between the targets and templates with detectable sequence similarities (T0141 with 1lba12 and T0152 with 1cjw4) are noted in Table I.
The CM domains identified through these sequence/profile-based methods represent a significant portion of the CASP5 targets (51 of 80 domains) and cover a broad range of sequence similarities to PDB templates (some identified with simple BLAST, and some identified with transitive PSI-BLAST searches). By using a measure of sequence similarity to identified PDB templates described the CASP5 fold recognition assessment (Sseq),13 the target domains assigned to this single category tended to fall into two groups (see bimodal distribution, Fig. 3). These two groups correspond to close homologues (higher Sseq group, Fig. 3) and remote homologues (lower Sseq group, Fig. 3). Generally, targets grouped within the close homologues identified template sequences with simple BLAST or with PSI-BLAST and represented either identical proteins from different species (orthologs) or similar proteins from the same SCOP families as the template proteins. Likewise, targets grouped within the remote homologues identified template sequences with PSI-BLAST or with transitive PSI-BLAST and belonged to the same SCOP superfamilies as the template proteins.
The bimodal distribution of target domains illustrated in Figure 3 suggests a natural boundary between the comparative modeling assessment category and the fold recognition category. However, grouping the target domains into two discrete clusters based on this sequence similarity measure (Sseq) alone was difficult. To accomplish this task, we chose to include an additional measure of structural similarity to identified PDB templates (see fold recognition assessment article for a complete description13). The resulting two-dimensional scale defined a precise boundary between close homologues (29 CM domains) and remote homologues (22 CM/FR(H) domains), although it resulted in a significant overlap between the two assessment categories.
To classify the remaining targets (29 domains), we used Dali10,11 to search the PDB for protein structures with similar folds. We also used a secondary structure-based vector search program developed in our laboratory (unpublished) to identify more distant protein structures in the PDB that displayed similar topologies to the target folds. We combined these automated search programs with manual inspection and a general knowledge of protein folds to produce the final classification. For cases with identified structural similarities (24 domains), analogy between the target and template was assumed unless there was enough compelling evidence to hypothesize descent from a common ancestor (see examples below). For those cases without clear similarities to known structures, a classification of new fold was assigned (5 domains).
The overlap [FR(A)/NF] between the fold recognition category and the new fold category was defined on the basis of various criteria including general overall size and fold topology of the complete structure, length, and arrangement of individual secondary structural elements within the structure, and degree of partial similarities to existing folds. We asked the question: How well does an existing fold approximate this target domain? One of the more difficult targets to classify in this respect was the C-terminal domain of F-actin capping protein α-1 subunit (T0162_3). We classified this domain as NF, although the overall topology of the core was similar to that of ubiquitin-conjugating enzyme Ubc9 (1u9a14). Each structure includes a five-strand meander flanked on one side by two α-helices. However, the flanking helices of the target domain (T0162_3) form a parallel interaction with a flat β-sheet, whereas those of Ubc9 interact in a perpendicular orientation due to a significant twist of the β-sheet. In addition, the secondary structural elements of Ubc9 are generally shorter in length than those of the target and include two additional C-terminal helices. Finally, Ubc9 represents an independent folding unit, whereas the extended target domain likely requires another subunit to form a compact structure.
In classifying individual CASP5 target domains, we sought to establish evolutionary relationships to existing folds wherever possible. First, we defined as a homologue any target whose sequence detected its corresponding template sequence using the various forms of PSI-BLAST. Target T0168, a glutaminase from B. subtilis, represents one of the most challenging sequence links established with use of these methods. By using transitive PSI-BLAST searches with two intermediate sequences, a hypothetical protein from P. aeruginosa (gi|15596835) and a 6-aminohexanoate-dimer hydrolase (gi|488342), this target glutaminase sequence was linked to sequences from the β-lactamase/D-ala carboxypeptidase superfamily (i.e., 3pte, 2blt, 1ci9). Members of this superfamily contain an α/β sandwich domain interrupted by a cluster of helices. The active site lies between these two domains and contains a conserved catalytic Ser-x-x-Lys motif.15-17 The presence of this motif in the glutaminase structure, along with the positioning of conserved residues within the defined active site cleft, further supported the presumption of homology for this target.
For those targets without detectable sequence similarity, we considered the degree of structural similarity to known folds (Dali z score above 918) and combined this information with additional structural and functional considerations as evidence for homology. Examples of such additional considerations included similarities in the organization of domain structure, the sharing of unusual structural features, the sharing of local structural motifs, or the placement of active sites. Six additional CASP5 target domains (T0134, T0138, T0156, T0157, T0174, and T0193_1) were classified as homologues to proteins with known folds based on these criteria.
Two of these CASP5 targets (T0134 and T0174) displayed considerable structural similarity (Dali z scores 19.1 and 18.5) to their respective templates (1qts19 and 1kvk20) while retaining an identical domain organization. Both the delta-adaptin appendage domain (target T0134) and its closest template, the clatherin adaptor appendage domain (1qts19), have an N-terminal immunoglubulin-like β-sandwich followed by a C-terminal clathrin adaptor appendage domain-like fold. Similarly, target T0174 (Protein XOl-1 from C. elegans) displays a two-domain structure analogous to that of its closest template, mevalonate kinase (1kvk20). Both structures include an N-terminal ribosomal protein S5 domain 2-like domain and a C-terminal ferredoxin-like domain.
Additional examples of notable structural similarity in the absence of detected sequence similarity included target T0138 and target T0156. The target T0138 KaiA N-terminal domain from S. elongates superimposed with a CheY-like superfamily member, the PhoB receiver domain from T. maritima (1kgs21), with a reasonable Dali z score (13.9). The structure of target T0156 had a less impressive Dali z score (7.5) when aligned with the “swiveling” domain of pyruvate phosphate dikinase (1 dik22). However, both structures included an identical topological arrangement of secondary structural elements comprising the three layers (β-β-α) of the “swiveling” domain fold. In addition, the two β-sheet layers of both the target domain (T0156) and the template domain (1 dik central domain) form a distinctive closed barrel (n = 7, S = 10). This unusual structural similarity compelled us to regard these two proteins as homologues.
Two CASP5 targets (T0193_1 and T0157) retained conserved motifs that distinguished important structural or functional aspects of their closest templates. The N-terminal domain of target T0193, an AT-rich DNA-binding protein from T. aquaticus, formed a three-helical bundle with a winged helix-turn-helix motif similar to that of the putative transcriptional regulator TM1602 N-terminal domain (1j5y9). PSI-BLAST detected the sequence of this structure (1j5y) with an E-value (0.85 over 47 residues) outside a reasonable threshold but with an alignment that matched the structural alignment. The target structure superimposed with the template structure (Dali z score 4.8), although residues corresponding to the “wing” were disordered. Target T0157, E. coli yqgF, retained identifiable structural motifs present in the ribonuclease H-like fold (i.e., RuvC resolvase 1hjr23). It is of interest that these motifs were already identified in an article describing the structural and evolutionary relationships of Holliday junction resolvases and related nucleases.24 Based on the presence of these motifs, this article predicted a preservation of the core RuvC secondary structure elements in the structure corresponding to the CASP5 target sequence T0157.
Finally, target T0147, E. coli YcdX, presented an unresolved case of potential homology to the TIM β/α barrel superfamily of metallo-dependent hydrolases. This target was previously defined as a PHP domain based on a detailed analysis of the domain conservation of two distinct classes of polymerases.25 Although no definitive structural prediction for the PHP domain superfamily was presented in the report, the authors suggested a link to the metal-dependent hydrolase superfamily based on the presence of a conserved metal binding site motif (HXH) and the results of multiple alignment-based database threading.25
It is interesting that metal-dependent hydrolases such as cytosine deaminase (1k6w26) form a distorted TIM β/α barrel fold capped by a C-terminal helix similar to that of the target structure. Although the target TIM β/α barrel is composed of only seven strands and helices, both structures bind transition metals within the barrel using the conserved metal-binding motif. However, in the absence of stronger evidence, we classified this target as an analog [FR(A)].
The remaining CASP5 target domains belonged to one of two categories: structural analogs and new folds. To make sure we did not miss any potential template folds of the target domains and to help make the distinction between structure analogs and new folds, we used a secondary structure vector search program under development in our laboratory. To perform this vector search, the secondary structural elements belonging to each target domain were defined, and a matrix of contacts between these elements that included the types of interactions (i.e., parallel, antiparallel) and the handedness of connections was constructed. We used this target matrix to search for exact matches in a database of similar matrices defined for all available PDB structures. This program finds topological and architectural similarities including circular permutations but is not sensitive to structural details such as packing, length of secondary structure elements, or large insertions.
An example of the utility of the vector search is illustrated with the first domain of target T0187, a putative glycerate kinase from T. maritima [Fig. 4(A)]. A Dali search using this domain did not identify any reasonable structure templates in the PDB. However, the vector search program found a hit to cobalt precorrin-4-methyltransferase CbiF domain 2 [1cbf,27 Fig. 4(B)]. Both target and template structures displayed a central mixed sheet of 5 β-strands (order 12534) surrounded by α-helices. However, the target fold included an extra α-helix at the N-terminus and an extra α/β insertion between the last two β-strands that formed the edge of the β-sheet (Fig. 4). It is of interest that one group (Brooks, Group 373) identified this structure (1cbf) as a parent template for the target [Fig. 4(C)]. Although the topological arrangement of the core folds of each of these structures was similar, the packing of the connecting helices around the sheet differed significantly, providing an explanation for the lack of detection by Dali.
By including circular permutations and allowing for different insertions to the core fold of the target domain (T0187_1), we searched for templates containing even greater structural variability. Using this strategy, we linked the putative glycerate kinase domain (T0187_1) to another CASP 5 target domain [T0149_2, Fig. 4(D)]. This similarity assumed including the edge strand insertion of target T0187_1 as a core secondary structural element [purple, Fig. 4(A)] and treating the first β-strand as an insertion. The resulting common antiparallel sheet resembled those of two existing structures: homoserine dehydrogenase domain 2 (1ebf_B149-337,28 Fig. 4(E)] and heat shock protein HSP90 [1a4h_A1-214,29 Fig. 4(F)].
We give special thanks to Alexey Murzin for sharing with us his expertise in structural classification. We also recognize the X-ray crystallographers and nuclear magnetic resonance (NMR) spectroscopists who submitted the experimental structures used as targets for CASP5, including the following structural genomics efforts: M. tuberculosis structural genomics consortium, Northeast Structural Genomics Consortium, Midwest Center for Structural Genomics, Joint Center for Structural Genomics, Structure-to-Function, and New York Structural Genomics Research Consortium.