|Home | About | Journals | Submit | Contact Us | Français|
Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.
It is now widely accepted that domains constitute the basic structural, functional or evolutionary unit of proteins.1-3 Proteins can comprise a single domain or they can be made from several domains resulting in a multi-domain protein. The exponential growth of protein sequence and structure data and the development of sensitive sequence comparison methods have contributed significantly towards understanding the mechanisms of protein evolution. Sequence and structure-based comparison of protein database sequences suggested that evolution made use of a limited repertoire of domain families to create multi-domain proteins with a wide variety of architecture to cater to the functional requirements of an organism at the molecular level.4,5 Structural assignments to gene sequences from complete genomes revealed that about two-thirds of prokaryotic proteins and 80% of eukaryotic proteins are multi-domain proteins.6 The preponderance of multi-domain proteins in the three kingdoms of life underscores their role in the evolution of diverse molecular functions. Thus, it becomes important to understand the evolution of multi-domain proteins.
In 1973, Donald Wetlaufer introduced the concept of continuous and discontinuous domains.7 A continuous domain is formed by one part of a polypeptide chain, while a discontinuous domain is formed by two or more parts of a single polypeptide chain. A majority of multi-domain proteins are formed by continuous domains, where the individual domains are secured by end-to-end linkages. However, there are exceptions to this common pattern where proteins exhibit discontinuity in their domain arrangement. Here, we focus on insertions, where a domain is inserted into another domain (Figure 1). Essentially, insertions represent one example of non-contiguous domain arrangement. While domain insertions were described anecdotally in a few protein structures by Russell,8 the availability of an accurate and well-curated domain classification resource such as the Structural Classification of Proteins (SCOP) database and an ever increasing size of the Protein Data Bank (PDB) gave us an opportunity to investigate the phenomenon comprehensively. Here, we provide a quantitative description of domain insertions in 3D structures.
We followed the definition of protein domains in the SCOP database (version 1.61).1 Although there are several available schemes of protein structure classification, we chose SCOP because it is an expert curated classification of protein structures based on their structural and evolutionary relatedness. In the SCOP database, a protein domain is considered as a unit of evolution if it occurs independently by itself or in combination with other domains.
SCOP represents a hierarchical classification scheme with four principal levels: family, superfamily, fold and class. Domains clustered into families are related evolutionarily and can be detected at the sequence level. Domains grouped within superfamilies can have low sequence identity, but their structural and functional features suggest a common evolutionary origin. Superfamilies with similar topology are grouped under a fold. Folds are assigned to classes based on their secondary structure. For our analysis, we considered the fold and superfamily levels of the SCOP hierarchy, and the five major classes (all-α, all-β, α/β, α + β and “small proteins”). All-α and all-β classes include proteins with abundant α-helices or β-sheets, respectively. The α/β class is distinguished mainly by parallel β-sheets (β-α-β units), whereas the α + β class contains proteins with predominantly anti-parallel β-sheets (segregated α and β regions). “Small proteins” are distinguished by their size rather than other features.
We obtained data for our analysis from the PDB.9 To overcome the redundancy inherent in the PDB, we chose a pre-computed list of non-redundant protein chains provided by PDB_Select†.10 We used the set of proteins that had pairwise sequence identities less than 90%. We designated this set as PDB_90. Out of the 6182 chains in PDB_90, only 5883 chains were assigned SCOP domain definitions. We used the SCOP parseable file “dir.cla.scop.txt_1.61”‡ to extract domain definitions.
It is self-evident that insertions can only be found in multi-domain proteins, where one domain (insert) is contained within another domain (parent). Parent and insert domains can belong to the same or different SCOP superfamilies. Likewise, a combination of two domains can be viewed as a combination of superfamily participations. We obtained a total of 140 protein chains that conformed to our definition. When we considered 140 protein chains as parent-insert superfamily participations, we observed several identical parent-insert superfamily participations. Whenever there was also the same topological relationship between the parent and insert domains, we retained only one example of a parent-insert superfamily participation. This procedure left us with 40 unique parent-insert superfamily participations. Variations on the simple scheme “one insert within one parent” are present; they are shown in Figure 2.
For all cases of identified domain insertions, we checked for artefacts arising from missing coordinates. This was necessary because SCOP domain definitions are based on atomic coordinates provided in the PDB. To ascertain consistency, we compared atomic coordinates (ATOM records) versus sequences (SEQRES records) obtained from the ASTRAL compendium.11 In the majority of cases, the sequences are completely covered by coordinates, but in other cases, there are parts of sequences with missing coordinates. However, the coordinates that are absent do not obscure the position of insertion in the latter cases.
We then calculated unique superfamily participations for all multi-domain proteins. We identified 450 unique superfamily participations for 5883 single or multi-domain proteins in SCOP. Thus, domain insertions constitute 9% (40/450) of all unique superfamily combinations.
Domain insertions can be categorised as either single or multiple depending on the number of inserts (Figure 2). In single insertions, one domain is inserted into another domain, and both domains can belong to the same or different superfamilies. For example, in Figure 1, the Escherichia coli enzyme RNA 3′-terminal phosphate cyclase (PDB 1qmhA)12 has two domains, a small insert and a larger parent that belong to different superfamilies. 90% (36/40) of the observed insertions are single domain insertions. In multiple insertions, more than one domain, either of the same or different superfamily, is inserted into the parent domain. We observed three types of multiple insertions: (i) Nested insertions: In Thermoplasma acidophilum thermosome (PDB 1a6dA),13 the apical domain of the archael chaperonin is inserted into the intermediate domain, which is in turn inserted into an ATPase domain. (ii) Two-domain insertions: The type II inosine monophosphate dehydrogenase from Streptococcus pyogenes (PDB 1zfjA)14 contains two tandem cystathionine-β-synthase domains inserted into the catalytic TIM-barrel domain. The second example of this is the Saccharomyces cerevisiae PI-Sce I intein (PDB 1ef0A),15 a homing endonuclease with protein splicing activity, which has the duplicated endonuclease domain inserted into the Hint domain. (iii) Three-domain insertions: In PI-Pfu I, an intein-encoded homing endonuclease from the archaebacteria Pyrococcus furiosus (PDB 1dq3A),16 the Hint domain has three tandem inserts, two intein endonuclease domains with αββαββαα structural motifs, and one Stirrup domain.
Previous work on intron-encoded homing endonucleases from the dodecapeptide family showed that for their folding, dimerisation and catalysis, they should form a dimer that has two copies of the LAGLIDADG motif (one copy per subunit of a dimer), or alternatively they could be monomeric with the monomer having both copies of the motif.17 We found that in PI-Sce I (case (ii)) and PI-Pfu I (case (iii)), two monomeric domains are tandemly inserted into one parent domain. This observation suggests to us that during the course of evolution, there was a simultaneous insertion of two monomeric domains into the parent domain, rather than an insertion of one monomeric domain followed by its duplication.
In our analysis, we treated multiple insertions as several separate parent-insert combinations, resulting in the total of 45 such combinations within 40 protein chains. There are 41 unique parent-insert superfamily combinations. Upon examination of relationships among proteins containing insertions, levels of SCOP hierarchy, and superfamily participation of parent and inserted domains, we identified several biologically meaningful patterns. These findings are discussed below.
As mentioned before, we considered five SCOP classes. There is a maximum of 25 (5 × 5) different class pairwise combinations. In our data, we observed only 15 combinations when investigating class participation of parent-insert pairs. The combination of α/β-parent-α + β-insert is predominant, where 50% of all parents belonged to α/β class and 40% of all inserts belonged to α + β class. Domains from α/β class occur as parent domains twice and four times more often than domains from all-β and all-α class, respectively. Domains from the class of “small proteins” are seen only as inserts. This bias could be explained, at least to a certain extent, by taking into consideration the size and function of parents and inserts, which is articulated in the next section.
Figure 3(a) shows the domain length distribution for proteins from PDB_90 across the five SCOP classes. The average domain length is longest for α/β class followed by the all-β, α + β, and all-α class. When we calculated distribution of average domain lengths for 41 parent domains, we observed the same trend (Figure 3(b)). However, the average length of parent domains is noticeably larger than the average length of domains from PDB_90 set; this is true for each SCOP class (compare Figure 3(a) and (b)). Thus, combining the fact that α/β parent domains are the most abundant, with the fact that α/β domains are the longest on average, we arrived at an explanation that longer domains more readily accept insertions during evolution. As for the inserted domains, α + β and all-α class are equal and major contributors to the number of domains. Therefore, the trend observed for parents is not applicable for inserts.
In most cases, inserted domains are shorter than parent domains (Figure 4(a)). Parents comprised 50–80% of protein length, while inserts comprised 20–50%. Close to 80% of inserts are shorter than 175 residues, which is the average length of a protein domain calculated from crystal structures.18 More than 60% of inserts are shorter than 130 residues. This observation is consistent with the heuristic thinking that smaller domains are less likely to disturb the structure and folding of parent domains; the observation could explain shorter lengths of inserted domains. Our explanation does not contradict an important experiment by Doi and colleagues.19 They were able to show that when random sequences of 120–130 amino acid residues were inserted into a surface loop region of E. coli RNase HI, about 10% of the clones retained >1% of the wild-type RNase HI activity.19
The large proportion of α/β class domains as parents can be correlated with their biochemical function. Previous work showed that more than half of the proteins in the PDB are enzymes, and close to one half of all enzyme families contain multi-domain proteins. Multi-domain enzymes often consist of a catalytic domain and a nucleotide-binding domain.20 It is therefore possible to predict that domain insertions are likely to occur in enzymes. Indeed, in our dataset, 39 out of 40 parent-insert pairs conform to this prediction. The remaining non-enzymatic protein is the bluetongue virus capsid protein vp-7, which has the central domain from all-β class inserted into the multi-helical parent domain. A genome-scale analysis of the structural features of proteins revealed that proteins with α/β-fold are frequently involved in fusion events.21 α/β-folds are also known to be associated disproportionately with enzymatic function,20 which lends further credence to the prominent role of α/β-folds in accepting insertions.
Out of 57 folds in the class of “small proteins”, we found two domains with a similar fold (Rubredoxin) as inserts; both the inserted domains belong to the same superfamily. Within the α + β class, the 18 inserted domains (from 15 superfamilies) spanned 11 folds; there are 204 different folds in the α + β class (data not shown). The trend is similar for the other SCOP classes, where folds of inserted domains constitute minor fractions of known folds. In contrast to the inserts, all parent domains have different folds. Thus, we observed another distinction between parents and inserts at the fold level.
Similarly, parent superfamilies are found to be more versatile than insert superfamilies. Most insert superfamilies combine with only one parent superfamily. There are merely three out of 45 insert superfamilies that combine with two different parent superfamilies. These insert superfamilies are NAD(P)-binding Rossmann superfamily, FAD/NAD(P)-binding superfamily and C-terminal domain of FAD-linked reductases superfamily.
While most parent superfamilies combine with just one insert superfamily, there are five conspicuous exceptions. There are three parent super-families each combining two different insert superfamilies. The three parent superfamilies are Zn-dependent exopeptidases superfamily, nucleotidyl transferase superfamily, and nucleotide-binding domain superfamily. Moreover, there are two parent superfamilies each combining with three different insert superfamilies. The two parent superfamilies are P-loop containing NTP hydro-lases superfamily, and FAD/NAD(P)-binding domain superfamily.
Two further observations at the superfamily level are worth mentioning. Firstly, with one exception, all parents and inserts belong to different superfamilies: in the E. coli enzyme glutathione reductase (PDB 1gesB),22 both the parent and insert belong to the superfamily of FAD/NAD(P)-binding domains. Secondly, superfamilies that are popular in the parent or insert context also appear to be popular in sequential domain combinations.23 They are found combining with more than one superfamily in sequential domain order. One exception to this correlation is the superfamily of C-terminal domains of FAD-linked reductases; this superfamily is popular in the insert context, but does not tandemly combine with other superfamilies.
We did not find any bias in the distribution of insertion points within 41 unique parent-insert combinations. However, we observed a significant bias in the location of the insertion point when we considered a subset of 28 parent-insert combinations, where either the parent or insert superfamily also participated in sequential combination with other superfamilies. As shown in Figure 4(b), for the 28 cases in question the insertion point occurred in the last third part of the parent domain sequence (confidence level 98%). Spatially, all 41 insertions are observed in loop regions of the 3D structure of parent domains.
We wanted to determine how the insertion context affects the distance between N and C termini of an inserted domain. Distance between termini was defined as the distance between Cα atoms of the first and the last residue of the domain. We first calculated distances for domains that do not participate in insertions. In order to do this, we considered 1000 domains, each representative of a SCOP superfamily. We obtained sequences and coordinates for these 1000 domains from the ASTRAL compendium.11 Only 687 domain sequences are covered completely by coordinates. Using AEROSPACI scores,11 we were able to find 60 substitutes for the 313 representative domains that are not entirely covered by coordinates. Altogether, we obtained complete coordinate information for 747 domains (687 + 60). Because we confined our analysis to five major SCOP classes, we calculated distances between termini for 711 domains, as the rest do not belong to the five classes being investigated. The average distance for representative domains is 25 Å.
Calculation of distances between the termini of inserted domains was less straightforward. Domain boundaries reported in SCOP are defined manually. Therefore, we compared SCOP domain boundaries for 41 inserted domains against the domain boundaries reported in CATH database.24 In contrast to SCOP, CATH structural classification of proteins is produced automatically. However, only 28 out of 41 inserted domains were available in CATH. For the other 13, there were differences in domain classification or the corresponding proteins were absent from CATH classification. For 28 inserted domains, boundaries are identical between SCOP and CATH. The average distance between domain termini of inserted domains is 8 Å (confidence level 99%), which is two-thirds shorter than the distance between termini in normal domains.
There are two superfamilies that occur in both parent and insert context. This example allowed us to compare distances between termini for a parent and an insert from the same superfamily. In case of FAD/NAD(P)-binding domain superfamily, the distances are 30 Å and 5 Å for parent and insert, respectively. These figures are 11 Å and 8Å for NAD-binding Rossmann domain superfamily. Thus, our analysis unambiguously shows that the ends of inserted domains are significantly closer than the ends of parent domains, or domains not participating in insertions.
It is interesting to speculate how the distance between domain termini can affect stability and conformational flexibility of a protein domain. While insertion context might generally reduce conformational freedom of the domain, it can simultaneously contribute to the stability of the domain, which would in turn affect its function. One can also imagine how the close proximity of domain termini can restore protein conformational flexibility by mimicking an inter-domain link observed in sequentially ordered domains.
Utilising an evolutionary basis of domain classification, we described the nature and characteristics of domain insertions in known protein structures. Domain insertions represent an unusual but abundant case of multi-domain proteins. Our analysis provides several novel insights into the nature and characteristics of domain insertions: (1) 9% of multi-domain proteins contain insertions. (2) The majority of insertions are single domain insertions. We also found two-domain, three-domain, and nested insertions. (3) α/β class has a higher propensity to accept insertions. This can be correlated to the size and function of proteins within this class. (4) In most cases, parent domains are found to be longer than the inserted domains. (5) When fold and superfamily participations were considered for parents and inserts, the former are found to be more versatile than the latter, in that the parent domains combined with different partners. (6) The point of insertion is biased towards the C terminus of parents, whenever the parent domain belongs to the superfamily that sequentially combines with other superfamilies. (7) Inserted domains tend to have juxtaposed termini compared to parent domains or domains that do not participate in insertions. Perhaps, these domains are more viable in the insert context when their termini are close in space; small size can further contribute to their viability.
Our results clearly indicate that, despite the structural and functional constraints inherent in the insertion of a domain into another, domain insertion an effective way to create multi-domain proteins. Functional hybrid proteins have been created through domain insertion in the laboratory by several groups. We cite three examples to support our observations. Betton and co-workers created hybrid proteins by inserting a penicillin-hydrolysing enzyme TEM β-lactamase (Bla) into the maltodextrin-binding protein (MalE);25 they used the permissive insertion sites identified before.26 Of the two insertions that resulted in functional hybrids, one insertion occurred in the first quarter of the MalE protein, while the other occurred in the last quarter. The parent protein (MalE) belongs to the α/β class; the distance between the termini of the inserted domain (Bla) is 5 Å, as shown by the authors. The proteins 1,3-1,4-β-glucanase from Bacillus macerans (wtGLU) and 1,4-β-xylanase from Bacillus subtilis (wtXYN) are single-domain jellyroll proteins catalysing similar enzymatic reactions; cpMAC-57 is a circularly permuted variant of wtGLU. Ay et al., created a fusion protein by inserting wtXYN into cpMAC-57. The authors showed that both fold spontaneously and have enzyme activities at wild-type level. The crystal structure of the chimeric protein showed nearly ideal, native-like fold for both the domains.27 In the third example, Collinet et al. were successfully able to produce a chimeric protein with a two-domain insertion. They inserted the monomeric proteins dihydrofolate reductase (159 residues, belongs to α/β class) and β-lactamase (263 residues, belongs to the class “multi-domain proteins” in SCOP) in four different positions of the host protein phosphoglycerate kinase (415 residues, belongs to α/β class) and showed that both the host as well as the inserted partners are functional.28 They also observed functional coupling between the two fused partners in some of the chimeras. Thus, we believe that our description of the many features of domain insertions could be used for creating novel multi-functional fusion proteins by employing protein engineering methods. We have developed a web resource for domain insertions in protein structures that are classified in the SCOP database†.29
We thank Cyrus Chothia and Sarah Teichmann for discussions, Siarhei Maslau and Emma Hill for valuable comments on the manuscript. R.A.-S & R.S are grateful to the Cambridge Commonwealth Trust and the Medical Research Council, UK for financial support.
†April 2002 release obtained from ftp://ftp.embl-heidelberg.de/pub/databases/protein_extras/pdb_select
Supplementary data associated with this article can be found at doi: 10.1016/j.jmb.2004.03.039