Identification of structural domains of splicing proteins
Our main priorities in identifying structural domains of splicing proteins were to check and correct previously reported domain boundaries and to identify and characterize domains that were not available in UniProt and other databases. We focused on 252 proteins of the human spliceosome, including 244 proteins found in the results of proteomics analyses of the major human spliceosome and 8 proteins specific to the U11/U12 subunits of the minor spliceosome (see ‘Materials and Methods’ section for references to protein sources and Supplementary Table S1
for protein GIs). We did not find any references to U4atac/U6atac-specific proteins either in literature or in the Gene Ontology (GO) database [http://geneontology.org
)]. A total of 118 proteins were classified as ‘abundant’ as in (2
); other proteins were classified as ‘non-abundant’. ‘Abundant’ proteins are suggested to be the most important for the correct action of the spliceosome (2
Using a combination of protein fold-recognition and sequence conservation-based domain identification methods, we identified 465 ordered structural domains in the 252 proteins, including 80 domains in the snRNP proteins of the major human spliceosome ( and Supplementary Table S2
). Ordered structural domains cover >80% of the ordered regions of the proteins, and ~50% of all residues in the splicing proteins. Correspondingly, close to a half of the human spliceosomal proteome is predicted to be intrinsically disordered. The analysis of various structural and functional types of intrinsic disorder in the spliceosome brought about a quantity of data whose presentation is beyond the scope of this article and that has been consequently made the subject of an independent article (I.K. and J.M.B., submitted for publication).
Statistics of structural domains detected in the human spliceosomal proteome
Based on the predicted order/disorder boundaries and the presence/absence of predicted secondary structure elements, we also detected 25 regions that we termed ‘suspected domains’. This category included two groups of regions. The first group were domain-length (>40 residues) regions without a recognized fold that were the only ordered regions of otherwise highly intrinsically disordered proteins (≥70% residues predicted to be disordered). The second group were present in proteins with low-to-middle intrinsic disorder content (<70% residues predicted to be disordered) that contained other ordered structural domains. The ‘suspected domains’ in these proteins were ordered regions that had clear order/disorder boundaries and contained predicted secondary structure elements, but lacked a PFAM domain assignment (30
) and showed no clear relationship to any known folds according to protein fold-recognition analyses.
Ordered domains of splicing proteins classified in the SCOP (49
) catalogue belong to classes a–e and g, with an over-representation of class d, which contains superfamily d.58.7 (RNA-binding domain, RRM (RBD), which usually corresponds to PFAM domain PF00076, RRM_1; ). RRM is present in the 252 proteins in as many as 117 copies. This means that roughly each fourth to fifth domain in the spliceosomal proteome is an RRM. As RRM is a small domain that usually binds single-stranded RNA (63
), this reflects the key character of protein–RNA interactions in the splicing process.
Statistics of ordered structural domains of the human spliceosome according to the SCOP classification
Other common types of ordered protein regions found in the human spliceosomal proteome include other small RNA-binding domains, large α- and β-repeat-based protein-binding domains, small protein disorder-binding domains, ubiquitin-related domains and stable multidomain RNA helicase architectures (). Repeat-based domains are often found as building blocks of protein complexes, while some of the ubiquitin-related domains have been shown to be part of a putative ubiquitin-based system of controlling spliceosome assembly and dynamics (22
Common types of ordered structural domains in the human spliceosomal proteome
In addition to ordered domains, we found nine regions with an expected independent function that were predicted to be disordered, but that were either found in experimental structures or could be confidently modeled due to strong sequence matches to known domains. We considered these nine regions to be putative disordered domains that undergo a transition to order upon entering a complex. We discuss the features of these domains in an independent article that focuses specifically on intrinsic disorder in the spliceosomal proteome (I.K. and J.M.B., submitted for publication). Here, we will only note that, in general, the identification of disordered structural domains is currently a non-trivial task in comparison with the identification of ordered structural domains, as fewer experimentally validated examples of disorder exist in databases and the properties of disorder make automated identification and propagation more difficult.
Non-redundant set of experimental and theoretical structural models
Following the identification of domains, we constructed a non-redundant set of experimental and theoretical structural models of regions in splicing proteins. As the utility and credibility of models, both experimental and theoretical, depends on their accuracy, we set some simple heuristic rules of preference to increase the chance that we chose the models with the best quality. We preferred experimental models over theoretical models, X-ray experimental models over NMR experimental models and comparative theoretical models over de novo
theoretical models (). The lowest tier in the hierarchy was pro forma
constructs, in which only the primary and secondary structure were represented explicitly, while the tertiary arrangement was arbitrary. As a result, we mapped 104 non-redundant experimental models to the sequences of the spliceosomal proteins, and created 255 comparative and 43 de novo
models ( and Supplementary Table S3
), as well as over 500 constructs. The 104 non-redundant experimental models include 23 models of (nucleo)protein complexes, of which 13 complexes have residues from more than one spliceosome-associated protein. While models of complexes tend to have lower accuracy than models of isolated chains, we considered them to be more informative about the protein functional than models of isolated chains. This was the only instance where we favored the availability of additional information over plain accuracy of the structure.
Structural representations of regions of proteins of the human spliceosomal proteome
Over 90% of ordered regions of splicing proteins can be associated with experimental structural information or with comparative and de novo models (). This value is similar for the proteins of the snRNP subunits of the major spliceosome and other proteins associated with the human spliceosome. Between different types of structural representations, experimentally determined structural models cover 20.6% of all ordered residues, the comparative models we generated cover 67.4% of all ordered residues, and the de novo models cover 4.8% of all ordered residues. Hence, our theoretical models cover three times the length of ordered protein sequence covered by experimental models.
Coverage of structural order and disorder with different types of structural models. The values displayed on the graph are the number of residues covered by a given type of structural model, followed by percentage value.
X-ray crystallography is useful for the structure determination of large proteins (>30
kDa) and protein complexes, while NMR is well-suited for the structure determination of relatively small proteins. Not surprisingly, the ratio of the number of ordered residues in proteins from snRNP subunit structures solved by X-ray crystallography versus NMR is ~3:1 (15.7%:4.7%), while this ratio for all splicing proteins is ~1.77:1 (13.4%:7.2%). The main reason for this is that small domains are statistically more populous in the general set of splicing proteins compared to the snRNP subunits. Contrariwise, most structures of protein–protein complexes available for splicing proteins include regions from snRNP proteins. Since the resolution (and hence accuracy) of experimentally determined structures is typically inversely correlated with the molecule or complex size, X-ray models of snRNP proteins have on average a slightly worse resolution (mean 2.20
Å) than X-ray models of all spliceosomal proteins (mean 2.08
For predicted disordered regions, confident structural coverage is very low in comparison to ordered regions. Less than 2% of residues predicted to be disordered are covered by experimental models, and even together with our theoretical models, we could only cover 8.9% of all disordered residues. Moreover, most of the residues covered belong to linkers between ordered structural domains or short regions in protein termini. This low coverage of intrinsically disordered regions by structural models may be in the future a considerable challenge in producing a comprehensive structural model of the spliceosome.
Assessment of model quality
For all models except pro forma
constructs, we also independently evaluated their accuracy to determine how credible they were. To do this, we used two methods: MetaMQAPII (58
) and QMEAN (59
). Both of them provide a global score for the entire model (predicted RMSD for MetaMQAPII, QMEAN Z
-score for QMEAN) as well as a local score for individual residues (in this analysis, only the MetaMQAPII score was used). Functionally relevant and evolutionarily conserved regions (e.g. binding interfaces) are typically predicted with a higher than average accuracy, in particular when comparative modeling is used. Consequently, even a model with a poor global score can be useful for functional considerations, if its functionally important parts are scored well and are likely to be accurate. Some readers may also be interested in scores that describe only the model’s quality with respect to a particular feature (e.g. secondary structure). To help describe different features of models, we recorded the mean values and standard deviations of QMEAN Z
-scores for six QMEAN contributing factors. These values for all models are provided with the manuscript (Supplementary Table S4
For comparison with theoretical models, we ‘predicted’ the global quality of experimentally determined structures (Supplementary Figure S1
). Expectedly, both X-ray and NMR models we selected for our data set are highly scored by both MetaMQAPII and QMEAN, which is an indicator of the high accuracy of these structures (; for RMSD, the lower the score, the better the model; for the QMEAN Z
-score good models are scored higher). Mean QMEAN Z
-scores for models of both types (0.42 for X-ray and 0.08 for NMR) compare favorably to mean QMEAN Z
-scores of models across the entire PDB (−0.58 and −1.19, respectively) (67
). As X-ray models in our database were scored slightly better than NMR models, we used scores for X-ray models as a benchmark with which to classify theoretical models into those ‘likely to be globally accurate’ or ‘unlikely to be globally accurate’. The worst-scored X-ray models in our data set have a predicted RMSD of 4.5
Å (PDB ID 2ok3, resolution 2.0
Å) and a QMEAN Z
-score of −1.99 (PDB ID 2qfj, resolution 2.10
Å). Consequently, we divided all non-X-ray models into four classes depending on passing one or both thresholds: predicted RMSD ≤4.5
Å and QMEAN Z
-score ≥−2.0 ().
Predicted quality of models of regions of human spliceosomal proteins
Figure 3. Models of regions of human splicing proteins divided by quality. This bubble graph displays the numbers of models of different types that belong to different classes of quality. Mean lengthcomp is the mean length of a comparative model of a given quality (more ...)
The majority of both NMR and theoretical models belong to the most reliable class (i.e. ‘scored not worse than the worst crystal structures in the data set’). These models are expected to be generally correct, although their local accuracy may vary. Models scored well only by one method should be treated with more caution than models scored well by both methods. However, poor scoring by one method may also be due to the model being either very short or very long. Models that are scored poorly by MetaMQAPII, but are scored well according to the QMEAN Z-score are usually short, while models that are scored high by MetaMQAPII and low by QMEAN are usually long. The mean length of a model scored well by both methods is 220 residues, but the mean length of a model scored well only by QMEAN is 70 residues and the mean length of a model scored well only by MetaMQAPII is 362 residues. Therefore, we urge the reader to consider the length of the model before while using models scored poorly by only one method.
Over 40 models are scored poorly by both MetaMQAPII and QMEAN. These models may have been built on remotely related templates or did not fold well when modeled de novo, and are to be expected to have various errors. Based on our previous experience, we believe that some of these cases may represent new protein folds or interesting variations of known folds that present considerable challenge for protein modeling methods. Hence, while we regard these models as unreliable, we propose the corresponding proteins or domains as attractive targets both for experimental protein structure determination, and for protein modeling with other advanced techniques.
The entire non-redundant set of representations (including selected representative models determined by experimental methods, and all theoretical models built with computational methods) is available as an online database SpliProt3D at http://iimcb.genesilico.pl/SpliProt3D
. The web server allows for browsing, selecting and downloading the models. Proteins are also associated with sequence alignments annotated with predictions of intrinsic order versus disorder, predictions of secondary structure, protein-binding disorder, solvent accessibility and coiled-coils, as well as the positions of post-translational modifications. The database will be curated and new entries will be added and obsolete ones archived following the progress in structure determination of new spliceosomal proteins and/or publication of new theoretical models with better predicted accuracy. We would like to encourage structural biologists working on structure determination or prediction for spliceosomal proteins to contact us to have their models included and referenced in our database.
Comparison of predictions with the experimentally determined SF3A structure
After submission of this article for review, a crystal structure of the yeast U2 snRNP SF3A sub-complex was published (68
), giving us an opportunity to compare some of our predictions with the independently determined experimental structure.
The structure of the yeast SF3A complex includes, in addition to several regions composed of individual secondary structure elements, three ordered domains for which an experimental structure had not been published before. One domain in the yeast protein Prp9 is >200 residues long (its counterpart in the human protein SF3a60 is situated roughly between residues 1–77, 129–244 and 310–372); it features a novel helical architecture. Originally, we made no tertiary structural predictions for this domain (i.e. our database contained only constructs), and it is highly unlikely that the structure of this domain could have been predicted accurately by a standard bioinformatics approach. Another domain in the yeast Prp9 is a zf-C2H2 zinc finger inserted into the long helical domain, whose counterpart in the human protein SF3a60 lacks the Zn-binding residues and is closely neighbored by another insertion, of a SAP domain. Despite these differences, in our original model of this domain (with a predicted RMSD of 8.8
Å and QMEAN Z
-score of −1.93), we correctly predicted the fold and the position of nearly all residues in this zinc finger. We also correctly predicted the boundaries and the fold of an all-β domain in the human protein SF3a66, a counterpart of the yeast protein Prp11. The original comparative model of this domain had a predicted RMSD of 4.7
Å and a QMEAN Z
-score of −0.92, with a medium reliability of the fold prediction. In practice, upon comparison, this translated to predicting the position of approximately a half of the residues in the domain correctly. This analysis demonstrates the utility of the predictions, and that even models with a predicted relatively low accuracy can, in fact, exhibit correct folds, spatial shapes and locations of some of the functionally important residues.
Given the availability of the new template, we generated new models for the human counterparts of the SF3A crystal structure, using the comparative approach. We also generated a new comparative model for a domain in the C-complex-related protein cactin (NY-REN-24/C19orf29, gi: 126723149) as this protein is predicted to have a domain with the same all-β fold as the SF3a66 domain. The new models have been deposited in the database, while the old models have been moved to the archive of the ‘obsolete’ entries and are still available for analysis.
Ubiquitin-related domains are most common in the proteins of the late stages of splicing
Given the known role of ubiquitin in controlling spliceosome assembly and dynamics (21
), and the fact that ubiquitin-related domains are one of the largest groups of domains in splicing proteins, we were interested in learning how these domains were distributed across the different groups of splicing proteins. We found 19 potential or known ubiquitin-related domains in 15 splicing-related proteins, including 12 abundant proteins of the major spliceosome and one protein of the U11/U12 di-snRNP subunit of the minor spliceosome ( and ). These domains cover most of the main classes of ubiquitin-related domains, including ubiquitin fold domains, RING zinc finger/U-box domains that may act as ubiquitin ligases, a ubiquitin conjugating enzyme-like domain, a ubiquitin carboxyl-terminal hydrolase domain and the JAB1/MPN domain of protein U5-220K (hPrp8) described in (23
). In several cases, such as that of the abundant C-complex-specific protein FLJ35382 (C1orf55) and the TREX complex protein THOC5, only similarity of a protein region to a known ubiquitin-related fold could be detected.
Ubiquitin-related regions in the spliceosomal proteome
Figure 4. Ubiquitin-related structural regions of human splicing proteins. (A) Ubiquitin-fold region of protein FLJ35382 (C1orf55; residues 1–80). Predicted RMSD 3.5Å, QMEAN Z-score −1.33. (B) RWD-like region of protein THOC5 (residues (more ...)
Ubiquitin-related domains are more abundant in proteins active in the late stages of splicing (B, B-act and C complexes). The ubiquitin-fold domain of protein SF3a120 is the only ubiquitin-related domain found in the U2 snRNP (its counterpart is found in the U11/U12 di-snRNP). On the other hand, as many as three proteins of the B/B-act complex (UBL5, Cyp-60 and RNF113A) and four proteins of the C complex (FLJ35382/C1orf55, XAP-5/FAM50A, NOSIP and CCDC130) contain ubiquitin-related domains, in addition to a domain in the U5 snRNP (the JAB1/MPN of U5-220K) and a protein in the U4/U6.U5 tri-snRNP (U4/U6.U5-65K). In summary, this distribution suggests that the late stages of splicing are probably under a stricter ubiquitin-based control than the early stages. This may be due to the fact that the earlier stages of splicing, such as intron/exon definition, are more dependent on weak, disorder-based interactions, while the later catalytic stages require precise subunit rearrangements.
Zinc finger-like domains flanked by conserved intrinsically disordered regions in U2 snRNP SF3a120 and other splicing proteins
Our FR analysis detected that the human SF3A sub-complex contains, in addition to the zinc finger in protein SF3a60, another degenerate C2H2 (g.37.1)-type zinc finger in the middle conserved region of protein SF3a120 (conserved region: residues 217–530, PFAM domain PRP21_like_P; zinc finger: residues 407–435). In Saccharomyces cerevisiae
, this zinc finger is absent entirely. However, in the majority of non-animal species, especially other fungi, amoeba and Apicomplexa, this zinc finger retains some of the cysteine and histidine zinc-binding residues (A). The zinc finger remnant is surrounded on both sides by intrinsically unstructured regions that are in part predicted to form helical (potentially coiled-coil) structures. The short motifs lying on the distal ends of the disordered linkers are conserved. An additional coiled-coil region connects the N-terminal conserved motif with the previously described (69
) second Surp module of SF3a120. Thus, the PRP21_like_P module consists of three motifs, the second of which is a zinc-finger remnant, connected by flexible linkers, with an N-terminal coiled coil that connects the N-terminal motif to the Surp region (B). Structural modules of this type usually serve to simultaneously contact a binding partner of the protein in several locations. In the particular case of SF3a120, it has been suggested that both the U2 snRNA and a so far, unidentified splicing protein are potential partners (69
Figure 5. Architecture of the conserved middle region of protein SF3a120 (residues 217–530). (A) Alignment of the residues of a zinc-finger domain in the middle part of SF3a120 (residues 407–435). The ‘g.37.1’ annotation row displays (more ...)
Through a systematic search, we found several other examples of zinc finger and zinc finger-like domains embedded in conserved disordered regions in the spliceosomal proteome (). Alternatively, tandem zinc fingers can be separated, e.g. by predicted coiled-coil regions. The new zinc-finger domains we found belong usually to the zf-C2H2 (g.37.1)-type, which can bind RNA and/or mediate protein–protein interactions. The pre-mRNA/mRNA-binding protein ARS2 contains a ZZ RING zinc finger, while the C complex protein NOSIP contains two RING zinc finger/U-box-like regions.
Zinc-finger domains flanked by or embedded in predicted disordered regions
BLUF-like domain (DUF1115) of the U4/U6 di-snRNP protein 90K (hPrp3)
The C-terminal ordered domain of protein U4/U6-90K (hPrp3), which corresponds to PFAM domain DUF1115 (PFAM ID: PF06544; residues 540–683), was predicted in our analysis to have a ferredoxin-like fold. It is predicted to be related to the acylphosphatase/BLUF domain-like superfamily (SCOP ID: d.58.10). BLUF family domains have two additional helices in the C-terminus compared to acylphosphatase family domains. These helices are present in the DUF1115 domain, and so this domain is predicted to be a BLUF-like domain (). This is an unusual assignment, because the BLUF domain is a FAD/FMN-binding blue light photoreceptor domain found primarily in bacteria. In Eukaryota, it is found almost exclusively in euglenids and Heterolobosea. On the other hand, DUF1115 is found exclusively in eukaryotes. However, very high scores of BLUF domain templates yielded by FR methods for the hPrp3 DUF1115 sequence suggest that this protein is definitely homologous to the BLUF family.
Figure 6. BLUF-like region of protein U4/U6-90K (hPrp3) (domain DUF1115, residues 540–683). The position of the conserved residue W604 is displayed. Predicted RMSD 3.7Å, QMEAN Z-score −3.06.
Nevertheless, DUF1115 differs from BLUF domains in some key features. The conserved FAD/FMN-binding residues are not conserved in DUF1115, and nor is a tryptophan residue whose position is altered depending on the excitement state of the photoreceptor (70
) (Supplementary Figure S2
). On the other hand, DUF1115 contains a disordered loop between the second α-helix and the fifth β-strand. The presence of this loop, though not its length, is conserved in DUF1115 domains. Moreover, a conserved tryptophan residue, W604 in hPrp3, is located next to the disordered loop.
Based on biochemical data, the DUF1115 domain may be a region of interaction of hPrp3 with the U5 snRNP protein hPrp6 and/or the U4/U6.U5 tri-snRNP protein U4/U6.U5-110K (SART-1) (71
). However, it is also possible that this interaction proceeds through the disordered PRP3 domain of this protein (71
). A possible alternative role for DUF1115 is suggested by the fact that, apart from proteins from the hPrp3 family, it is found only in a family of proteins containing the RWD domain. The RWD domain belongs to the ubiquitin conjugating enzyme superfamily (72
). Hence, the hPrp3 DUF1115 may be a part of the spliceosomal ubiquitin-based system.
N-terminal PWI-like domains of the helicases hPrp22 (DHX8), hPrp2 (DHX16) and hBrr2 (U5-200K)
hPrp22 (DHX8) and hPrp2 (DHX16) are RNA helicases that function in the remodeling of the spliceosome (6
). According to our predictions, these two helicases contain N-terminal ordered helical bundles with a PWI superfamily fold (SCOP superfamily a.188.1) and similarity to the PFAM PWI domain ( and ). PWI is a nucleic acid-binding domain first described in the splicing protein SRm160 (73
). PWI is also found in the animal protein U4/U6-90K (hPrp3). The hPrp22 and hPrp2 PWI-like bundles (hPrp22: residues 1–92 or 1–120; hPrp2: 1–95) are not found in a search with the profile of the PFAM PWI domain, possibly because their eponymous PWI tripeptide motifs are degenerated. In hPrp22 and its homologs, only the third position of this motif is conserved: [x][x][IV], while in hPrp2 and its homologs, the second and third positions are usually conserved: [x][WFY][IV]. However, PFAM displays several putative hPrp2/hPrp22 homologs when queried for proteins that contain PWI domains. Furthermore, stable binding to nucleic acids by PWI requires an adjacent basic-rich region (74
). We found potential candidates for such ancillary regions both in hPrp22 and in hPrp2 (hPrp22: residues: 93–116; hPrp2: residues 120–132).
Figure 7. PWI-like regions of splicing helicases. (A) hPrp22 (DHX8; residues 1–120 shown, but domain may end at residue 92). Predicted RMSD 2.4Å, QMEAN Z-score −2.76. (B) hPrp2 (DHX16; residues 1–95). Predicted RMSD 5.8Å, (more ...)
Figure 8. The PWI domain and PWI-like regions in splicing helicases. In all alignments, the ‘PWI’ annotation row displays the residues of the PWI motif conserved in a given protein. The ‘jnetpred (…)’ annotation row displays (more ...)
We also found a PWI-like helical bundle in the N-terminus of the human protein U5-200K (hBrr2; residues 258–338; ). This helical bundle is conserved across the majority of eukaryotes, and is found, for instance, in the S. cerevisiae Brr2. The PWI-like domain of U5-200K retains a relatively well conserved second and third position of the tripeptide PWI motif: [x][WFY][ILV]. Notably, if correct, this prediction represents the first case when a PWI-like domain is located in the middle of a protein. Usually, as is the case of SRm160, hPrp3, hPrp22 and hPrp2, a PWI domain is located either in the immediate N-terminus or in the immediate C-terminus of a protein. There are at least three candidate basic-rich regions in the vicinity of the U5-200K PWI-like domain (residues 254–259; 343–349; 373–386).
Sequences of proteins from the hPrp22 (DHX8) and hPrp2 (DHX16) families are very similar, to the effect that we could not easily separate them in a clustering analysis (Supplementary Figure S3
). The most important discriminant between the two families appears to be the presence of an S1 RNA-binding domain (PDB ID: 2eqs; DOI:10.2210/pdb2eqs/pdb, manuscript to be published) between the N-terminal PWI-like bundle and the C-terminal helicase domains. This domain is present in hPrp22 and its homologs, but not in hPrp2 and its homologs. This led us to the hypothesis that Prp2, with the PWI-like domain, was the ancestral protein, which then underwent the insertion of the S1 domain. Nevertheless, the PWI-like domains of hPrp22 and hPrp2 differ in several aspects.
The first difference lies in the above-mentioned degree of degeneration of the tripeptide PWI motif, which is larger in hPrp22 and its homologs than in hPrp2 and its homologs. In an extreme case, the N-terminus of the Prp22 protein of S. cerevisiae
and the related organism Eremothecium (Ashbya) gossypii
is located inside the motif, which is therefore incomplete. The degeneration of the PWI motif may be offset by the heavy conservation of a [DE][FY] motif in the second helix of the bundle. The main reason for the conservation of the PWI motif in canonical PWI domains is that it stabilizes the structure of the PWI domain (74
). It is possible that the conservation of the [DE][FY] motif is sufficient to guarantee the stabilization of the bundle in conjunction with the conservation of the third position of the PWI motif.
Second, there is also a possible difference in either the number or the arrangement of helices comprising the PWI domain. SCOP describes superfamily a.188.1 as a ‘four-helix bundle’. However, in the structure of the PWI domain from protein SRm160, the bundle is followed by an additional short α-helix orthogonal to the bundle (PDB ID: 1mp1) (74
). The presence of this α-helix is also predicted for the hPrp3 PWI domain, although it is missing from the available experimental structure (PDB ID: 1x4q; DOI:10.2210/pdb1x4q/pdb, manuscript to be published). Similarly, secondary structure predictions for hPrp2 also indicated that this protein is likely to contain an additional α-helix. However, for hPrp22, predictions of domain boundaries are less decisive. The hPrp22 PWI-like domain is either predicted to be a four-helix bundle (in which case it is confined to residues 1–92), or to contain an additional α-helix, but separated from the bundle by an intrinsically disordered region (in which case the domain spans residues 1–120). In either case, the helix arrangement is predicted to be different than in hPrp2. To note, the U5-200K PWI-like domain is predicted to be a five-helix domain.
Third, the pattern of evolutionary conservation of the PWI-like domains is different in hPrp22 and hPrp2. Fewer putative and confirmed hPrp2 homologs from different species have the PWI-like domain than do hPrp22 homologs. For instance, the functional analog of hPrp2 in S. cerevisiae
, Prp2, is considered to be its homolog, but lacks the PWI-like domain. The Prp22 combination of PWI
S1 appears to be retained, while the Prp2 PWI is missing, also in putative homologs in organisms, such as kinetoplastids (Trypanosoma brucei
, Leishmania major
), some Apicomplexa (Plasmodium falciparum
, Babesia bovis
, but not Tetrahymena thermophila
, which has both), Trichomonas vaginalis
and Entamoeba histolytica
Altogether, the PWI-like domain of hPrp22 is more diverged from the canon, but more often retained, while the PWI-like domain of hPrp2 is less diverged from canon, but more often completely lost. This result does not contradict the hypothesis that the Prp22 protein was formed in the insertion of the S1 domain into the ancestral Prp2. It rather suggests the possibility that some property of the ‘degenerated’ PWI-like domain ensured its retention in evolution. An in-depth structural study of this region may elucidate the reason why.
As hinted above, the U5-200K PWI-like domain is in many respects a ‘canonical’ PWI-like domain similar to that of hPrp2,it retains two out of three of the positions of the tripeptide PWI motif, and is predicted to be a five-helix domain. However, U5-200K is in general highly conserved, and unlike in hPrp2, this conservation also applies to its PWI-like domain.
The N-termini of S. cerevisiae
Prp2 and Prp22 are dispensable for splicing (75
), while the N-terminus of S. cerevisiae
Brr2 was shown not to contact any of the proteins of the U4/U6.U5 tri-snRNP (71
). Hence, the N-terminal PWI-like domains of hPrp2, hPrp22 and U5-200K are likely to have only a supporting role in splicing, one that is not revealed in the activity of the yeast proteins. We suggest that they may help in the correct positioning of the C-terminal helicase domains on the relevant snRNAs. Nevertheless, we could not find any data on the activity of the N-termini of hPrp2, hPrp22 and U5-200K. Furthermore, no experimental model of a PWI domain bound to RNA exists, to which we could compare the mode of binding of the hPrp2, hPrp22 and U5-200K PWI-like domains. Hence, as far as this publication is concerned, the question of what is bound to the PWI-like domains of the splicing helicases remains open.
An N-terminal domain of the hPrp8 protein (U5-220K)
We could not confirm a published prediction of a bromo-domain encompassing hPrp8 residues 127–242 (a part of the N-terminal PFAM domain PRO8NT), originally made for yeast Prp8 residues 200–315 (77
). In our view, the bromo-domain assignment does not command a consistent evolutionary conservation pattern. It encompasses 20 residues universally conserved in Prp8 homologs from all known species and nearly 100 residues conserved only in some eukaryotic Prp8 homologs. On the other hand, we were able to construct a de novo
model for the most conserved part (residues 86–150) of the PRO8NT domain (Supplementary Figure S4
). Quality evaluation indicates that the model of the putative Prp8 bromo-domain described in (77
) has low predicted accuracy (predicted RMSD 8.7
Å, QMEAN Z
-score −4.25) compared to our de novo
model of residues 86–150 (predicted RMSD of 2.4
Å, QMEAN Z
-score −1.93). Altogether, although we cannot exclude the possibility that PRO8NT encases a bromo-domain, we suggest that further studies (ideally: experimental structure determination) will be required to provide a confident structural model of this region.
Other previously uncharacterized structural regions of abundant splicing proteins
We found several other new types of structured regions in abundant splicing proteins that we were able to assign to known folds and/or are similar to existing structures, with varying degree of confidence (). For instance, a region in the C-terminus of the hPrp19/CDC5L-related protein KIAA0560 (IBP160/Aquarius homolog; residues 453–1485) has a helicase architecture similar to the nonsense-mediated decay protein Upf1p (). KIAA0560 is a 1485-residue-long protein, whose binding to pre-mRNA introns is necessary for the successful deposition of the exon junction complex on the pre-mRNA (78
) and for successful release of box C/D snoRNAs (small nucleolar RNAs) from introns (14
). Upf1p contains two RNA helicase domains (c.37.1), the first of which is interrupted twice by two insertions: an all-β and an all-α domain insertion (79
). In KIAA0560, this first c.37.1 domain is interrupted three times: both of the original insertions are kept, but a third insertion, largely disordered, has appeared between them.
Figure 9. Other previously uncharacterized structural regions of the spliceosomal proteome. (A) The C-terminus of protein KIAA0560 (AQR), structurally similar to protein Upf1p (residues 453–1485). RMSD 3.3Å, QMEAN Z-score −4.97. (more ...)
Another previously not described region lies in the C-terminus of the B complex protein TFIP11 (homolog of the yeast protein Spp382). The results of our FR analysis suggest that region is a potential double-stranded RNA binding domain (dsRBD) (). In other splicing proteins, such as the non-abundant A complex protein DHX9, dsRBD domains often occur in tandem, but the TFIP11 region does not have a partner. However, TFIP11 contains also another previously structurally uncharacterized region with a putative RNA-binding function, a G-patch domain. While the G-patch domain does not show sequence similarity to any other known domains, a highly scoring de novo model of this domain shows structural similarity to a dsRBD domain (). In fact, in the non-abundant splicing-related protein SON, the G-patch domain occurs in tandem with a dsRBD domain partner. If the G-patch domain has a dsRBD-like fold, the TFIP11 G-patch domain could provide the functionality of a second tandem dsRBD-like domain for the not described suspected domain of TFIP11.
We were also able to construct highly scored de novo
models with a clear structural similarity to known folds for ordered helical regions located on the N-termini of proteins hnRNP R and Q. No known structural domain is assigned to these regions, but our de novo
models of these regions exhibit fairly high scores (predicted RMSD 1.3
Å, QMEAN Z
-score 0.12) for the region in protein hnRNP R. Based on structural similarity scores yielded by the DALI server (51
), these may be helix-turn-helix domains ().
Other new putative structural domains are described in .
New types of predicted structural regions in the human spliceosomal proteome that can be classified into known superfamilies
Comparison of the human and Giardia lamblia spliceosomal proteome: setting priorities for spliceosome structure modeling
The human spliceosome, with its 119 abundant proteins, represents a fairly challenging target for both experimental and theoretical structural analyses. To round-off our analysis, we wanted to put forth a candidate minimum set of structural regions in a functional spliceosome that, in our opinion, should be prioritized during the modeling of the structure of the complex.
In general, eukaryotic species with fewer introns have fewer splicing proteins. The yeast Saccharomyces cerevisiae
has homologs of only 61 of the human abundant splicing-related proteins (2
). On the other hand, S. cerevisiae
has also some Saccharomycetes
-specific splicing proteins, such as Prp24 (41
), which do not appear in other fungi. In the search of a ‘minimum’ set of regions to include in the model of a functional spliceosome, we turned to the extremely intron-scarce (80
) parasitic organism G. lamblia
, which is also known for its genome minimalism (82
). This organism apparently underwent a reversed process with respect to the diversified and specialized human spliceosomal proteome, namely the loss of many genes encoding spliceosomal proteins.
The genome of G. lamblia
ATCC50803 encodes homologs of only 30 human abundant splicing proteins (). Two more proteins can be found in G. lamblia
P15. However, not all of these homologs may be involved in splicing. For instance, G. lamblia
ATCC50803 possesses orthologs of U4/U6-15.5K and EIF4A3. In humans, U4/U6-15.5K is a component of the U4/U6 di-snRNP, where it binds to U4/U6-61K (hPrp31) (83
), while EIF4A3 is a protein of the EJC (33
). U4/U6-61K and all EJC proteins save EIF4A3 are missing in G. lamblia
. However, the human U4/U6-15.5K protein also participates in box C/D snoRNP formation (83
), where it binds a different protein, which does have a G. lamblia
homolog, and the human EIF4A3 is an isoform of the eukaryotic translation initiation factor 4A. It is therefore possible that their orthologs in G. lamblia
perform only these splicing-unrelated functions.
Human spliceosomal proteins with potential G. lamblia homologs, and these potential homologs
There is a pattern to the presence and absence of abundant splicing-related proteins and/or their domains and disordered regions in the G. lamblia
proteome. Almost all the proteins of the U2 snRNPs are present in G. lamblia
, as well as a homolog of U2AF35K, but only some core proteins of the U5 snRNP, such as Prp8 and Brr2. Snu114, which, according to the current understanding, is in other organisms the third part of the troika of U5 proteins essential to splicing (21
), is an important absentee. Many proteins of the U1 snRNP and U4/U6 di-snRNP proteome are missing, as well as are all proteins specific to the human U4/U6.U5 tri-snRNP. The set of Step 2 factors is reduced to three RNA helicases, and these helicases are reduced to C-terminal regions of their human counterparts, with a common architecture. The G. lamblia
helicases are also impossible to assign unambiguously to their human or yeast counterparts. Clustering analysis of helicase sequences from different organisms places the G. lamblia
helicases away from any major cluster (Supplementary Figure S3
). Finally, G. lamblia
has very few homologs of human proteins of the auxiliary complexes, and only two non-snRNP stage-specific proteins (PRP38 and RNF113A) are present in this organism.
The snRNP protein homologs present in the G. lamblia
proteome are shorter than their human counterparts. Three main types of structural features that are common for human spliceosomal proteins are largely absent from the G. lamblia
- intrinsically disordered proteins or disordered regions with possibly autonomous function (long protein disorder that does not form inter-domain linkers, including compositionally biased disorder and some regions of disorder with preformed structural elements); consequently, highly disordered proteins, such as the U4/U6.U5-specific proteins U4/U6.U5-110K and U4/U6.U5-27K;
- short peptide regions that act as ligand partners for other splicing proteins (PRP4, SF3a60_bindingd, SF3b1 and the ULM-containing region of protein SF3b155); and their partners (PRP4 partner: U4/U6-20K; SF3a60_bindingd partner: second Surp domain of protein SF3a120. This protein is missing entirely (see below); SF3b1 partner: p14; SF3b155 ULM partner: U2AF65K);
- ubiquitin-related domains. This includes: the entire protein SF3a120 (which contains an ubiquitin domain in addition to the Surp domains); the U4/U6.U5-specific protein U4/U6.U5-65K, which contains the ubiquitin hydrolase domains zf-UBP and UCH; the zf-C3HC4 RING zinc finger of protein RNF113A. In contrast, the zf-CCCH zinc finger of RNF113A, which is a putative RNA-binding domain, is present.
In our analysis of intrinsic disorder in the human spliceosomal proteome (I.K and J.M.B., submitted for publication), we discuss how disordered regions of splicing proteins are tied to functions of dynamics, assembly and regulation of the spliceosome. This is also the function of known ubiquitin-related regions. Hence, it appears that G. lamblia
is missing most proteins and/or protein regions primarily responsible for splicing regulation and dynamics. On the other hand, G. lamblia
retained pre-mRNA and snRNA-binding proteins and/or regions, as well as proteins that directly assist in splicing, such as the catalytic factor helicases. It also appears that this parasitic organism’s ubiquitin-based system of splicing control is reduced, rather than entirely missing. The C-terminal Mov34/MPN/JAB1 domain present in Prp8 from human or yeast (SCOP superfamily c.97.3), which may be implicated in an ubiquitin-based system (65
), is absent from the G. lamblia
), but the corresponding region in the latter protein is predicted by FR analysis to be a domain with a ubiquitin-like fold (SCOP superfamily d.15.1).
It is possible, that, like yeast, G. lamblia evolved its own specialized splicing proteins, which would not be detected in sequence similarity searches done with proteins from other organisms. Since G. lamblia is a parasite, it is also possible that it supplements some of its missing proteins (such as Snu114) from the host. Finally, it is also possible that some information was missed by our bioinformatics analysis but may be uncovered by an in-depth experimental analysis. With the caveat of the possibility of gaps in data (such as, possibly, Snu114), these are not single proteins that are missing, reduced or degenerated, but entire systems. The cropped set of proteins remaining in our G. lamblia spliceosomal proteome data set, corresponds to a system much less dynamical than the human spliceosome, less precisely regulated and less able to adapt to variable conditions. However, such a spliceosome may still be functional. Hence, we propose that from a practical standpoint, the set of structural regions with homologs in G. lamblia is a good starting point for the higher order structural modeling of the spliceosome, as well as constitutes an attractive list of targets for experimental structural determination.