|Home | About | Journals | Submit | Contact Us | Français|
In this work, we describe the results of a comprehensive structural bioinformatics analysis of the spliceosomal proteome. We used fold recognition analysis to complement prior data on the ordered domains of 252 human splicing proteins. Examples of newly identified domains include a PWI domain in the U5 snRNP protein 200K (hBrr2, residues 258–338), while examples of previously known domains with a newly determined fold include the DUF1115 domain of the U4/U6 di-snRNP protein 90K (hPrp3, residues 540–683). We also established a non-redundant set of experimental models of spliceosomal proteins, as well as constructed in silico models for regions without an experimental structure. The combined set of structural models is available for download. Altogether, over 90% of the ordered regions of the spliceosomal proteome can be represented structurally with a high degree of confidence. We analyzed the reduced spliceosomal proteome of the intron-poor organism Giardia lamblia, and as a result, we proposed a candidate set of ordered structural regions necessary for a functional spliceosome. The results of this work will aid experimental and structural analyses of the spliceosomal proteins and complexes, and can serve as a starting point for multiscale modeling of the structure of the entire spliceosome.
The spliceosome is a eukaryotic macromolecular ribonucleoprotein (RNP) complex that performs the excision of introns (non-coding sequences) from pre-mRNAs following transcription. In humans, two forms of the spliceosome exist. The major spliceosome, which excises >99% of human introns, is composed primarily out of four stable small nuclear ribonucleoprotein (snRNP) particles (subunits), named after their small nuclear RNA (snRNA) components: U1, U2, U4/U6 and U5. The minor spliceosome, which is absent in many species and which in human excises the remaining <1% introns, contains a U5 snRNP identical to the one from the major spliceosome, as well as two other snRNPs: U11/U12, and U4atac/U6atac. The U11/U12, and U4atac/U6atac di-snRNPs are distinct from, but structurally and functionally analogous to, the U1 and U2, and U4/U6 di-snRNP, respectively (1). The major human spliceosome contains 45 distinct proteins in its snRNP subunits in addition to around 80 abundant non-snRNP proteins (2). These proteins, together with the snRNAs, may be considered to be an experimental approximation of the ‘core’ of the spliceosome, that is the set of structural elements necessary for the procession of the splicing reaction. Proteomics analyses of spliceosomal proteomes from various species yield also up to over 100 non-abundant splicing proteins (2–8), which may be active e.g. in certain instances of splicing. Out of the 45 distinct snRNP proteins, only seven, the so-called Sm proteins, are present in more than one copy. The Sm proteins form heteroheptamers with a toric shape, one per each of the U1, U2, U4 and U5 snRNPs. In each snRNP, the Sm heteroheptamer forms a platform that supports the respective snRNA. A similar platform associated with the U6 snRNA is composed of a set of seven related ‘like-Sm’ proteins (9).
Splicing-related proteins may also participate in other cellular events, including mRNA transcription (10,11), 5′ capping, 3′ cleavage and polyadenylation, as well as mRNA export, localization and decay (12,13) and box C/D snoRNP formation (14). While the majority of non-snRNP proteins are independent factors, some associate into non-snRNP protein complexes, which include the hPrp19/CDC5L (NTC) complex (15), the exon-junction complex (EJC) (16), the cap-binding complex (CBP) (17), the retention-and-splicing complex (RES) (18), and the transport-and-exchange complex (TREX) (19). These complexes may also have non-splicing functions (16,20).
A characteristic feature of the spliceosome is its extraordinary dynamism, as the snRNP composition of a spliceosome entity bound to the substrate pre-mRNA changes depending on the stage of the splicing reaction. For the major spliceosome, an E (entry) complex spliceosome contains U1 snRNP, an A complex contains U1 and U2 snRNP, a B complex contains U1 and U2 snRNP in addition to a tri-snRNP entity composed of the U4/U6 and U5 snRNPs, called U4/U6.U5, while the activated B (B-act) and catalytic (C) complexes contain U2, U5 and U6 snRNPs. After the splicing catalysis occurs and the mRNA is released, the initial configuration of the snRNPs (U1, U2 and U4/U6 and U5 separately) is recycled (21). Each stage-specific configuration of the snRNP subunits is also associated with a different non-snRNP protein complement. As a result, just like the snRNP composition, the non-snRNP composition of a given instance of the spliceosome also varies (2). In recent years, evidence has surfaced that ubiquitin-based (22–24) and intrinsic disorder-based (25) systems may contribute to the regulation of splicing assembly and dynamics.
To further the studies of the spliceosome and the association between splicing and other cellular processes, it is useful to determine the domain architecture and the three-dimensional structures of spliceosomal proteins. Detailed knowledge of protein structure can help determine how molecules perform their biological functions. Structure can also aid in understanding the effects of variations, resulting, e.g. from SNPs or from alternative splicing, which may have implications for disease. Besides, identification of structural similarities can reveal distant evolutionary relationships between proteins that cannot be detected from a comparison of their sequences alone (26). Of particular importance is the structural analysis of components of larger systems and complexes that have eluded high-resolution structural characterization. For instance, it has been suggested that high-resolution models of individual snRNP components may be fit into molecular envelopes created by low-resolution cryo-electron microscopy (cryo-EM) maps (27) to construct structures of the spliceosome at different stages of its action (28). Thereby, structural characterization of individual components of the spliceosome can bring us closer to modeling the structure and function of the entire system.
There are two main potential gaps in our understanding of the structure of the protein components of the spliceosome. The first one lies in recognizing the protein architecture at the primary level, e.g. the detection of conserved/structured domains and disordered regions. Most structural domains of splicing proteins are annotated by automated inferences in protein sequence databases such as UniProt (29). Many domains, especially those of the ‘core’ splicing proteins, have also been characterized in literature. However, automated annotations are limited in that they can only either spread information that is already available in the system (such as through homology inferences) or information that conforms to tight preset standards (such as in the detection of domains that conform to PFAM domain profiles) (30). Hence, at times, elements of protein architecture remain undetected throughout automated annotation, and can only be determined through additional analyses and human interpretation of other data.
The second gap lies in the lack of structural representation. Partial or complete structures have been determined for many splicing-related proteins and their complexes. These include a nearly complete U1 snRNP (31), U4 snRNP core with the Sm ring (32), several complexes associated with the spliceosome such as the human EJC (33) or the human CBP (34) and various protein–protein and protein–RNA complexes, such as the human U2 snRNP protein p14 (SF3b14a) bound to a region of SF3b155 (35). In total, as of December 2011, data from the Protein Data Bank (PDB) (36) show that at least 340 structures have been determined by X-ray crystallography and NMR for human spliceosomal proteins or their domains, either alone or in various complexes. Many of these structural models are redundant because they represent the same regions of the same proteins. However, for many regions, no three-dimensional models are available.
As an essential step towards enhancing our current understanding of the spliceosome, we have carried out a systematic structural bioinformatics analysis of the proteins of the human spliceosomal proteome, with a dual focus on characterizing their ordered parts and modeling their structures. In an effort to help set the priorities for future modeling of the entire spliceosome, we also compared the human spliceosomal proteome with the proteome of the parasitic diplomonad Giardia lamblia, known for its genomic minimalism. We put forward the set of structural regions common for human and G. lamblia as an attractive target for future studies. This analysis complements a parallel study of the unstructured part of the proteins of the spliceosome (I.K. and J.M.B., submitted for publication), and runs alongside efforts of many research groups to characterize the structure of spliceosomal RNAs and map out the interactions between the spliceosomal components.
A total of 244 proteins found in the proteomics analyses of the major human spliceosome [sourced from one or more of the following references (2,4,8,37–41)], and 8 proteins specific to the U11/U12 di-snRNP subunit of the minor spliceosome (Supplementary Table S1) (42), were downloaded from the NCBI Protein (nr) database. Proteins were classified as ‘abundant’ and ‘non-abundant’ according to (2), and they were assigned into groups based mainly on (2), followed by references (4,38–40). Proteins classified here as ‘miscellaneous’ were classified in primary sources, variably, as ‘miscellaneous proteins’, ‘miscellaneous splicing factors’, ‘additional proteins’, ‘proteins not reproducibly detected’ and ‘proteins not previously detected’. We disclaim any responsibility for the factual accuracy of the association of proteins with the relevant groups beyond the point of following the primary sources.
Searches of protein homologs in the NCBI Protein (nr) database were carried out at the NCBI using BLASTP/PSI-BLAST (43) with default parameter settings. Putative homology was validated by reciprocal BLASTP searches against the Protein database with ‘human’ (NCBI taxon id: 9606) as a taxon search delimiter. Sequence alignments were calculated using the MAFFT server using the Auto strategy (http://mafft.cbrc.jp/alignment/server/) (44). Clustering analysis of helicase sequences was performed with CLANS (45).
Identification of intrinsically ordered and disordered regions of proteins, prediction of protein secondary structure and domain boundaries, as well as fold-recognition (FR) analyses, were carried out via the GeneSilico MetaServer gateway (for references to the original methods, see https://genesilico.pl/meta2) (46). In non-trivial cases (usually when putative modeling templates returned by FR scored low and/or various methods disagreed on the best template), FR alignments to the top-scoring templates from the PDB were compared, evaluated and ranked by the PCONS server (47), and the PCONS result was used to identify region boundaries. Additional searches were performed on the HHPRED server (48).
SCOP database (49) IDs used for the purposed of structural domain identification were either extracted from the Protein Data Bank or from the SCOP parseable files on the SCOP website (http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html) or assigned using the fastSCOP server (http://fastscop.life.nctu.edu.tw/) (50). PFAM domain names were assigned on the PFAM website (http://pfam.sanger.ac.uk/). SCOP v. 1.75 and PFAM v. 25.0 were used. Structural similarity was compared using the DALI server (51).
In assigning structural models to regions, we followed a four-step procedure (Figure 1). Whenever a high-resolution experimental structural model (either X-ray or NMR structure) was available, we assigned it to the corresponding sequence region. If a structural similarity to a protein of known structure was predicted for a given region by fold-recognition algorithms (see below for details), we constructed a model for this region by a comparative (template-based) modeling technique, using the detected experimental structures as templates. In the absence of confidently predicted templates, we used de novo folding methods for relatively small fragments likely to form globular domains. For the remaining regions (those without experimentally solved structures and for which the current modeling methodology cannot provide confident predictions of the 3D structure), we generated pro forma models, in which only the primary and (predicted) secondary structure was represented explicitly, while the tertiary arrangement was arbitrary. Pro forma models are not supposed to be reliable at the tertiary level and were constructed for the sake of further analyses (e.g. to initialize protein folding analyses that require some kind of a structural representation as an input).
For regions with multiple solved structures in the Protein Data Bank, the following criteria of preference were used: (i) structures of the region in complex with other proteins and/or nucleic acids (i.e. in a potentially ‘active’ or ‘functionally relevant’ state) were given priority over structures of the region in isolation, (ii) crystallographic structures were given priority over NMR structures, (iii) higher-resolution crystallographic structures were given priority over lower-resolution structures and (iv) more complete structures were given priority over less complete structures. The following experimental artifacts were removed from experimental structure files or corrected by standard modeling procedures: non-native sequences added to aid in the protein expression and structure determination process (e.g. affinity tags), non-standard amino acids (e.g. selenomethionine was replaced by methionine), and gaps in sequences (e.g. short disordered loop fragments were added). Single chains only were retained if the original PDB file contained multiple chains of the same protein.
Comparative models were constructed by default with MODELLER (52) based on templates identified in the fold-recognition process. Selected challenging models were constructed using the I-TASSER server (53). Selected models were also adjusted with ROSETTA 3.0/3.1 using the loop modeling mode (54). De novo models were produced with the ROSETTA 3.0/3.1 AbInitioRelax application and clustered with the Rosetta 3.0/3.1 Cluster Application, following the protocols set out in the ROSETTA User Guide for version 3.1. (http://www.rosettacommons.org/manual_guide) (54). De novo folding was attempted if the following conditions were fulfilled: the region was ≤125 residues in length, predicted to be completely ordered and predicted to contain secondary structure elements. These conditions correspond to the current practical limit of utility of this type of methods (55). Artificial pro forma spatial representations of protein chains of unknown/uncertain structure or predicted to lack a stable structure were built with UCSF Chimera (v.1.4/1.5) using the Tools>Structure Editing>Build Structure command (56). Pro forma constructs reflect only the known primary and predicted secondary structure of the corresponding regions, while their tertiary structure should be regarded as unassigned (and remains to be modeled in the future). Miscellaneous manipulations of structures and models of molecules during this stage were performed in UCSF Chimera (56) and Swiss-PdbViewer v. 4.0.1 (57).
Assessment of model quality was performed with MetaMQAPII [https://genesilico.pl/toolkit/unimod?method=MetaMQAPII, an updated version of a method described in (58)] and QMEAN [http://swissmodel.expasy.org/qmean/ (59)].
MetaMQAP predicts the deviation of the query model from the (unknown) native structure and expresses it as the predicted global root mean square deviation (RMSD) and the predicted global distance test total score (GDT_TS) (60). The lower the predicted RMSD and the higher the predicted GDT_TS score, the better the model.
QMEAN first calculates an internal score, and then the QMEAN Z-score indicates by how many standard deviations the QMEAN score of the model differs from expected values for experimental structures that have a similar length to the model. High quality models are expected to have positive QMEAN Z-scores, and good models are expected to have a QMEAN Z-score above −2.0. Indicators of accuracy of individual residues were generated by MetaMQAPII and are supplied as B-factor values inside the model files available from the SpliProt3D database website (see below). They can be visualized with the UCSF Chimera command Render By Attribute > (attributes of residues: average B-factor) or with equivalent commands in other molecular visualization programs. Mean values and standard deviations of the QMEAN Z-scores for the six QMEAN contributing factors are provided with this publication (Supplementary Table S4) and the values for all models are provided with the model files. Models of low quality are expected to have a strongly negative QMEAN Z-score, but also strongly negative Z-scores for most of the contributing terms.
As MetaMQAPII is not capable of evaluating multimeric models, for models of protein complexes (11 X-ray models and 2 NMR models) only the quality of the longest chain was evaluated by MetaMQAPII.
Models and additional data, including alignments of representative sequences annotated with predictions of order/disorder, secondary structure, binding disorder, solvent accessibility and coiled coils, as well as and annotations of sites of post-translational modification from UniProt (29), are available via the SpliProt3D web server at http://iimcb.genesilico.pl/spliprot3D. The entire archive of files available for download has approximately 250 MB.
Our main priorities in identifying structural domains of splicing proteins were to check and correct previously reported domain boundaries and to identify and characterize domains that were not available in UniProt and other databases. We focused on 252 proteins of the human spliceosome, including 244 proteins found in the results of proteomics analyses of the major human spliceosome and 8 proteins specific to the U11/U12 subunits of the minor spliceosome (see ‘Materials and Methods’ section for references to protein sources and Supplementary Table S1 for protein GIs). We did not find any references to U4atac/U6atac-specific proteins either in literature or in the Gene Ontology (GO) database [http://geneontology.org (62)]. A total of 118 proteins were classified as ‘abundant’ as in (2); other proteins were classified as ‘non-abundant’. ‘Abundant’ proteins are suggested to be the most important for the correct action of the spliceosome (2).
Using a combination of protein fold-recognition and sequence conservation-based domain identification methods, we identified 465 ordered structural domains in the 252 proteins, including 80 domains in the snRNP proteins of the major human spliceosome (Table 1 and Supplementary Table S2). Ordered structural domains cover >80% of the ordered regions of the proteins, and ~50% of all residues in the splicing proteins. Correspondingly, close to a half of the human spliceosomal proteome is predicted to be intrinsically disordered. The analysis of various structural and functional types of intrinsic disorder in the spliceosome brought about a quantity of data whose presentation is beyond the scope of this article and that has been consequently made the subject of an independent article (I.K. and J.M.B., submitted for publication).
Based on the predicted order/disorder boundaries and the presence/absence of predicted secondary structure elements, we also detected 25 regions that we termed ‘suspected domains’. This category included two groups of regions. The first group were domain-length (>40 residues) regions without a recognized fold that were the only ordered regions of otherwise highly intrinsically disordered proteins (≥70% residues predicted to be disordered). The second group were present in proteins with low-to-middle intrinsic disorder content (<70% residues predicted to be disordered) that contained other ordered structural domains. The ‘suspected domains’ in these proteins were ordered regions that had clear order/disorder boundaries and contained predicted secondary structure elements, but lacked a PFAM domain assignment (30) and showed no clear relationship to any known folds according to protein fold-recognition analyses.
Ordered domains of splicing proteins classified in the SCOP (49) catalogue belong to classes a–e and g, with an over-representation of class d, which contains superfamily d.58.7 (RNA-binding domain, RRM (RBD), which usually corresponds to PFAM domain PF00076, RRM_1; Table 2). RRM is present in the 252 proteins in as many as 117 copies. This means that roughly each fourth to fifth domain in the spliceosomal proteome is an RRM. As RRM is a small domain that usually binds single-stranded RNA (63,64), this reflects the key character of protein–RNA interactions in the splicing process.
Other common types of ordered protein regions found in the human spliceosomal proteome include other small RNA-binding domains, large α- and β-repeat-based protein-binding domains, small protein disorder-binding domains, ubiquitin-related domains and stable multidomain RNA helicase architectures (Table 3). Repeat-based domains are often found as building blocks of protein complexes, while some of the ubiquitin-related domains have been shown to be part of a putative ubiquitin-based system of controlling spliceosome assembly and dynamics (22,65).
In addition to ordered domains, we found nine regions with an expected independent function that were predicted to be disordered, but that were either found in experimental structures or could be confidently modeled due to strong sequence matches to known domains. We considered these nine regions to be putative disordered domains that undergo a transition to order upon entering a complex. We discuss the features of these domains in an independent article that focuses specifically on intrinsic disorder in the spliceosomal proteome (I.K. and J.M.B., submitted for publication). Here, we will only note that, in general, the identification of disordered structural domains is currently a non-trivial task in comparison with the identification of ordered structural domains, as fewer experimentally validated examples of disorder exist in databases and the properties of disorder make automated identification and propagation more difficult.
Following the identification of domains, we constructed a non-redundant set of experimental and theoretical structural models of regions in splicing proteins. As the utility and credibility of models, both experimental and theoretical, depends on their accuracy, we set some simple heuristic rules of preference to increase the chance that we chose the models with the best quality. We preferred experimental models over theoretical models, X-ray experimental models over NMR experimental models and comparative theoretical models over de novo theoretical models (Figure 1). The lowest tier in the hierarchy was pro forma constructs, in which only the primary and secondary structure were represented explicitly, while the tertiary arrangement was arbitrary. As a result, we mapped 104 non-redundant experimental models to the sequences of the spliceosomal proteins, and created 255 comparative and 43 de novo models (Table 4 and Supplementary Table S3), as well as over 500 constructs. The 104 non-redundant experimental models include 23 models of (nucleo)protein complexes, of which 13 complexes have residues from more than one spliceosome-associated protein. While models of complexes tend to have lower accuracy than models of isolated chains, we considered them to be more informative about the protein functional than models of isolated chains. This was the only instance where we favored the availability of additional information over plain accuracy of the structure.
Over 90% of ordered regions of splicing proteins can be associated with experimental structural information or with comparative and de novo models (Figure 2). This value is similar for the proteins of the snRNP subunits of the major spliceosome and other proteins associated with the human spliceosome. Between different types of structural representations, experimentally determined structural models cover 20.6% of all ordered residues, the comparative models we generated cover 67.4% of all ordered residues, and the de novo models cover 4.8% of all ordered residues. Hence, our theoretical models cover three times the length of ordered protein sequence covered by experimental models.
X-ray crystallography is useful for the structure determination of large proteins (>30kDa) and protein complexes, while NMR is well-suited for the structure determination of relatively small proteins. Not surprisingly, the ratio of the number of ordered residues in proteins from snRNP subunit structures solved by X-ray crystallography versus NMR is ~3:1 (15.7%:4.7%), while this ratio for all splicing proteins is ~1.77:1 (13.4%:7.2%). The main reason for this is that small domains are statistically more populous in the general set of splicing proteins compared to the snRNP subunits. Contrariwise, most structures of protein–protein complexes available for splicing proteins include regions from snRNP proteins. Since the resolution (and hence accuracy) of experimentally determined structures is typically inversely correlated with the molecule or complex size, X-ray models of snRNP proteins have on average a slightly worse resolution (mean 2.20Å) than X-ray models of all spliceosomal proteins (mean 2.08Å).
For predicted disordered regions, confident structural coverage is very low in comparison to ordered regions. Less than 2% of residues predicted to be disordered are covered by experimental models, and even together with our theoretical models, we could only cover 8.9% of all disordered residues. Moreover, most of the residues covered belong to linkers between ordered structural domains or short regions in protein termini. This low coverage of intrinsically disordered regions by structural models may be in the future a considerable challenge in producing a comprehensive structural model of the spliceosome.
For all models except pro forma constructs, we also independently evaluated their accuracy to determine how credible they were. To do this, we used two methods: MetaMQAPII (58) and QMEAN (59). Both of them provide a global score for the entire model (predicted RMSD for MetaMQAPII, QMEAN Z-score for QMEAN) as well as a local score for individual residues (in this analysis, only the MetaMQAPII score was used). Functionally relevant and evolutionarily conserved regions (e.g. binding interfaces) are typically predicted with a higher than average accuracy, in particular when comparative modeling is used. Consequently, even a model with a poor global score can be useful for functional considerations, if its functionally important parts are scored well and are likely to be accurate. Some readers may also be interested in scores that describe only the model’s quality with respect to a particular feature (e.g. secondary structure). To help describe different features of models, we recorded the mean values and standard deviations of QMEAN Z-scores for six QMEAN contributing factors. These values for all models are provided with the manuscript (Supplementary Table S4).
For comparison with theoretical models, we ‘predicted’ the global quality of experimentally determined structures (Supplementary Figure S1). Expectedly, both X-ray and NMR models we selected for our data set are highly scored by both MetaMQAPII and QMEAN, which is an indicator of the high accuracy of these structures (Table 5; for RMSD, the lower the score, the better the model; for the QMEAN Z-score good models are scored higher). Mean QMEAN Z-scores for models of both types (0.42 for X-ray and 0.08 for NMR) compare favorably to mean QMEAN Z-scores of models across the entire PDB (−0.58 and −1.19, respectively) (67). As X-ray models in our database were scored slightly better than NMR models, we used scores for X-ray models as a benchmark with which to classify theoretical models into those ‘likely to be globally accurate’ or ‘unlikely to be globally accurate’. The worst-scored X-ray models in our data set have a predicted RMSD of 4.5Å (PDB ID 2ok3, resolution 2.0Å) and a QMEAN Z-score of −1.99 (PDB ID 2qfj, resolution 2.10Å). Consequently, we divided all non-X-ray models into four classes depending on passing one or both thresholds: predicted RMSD ≤4.5Å and QMEAN Z-score ≥−2.0 (Figure 3).
The majority of both NMR and theoretical models belong to the most reliable class (i.e. ‘scored not worse than the worst crystal structures in the data set’). These models are expected to be generally correct, although their local accuracy may vary. Models scored well only by one method should be treated with more caution than models scored well by both methods. However, poor scoring by one method may also be due to the model being either very short or very long. Models that are scored poorly by MetaMQAPII, but are scored well according to the QMEAN Z-score are usually short, while models that are scored high by MetaMQAPII and low by QMEAN are usually long. The mean length of a model scored well by both methods is 220 residues, but the mean length of a model scored well only by QMEAN is 70 residues and the mean length of a model scored well only by MetaMQAPII is 362 residues. Therefore, we urge the reader to consider the length of the model before while using models scored poorly by only one method.
Over 40 models are scored poorly by both MetaMQAPII and QMEAN. These models may have been built on remotely related templates or did not fold well when modeled de novo, and are to be expected to have various errors. Based on our previous experience, we believe that some of these cases may represent new protein folds or interesting variations of known folds that present considerable challenge for protein modeling methods. Hence, while we regard these models as unreliable, we propose the corresponding proteins or domains as attractive targets both for experimental protein structure determination, and for protein modeling with other advanced techniques.
The entire non-redundant set of representations (including selected representative models determined by experimental methods, and all theoretical models built with computational methods) is available as an online database SpliProt3D at http://iimcb.genesilico.pl/SpliProt3D. The web server allows for browsing, selecting and downloading the models. Proteins are also associated with sequence alignments annotated with predictions of intrinsic order versus disorder, predictions of secondary structure, protein-binding disorder, solvent accessibility and coiled-coils, as well as the positions of post-translational modifications. The database will be curated and new entries will be added and obsolete ones archived following the progress in structure determination of new spliceosomal proteins and/or publication of new theoretical models with better predicted accuracy. We would like to encourage structural biologists working on structure determination or prediction for spliceosomal proteins to contact us to have their models included and referenced in our database.
After submission of this article for review, a crystal structure of the yeast U2 snRNP SF3A sub-complex was published (68), giving us an opportunity to compare some of our predictions with the independently determined experimental structure.
The structure of the yeast SF3A complex includes, in addition to several regions composed of individual secondary structure elements, three ordered domains for which an experimental structure had not been published before. One domain in the yeast protein Prp9 is >200 residues long (its counterpart in the human protein SF3a60 is situated roughly between residues 1–77, 129–244 and 310–372); it features a novel helical architecture. Originally, we made no tertiary structural predictions for this domain (i.e. our database contained only constructs), and it is highly unlikely that the structure of this domain could have been predicted accurately by a standard bioinformatics approach. Another domain in the yeast Prp9 is a zf-C2H2 zinc finger inserted into the long helical domain, whose counterpart in the human protein SF3a60 lacks the Zn-binding residues and is closely neighbored by another insertion, of a SAP domain. Despite these differences, in our original model of this domain (with a predicted RMSD of 8.8Å and QMEAN Z-score of −1.93), we correctly predicted the fold and the position of nearly all residues in this zinc finger. We also correctly predicted the boundaries and the fold of an all-β domain in the human protein SF3a66, a counterpart of the yeast protein Prp11. The original comparative model of this domain had a predicted RMSD of 4.7Å and a QMEAN Z-score of −0.92, with a medium reliability of the fold prediction. In practice, upon comparison, this translated to predicting the position of approximately a half of the residues in the domain correctly. This analysis demonstrates the utility of the predictions, and that even models with a predicted relatively low accuracy can, in fact, exhibit correct folds, spatial shapes and locations of some of the functionally important residues.
Given the availability of the new template, we generated new models for the human counterparts of the SF3A crystal structure, using the comparative approach. We also generated a new comparative model for a domain in the C-complex-related protein cactin (NY-REN-24/C19orf29, gi: 126723149) as this protein is predicted to have a domain with the same all-β fold as the SF3a66 domain. The new models have been deposited in the database, while the old models have been moved to the archive of the ‘obsolete’ entries and are still available for analysis.
Given the known role of ubiquitin in controlling spliceosome assembly and dynamics (21,22), and the fact that ubiquitin-related domains are one of the largest groups of domains in splicing proteins, we were interested in learning how these domains were distributed across the different groups of splicing proteins. We found 19 potential or known ubiquitin-related domains in 15 splicing-related proteins, including 12 abundant proteins of the major spliceosome and one protein of the U11/U12 di-snRNP subunit of the minor spliceosome (Table 6 and Figure 4). These domains cover most of the main classes of ubiquitin-related domains, including ubiquitin fold domains, RING zinc finger/U-box domains that may act as ubiquitin ligases, a ubiquitin conjugating enzyme-like domain, a ubiquitin carboxyl-terminal hydrolase domain and the JAB1/MPN domain of protein U5-220K (hPrp8) described in (23). In several cases, such as that of the abundant C-complex-specific protein FLJ35382 (C1orf55) and the TREX complex protein THOC5, only similarity of a protein region to a known ubiquitin-related fold could be detected.
Ubiquitin-related domains are more abundant in proteins active in the late stages of splicing (B, B-act and C complexes). The ubiquitin-fold domain of protein SF3a120 is the only ubiquitin-related domain found in the U2 snRNP (its counterpart is found in the U11/U12 di-snRNP). On the other hand, as many as three proteins of the B/B-act complex (UBL5, Cyp-60 and RNF113A) and four proteins of the C complex (FLJ35382/C1orf55, XAP-5/FAM50A, NOSIP and CCDC130) contain ubiquitin-related domains, in addition to a domain in the U5 snRNP (the JAB1/MPN of U5-220K) and a protein in the U4/U6.U5 tri-snRNP (U4/U6.U5-65K). In summary, this distribution suggests that the late stages of splicing are probably under a stricter ubiquitin-based control than the early stages. This may be due to the fact that the earlier stages of splicing, such as intron/exon definition, are more dependent on weak, disorder-based interactions, while the later catalytic stages require precise subunit rearrangements.
Our FR analysis detected that the human SF3A sub-complex contains, in addition to the zinc finger in protein SF3a60, another degenerate C2H2 (g.37.1)-type zinc finger in the middle conserved region of protein SF3a120 (conserved region: residues 217–530, PFAM domain PRP21_like_P; zinc finger: residues 407–435). In Saccharomyces cerevisiae, this zinc finger is absent entirely. However, in the majority of non-animal species, especially other fungi, amoeba and Apicomplexa, this zinc finger retains some of the cysteine and histidine zinc-binding residues (Figure 5A). The zinc finger remnant is surrounded on both sides by intrinsically unstructured regions that are in part predicted to form helical (potentially coiled-coil) structures. The short motifs lying on the distal ends of the disordered linkers are conserved. An additional coiled-coil region connects the N-terminal conserved motif with the previously described (69) second Surp module of SF3a120. Thus, the PRP21_like_P module consists of three motifs, the second of which is a zinc-finger remnant, connected by flexible linkers, with an N-terminal coiled coil that connects the N-terminal motif to the Surp region (Figure 5B). Structural modules of this type usually serve to simultaneously contact a binding partner of the protein in several locations. In the particular case of SF3a120, it has been suggested that both the U2 snRNA and a so far, unidentified splicing protein are potential partners (69).
Through a systematic search, we found several other examples of zinc finger and zinc finger-like domains embedded in conserved disordered regions in the spliceosomal proteome (Table 7). Alternatively, tandem zinc fingers can be separated, e.g. by predicted coiled-coil regions. The new zinc-finger domains we found belong usually to the zf-C2H2 (g.37.1)-type, which can bind RNA and/or mediate protein–protein interactions. The pre-mRNA/mRNA-binding protein ARS2 contains a ZZ RING zinc finger, while the C complex protein NOSIP contains two RING zinc finger/U-box-like regions.
The C-terminal ordered domain of protein U4/U6-90K (hPrp3), which corresponds to PFAM domain DUF1115 (PFAM ID: PF06544; residues 540–683), was predicted in our analysis to have a ferredoxin-like fold. It is predicted to be related to the acylphosphatase/BLUF domain-like superfamily (SCOP ID: d.58.10). BLUF family domains have two additional helices in the C-terminus compared to acylphosphatase family domains. These helices are present in the DUF1115 domain, and so this domain is predicted to be a BLUF-like domain (Figure 6). This is an unusual assignment, because the BLUF domain is a FAD/FMN-binding blue light photoreceptor domain found primarily in bacteria. In Eukaryota, it is found almost exclusively in euglenids and Heterolobosea. On the other hand, DUF1115 is found exclusively in eukaryotes. However, very high scores of BLUF domain templates yielded by FR methods for the hPrp3 DUF1115 sequence suggest that this protein is definitely homologous to the BLUF family.
Nevertheless, DUF1115 differs from BLUF domains in some key features. The conserved FAD/FMN-binding residues are not conserved in DUF1115, and nor is a tryptophan residue whose position is altered depending on the excitement state of the photoreceptor (70) (Supplementary Figure S2). On the other hand, DUF1115 contains a disordered loop between the second α-helix and the fifth β-strand. The presence of this loop, though not its length, is conserved in DUF1115 domains. Moreover, a conserved tryptophan residue, W604 in hPrp3, is located next to the disordered loop.
Based on biochemical data, the DUF1115 domain may be a region of interaction of hPrp3 with the U5 snRNP protein hPrp6 and/or the U4/U6.U5 tri-snRNP protein U4/U6.U5-110K (SART-1) (71). However, it is also possible that this interaction proceeds through the disordered PRP3 domain of this protein (71). A possible alternative role for DUF1115 is suggested by the fact that, apart from proteins from the hPrp3 family, it is found only in a family of proteins containing the RWD domain. The RWD domain belongs to the ubiquitin conjugating enzyme superfamily (72). Hence, the hPrp3 DUF1115 may be a part of the spliceosomal ubiquitin-based system.
hPrp22 (DHX8) and hPrp2 (DHX16) are RNA helicases that function in the remodeling of the spliceosome (6). According to our predictions, these two helicases contain N-terminal ordered helical bundles with a PWI superfamily fold (SCOP superfamily a.188.1) and similarity to the PFAM PWI domain (Figures 7 and and8).8). PWI is a nucleic acid-binding domain first described in the splicing protein SRm160 (73,74). PWI is also found in the animal protein U4/U6-90K (hPrp3). The hPrp22 and hPrp2 PWI-like bundles (hPrp22: residues 1–92 or 1–120; hPrp2: 1–95) are not found in a search with the profile of the PFAM PWI domain, possibly because their eponymous PWI tripeptide motifs are degenerated. In hPrp22 and its homologs, only the third position of this motif is conserved: [x][x][IV], while in hPrp2 and its homologs, the second and third positions are usually conserved: [x][WFY][IV]. However, PFAM displays several putative hPrp2/hPrp22 homologs when queried for proteins that contain PWI domains. Furthermore, stable binding to nucleic acids by PWI requires an adjacent basic-rich region (74). We found potential candidates for such ancillary regions both in hPrp22 and in hPrp2 (hPrp22: residues: 93–116; hPrp2: residues 120–132).
We also found a PWI-like helical bundle in the N-terminus of the human protein U5-200K (hBrr2; residues 258–338; Figure 7). This helical bundle is conserved across the majority of eukaryotes, and is found, for instance, in the S. cerevisiae Brr2. The PWI-like domain of U5-200K retains a relatively well conserved second and third position of the tripeptide PWI motif: [x][WFY][ILV]. Notably, if correct, this prediction represents the first case when a PWI-like domain is located in the middle of a protein. Usually, as is the case of SRm160, hPrp3, hPrp22 and hPrp2, a PWI domain is located either in the immediate N-terminus or in the immediate C-terminus of a protein. There are at least three candidate basic-rich regions in the vicinity of the U5-200K PWI-like domain (residues 254–259; 343–349; 373–386).
Sequences of proteins from the hPrp22 (DHX8) and hPrp2 (DHX16) families are very similar, to the effect that we could not easily separate them in a clustering analysis (Supplementary Figure S3). The most important discriminant between the two families appears to be the presence of an S1 RNA-binding domain (PDB ID: 2eqs; DOI:10.2210/pdb2eqs/pdb, manuscript to be published) between the N-terminal PWI-like bundle and the C-terminal helicase domains. This domain is present in hPrp22 and its homologs, but not in hPrp2 and its homologs. This led us to the hypothesis that Prp2, with the PWI-like domain, was the ancestral protein, which then underwent the insertion of the S1 domain. Nevertheless, the PWI-like domains of hPrp22 and hPrp2 differ in several aspects.
The first difference lies in the above-mentioned degree of degeneration of the tripeptide PWI motif, which is larger in hPrp22 and its homologs than in hPrp2 and its homologs. In an extreme case, the N-terminus of the Prp22 protein of S. cerevisiae and the related organism Eremothecium (Ashbya) gossypii is located inside the motif, which is therefore incomplete. The degeneration of the PWI motif may be offset by the heavy conservation of a [DE][FY] motif in the second helix of the bundle. The main reason for the conservation of the PWI motif in canonical PWI domains is that it stabilizes the structure of the PWI domain (74). It is possible that the conservation of the [DE][FY] motif is sufficient to guarantee the stabilization of the bundle in conjunction with the conservation of the third position of the PWI motif.
Second, there is also a possible difference in either the number or the arrangement of helices comprising the PWI domain. SCOP describes superfamily a.188.1 as a ‘four-helix bundle’. However, in the structure of the PWI domain from protein SRm160, the bundle is followed by an additional short α-helix orthogonal to the bundle (PDB ID: 1mp1) (74). The presence of this α-helix is also predicted for the hPrp3 PWI domain, although it is missing from the available experimental structure (PDB ID: 1x4q; DOI:10.2210/pdb1x4q/pdb, manuscript to be published). Similarly, secondary structure predictions for hPrp2 also indicated that this protein is likely to contain an additional α-helix. However, for hPrp22, predictions of domain boundaries are less decisive. The hPrp22 PWI-like domain is either predicted to be a four-helix bundle (in which case it is confined to residues 1–92), or to contain an additional α-helix, but separated from the bundle by an intrinsically disordered region (in which case the domain spans residues 1–120). In either case, the helix arrangement is predicted to be different than in hPrp2. To note, the U5-200K PWI-like domain is predicted to be a five-helix domain.
Third, the pattern of evolutionary conservation of the PWI-like domains is different in hPrp22 and hPrp2. Fewer putative and confirmed hPrp2 homologs from different species have the PWI-like domain than do hPrp22 homologs. For instance, the functional analog of hPrp2 in S. cerevisiae, Prp2, is considered to be its homolog, but lacks the PWI-like domain. The Prp22 combination of PWI+S1 appears to be retained, while the Prp2 PWI is missing, also in putative homologs in organisms, such as kinetoplastids (Trypanosoma brucei, Leishmania major), some Apicomplexa (Plasmodium falciparum, Babesia bovis, but not Tetrahymena thermophila, which has both), Trichomonas vaginalis and Entamoeba histolytica.
Altogether, the PWI-like domain of hPrp22 is more diverged from the canon, but more often retained, while the PWI-like domain of hPrp2 is less diverged from canon, but more often completely lost. This result does not contradict the hypothesis that the Prp22 protein was formed in the insertion of the S1 domain into the ancestral Prp2. It rather suggests the possibility that some property of the ‘degenerated’ PWI-like domain ensured its retention in evolution. An in-depth structural study of this region may elucidate the reason why.
As hinted above, the U5-200K PWI-like domain is in many respects a ‘canonical’ PWI-like domain similar to that of hPrp2,it retains two out of three of the positions of the tripeptide PWI motif, and is predicted to be a five-helix domain. However, U5-200K is in general highly conserved, and unlike in hPrp2, this conservation also applies to its PWI-like domain.
The N-termini of S. cerevisiae Prp2 and Prp22 are dispensable for splicing (75,76), while the N-terminus of S. cerevisiae Brr2 was shown not to contact any of the proteins of the U4/U6.U5 tri-snRNP (71). Hence, the N-terminal PWI-like domains of hPrp2, hPrp22 and U5-200K are likely to have only a supporting role in splicing, one that is not revealed in the activity of the yeast proteins. We suggest that they may help in the correct positioning of the C-terminal helicase domains on the relevant snRNAs. Nevertheless, we could not find any data on the activity of the N-termini of hPrp2, hPrp22 and U5-200K. Furthermore, no experimental model of a PWI domain bound to RNA exists, to which we could compare the mode of binding of the hPrp2, hPrp22 and U5-200K PWI-like domains. Hence, as far as this publication is concerned, the question of what is bound to the PWI-like domains of the splicing helicases remains open.
We could not confirm a published prediction of a bromo-domain encompassing hPrp8 residues 127–242 (a part of the N-terminal PFAM domain PRO8NT), originally made for yeast Prp8 residues 200–315 (77). In our view, the bromo-domain assignment does not command a consistent evolutionary conservation pattern. It encompasses 20 residues universally conserved in Prp8 homologs from all known species and nearly 100 residues conserved only in some eukaryotic Prp8 homologs. On the other hand, we were able to construct a de novo model for the most conserved part (residues 86–150) of the PRO8NT domain (Supplementary Figure S4). Quality evaluation indicates that the model of the putative Prp8 bromo-domain described in (77) has low predicted accuracy (predicted RMSD 8.7Å, QMEAN Z-score −4.25) compared to our de novo model of residues 86–150 (predicted RMSD of 2.4Å, QMEAN Z-score −1.93). Altogether, although we cannot exclude the possibility that PRO8NT encases a bromo-domain, we suggest that further studies (ideally: experimental structure determination) will be required to provide a confident structural model of this region.
We found several other new types of structured regions in abundant splicing proteins that we were able to assign to known folds and/or are similar to existing structures, with varying degree of confidence (Table 7). For instance, a region in the C-terminus of the hPrp19/CDC5L-related protein KIAA0560 (IBP160/Aquarius homolog; residues 453–1485) has a helicase architecture similar to the nonsense-mediated decay protein Upf1p (Figure 9). KIAA0560 is a 1485-residue-long protein, whose binding to pre-mRNA introns is necessary for the successful deposition of the exon junction complex on the pre-mRNA (78) and for successful release of box C/D snoRNAs (small nucleolar RNAs) from introns (14). Upf1p contains two RNA helicase domains (c.37.1), the first of which is interrupted twice by two insertions: an all-β and an all-α domain insertion (79). In KIAA0560, this first c.37.1 domain is interrupted three times: both of the original insertions are kept, but a third insertion, largely disordered, has appeared between them.
Another previously not described region lies in the C-terminus of the B complex protein TFIP11 (homolog of the yeast protein Spp382). The results of our FR analysis suggest that region is a potential double-stranded RNA binding domain (dsRBD) (Figure 9). In other splicing proteins, such as the non-abundant A complex protein DHX9, dsRBD domains often occur in tandem, but the TFIP11 region does not have a partner. However, TFIP11 contains also another previously structurally uncharacterized region with a putative RNA-binding function, a G-patch domain. While the G-patch domain does not show sequence similarity to any other known domains, a highly scoring de novo model of this domain shows structural similarity to a dsRBD domain (Figure 9). In fact, in the non-abundant splicing-related protein SON, the G-patch domain occurs in tandem with a dsRBD domain partner. If the G-patch domain has a dsRBD-like fold, the TFIP11 G-patch domain could provide the functionality of a second tandem dsRBD-like domain for the not described suspected domain of TFIP11.
We were also able to construct highly scored de novo models with a clear structural similarity to known folds for ordered helical regions located on the N-termini of proteins hnRNP R and Q. No known structural domain is assigned to these regions, but our de novo models of these regions exhibit fairly high scores (predicted RMSD 1.3Å, QMEAN Z-score 0.12) for the region in protein hnRNP R. Based on structural similarity scores yielded by the DALI server (51), these may be helix-turn-helix domains (Figure 9).
Other new putative structural domains are described in Table 8.
The human spliceosome, with its 119 abundant proteins, represents a fairly challenging target for both experimental and theoretical structural analyses. To round-off our analysis, we wanted to put forth a candidate minimum set of structural regions in a functional spliceosome that, in our opinion, should be prioritized during the modeling of the structure of the complex.
In general, eukaryotic species with fewer introns have fewer splicing proteins. The yeast Saccharomyces cerevisiae has homologs of only 61 of the human abundant splicing-related proteins (2). On the other hand, S. cerevisiae has also some Saccharomycetes-specific splicing proteins, such as Prp24 (41), which do not appear in other fungi. In the search of a ‘minimum’ set of regions to include in the model of a functional spliceosome, we turned to the extremely intron-scarce (80,81) parasitic organism G. lamblia, which is also known for its genome minimalism (82). This organism apparently underwent a reversed process with respect to the diversified and specialized human spliceosomal proteome, namely the loss of many genes encoding spliceosomal proteins.
The genome of G. lamblia ATCC50803 encodes homologs of only 30 human abundant splicing proteins (Table 9). Two more proteins can be found in G. lamblia P15. However, not all of these homologs may be involved in splicing. For instance, G. lamblia ATCC50803 possesses orthologs of U4/U6-15.5K and EIF4A3. In humans, U4/U6-15.5K is a component of the U4/U6 di-snRNP, where it binds to U4/U6-61K (hPrp31) (83), while EIF4A3 is a protein of the EJC (33). U4/U6-61K and all EJC proteins save EIF4A3 are missing in G. lamblia. However, the human U4/U6-15.5K protein also participates in box C/D snoRNP formation (83), where it binds a different protein, which does have a G. lamblia homolog, and the human EIF4A3 is an isoform of the eukaryotic translation initiation factor 4A. It is therefore possible that their orthologs in G. lamblia perform only these splicing-unrelated functions.
There is a pattern to the presence and absence of abundant splicing-related proteins and/or their domains and disordered regions in the G. lamblia proteome. Almost all the proteins of the U2 snRNPs are present in G. lamblia, as well as a homolog of U2AF35K, but only some core proteins of the U5 snRNP, such as Prp8 and Brr2. Snu114, which, according to the current understanding, is in other organisms the third part of the troika of U5 proteins essential to splicing (21), is an important absentee. Many proteins of the U1 snRNP and U4/U6 di-snRNP proteome are missing, as well as are all proteins specific to the human U4/U6.U5 tri-snRNP. The set of Step 2 factors is reduced to three RNA helicases, and these helicases are reduced to C-terminal regions of their human counterparts, with a common architecture. The G. lamblia helicases are also impossible to assign unambiguously to their human or yeast counterparts. Clustering analysis of helicase sequences from different organisms places the G. lamblia helicases away from any major cluster (Supplementary Figure S3). Finally, G. lamblia has very few homologs of human proteins of the auxiliary complexes, and only two non-snRNP stage-specific proteins (PRP38 and RNF113A) are present in this organism.
The snRNP protein homologs present in the G. lamblia proteome are shorter than their human counterparts. Three main types of structural features that are common for human spliceosomal proteins are largely absent from the G. lamblia spliceosomal proteome:
In our analysis of intrinsic disorder in the human spliceosomal proteome (I.K and J.M.B., submitted for publication), we discuss how disordered regions of splicing proteins are tied to functions of dynamics, assembly and regulation of the spliceosome. This is also the function of known ubiquitin-related regions. Hence, it appears that G. lamblia is missing most proteins and/or protein regions primarily responsible for splicing regulation and dynamics. On the other hand, G. lamblia retained pre-mRNA and snRNA-binding proteins and/or regions, as well as proteins that directly assist in splicing, such as the catalytic factor helicases. It also appears that this parasitic organism’s ubiquitin-based system of splicing control is reduced, rather than entirely missing. The C-terminal Mov34/MPN/JAB1 domain present in Prp8 from human or yeast (SCOP superfamily c.97.3), which may be implicated in an ubiquitin-based system (65), is absent from the G. lamblia Prp8 (84), but the corresponding region in the latter protein is predicted by FR analysis to be a domain with a ubiquitin-like fold (SCOP superfamily d.15.1).
It is possible, that, like yeast, G. lamblia evolved its own specialized splicing proteins, which would not be detected in sequence similarity searches done with proteins from other organisms. Since G. lamblia is a parasite, it is also possible that it supplements some of its missing proteins (such as Snu114) from the host. Finally, it is also possible that some information was missed by our bioinformatics analysis but may be uncovered by an in-depth experimental analysis. With the caveat of the possibility of gaps in data (such as, possibly, Snu114), these are not single proteins that are missing, reduced or degenerated, but entire systems. The cropped set of proteins remaining in our G. lamblia spliceosomal proteome data set, corresponds to a system much less dynamical than the human spliceosome, less precisely regulated and less able to adapt to variable conditions. However, such a spliceosome may still be functional. Hence, we propose that from a practical standpoint, the set of structural regions with homologs in G. lamblia is a good starting point for the higher order structural modeling of the spliceosome, as well as constitutes an attractive list of targets for experimental structural determination.
This work has been intended to review the existing structural information about human spliceosomal proteins and to fill in gaps, providing a framework of reference for future structural analyses of the spliceosome. We used protein structure prediction methods to identify ordered spliceosomal protein structural elements either not characterized at all on the structural level or characterized insufficiently, and thus underreported in databases and literature. Examples of such un-/under-characterized elements include the zinc-finger domain in protein SF3a120 of the U2 snRNP, PWI-like domains in the essential splicing helicases hPrp22 (DHX8), hPrp2 (DHX16) and the U5 snRNP protein hBrr2 (U5-200K), and several ubiquitin-related regions in abundant splicing proteins. In the latter case, by combining database data with our results, we determined that ubiquitin processing-related domains are common especially in non-snRNP splicing factors active in the later stages of the splicing reaction. Having completed the characterization of ordered domains of splicing proteins, we constructed a minimum non-redundant set of experimental structural representations of the proteins of the human spliceosome and modeled most of the (potentially) ordered structural elements without experimental structural models. Confident high-resolution structural models can be assigned to over 90% of structural order in the spliceosome proteins, which corresponds to about 50% of all amino acid residues.
We analyzed the spliceosomal proteome of the intron-poor organism G. lamblia to determine a candidate minimum set of structural elements present in a functional spliceosome. We found that the G. lamblia spliceosome does not contain the majority of disordered regions found in the human splicing proteome, and has retained only a vestigial ubiquitin-based system of control. Overall, the G. lamblia spliceosome appears to be much simpler than the human or the yeast one, in accordance with this organism’s overall genomic minimalism and its genome’s intron-poorness.
The results of our analysis of the structural domains in proteins of the human spliceosome may be used to guide experimental characterization of these regions. The characterization of the reduced G. lamblia spliceosome may help set priorities in selecting the structural regions for experimental structural determination, and those to be included in a first draft of a model of a functional spliceosome. We suggest that in the event of modeling the structure of a functional spliceosome, the ordered protein regions found in G. lamblia proteins should take priority. Finally, as long as the corresponding structural information is absent, the models we constructed may be used in further structural studies, for instance in modeling the structure of the entire spliceosome. Models of non-‘core’ proteins can be used to broaden our understanding of alternative splicing. Our models, domain characterizations and suggested priorities thus form a framework of reference for future structural studies of the spliceosome, and in particular, for the modeling of the structure of the functional spliceosome.
Following the (near) completion of the parts list of the spliceosome, we are also advancing our understanding of the structure of these parts. This work provides working structural models for a majority of the parts that appear to be ordered regardless of their functional state. While experimental determination of high-resolution structures for all of these elements would be desirable, theoretical models can be used to design experiments or perform calculations/simulations that require protein structure as a basis. The next step in the structural analysis the spliceosome would be to use integrative modeling techniques to generate three-dimensional pictures of the splicing machinery, in analogy to the previous work on the nuclear pore complex (85,86). The even greater challenge ahead will be to model the dynamics of the splicing cycle, for which even greater union of experimental and theoretical techniques will be required.
Supplementary Data are available at NAR Online: Supplementary Tables 1–4 and Supplementary Figures 1–4.
EU 6th Framework Programme Network of Excellence EURASNET [EU FP6 contract no LSHG-CT-2005-518238]. J.M.B. has been additionally supported by the 7th Framework Programme of the European Commission [EC FP7, grant HEALTHPROT, contract number 229676], by the European Research Council [ERC, StG grant RNA + P=123D] and by the ‘Ideas for Poland’ fellowship from the Foundation for Polish Science. Computing power has been provided in part by the Interdisciplinary Centre for Mathematical and Computational Modeling of the University of Warsaw [grant number G27-4]. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the article. Funding for open access charge: EC FP7 contract number 229676 (HEALTHPROT) and by ERC (RNA + P=123D).
Conflict of interest statement. None declared.
We thank Łukasz Kozłowski, Albert Bogdanowicz, Marcin Pawłowski, Geoff Barton, Jim Procter and Pascal Benkert for help with their software. We also thank Reinhard Lührmann, Elżbieta Purta, Łukasz Kozłowski, Joanna Kasprzak, and Anna Czerwoniec for critical reading of the article, useful comments and suggestions.