|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
Serine/arginine-rich (SR) splicing factors play an important role in constitutive and alternative splicing as well as during several steps of RNA metabolism. Despite the wealth of functional information about SR proteins accumulated to-date, structural knowledge about the members of this family is very limited. To gain a better insight into structure-function relationships of SR proteins, we performed extensive sequence analysis of SR protein family members and combined it with ordered/disordered structure predictions. We found that SR proteins have properties characteristic of intrinsically disordered (ID) proteins. The amino acid composition and sequence complexity of SR proteins were very similar to those of the disordered protein regions. More detailed analysis showed that the SR proteins, and their RS domains in particular, are enriched in the disorder-promoting residues and are depleted in the order-promoting residues as compared to the entire human proteome. Moreover, disorder predictions indicated that RS domains of SR proteins were completely unstructured. Two different classification methods, the charge-hydropathy measure and the cumulative distribution function (CDF) of the disorder scores, were in agreement with each other, and they both strongly predicted members of the SR protein family to be disordered. This study emphasizes the importance of the disordered structure for several functions of SR proteins, such as for spliceosome assembly and for interaction with multiple partners. In addition, it demonstrates the usefulness of order/disorder predictions for inferring protein structure from sequence.
Serine/arginine-rich (SR) proteins constitute a family of metazoan splicing factors that are essential for both constitutive and alternative splicing of pre-mRNAs (1). In constitutive splicing, they are known to promote cross-intron and cross-exon interactions, and to influence the recruitment of the U1 snRNP and U2AF splicing factor into the spliceosome (2). In alternative splicing, SR proteins are known to interact with exonic splicing enhancers (ESEs) and to stimulate the splicing of adjacent introns (3). Recent studies suggested several additional functions for SR proteins in mRNA metabolism [reviewed in (4)].
It is generally accepted that there are 10 canonical SR proteins in mammals, with sizes ranging from 20 to 75 kDa (5). These proteins were initially identified and grouped into a family based on common biochemical and immunological properties (1). SR proteins belong to a larger superfamily of SR-like proteins that are characterized by the presence of RS or RS-like domains (6). A bioinformatic approach identified about 50 proteins with RS domains in Homo sapiens, 80 in Caenorhabditis elegans and 110 in Drosophila melanogaster (7).
All SR proteins have a modular organization and consist of one or two RNA recognition motifs (RRMs), located on their N-terminus, and one arginine-serine-rich (RS) domain, located on the C-terminus. The RRM domains generally recognize specific RNA sequences through a wide range of interactions (8), and they can also participate in protein–protein interactions (9). Likewise, RS domains can engage in homotypical protein–protein interactions (2), and it was shown recently that they could also contact the pre-mRNA branchpoint (10). Thus, both RRM and RS domains have a broad binding specificity.
RS domains are required for all essential functions of SR proteins. It has been shown that RS domains function as splicing activation domains (11), and that they harbor signals for nuclear localization and nucleocytoplasmic shuttling (12,13). Besides these important functions, recent studies demonstrated that the RS domain of the ASF/SF2 splicing factor is also required for the nonsense-mediated mRNA decay (14). The RS domains are heavily phosphorylated on the serines by two families of kinases (15). Phosphorylation and dephosphorylation of RS domains modulates their interactions with other proteins and RNA (16).
Despite the fact that SR proteins have been a topic of intense investigation for the last fifteen years, only limited structural knowledge about this protein family is available to-date. Structural and functional studies of SR proteins are largely impeded by the difficulty of their purification. These proteins are very prone to inclusion bodies formation and aggregation during the purification procedure (A. Krainer, personal communication). Possibly for this reason, structural knowledge about SR proteins is currently limited to the RRM domains. The structures of the RRM domains from several RNA-binding proteins (but not from the canonical SR proteins) have been determined [reviewed in (17)]. In addition, there is only one circular dichroism (CD) study that investigates the structures of the full-length ASF/SF2 protein as well as the structures of its deletion mutants, delta-RS and the RS domain itself (18). The CD spectrum of the RS domain is characteristic of the random coil conformation, whereas full-length ASF/SF2 and the delta-RS construct have some α-helical content (18).
The aim of this study is to expand structural knowledge about the SR protein family using sequence analysis combined with the prediction of ordered and disordered protein regions. Intrinsically disordered (ID) proteins represent a new class of proteins that lack a folded structure under physiological conditions and that exist in the ensemble of conformations (19–21). The growing list of ID proteins currently consists of over 200 proteins [(22), see also http://www.disprot.org/]. It has been shown that ID proteins and regions are involved in numerous important biological functions (23–25), including signaling (26), protein–protein interactions with multiple partners (27) and post-translational modifications (28).
Here, we show that the RS domains of SR proteins are predicted to be completely disordered, and that SR proteins belong to the growing class of intrinsically unstructured proteins. These findings emphasize the importance of disorder for determining broad binding specificity of SR proteins and for spliceosome assembly. In addition, they add splicing to the growing list of biological functions in which disordered proteins and protein regions are involved.
Sequences of 10 human SR proteins (Table 1) were extracted from the SWISS-PROT database release 46.3. The dataset of the disordered protein regions was extracted from the DisProt Database (22). The dataset of ordered protein regions (O_PDB_S25) was constructed as described (29) and represents a non-redundant subset of well-ordered globular proteins extracted from the PDB Select 25 database (30). The disorder predictions on this dataset served as a control for estimating the false-positive prediction error rate in Figure 3. The Globular-3D dataset consisted of ordered protein regions extracted from PDB (31); fibrous sequences such as coiled coils, collagen and silk fibroins were removed from this dataset (29). The sequences of human proteins for the human proteome dataset were extracted from the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/protein). The datasets of completely ordered and completely disordered proteins used in Figure 6 were constructed as described (32).
Predictions of intrinsic disorder in SR proteins were carried out using a well-characterized disorder predictor PONDR® VL-XT (29,33). This predictor was trained on the experimentally (X-ray and NMR) confirmed disordered protein regions of a length of at least 30 residues, while the ordered training set included completely ordered proteins extracted from the non-redundant set of proteins from PDB Select 25 (30). The accuracy of this predictor, benchmarked on the 42 CASP5 targets, reached 72.8% (34). PONDR® VL-XT is currently being used successfully to guide the removal of disordered regions that interfere with crystallization of ‘problematic’ proteins for high-throughput structure determination (35). Access to PONDR® VL-XT was provided by Molecular Kinetics (Indianapolis, IN). VL-XT is copyright© 1999 by the WSU Research Foundation, all rights reserved. PONDR® is copyright© 2004 by Molecular Kinetics, all rights reserved.
The CDF represents a cumulative histogram of the PONDR® VL-XT prediction scores for each residue in a given protein. This histogram allows the separation of ordered and disordered proteins based on the distribution of the disorder scores (38). The boundary points on the CDF plot were calculated as previously described (32).
The charge-hydropathy method developed by Uversky et al. (20) has been used to classify SR proteins as ordered or disordered. The mean net charge and the mean normalized Kyte–Doolittle hydropathy (39) were calculated for each protein and their values were plotted against each other. The boundary between ordered and disordered proteins was determined using a linear discriminant function as previously described (32).
It was previously established that ordered and disordered protein regions are characterized by significantly different amino acid compositions, with the prevalence of hydrophilic and charged amino acids and the depletion of hydrophobic and aromatic amino acids amongst the disordered regions (21,40). To determine whether sequence attributes of SR proteins are similar to those of the disordered protein regions, we calculated the amino acid frequencies for each of these two datasets (Figure 1). The plot represents the difference in the frequencies between the two studied datasets and a completely ordered set of proteins, Globular-3D (Materials and Methods).
Due to the presence of RS domains, the frequencies of arginine and serine in the SR protein dataset are significantly higher than in the dataset of disordered proteins. The frequencies of aromatic residues, aliphatic residues and cysteine (the left side of the graph, Figure 1) are very similar for both the SR and disordered protein datasets, with the exception of one residue, tyrosine. Whereas disordered proteins are depleted in tyrosine, the SR proteins are slightly enriched in this residue. Since tyrosine has several distinct properties (such as partial hydrophobicity, aromatic side chain and a reactive hydroxyl group), it is tempting to speculate that it could participate in stacking interactions with the RNA bases. Another interesting difference between the two datasets is the frequency of the negatively charged glutamic acid: SR proteins are depleted in E while the disordered proteins are significantly enriched in E (Figure 1). The depletion of SR proteins in the negatively charged D and E and their enrichment in the positively charged K and R may be essential for interaction with the negatively charged RNA. In spite of a few compositional differences between SR and disordered proteins, the overall trend for the two datasets is very similar, with their overall depletion in the hydrophobic residues and enrichment in some hydrophilic, charged and flexible residues (such as proline, lysine and arginine).
The analysis of ID proteins that were characterized by various experimental techniques (such as NMR, X-ray and CD) have indicated that independent of the characterization method, all disordered proteins have similar amino acid frequencies (21). Based on this analysis, it was proposed that the residues found to be enriched in all disordered proteins, be called disorder-promoting residues (A,R,S,Q,E,G,K,P), and the residues found to be depleted in all disordered proteins, be called order-promoting residues (N,C,I,L,F,W,Y,V) (21). We calculated the frequencies of disorder- and order-promoting residues for each of the SR proteins as well as for each of the RS domains (Table 1). This analysis shows that the SR proteins and their RS domains in particular, are enriched in disorder-promoting residues and are depleted in order-promoting residues as compared to the entire human proteome. The average percentages of disorder promoters reach 68.7% for SR proteins and 87.9% for RS domains, while the average for the human proteome is only 51%. At the same time, the average for order promoters are 20.4%, 7.2% and 32.9% for SR proteins, RS domains and the human proteome, respectively. Given the high proportion of disorder-promoting residues within RS domains, it is very likely that these domains would be unable to adopt a stable 3D structure in solution without binding partners.
Shannon's entropy (36) could be used as a measure of the sequence complexity of a protein (29), when applied to protein sequence. Previously, it has been shown that disordered sequences have an overall lower sequence complexity than ordered sequences (29). Furthermore, an independent analysis of 126 intrinsically unstructured sequences indicated that they are characterized by a higher frequency of short repetitive regions (43), thereby confirming the prevalence of low complexity segments among this protein class. SR proteins represent a perfect example of ID proteins that carry low complexity regions corresponding to the RS domains.
Here, we determined the overall sequence complexity of the SR protein family and compared it to the complexity of ordered and disordered protein regions (Figure 2). As expected, SR proteins and disordered protein regions have similar complexity distributions that differ from the complexity distribution of ordered regions. The analysis shows that SR proteins have an even higher proportion of extremely low complexity segments than the disordered proteins (compare the complexity values from ~1.5 to 3.0). In addition, the peaks for SR and disordered regions overlap and are shifted towards lower complexity values as compared to the ordered regions (Figure 2). Thus, a sequence complexity analysis of SR proteins suggests that, similar to the disordered proteins, they are enriched in low complexity segments.
We then applied a well-characterized disorder predictor PONDR® VL-XT (Materials and Methods) to SR proteins to predict the location of ordered and disordered regions. Results of the prediction agreed with the sequence analysis and further confirmed the high disorder content within this protein family. The disorder predictions, when analyzed based on the percentages of predicted disordered residues, clearly indicate an extremely high disorder content for SR proteins as compared to human proteome (Figure 3). For example, for regions of ≥40 consecutive disorder predictions (where the false-positive disorder prediction error rate is <1%) SR proteins have ~4.9-fold more predicted disordered residues than human proteins (~44% for the SR proteins versus ~9% for the human proteins), whereas for regions of ≥100 consecutive disorder predictions SR proteins have ~11-fold more predicted disordered residues (~30% for the SR proteins versus ~2.7% for the human proteins).
Other disorder attributes (such as overall percentage of predicted disordered residues, average disorder score and longest disordered region), calculated on a per protein basis, also indicate that SR proteins belong to a class of intrinsically unstructured proteins (Table 2). With the exception of two proteins, ASF/SF2 and SRp30c, overall percentages of predicted disordered residues exceed 50% for all remaining SR proteins. Furthermore, the average disorder score for all but one SR protein (SRp30c) is greater than 0.5, where 0.5 is a boundary score between order and disorder.
The analysis of individual predictions shows that the disorder predictions for SR proteins highly correlate with their domain organization (Figure 4). In general, RRM domains are predicted to be ordered, while Gly-rich regions and RS domains are predicted to be disordered. Although the boundaries of these predictions do not always correspond exactly to the domain boundaries, for the most part, they agree with each other fairly well (Figure 4).
Remarkably, our predictions highly correlate with the limited structural information that is available for individual domains constituting SR proteins. For example, we predict that the RRM domains of SR proteins are ordered (Figure 4). This prediction is supported by the experimental data: the structures of RRM domains from several RNA-binding proteins are solved, and RRMs indeed are ordered. They consist of four antiparallel β-strands packed against two α-helices, forming a β-sheet that makes multiple contacts with RNA (44). In addition, disorder predictions for the RS domains also agree with experimental observations. The CD spectra of the isolated recombinant RS domain of the SF2/ASF splicing factor is typical of a completely unstructured protein, with maximum negative ellipticity ~200–202 nm and the isodichroic point around 212 nm, suggestive of random coil conformation (18). The flexibility and disorder of the glycine-rich protein regions in solution have also been previously observed (45).
The binary classification of proteins as either ordered or disordered is an oversimplification of a real biological situation since most proteins consist of a mixture of ordered and disordered regions. At the same time, it has been proven useful for estimating the disorder content of genomes (32). Such binary classification can also be used to estimate the disorder content of protein families or protein functional categories.
One of the methods for classifying protein as ordered or disordered is the CDF of disorder scores (38). This method separates ordered and disordered sequences based on the per-residue disorder score, and the optimal boundary, determined using the univariate normal probability density function, could be drawn between these two protein classes (32). The CDF curves for ordered proteins are located above the boundary, while the CDF curves for disordered proteins are located below the boundary.
Here, we applied the CDF method to classify SR proteins (Figure 5). According to the CDF classification, 9 out of 10 SR proteins belong to the class of ID proteins. The CDF curve for only one SR protein, SRp30c, is located slightly above the order-disorder boundary. This protein could be considered marginally ordered (or marginally disordered) using this classification method. Indeed, SRp30c carries the shortest RS domain (only 13 residues), and all other disorder attributes (Table 2) for this protein are suggestive of its marginally ordered structure (Figure 4). Thus, CDF analysis classifies the majority of SR proteins as disordered.
Another method that has previously been developed for binary classification of proteins (20) is based on the calculated mean net charge and mean normalized Kyte–Doolittle hydropathy (39). When plotted against each other, these measures are known to separate ordered and disordered proteins by a boundary that could be determined using a linear discriminant function (32). Disordered proteins are generally clustered above the boundary, and therefore are characterized by a combination of high net charge and low hydropathy. In contrast, ordered proteins are generally clustered below the boundary and are characterized by lower net charge and higher hydropathy than disordered proteins.
When we applied the charge-hydropathy classification to the SR protein family, all 10 members of this family fell into the disordered protein category (Figure 6). As expected from the previous analysis, the SRp30c protein is located closer to the order-disorder boundary than the remaining SR proteins. The SRp75 protein has the lowest hydrophobicity and one of the highest values for the net charge (the leftmost red diamond in Figure 6), in agreement with the highest content of predicted disorder and the longest RS domain in comparison to other SR proteins (Figure 4).
In summary, two classification methods applied to the SR protein family are in agreement with each other, and they both predict that SR proteins belong to the class of intrinsically unstructured proteins. To a large degree, this classification is attributable to the disordered nature of the RS domains that comprise a significant portion (up to 64% of protein length in the extreme case of SRp75) of the SR proteins. The potential importance of disorder for RS domain functions is discussed in the section below.
It is widely accepted that the RS domains of SR proteins participate in homotypical protein–protein interactions with the RS domains of numerous other SR and non-SR proteins (2,46). Moreover, it was recently suggested that RS domains could also specifically contact the pre-mRNA branchpoint (10). Thus, the RS domains seem to have dual specificity, e.g. they participate in interactions with both proteins and RNA.
It is difficult to understand how such a broad binding specificity is achieved assuming that the RS domain has a folded globular structure. In contrast, a broad specificity is in perfect agreement with the disordered structure because intrinsic structural disorder could facilitate accommodation of structurally diverse partners. Indeed, numerous examples from the literature indicate that ID proteins and protein regions are involved in protein–protein and protein–nucleic acids interactions (23,24), and that such interactions may also include folding upon binding (47). Structural plasticity is especially important for interactions with multiple partners (27,48). Thus, we suggest that disorder of the RS domain of SR proteins is crucial in determining the broad binding specificity of these factors.
Another important function of disorder within RS domains arises from their dispensability in splicing reaction. The fact that the RS domains from non-SR proteins as well as synthetic RS domains, consisting of only RS dipeptide repeats, are sufficient to activate splicing (49) argues against the requirement for a particular 3D structure for this function. Rather, it supports the requirement for a disordered structure, because the RS domains of other proteins, as well as the sequences of RS dipeptides, are also predicted to be unfolded (data not shown).
The RS domains of SR proteins are extensively phosphorylated by two families of kinases, the SR protein-specific kinases (SRPKs) and Clk/Sty protein kinases (15). Phosphorylation is required for translocation of SR proteins from the cytoplasm to the nucleus (50), and it is also known to regulate the activity of SR proteins during early development (51). Interestingly, we have previously shown that the phosphorylation sites of numerous other proteins are preferentially located in the disordered regions (28). Furthermore, other modifications such as methylation (52) and ubiquitination (P. Radivojac and L. M. Iakoucheva, manuscript in preparation) are also predicted to occur in disordered protein regions. Consistent with these observations, extensive phosphorylation of the RS domain supports the prediction of its disordered structure. Moreover, methylation of three arginines (R93, R97 and R109) has recently been observed in the ASF/SF2 splicing factor, as well as in several other hnRNPs and SR-like proteins (53). Thus, disorder could potentially facilitate numerous post-translational modifications of the RS domains.
ID proteins are often involved in the assembly of macromolecular complexes (54). The building of such complexes usually proceeds in a step-by-step manner and requires conformational flexibility and adaptability of constituting components during the assembly process. One example of a macromolecular complex that depends on the flexibility of its components is a ribosome. CD studies of individual ribosomal proteins from Escherichia coli showed that they are substantially disordered when separated from the ribosome (55). Moreover, some of the ribosomal proteins remain in the largely extended conformation even within the ribosome; they are filling ‘gaps and cracks’ between rRNA loops (56). Recent investigation of the biophysical properties of proteins comprising another macromolecular assembly, a nuclear pore complex, showed that they also exhibit structural characteristics typical of natively unfolded proteins (57).
The spliceosome represents another example of a large macromolecular complex, for which only limited structural knowledge is currently available (58). The spliceosome resembles a ribosomal subunit with respect to composition (RNA and proteins), complexity (large number of proteins) and size. SR proteins play key roles in the spliceosome assembly by facilitating recruitment of components of the spliceosome via protein–protein interactions that are potentially mediated by the RS domains (59). It is logical to suggest that the disordered structure of the RS domains would play an important role in facilitating interactions of spliceosome components during the assembly process.
As shown above, numerous functions performed by SR proteins in general, and their RS domains in particular, seem to rely on the disordered structure. Our predictions of disorder for RS domains, together with the classification of SR proteins as a disordered family, strongly suggest that unstructured conformation may be essential for the activity of SR splicing factors. Furthermore, our findings add splicing to the growing list of biological functions (23) performed by the disordered proteins and protein regions.
We would like to thank Adrian Krainer and Keith Dunker for helpful discussions and valuable comments and suggestions. We thank Katherine Montague for proofreading the manuscript. This study was supported by NSF grant MCB 0444818 to L.M.I. Funding to pay the Open Access publication charges for this article was provided by the National Science Foundation.
Conflict of interest statement. None declared.