|Home | About | Journals | Submit | Contact Us | Français|
Little is known about pre-mRNA splicing in Dictyostelium discoideum although its genome has been completely sequenced. Our analysis suggests that pre-mRNA splicing plays an important role in D. discoideum gene expression as two thirds of its genes contain at least one intron. Ongoing curation of the genome to date has revealed 40 genes in D. discoideum with clear evidence of alternative splicing, supporting the existence of alternative splicing in this unicellular organism. We identified 160 candidate U2-type spliceosomal proteins and related factors in D. discoideum based on 264 known human genes involved in splicing. Spliceosomal small ribonucleoproteins (snRNPs), PRP19 complex proteins and late-acting proteins are highly conserved in D. discoideum and throughout the metazoa. In non-snRNP and hnRNP families, D. discoideum orthologs are closer to those in A. thaliana, D. melanogaster and H. sapiens than to their counterparts in S. cerevisiae. Several splicing regulators, including SR proteins and CUG-binding proteins, were found in D. discoideum, but not in yeast. Our comprehensive catalog of spliceosomal proteins provides useful information for future studies of splicing in D. discoideum where the efficient genetic and biochemical manipulation will also further our general understanding of pre-mRNA splicing.
The amoeboid protozoan Dictyostelium discoideum is a eukaryotic model organism that has been extensively used in studying signal transduction, cell motility and cell differentiation. It occupies a unique phylogenetic position and belongs to the group of mycetozoans that branches out after plants but before metazoans and fungi (Baldauf et al., 2000). Little is known about the RNA processing machinery in D. discoideum.
Pre-mRNA splicing is the process that removes intervening sequences (introns) from the nascent pre-mRNA transcripts to form functional mRNAs. This process is a critical step in eukaryotic gene expression and occurs in the multi-component macromolecular machine named the spliceosome (e.g., Calarco. et al., 2011; Hoskins et al., 2011; Ramani. et al., 2011 and references within). This large RNA-protein complex contains, in addition to the pre-mRNA substrate, several uridine-rich small nuclear ribonucleoprotein (snRNP) particles as well as a number of associated proteins. To process the majority of introns (the major class, also called the U2-type introns), the spliceosome contains U1, U2, U4/6 and U5 snRNPs. The splicing of the minor class of introns (also called the U12-type) occurs in the spliceosome containing U11 and U12 in addition to U4atac, U6atac and U5 snRNPs (for review, see (Patel and Steitz, 2003; Will Lührmann, 2005)). Biochemical and molecular studies have revealed major components of the splicing machinery, especially the U2-type spliceosome.
The completion of the D. discoideum genome (Eichinger et al., 2005) provides an opportunity for us to systematically examine pre-mRNA splicing and the splicing machinery in this model organism. We queried the D. discoideum genome available at dictyBase (http://dictybase.org; (Chisholm et al., 2006)) to determine the presence of introns in the coding sequences of the primary protein sequence set at dictyBase. The analysis revealed that among 13,527 predicted and known protein-coding genes in D. discoideum, 9232 (68%) contain at least one intron. This indicates that pre-mRNA splicing plays an important role in the expression of a majority of D. discoideum genes. Furthermore, in our comparison of genomic and expressed sequence tag (EST) sequences, we found that a number of D. discoideum genes undergo alternative pre-mRNA splicing, suggesting that alternative splicing regulation may play a role in the biology of this unicellular organism.
To identify genes encoding D. discoideum spliceosomal components, we searched dictyBase using sequences of spliceosomal proteins present in Homo sapiens (human). Our search criteria for D. discoideum orthologs included sequence similarity, reciprocal matches, the presence of the relevant domain(s), manual review and independent phylogenetic analysis. In general, we found that spliceosomal proteins and related factors in D. discoideum have higher similarity to those in the plant (Arabidopsis thaliana), fly (Drosophila melanogaster) and human (Homo sapiens) genomes than to their yeast (Saccharomyces cerevisiae) orthologs.
D. discoideum has a genome size of 34 Mb, which is smaller than the human (2851 Mb), fly (180 Mb) and plant (157 Mb) genomes, but about 2.6 times the size of the yeast genome (13 Mb). The D. discoideum genome contains 13,527 predicted protein-coding genes, which is similar to those in fly (13,676) and plant (13,029), but significantly higher than those in yeast (5538) (Eichinger et al., 2005). We queried the D. discoideum genome sequence at dictyBase and found that in D. discoideum, 9210 (68%) contain at least one intron. It is known that 77% of fly genes (Crosby et al., 2007) and only 5% of yeast genes contain intron(s). The mean numbers of introns in spliced genes are 1.0, 1.9, 4.0 and 8.1 in yeast, D. discoideum, fly and human, respectively (Eichinger et al., 2005). We examined gene models and genomic sequences in comparison with ESTs and cDNA sequences. To date, this has led to the identification of 40 genes that have clear evidence of alternative splicing (Table 1). This is contrary to the previous belief that no regulated alternative splicing exists in any unicellular organism (Barbosa-Morais et al., 2006).
Based on published studies, we inspected protein sequences that have been reported as spliceosomal proteins or proteins with experimental evidence for their roles in pre-mRNA splicing. A collection of 264 human sequences for spliceosome associated proteins (Hartmuth et al., 2002; Zhou et al., 2002; Wu et al., 2004; Collins and Penny, 2005; Barbosa-Morais et al., 2006; Matlin and Moore, 2007; Bessonov et al., 2008) were retrieved from the RefSeq database and used to query the dictyBase database. Figure 1 shows a flow chart for our general search procedure. As a result, the vast majority of non-redundant homologs to human spliceosomal proteins and related factors (154) were identified in the D. discoideum genome. Furthermore, we identified several putative homologs by second-pass individual analyses (see METHODS). This increased the total number of putative spliceosomal proteins to 160. It demonstrates that 61% (160/264) of the human spliceosomal proteins have predicted orthologs in D. discoideum. The D. discoideum spliceosomal proteins and related factors are described below in several groups: the snRNP proteins, non-snRNP proteins, hnRNP and associated proteins, and alternative splicing regulators (Tables 2–5). “No hit” in Tables 2–5 indicates that the identified D. discoideum spliceosomal proteins did not hit their corresponding orthologs in fly, plant and yeast in the RefSeq database using our search criteria. In some cases, the fly, plant or yeast orthologs do exist, but are not identified using D. discoideum proteins because of the sequence divergence between D. discoideum and fly, plant or yeast.
The snRNP proteins are further classified into Sm/Lsm core proteins, U1, U2, U5, U4/U6-specific proteins and tri-snRNP specific proteins (Table 2). Among snRNP proteins, all orthologs to 49 human proteins were identified in D. discoideum, plant, and fly genomes, but only 43 yeast orthologs were available (Table 2). Spliceosomal snRNP proteins in D. discoideum are highly conserved, and similar to those in higher eukaryotes. The Sm/LSm core proteins in the D. discoideum genome have almost one-to-one correspondences to their human counterparts. Such close relationship is illustrated by LSm6 (LSM6) and LSm7 (LSM7) in the phylogenetic tree (Fig. 2A).
When there are two or more closely related proteins in human, D. discoideum often has fewer, or just one ortholog. For example, searches with SmB/B’ (SNRPB) or SmN (SNRPN) led to the same hit, DDB0233178 in D. discoideum. Similarly, only one gene with sequence similarity to both SNRPB and SNRPN has been identified in the D. melanogaster, A. thaliana and S. cerevisiae genomes. This relationship has been demonstrated in the phylogenetic analysis (Fig. 2B). These orthologs in the fly, plant, D. discoideum and yeast genomes are arranged in the expected evolutionary position.
There is only one gene (DDB0233135) identified in D. discoideum with significant sequence similarity to both human U1-specific protein A (U1A, SNRPA) and U2-specific protein B2 (U2B”, SNRPB2). This finding is similar to what is found in the fly and plant. Interestingly, this D. discoideum ortholog only identifies U2B” (Msl1p), but not U1A (Mud1p) in yeast. It is possible that other U1A like genes exist in D. discoideum, but with more divergent sequences. Three proteins, SmE (SNRPE), the U5-specific 52 kDa (CD2BP2) and tri-snRNP 27 kDa (SNRNP27), were identified by second-pass blast and domain analyses (see METHODS). The small (92 amino acids) SNRPE homolog (DDB0302415) was not present in D. discoideum gene predictions but was identified using the human SNRPE protein sequence and the tBlastn program. The putative Dictyostelium CD2BP2 homo-log (DDB0233538) contains a > 50 polyasparagine stretch. Homopolymers, especially polyglutamine and polyasparagine stretches are abundant in D. discoideum (Eichinger et al., 2005). These proteins can be classified by customized blast searches (see METHODS). The possible SNRNP27 homolog (DDB0238807) in D. discoideum, which is present in fly but absent in yeast, was also identified using the tBlastn program. This revealed a gene where the original gene structure was incorrect. After curation at dictyBase, the predicted homolog aligns well with the human SNRNP27 at the C-terminus, where both proteins contain a DUF1777 domain whose function remains unclear. The D. discoideum protein is almost twice as long as its human counterpart (296 versus 155 amino acids). However, this difference in length occurs in the repetitive arginine-rich N-terminal sequences that both proteins share to a different degree.
Non-snRNP proteins associated with spliceosomal assembly and pre-mRNA splicing are classified into several groups: SR and SR-related proteins, PRP19 complex proteins, catalytic step II and late-acting proteins, exon junction complex (EJC) proteins and other splicing factors. We searched the D. discoideum proteome using corresponding human proteins with the same criteria as described above. The D. discoideum non-snRNP proteins are more similar to A. thaliana, D. melanogaster and H. sapiens orthologs than are those of S. cerevisiae. In non-snRNP spliceosomal proteins, the majority of the human proteins have orthologs in D. discoideum, A. thaliana and D. melanogaster but not in S. cerevisiae. For the convenience of description, we list them in groups as shown in Table 3.
SR and SR-related proteins are characterized by two structural motifs, RNA recognition motif (RRM) of RNP type and RS domain containing arginine-serine rich sequences. Sixteen members of SR and SR-related proteins have been identified in human. These proteins play important roles in both constitutive splicing and alternative splicing regulation (reviewed in (Blencowe, 2000; Black, 2003; Wu et al., 2004; Sanford et al., 2005; Lin and Fu, 2007; Matlin and Moore, 2007)). Interestingly, three distinct SR protein orthologs with RRMs and an RS domain were identified in the D. discoideum genome. These proteins are DDB0233327, DDB0233352 and DDB0233351, corresponding to human 9G8 (SFRS7), Tra2-beta (SFRS10) and Tra2-alpha (TRA2A), respectively (Table 3A). Several classical SR proteins in mammals do not have orthologs in the D. discoideum genome, including SC35 and ASF/SF2 (Table S1). On the other hand, some SR protein genes in D. discoideum seem to have expanded in numbers. For example, two genes were identified as possible homologs of human SRp75 (SFRS4): DDB0233308 and DDB0233309. In such cases, only those with the highest level of sequence homology were included in Table 3A. It is also interesting to note that the RS domains in D. discoideum SR proteins appear to be more enriched in the RDR/RDRS motif rather than in the typical RS/SR sequences found in mammalian SR proteins. For example, in DDB0233308 and DDB0233327, there are long stretches of RDR/RDRS peptides, whose functional significance remains to be investigated.
All seven well-documented PRP19 complex associated proteins have orthologs in the Dictyostelium, human, fly, plant and yeast genomes (Table 3B). Several proteins known to act during the late stage of spliceosomal assembly and splicing were also found to be highly conserved in Dictyostelium, human, fly, plant and yeast, including Prp22 (DHX8), Prp43 (DHX15), Prp16 (DHX38), Slu7 (SLU7), Prp17 (CDC40) and Prp18 (PRPF18) (Table 3C).
The EJC assembly is a splicing-dependent process and serves to mark the RNA for downstream processing steps such as export, translation and nonsense-mediated decay (Tange et al., 2004; Lejeune and Maquat, 2005). The conservation of EJC proteins is high in the D. discoideum genome (Table 3D). Five EJC proteins corresponding to human SRRM1, BAT1, RNPS1, RBM8A and MAGOH are found in D. discoideum, whereas yeast has only one ortholog to human BAT1. This suggests that RNA processing could be more complex in D. discoideum than in yeast, although further experimental data are required for this generalization.
We identified a number of other spliceosomal proteins in D. discoideum that contain various motifs present in known splicing factors, including DExD, cyclophilins, WD40s, cap binding proteins, polyadenylation machinery proteins, zinc finger motif and other uncharacterized motifs (Table 3E and 3F). DExD/H containing proteins play important roles in pre-mRNA splicing (Staley and Guthrie, 1998; Cordin et al., 2006). It is interesting to note that almost all of the human spliceosomal proteins with the DExD/H motif have D. discoideum orthologs. Cyclophilins catalyze cis-trans propyl bond isomerization and facilitate protein conformational changes. All five orthologs of human splicing-related cyclophilins are present in D. discoideum. Searching S. cerevisiae with D. discoideum proteins identified two positive hits (NP_013633 and NP_013317; Table 3E).
Forty-six heterogeneous nuclear ribonucleoproteins (hnRNPs) and other H complex associated proteins in the human genome were used to query dictyBase. Of these 46 sequences, we identified 9 non-redundant orthologs in D. discoideum (Table 4). The human hnRNP L (HNRNPL) hits the D. discoideum gene (DDB0233648) and both human hnRNP R (HNRNPR) and hnRNP Q (SYNCRIP) hit one D. discoideum protein (DDB0214833). None of these three hnRNPs has the yeast orthologs. This relationship was confirmed in the phylogenetic analysis (Fig. 2C). In the heat shock proteins, the human query sequences (HSPA1A and HSPA8) identified two groups of the orthologs in the fly genome, but only one cluster of the orthologs from A. thaliana, D. discoideum and S. cerevisiae (Fig. 2D). These clusters are not specific to either HSPA1A or HSPA8 and different from the one-to-one relationship as found in snRNPs (Fig. 2A and 2B). When these 9 non-redundant proteins were used to search the RefSeq database, all of them corresponded to the initial 16 human query sequences. The search with these putative D. discoideum hnRNP and related proteins also led to the identification of 15, 16 and 7 non-redundant proteins in the fly, plant and yeast genomes, respectively. In comparison with their yeast counterparts, Dictyostelium hnRNP protein orthologs are again more similar to those in the fly, plant and human genomes.
Several groups of alternative splicing regulators have been reported in mammalian and fly genomes. These include hnRNP proteins, the SR protein super-family (SR proteins and SR-related proteins), CUGBP and ETR-like factors (CELF), DExD/H box containing proteins, RNA-binding proteins containing the heterogeneous nuclear ribonucleoprotein K-type homology (KH) or RRM domains, and other RNA binding proteins. A number of proteins are involved in both spliceosomal assembly and alternative splicing regulation.
Alternative splicing regulators of the hnRNP protein family often bind to exonic or intronic splicing regulatory sequences and influence splice site selection (reviewed in (Blencowe, 2000; Black, 2003; Wu et al., 2004; Sanford et al., 2005; Lin and Fu, 2007; Matlin and Moore, 2007)). HnRNP protein orthologs have been described in the previous section. SR proteins play important roles in both constitutive and alternative splicing (Blencowe, 2000; Cartegni et al., 2002; Wu et al., 2004; Sanford et al., 2005; Lin and Fu, 2007). Both hnRNP and SR protein orthologs have also been described in previous sections (Table 3A).
DExD/H box-containing proteins and other RNA-binding proteins also play a role in alternative splicing regulation (e.g., Wu et al., 2006; Fushimi et al., 2008; Kar et al., 2011 and references within). Two orthologs of DExD/H box containing regulators, p68 (DDX5) and p72 (DDX17), were found in D. discoideum (Table 3E).
The CELF family of splicing regulators interacts with CUG-containing splicing regulatory elements and control alternative splicing of a number of genes (Ladd et al., 2001). RNA transcripts containing expanded CUG/CCUG repeats can bind and sequester CUG-binding proteins and cause aberrant splicing (Ebralidze et al., 2004). Altered expression of CUG-binding proteins has been associated with myotonic dystrophy ((Kanadia et al., 2003) and reviewed in (Wang and Cooper, 2007)). Two putative CELF family members were identified in D. discoideum (DDB0233674 and DDB0233675), which correspond to six human CELF family members, CELF1–6 (Table 5). Three fly proteins (NP_788039, NP_609559 and NP_723739) and three plant proteins (NP_171845, NP_567249 and NP_973752) are similar to the two D. discoideum proteins, which are related to the above six human CELF family members (Table 5). These CELF orthologs are similar to those heat shock proteins and do not have one-to-one relationships to the human CELF proteins. No CELF proteins were found in the yeast genome.
Our sequence analyses of genomic and EST databases strongly support earlier findings (Grant and Tsang, 1990; Bain et al., 1991; Greenwood and Tsang, 1991; Escalante et al., 2003) that D. discoideum has bona fide alternative splicing. To date, we have examined nearly all 13,527 genes individually and compared them with the available EST and cDNAs. This led to the identification of 40 genes that clearly show alternative splicing isoforms (Table 1). With only 50% of the 13,527 estimated genes in D. discoideum having at least some EST coverage, the actual number of alternatively spliced genes may be much higher than the 40 genes in this study. These results strongly suggest that alternative splicing could be important in the biology of this unicellular model organism. Consistent with this notion, a number of alternative splicing regulators have been identified by our sequence searches. Interestingly, all of the major families of alternative splicing regulators reported in mammals and D. melanogaster have been identified in D. discoideum. These include the SR protein super-family, CELF family, hnRNP protein family, DExD box containing proteins and other RNA binding proteins (see individual descriptions in the sections above). SR proteins are among the earliest acting proteins in spliceosome assembly. These proteins can interact with the exonic splicing regulatory elements and are related to the increased protein complexity. The CUG-binding proteins play a role in RNA processing and can regulate alternative splicing of different transcripts (Ladd et al., 2001). The expanded CUG/CCUG-containing transcripts can bind and sequester CUG-binding proteins and cause aberrant splicing (Wang and Cooper, 2007). Altered expression of CUG-binding proteins has been associated with myotonic dystrophy (Kanadia et al., 2003; Wang and Cooper, 2007). The presence of alternatively spliced genes and splicing regulators in the D. discoideum genome provides opportunities for studying alternative splicing in this simple model organism.
D. discoideum snRNA genes were identified using a motif-search algorithm written in the Perl program (see METHODS section). There are five genes coding for U1 snRNAs, seven for U2 snRNAs, three for U4 snRNAs, two for U5 snRNAs, and one for U6 snRNA. Searches for U11, U12, U4atac and U6atac did not reveal convincing homologs with significant sequence similarity (data not shown), suggesting that D. discoideum may not have the U12 type minor class of spliceosomes. Our results of D. discoideum spliceosomal snRNAs are similar to the findings published by Aspegren and colleagues (Aspegren et al., 2004; Hinas et al., 2006). Taken together, it suggests that our approach can be applied in different genomes for the identification of snRNA genes.
In this study we identified 160 candidate spliceosomal proteins in the model organism D. discoideum. 68% of the predicted and known protein-coding genes in D. discoideum contain one or more introns and these genes have to undergo pre-mRNA splicing to generate functional mRNA transcripts. Therefore, pre-mRNA splicing is critical for gene expression in D. discoideum. In addition to all spliceosomal snRNAs (U1, U2, U4, U5 and U6), we identified 100 non-redundant sequences in the D. discoideum genome that are likely functional homologs of human non-snRNP spliceosomal proteins. D. discoideum can be used as a model system for studying the spliceosome and its components. The identification of this comprehensive set of spliceosomal proteins in D. discoideum should facilitate studies of pre-mRNA splicing in this model system.
The entire set of spliceosomal snRNP core proteins, the PRP19 complex proteins and late-acting splicing proteins are very highly conserved in yeast, Dictyostelium, plant, fly and human. Such widespread conservation suggests that these proteins play critical roles in fundamental process of pre-mRNA splicing. D. discoideum branches from the metazoan lineage before yeast. Our analyses show that many metazoan splicing factors that are missing in yeast are present in D. discoideum, indicating that these splicing-associated proteins are more ancient than previously thought. Further study will shed light on the early evolution of the metazoan splicing machinery.
Mutations in several spliceosomal protein genes, PPRC3, PRPF8 and PRPF31, cause human retinal degeneration (reviewed in (Pacione et al., 2003; Mordes et al., 2006)). It is interesting to note that all these disease-associated spliceosomal proteins are conserved in D. discoideum. Our comprehensive catalog of Dictyostelium discoideum spliceosomal proteins and related factors presented here will be useful for future experiments to elucidate splicing mechanisms and the underlying molecular pathways leading to human disease.
Human spliceosomal proteins were collected from published studies (Hartmuth et al., 2002; Zhou et al., 2002; Wu et al., 2004; Barbosa-Morais et al., 2006; Matlin and Moore, 2007; Bessonov et al., 2008) and were used as the primary source to query dictyBase (http://www.dictybase.org/; (Chisholm et al., 2006)) using the BLASTp BLOSUM62 matrix with SEG filter (for filtering low-complexity subsequences) (Altschul et al., 1990). The D. discoideum protein primary features database and an E-value < 10−6 was used. The local alignments between the human and D. discoideum genes were manually reviewed to identify the structural motif regions present in the human spliceosomal proteins. Peptide sequences with only regional sequence homology but without the known motif(s) characteristic of the corresponding splicing proteins were excluded. Finally, a reciprocal BLAST search was performed using the identified D. discoideum hits as queries to search the RefSeq database (http://www.ncbi.nlm.nih.gov/RefSeq/). If a putative D. discoideum sequence matched the corresponding spliceosomal related gene in the human, fly, plant and yeast proteomes with an E-value < 10−5, it was accepted as the D. discoideum ortholog.
When we did not identify orthologs with the above described method, dictyBase curators performed individual tblastn and blastp searches at dictyBase combined with domain analyses. As a general rule, blastp results with 25% identity over 70% overall length were considered as orthologs. This second-pass approach often identified those orthologs whose automatic gene prediction has been either incorrect or absent, which resulted in the correctly annotated genes not being present in the dictyBase primary sequence dataset. These genes were then added manually and are now publicly available. In some cases, similarity was masked by highly repetitive sequences in the D. discoideum gene. These are common in D. discoideum. In this case, blast searches and domain analyses were performed with partial deletions of repetitive strings comparing results with those obtained with the full-length protein. Phylogenetic analysis was performed to identify and confirm the ortholog proteins in different species. The protein sequences in each group were aligned using Clustal W version 2.0 (Larkin et al., 2007), to generate a character matrix in NEXUS file format. Phylogenetic analysis was then performed again on each of the protein alignments with MrBayes (Ronquist and Huelsenbeck, 2003), using Markov chain Monte Carlo to approximate the posterior probabilities of each tree. The .con file generated from MrBayes includes two consensus trees, which have been used to generate a graphical representation in the program TreeView (Page, 2002).
To identify spliceosomal snRNAs in the D. discoideum genome, a motif-search algorithm was applied, which was specially designed for this task and written in Perl language. In order to get evolutionarily extra-conservative short sequence segments (motifs) within snRNAs, known snRNA genes were compared among human, fly and plant (A. thaliana) using the ClustalW program. For every spliceosomal snRNA, sequence motifs were identified that contain nucleotide sequences identical among the three species studied. These motifs and the observed distances between them were used as input for writing our Perl program that was used to scan the D. discoideum genome. Additional 10%–20% sequence variations were permitted within motifs and 10%–20% length variations in distances between the motifs because some of the D. discoideum snRNA genes are very divergent from their animal and plant orthologs. The secondary structures of the predicted D. discoideum genes were examined using the M-fold program.
Alternatively spliced genes are discovered as an ongoing effort at dictyBase where each gene model is individually inspected and compared with all available EST and cDNA data.
We thank members of Wu lab for helpful suggestions and critical reading of the manuscript. This work was supported by grants to J.Y. W from NIH (EY014576 and GM070967), to A.F. from NSF Career award MCB-0643542 and to R.L.C. from NIH (GM64426 and HG02273).