|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: DADLM PH. Analyzed the data: DADLM PH. Contributed reagents/materials/analysis tools: DADLM PH. Wrote the paper: DADLM PH.
Alternatively-spliced (AS) forms can vary protein function, intracellular localization and post-translational modifications. AS coupled with mRNA nonsense-mediated decay (NMD) can also control the transcript abundance. Here, we have investigated the genome-scale conservation of alternatively-spliced NMD candidates (AS-NMD candidates), in mammals.
We mapped >12 million cDNA/EST library transcripts, comprising pooled data from both older and next-generation sequencing techniques, against genomic sequences to annotate AS-NMD candidates generated by in-frame premature termination codons (PTCs), in the human, mouse, rat and cow genomes. In these genomes, we found populations of genes that harbour AS-NMD candidates, varying in number from ~149 to 2,051 genes. We discovered that a highly-significant proportion (27%–35%) of AS-NMD candidate genes in mouse, rat and cow, also have human orthologs targeted for NMD. Intron retention was the most abundant type of AS-NMD, ranging from 43% to 67% of genes harbouring an AS-NMD candidate. Groupings of AS-NMD candidate genes either with or without intron retentions also have highly significant AS-NMD conservation, indicating that the trend is not due primarily to conservation of intron retentions. As a subset, the AS-NMD intron retentions are distinguished from non-retained introns by higher GC content, and codon usage similar to the usage in protein-coding sequences. This indicates that most of these alternatively spliced sequences have coded for proteins in the recent evolutionary past. In general, the AS-NMD candidate genes showed a similar pattern of Gene Ontology functional category enrichments in all four species. Genes linked to nucleic-acid interaction and apoptosis, and involved in pathways linked with cancer, were the most common. Finally, we mapped the AS-NMD candidates to mass spectrometry-derived proteomics data, and gathered evidence of truncated polypeptides for at least 10% of all human AS-NMD candidate transcripts.
In summary, our analysis provides strong statistical evidence for conservation of functional AS-NMD candidature across Mammalia for a large subset of genes. However, because codon usage of AS-NMD intron retentions is similar to the usage in exons, it is difficult to de-couple conservation of AS-NMD-based regulation from conservation for protein-coding ability, for intron retentions.
Alternative splicing is the generation of multiple transcripts from a single protein-coding gene. Several studies have shown that at least half of all human genes undergo alternative splicing , , although RNA array and sequencing data shows that this number is likely as high as 94% of genes . Alternatively-spliced (AS) forms can vary protein function, intracellular localization and post-translational modifications. Although, in most cases, protein AS isoforms show only small differences, some isoforms can lead to large functional variances and even to genetic disorders . Apparently, alternative splicing can also control transcript abundance, due to its coupling to mRNA nonsense-mediated decay (NMD). It seems that at least 10–15% of human transcripts can be switched off by NMD coupled to alternative splicing .
Nonsense-mediated decay is a eukaryotic surveillance mechanism that detects mRNAs that harbour premature termination codons (PTCs), and commits these transcripts to rapid decay. One of the expected physiological consequences of NMD is the prevention of synthesis of truncated polypeptides, which presumably leads to protection of the cell from resultant deleterious dominant-negative or gain-of-function effects , . The biological importance of NMD is highlighted by the fact that 30% of inherited genetic diseases are due to PTC, most of which are in NMD targets . Therefore, the majority of the nonsense-associated diseases are caused either by an insufficient level of the functional protein as a result of degradation of a PTC-containing mRNA, or by generation of a defective truncated protein from a PTC-containing mRNA that escaped the NMD surveillance , .
In mammalian cells, translation termination codons and exon–exon junctions are cis-acting elements that allow recognition of PTCs. The mRNA is subject to rapid decay when there is a PTC more than ~50–55 nucleotides upstream of the last exon–exon junction . The positional information that marks exon–exon junctions is provided by the components of the splicing-dependent exon junction complex (EJC) that persist during export, and until the mRNA is translated . NMD is believed to require three core UPF factors which are conserved from yeast to humans, namely, UPF1 (also known as regulator of nonsense transcripts 1 [RENT1]), UPF2 (RENT2), and UPF3 . UPF1 is not only essential to NMD but is also required for rapid degradation of histone mRNAs .
Here, we have focused on the analysis of orthologous alternatively-spliced NMD candidates (AS-NMD candidates). We have annotated and analyzed the occurrence and conservation of AS-NMD candidates in genes, generated by in-frame premature stop codons, in four mammalian genomes: human, mouse, rat and cow. Strikingly, we find highly significant conservation of NMD candidature, and also demonstrate that genes containing an AS-NMD candidate are significantly associated with specific functional categories of proteins, and of disease-associated genes.
To assess the existence in public databases of alternatively-spliced forms that are putative candidates for nonsense-mediated mRNA decay (hereafter termed ‘AS-NMD candidates’), we developed and applied a pipeline to four mammalian genomes (human, mouse, rat and cow). This pipeline is described in detail, in Methods. We annotated AS-NMD candidates as transcripts containing in-frame premature termination codons (PTCs) >50 nucleotides 5′ of the last exon splice junction. In brief, we mapped the complete corpus of >12 million cDNA/EST transcript library sequences from several databases (RefSeq, Unigene, dbEST and H-invitational) against genomic sequences for the four mammals (human, mouse, rat and cow). This data set contains pooled transcript data both from older sequencing techniques and from next-generation sequencing methods. We compared these complete transcript library mappings to mappings to reference mRNA mappings for each gene, to calculate whether the mapped expressed sequence is part of an AS-NMD candidate.
The pipeline identified between ~149 and ~2051 genes with AS-NMD candidates, per genome (Table 1). The number of supporting cDNAs/ESTs varied between just 190 in rat, to 14,343 in human. The proportions of genes with AS-NMD candidates are between 1% and 10% of genes in each genome. However, it is important to be aware that, as for any analysis of transcript data, the major difference between several studies lies in the number of transcript sequences considered. For example, Zhang et al. (2009 ), reported that, the number of predicted AS-NMD candidates increased when these authors included predicted RefSeq ESTs in their analysis.
We classified the AS-NMD candidates according to the type of events that they present: IR, intron retention; CSE, cassette exon; ASD, alternative splicing donor; ASA, alternative splicing acceptor (Tables 2–3). Surprisingly, IR was the most abundant event in all genomes. It varied from 43% to 67% of all genes that harbour AS-NMD candidates. ASA was the second most abundant event (19%–32%). Due to the large number of IRs, we present below a set of analyses to verify whether or not—as a population—the IRs are artifacts derived from unprocessed or partially processed pre-mRNAs.
In a previous work, we studied duplicated pseudogenic exons (i.e., ΨEs, exons disabled by frameshifts and premature termination codons) on the same four genomes of our present study . In order to evaluate the contribution of exon duplication and pseudogenization to the production of AS-NMD candidates, we compared the list of genes bearing a ΨE and the list of AS-NMD candidates. We found that only 7 genes are common to both lists in human, and 3 genes in mouse. Thus the vast majority of AS-NMD candidates are not made through exon duplication, followed by disablement. We also cross-referenced our gene list with the Ensembl pseudogenes annotation (Build 52) to verify whether any of our genes was annotated exclusively as a pseudogene. No pseudogene was found in our list, indicating that they are all alternatively-spliced forms of protein-coding genes. We additionally compared our list of AS-NMD candidates with the general annotation of the human genome in the Ensembl database (www.ensembl.org). We found that a small fraction (321 genes, ~15%) in our list were also annotated as NMD-targeted transcripts by Ensembl.
Also, we wished to investigate whether AS-NMD events are associated with lower or higher gene expression levels. To do this, we used the number of ESTs/cDNAs as a broad indicator of expression level. In general, AS-NMD is associated with lower numbers of mapped ESTs/cDNAs (Supplementary Table S1A). For example, in human there are ~75 transcripts mapped per gene, compared with ~7 per gene for genes with AS-NMD events. Also, the substantial majority (>88%) of AS-NMD candidates have low numbers of ESTs/cDNAs (≤3 mapped sequences, Supplementary Table 1B).
We extracted orthology information about genes from the Ensembl database. Each AS-NMD candidate found in one of the three other species (mouse, rat and cow) was used as a query to find the corresponding ortholog in the human genome. Only matches annotated as ‘one2one’ (i.e., the transcript is bi-directionally the best match between the two genomes) were counted as true orthologs.
We examined the human conservation of genes that harbour AS-NMD candidates in the three other genomes (mouse, rat and cow) (Tables 1–2). A large proportion of these genes have clear single orthologs in the human genome (87–91%). Of these orthologous human genes, 27–35% also harbour orthologous AS-NMD candidates. (See the Supplementary Table S2 for the full list of the AS-NMD candidate orthologs in each species, and Supplementary Table S3 for a complete breakdown of the different types of AS-NMD and their conservation patterns.) This degree of conservation is highly significant according to hypergeometric probability (P-values <10−10, treating the conserved cases as a sample of all human genes that have AS-NMD candidature).
Additional calculations for the number of conserved cases that would arise from random placement of the NMD features in human introns, also indicate that these results are highly significant. For example, for the smallest sample (human intron retentions whose NMD candidature is conserved in rat, 92 cases), we generated 1,000 random samples of 92 transcripts, with appropriate weighting for the number of introns in each transcript. These weightings were calculated simply through random sampling from the total list of introns in all human genes. Through this sampling procedure, given the frequency of NMD events in human, we found that only 1.3(±2.3) AS-NMD candidates would have conserved AS-NMD candidature by chance between human and rat (32 conserved cases are observed, Table 2).
This significant conservation of AS-NMD candidature arises both for the whole data set of AS-NMD candidates (Table 1), and for the subset that comprise intron retentions (IRs) only (Table 2). In addition, AS-NMD candidate genes without AS-NMD intron retentions, also exhibit highly significant conservation (not tabulated explicitly in Tables 1–2; P<<0.000001 for rat, mouse and cow using hypergeometric probability, as described above). Interestingly, however, only ~13–22% of the IRs are conserved in the same position in the gene, indicating that there may just be selection pressure to maintain the NMD status, and not a specific retained intron.
Two models for AS-NMD candidates have recently been discussed, the ‘spurious transcript’ model, and the ‘regulatory model’ . In the ‘spurious transcript’ model, the NMD function is to degrade costly-to-make and potentially toxic unwanted transcripts. Another prediction of this model is that AS-NMD should arise for rare transcripts, and be more common in recently-evolved exons; hence, it should not be conserved in other species. The alternative model is the ‘regulatory model’. In this model, the function of the NMD is to modulate gene expression. Although this model does not preclude multiple AS forms, it does not predict them. The regulatory model predicts that AS-NMD candidature should be conserved among distant species .
Our findings on AS-NMD candidate conservation are supportive of the regulatory model of NMD for a large subset of genes. Several previously well-documented examples have leant support to this theory. For instance, the NMD control of splicing in the splicing regulator (SR) genes, one of the most extensive studied cases of NMD controlling gene expression, is conserved across mammals and plants .
Our results seem to disagree with a previous study that found only ~8% AS-NMD candidates in a pair of orthologs between human and mouse . However, a closer look shows that our data is comparable. The authors in this previous study used only RefSeq mRNAs in their analysis. When we considered only RefSeq mRNAs, in our analysis, the number of pairs of orthologous AS-NMD candidates between human and mouse was 7%, in agreement with the previous study .
Intron retention is one of the least studied AS forms. One of the major reasons for this is the difficulty of differentiating real IRs from artifacts derived from unspliced or partially spliced pre-mRNA , . However, previously it has been shown that retained introns seem to present sequence characteristics that distinguish them from non-retained introns, and also from protein-coding exons. These features include their average size, GC content and codon usage. In some cases the retained intron can also encode a protein domain or part of a domain –.
Here, we have assessed the AS-NMD retained introns in our data set for any evidence of selection pressures. One such source of evidence is deviation in GC content because of selection for transcriptional efficiency , , . We analyzed deviation in GC content in the retained introns, as a function of intron length (Figure 1), to avoid GC content bias due to the size of the sequences. Firstly, we examined the following three data sets: retained introns, exons bordering retained introns, and non-retained introns. We divided the three data sets into categories according to their sequence length (Figure 1). Non-retained introns showed a statistically significant lower GC content in all classes of size compared with retained introns and exons (χ2=18.9, p<0.0001; χ2=25.6, p<0.0001, respectively). In contrast, retained introns showed a similar GC content compared with exons (χ2=0.67, p=0.41). For comparison, we also included retained introns in genes that are orthologous AS-NMD candidates between human and mouse (Figure 1). Although, there is no retained intron shorter than 100 kb, all retained introns classes behaved more similarly to exons (χ2=0.21, p=0.9) than to non-retained introns (χ2=34, p<0.0001).
To evaluate the codon usage of the retained introns we used a test described by Galante, et al. (2004, ) (for more details see the Methods section). We created 100 random sets of exons bordering retained introns and non-retained introns, each set containing 16970 codons (the number of codons in the retained intron set). We then put each sequence in frame and performed a pairwise comparison of the total codon table (61 amino acids) of the three data sets (Table 4). We find that retained introns and exons share more similarities with each other than with non-retained introns (Table 4). This indicates that the retained introns may have had phases of protein-coding ability before acquiring premature stop codons, or may still be protein-coding. The fact that codon usage of AS-NMD intron retentions is similar to the usage in exons indicates that it is difficult to de-couple AS-NMD-based regulation from the evolutionary dynamics of protein-coding intron retentions.
We further analyzed the occurrence of stop codons in the retained introns (RIs), to analyse whether they may be able to become protein-coding through changing readong frames. On average, 95% of the RIs have a PTC in more than one frame and 90% of the RIs have a PTC in all three frames. Therefore, they are generally unlikely to be present in a functional, non-truncated protein.
We also assessed the ability of the AS-NMD candidate IRs to encode protein domains. To do this, we extracted three set of sequences: intron retentions, exons and sequences comprising retained introns and their 5′ and 3′ exons (‘E-IR-E’ sequences). These three sequence sets were compared against the Pfam database (as described in Methods). In total, we found 27 retained introns (in 27 genes) coding for a protein domain (17% of a total of 155 genes). Ten out of 27 retained introns encoded part of a domain also encoded by one of the flanking exons. Retained introns encoded for a different domain compared with their flanking exons in 14 cases and in three cases, only the retained intron encoded a Pfam domain. There was no case where the whole E-IR-E sequence encoded a domain.
The majority of the AS-NMD candidate ortholog pairs between human and mouse (153 out of 262) presented an IR in both species. As discussed above, the large number of AS-NMD candidates that are orthologously conserved between human and mouse is consistent with a widespread conserved mechanism to control gene expression through NMD.
We also annotated retained introns with PTCs that do not target the mRNA for NMD (termed ‘non-NMD’ retained introns) (Table 5). The number of genes bearing a non-NMD retained intron was between 2 and 4-fold higher than the number of genes with an AS-NMD candidate (the difference is statistically significant for all four genomes by χ2 test with P≤10−7). This indicates an apparent selection against PTC in NMD areas, and was also observed by us in a previous work , and by other authors as well . The average size of non-NMD intron retentions is greater than AS-NMD candidates, although only in humans, this difference is statistically significant (χ2=19, p<0.0001) (Table 5).
We wished to determine the functional linkages of genes with AS-NMD as a subpopulation of all genes in an organism. Using the webtool FatiGO , we annotated the most common Gene Ontology terms , Biocarta pathways (http://www.biocarta.com/genes/index.asp), TRANSFAC  transcription factors, and KEGG  terms in each genome, and calculated which terms are overrepresented in genes with AS-NMD candidates. Significant over-representation was calculated using a Fisher's exact test with P' value threshold of <0.05, and a correction to P' for multiple hypothesis testing .
The term ‘Nucleotide binding’ (GO:0000166) seems to be ubiquitous for AS-NMD candidates in the four genomes. Other abundant terms include those related to DNA interaction, DNA damage, cell cycle control and apoptosis (Table 6). Transferase and hydrolase activities were also common. The pattern of over- and under-representation of GO terms was similar for all four species. (The GO terms were simultaneously statistically significant in the four genomes only for non-corrected P values; see Table 6 for details.) This result shows that NMD may target different genes in different species, but these genes often belong to the same functional categories.
We also performed a census of the most common protein domains in the genes with AS-NMD candidates (with all domains counted only once per gene). The most common Pfam protein domain was the WD-40 domain (except in rat) (Supplementary table S4). This domain is known to be involved in several cellular functions, such as signal transduction, cell-cycle control and apoptosis . (Supplementary table S4).
In cow and rat, cancer-related pathways are significantly enriched (for P-values corrected for multiple hypothesis testing) according to KEGG classification. In mouse, besides cancer, AS-NMD candidates related with apoptosis are the most abundant. Human AS-NMD candidates are enriched for ‘glycine, serine and threonine metabolism’, ‘epithelial cell signaling in Helicobacter pylori infection’ and ‘androgen and estrogen metabolism’ KEGG pathways (Supplementary table S5). One role of NMD seems to be to protect individuals that are heterozygous for a PTC-containing allele; such an NMD-based mechanism has been demonstrated for genes associated with anemia, breast cancer, and diabetes type II , .
In the BioCarta pathway classification, we found five pathways in humans, enriched for AS-NMD candidates: antigen processing and presentation, inhibition of cellular proliferation by Gleevec, cell-to-cell adhesion signaling, integrin signaling pathway and role of Ran in mitotic spindle regulation (supplementary table S6).
Supplementary Figure S1 shows an example of a Biocarta pathway enriched for AS-NMD candidates. The drug Gleevec (also known as imatinib mesylate or STI-571) is considered one of the major breakthroughs in the treatment of cancer, especially CML, chronic myeloid leukemia. Gleevec acts on a molecular target by a mechanism that is more specific to cancer cells than traditional cytotoxic treatments . We found four key genes in the ‘Inhibition of cellular proliferation by Gleevec’ pathway that are targets for NMD: breakpoint cluster region (BRC), mitogen-activated protein kinase 1 (MAP3K1), v-akt murine thymoma viral oncogene homolog 1 (Akt-1), and c-fos FBJ murine osteosarcoma viral oncogene homolog (FOS). The loss of expression of these genes is linked with several diseases and in some cases can be lethal. For example, it was found that loss of Akt1 resulted in defective ischemia and Vegf-induced angiogenesis and severe peripheral vascular disease in mice . Loss of Mekk1 expression resulted in a greater apoptotic response of cells to hyperosmolarity and microtubule disruption .
We also calculated enrichments of transcription factor binding sites in the promoters of genes that target AS-NMD candidates (Supplementary Table S7). In human, the most significantly-enriched transcription factor binding site was for E2F1. Merdzhanova et. al. (2008, ), studied the association of E2F1 and the splicing regulator SC35 (a gene with an AS-NMD candidate). According to these authors, E2F1 upregulates the expression of SC35 in response to DNA-damaging agents; they also show that SC35 is required for apoptosis in response to these agents. The apoptotic mechanism involves the pro-apoptotic alternative splice form of the gene Bcl-x (also a AS-NMD candidate) that is in turn controlled by SC35 . Expression of the NMD-targeted form of SC35 could switch on the overexpression of anti-apoptotic splice variants of genes like Bcl-x and caspase. Such alteration is believed to contribute to the resistance of tumor cells to chemotherapy , .
We compared the representation of AS-NMD candidate genes in the OMIM database with the overall representation of the human genome in the same database. The difference of representation of the two sets in the OMIM database is highly statistically significant (χ2=129, p<0.0001). While 35% of the human gene complement is present in OMIM, the proportion of human AS-NMD candidate genes in OMIM is 50%. NMD-associated diseases often result from insufficient levels of full-length protein. Therefore, these diseases are generally due to two defects in gene expression: degradation of the newly synthesized PTC-containing mRNA by NMD, and failure of the abnormally low level of PTC-containing mRNA that escaped NMD to generate full-length, and thus, functional protein.
In summary, these analyses indicate that similar functional categories and protein domains are enriched in genes with AS-NMD in all four genomes. Also, in general, KEGG pathways related to cancer are enriched, and AS-NMD genes are more likely to be associated with diseases in the OMIM database. We discussed a specific example of a pathway with several AS-NMD candidates, that is important for cancer treatment, that we found in the Biocarta classification.
Since 5 to 25% of PTC-containing mRNAs fail to elicit NMD and are consequently translated into truncated proteins , several therapies have been developed specially to suppress in-frame PTC , . Some criticism of nonsense-suppressor therapies is that new diseases can be created even if only a single C-terminally extended protein from one of the many cellular transcripts were to accumulate to a toxic level. Furthermore, another problem is the unknown consequences of failure to eliminate naturally-occurring NMD targets .
To quantify the frequency with which the expressed AS-NMD candidates escape NMD and are translated into proteins we mapped each human translated AS-NMD candidate to ~550,000 peptides extracted from the repository of mass spectrometry-derived proteomics data (PRIDE) .
After filtering out the matches for uniqueness and size (peptides less than 7 amino acids long and peptides matching another gene were discarded) we found 205 AS events in 194 genes with peptide matches in the alternatively-spliced area (hereafter referred as AS-NMD-PRIDE genes) (Table 7). This is ~10% of the total number of genes targeted to NMD (Tables 1–2).
The majority (71%–81%) of the human AS-NMD-PRIDE genes possess an ortholog in one of the other species studied. We also compared the ‘AS-NMD-PRIDE’ genes with the remaining AS-NMD candidates (AS-NMD candidates with no match in PRIDE database). Surprisingly, the proportion with conserved AS-NMD orthologs was nearly the same for both sets of genes. For example, in the AS-NMD-PRIDE set 18%, 3% and 8% of the human orthologs in mouse, rat and cow respectively were also targeted for NMD. For the remaining AS-NMD candidates, the proportion was 14%, 3% and 6% (in mouse, rat and cow). More than half of the AS-NMD-PRIDEs come from AS forms containing retained introns (56%), while in AS-NMD candidates the frequency is 47%. Interestingly, the retained introns with PRIDE matches do not show any significant deviation with respect to GC% (Supplementary Table S8).
The AS-NMD-PRIDE set of genes were significantly enriched for ‘protein metabolic process’ (GO:0019538), ‘cellular macromolecule metabolic process’ (GO:0044260) GO biological process and ‘structural constituent of ribosome’ (GO:0003735) GO molecular function, compared with the whole human genome. The last GO term was also enriched when AS-NMD-PRIDE set of genes was compared with the general AS-NMD candidates set.
Among the genes present in the ‘structural constituent of ribosome’ GO term we found two cases in which the ortholog gene was also targeted for NMD: the gene RPL5 in mouse and the bovine RPS13. The human and cow RPS13 presented a retention of intron 1 that targets its mRNA for NMD. Using the ECR browser  to generate an alignment of the RPS13, we could observe a high GC content in the intron 1 (>65%), as well as small size (141 nt), and a high conservation among human, mouse and cow (Figure 2). Taken together, these are the typical symptoms of a non-spurious retained intron. Recent study showed that the presence of intron 1 in the human RPS13 reduced its expression by a factor of four . Apparently, the gene RPS13 can regulate its protein level via a feedback mechanism. The ribosomal protein S3 was found to inhibit the excision of the intron 1 from the rpS13 pre-mRNA. This protein binds specifically the intron fragment near the 5′ and 3′ splice site, therefore conferring protection against cleavage by ribonucleases . Although the extraribosomal function of rpS13 is still unclear, a study suggested that this gene (along with rpL23) promotes multidrug resistance in gastric cancer cells by suppressing apoptosis . A similar pattern of proteins regulating their own splicing and generating aberrant mRNA was also found in Saccharomyces cerevisiae  and in C. elegans  which suggests that NMD plays a important role in gene regulation. In Figure 3, we have illustrated five other examples of retained introns, that are conserved in mouse; note that each of these cases is in a gene structure with a large number of introns.
In this work, we have annotated and analyzed alternative splicing coupled with nonsense-mediated decay in four mammalian species: human, mouse, rat and cow. We have shown that alternative splicing coupled with NMD occurs in 2 to 10% of the known annotated genes in each genome, and that there is a highly significant conservation of AS-NMD candidature in genes across the four mammals. Human AS-NMD candidates seem to have a large number of orthologs in mouse (~82%), and at least 15% of those orthologs are also AS-NMD candidates in another species. We think that, in the future, increased sampling of transcripts will lead to increased discovery of conservation of specific AS-NMD events between different organisms. It is possible that there are many such AS-NMD features in genes, but that we have not detected all of them because of insufficient sampling. Also, the AS-NMD events may be designed to be easily turned ‘on’ or ‘off’ (i.e., the ability to include the AS-NMD feature may be easily modified over evolutionary time).
Although we only analyzed AS-NMD candidates with in-frame PTCs in this study, we have demonstrated that intron retention plays an important role in producing AS-NMD candidates in all four species. A large fraction (58 to 81%, depending on the species) of conserved AS-NMD candidates are intron retentions. The retained introns also possess features that distinguish them from non-retained introns, such as smaller size, higher GC content and codon usage similar to exons. The latter indicates that retained introns tend to have phases of protein-coding ability before acquiring stop codons; it is thus difficult to de-couple this phenomenon from the additional role of such transcripts as NMD candidates.
The AS-NMD candidate genes also showed a similar pattern of Gene Ontology enrichment in all four species. Genes linked to nucleotide interaction and cell fate (apoptosis) were the most abundant to produce AS-NMD candidates. KEGG and Biocarta pathways also showed an over-representation of pathways linked with cancer and genetic disorders for such AS-NMD-making genes. Thus, NMD appears to have a significant role in controlling the expression of aberrant proteins in these pathways, and prevents their deleterious effects.
We gathered evidence of translation of at least 10% of all AS-NMD candidates in the human genome. Such cases, are not, however, especially conserved in comparison to other AS-NMD candidates. We cannot determine here whether such truncated proteins have a function in human cells; they may just be ‘junk’ polypeptides that are degraded at a later stage. However, because retained introns with premature stop codons have similar codon usage to exons, this strongly indicates that some NMD substrates have at least been protein-coding in the recent evolutionary past.
Furthermore, in one of the AS-NMD candidates, the gene RPS13, we found evidences of a conserved mechanism of autoregulation, between human and cow, in which the protein inhibits the excision of one of its introns, creating a splice form that is targeted for NMD.
Alternative splicing coupled with NMD seems to act on a wide range of genes, but, however, only in a few significantly over-represented functions across mammalian genomes. The data herein provide statistical evidence for conserved regulation via AS-NMD candidature across Mammalia, for a large subset of genes.
The genome sequences and annotations of four mammals (human, cow, mouse and rat) were downloaded from the Ensembl Web site (http://www.ensembl.org), in March 2009. The genome assemblies are: human=Homo_sapiens.NCBI36.52; cow=Bos_taurus.Btau_4.0.52; mouse=Mus_musculus.NCBIM37.52 and rat=Rattus_norvegicus.RGSC3.4.52. These genomes were chosen based on two criteria: (i) high coverage (>7X), and (ii) large number of cDNAs and ESTs sequences available (covering >85% of annotated genes).
The cDNA/EST/mRNA data were downloaded from the Refseq database (ftp://ftp.ncbi.nih.gov/refseq), the H-invitational database (http://h-invitational.jp/hinv/ahg-db), the Unigene compendium (ftp://ftp.ncbi.nih.gov/repository/Unigene; Build #217) and the dbEST data set (ftp://ftp.ncbi.nih.gov/repository/dbEST/), in March 2009. At the time of downloading, the dbEST database contained deposited data from next-generation sequencing in addition to EST sequencing data from older techniques.
We identified the reference mRNA (an mRNA that aligns with 100% coverage and identity to an annotated protein) of each human, mouse, rat and cow transcripts. A bl2seq alignment  was performed between mRNAs from Refseq, Unigene and H-invitational (human genome only) and each protein in the Ensembl database. Transcripts without a reference mRNA were discarded from our analysis.
The genomic loci were identified and extracted from the Ensembl annotations and expanded by 500 nucleotides 5′ and 3′ of the locus, to allow for variation in the transcriptional start and polyadenylation sites . Each reference mRNA was aligned with its genomic locus using GMAP .
To identify AS forms we aligned each genomic locus that has a reference mRNA, against all mRNAs and ESTs/cDNAs from the Refseq, Unigene, dbEST and H-invitational databases (this last database is for human only). Each alignment was parsed into a set of splice junction coordinates. Alignments where the coverage was <0.96 of the length of the mRNA or EST and <98% sequence identity were discarded. We also discarded ESTs that aligned to the opposite strand. In cases were a mRNA/cDNA/EST mapped to more than one gene, the best mapping was considered the correct one. If a mapping presented the same percentage of identity and coverage to more than one gene, the mapping was discarded.
We performed a new round of alignments to identify the coding region (CDS) of each mapped transcript. Using the program Exonerate , we aligned each transcript retrieved from the previous mapping, against the encoded protein extracted from the Ensembl annotation.
We compared each mapped splice junction with the reference mRNA splice junctions. If a splice form mapped >10 nt from the reference junction, it was annotated as an alternative splice form. We verify the presence of PTCs targeting the mRNA/EST for NMD, using the 55-nucleotide rule . Using the CDS annotation, we checked the mRNA/EST sequence for an in-frame stop codon 55 nucleotides or more 5′ to an exon-exon junction.
To verify whether genes targeted for NMD due to IR were caused by an unspliced or partially-spliced pre-mRNA, we analyzed several features of our sequences and compared them with their flanking exons, and with non-retained introns. Some of our analysis followed the analysis described by Galante, et al. (2004) .
We assess the codon usage and codon bias of retained introns, exons bordering retained introns and non-retained introns. We only used AS-NMD candidates with at least one full length cDNA containing the IR. The frame of non-retained introns and retained introns was defined by the frame of the exon 5′ to it.
We analyzed the general codon bias of all 61 codons in the three sets. We performed 100 pairwise comparisons of the distribution of 16,970 codons (number of codons in the retained intron set), along with 61 possible codons to calculate the average χ2 and standard deviations thereof.
DNA sequences from retained introns, exons bordering retained introns and exon–intron retention–exon (‘E-IR-E’) sequences were compared to the Pfam database using BLASTx program (e-value 10−2). Only genes with a full-length cDNA/EST and with a retained intron entirely in the CDS were used in this analysis (155 in total).
To assess the GC content of the retained introns, exons bordering retained introns and non-retained introns, we created three datasets with the same number of sequences divided into five classes according to their length. The non-retained introns were randomly picked from the set of genes containing the retained introns in order to avoid any kind of bias toward a specific type of gene. A χ2 test was used to evaluate the differences between each dataset.
The information about pairs of orthologs between human and the other species was extracted from the Biomart query system in the Ensembl database. We only considered two genes as being orthologs when they mapped in a one-to-one way.
Gene Ontology (GO ) functional categories, transcription factors TRANSFAC  and KEGG  and BioCarta (http://www.biocarta.com/genes/index.asp) terms, were retrieved from the FatiGo database . We performed a functional enrichment analysis of our list of AS-NMD candidates against the whole genome annotation of each species. Significant overrepresentation was calculated using a Fisher's exact test with P' <0.05, and a correction to P' for multiple hypothesis testing . Human AS-NMD candidates were cross-referenced with the NCBI OMIM database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim), to identify linkage to genetic disorders.
We mapped all human AS-NMD candidates to >600,000 peptides from mass spectrometry, extracted from the PRIDE database . These peptides are derived from a wide variety of human samples, as follows: blood plasma, peripheral blood mononuclear cells, erythrocytes, platelets, brain, HeLa cells, K-562 cells, HEK-293 cells, melanocytes, placenta, breast cancer cells, bone marrow cancer cells, JURKAT cells, liver, cerebrospinal fluid, heart and saliva. Each transcript was translated in the 3 frames and checked against the PRIDE peptides. We derived a Perl program to check the occurrence of peptides upstream of the PTC. We filtered out any matching peptide smaller than 7 amino acids, and also any peptides matching to another human gene or protein.
Gleevec pathway with genes with NMD candidate transcripts labelled.
(0.63 MB EPS)
Summary of numbers of ESTs/cDNAs mapping to genes.
(0.04 MB DOC)
Orthologs of human NMD candidates.
(0.47 MB DOC)
Summary of the different types of NMD candidate.
(0.03 MB DOC)
Top five domain types in NMD candidates in the four mammals studied.
(0.05 MB DOC)
Overrepresented KEGG pathways
(0.04 MB DOC)
Overrepresented Biocarta pathways
(0.03 MB DOC)
Overrepresented transcription factors
(0.07 MB DOC)
Classification of introns by size
(0.03 MB DOC)
This work was performed in part using the RQCHP Compute Canada computing cluster at the Université de Sherbrooke, Québec.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was funded by the National Science and Engineering Research Council of Canada, and by Les Fonds du Quebec de la Recherche and des Technologies. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.