In this genome-wide analysis, we showed that alternative polyadenylation in intronic sites can generate lots of novel transcript variants. We preferentially selected intronic single-block ESTs for analysis in that these ESTs were not well considered in previous studies [33
], including Lee's research [32
]. So, our work is a good complement for previous study [17
]. Single-block ESTs within the intergenic region were not included in our analysis though some of them represent gene extensions [61
]. Single-block ESTs are often suspected as contamination of genomic DNA. However, in our analysis, we showed that about 84% of the EST clusters were supported by at least one evidence: hit from full-length cDNA, multiple-block 5'-end ESTs, overlapping with transcribing sites from Affymetrix tiling array, or having multiple supporting ESTs. So by carefully screening, the single-block ESTs could be used as valuable resources for discovering novel transcripts. Besides focusing on single-block ESTs, the pipeline in our analysis was designed to improve poly(A) site detection, all these contribute to the discovery of novel intronic 3'-end exons. During our analysis, we found that more than 90% of the EST entries in our results were created before the polyA_DB2 was released. It implied that most of the novel transcript variants were derived by the improvement of our detection methods and the consideration of single-block ESTs, but not merely by the growth of the transcript databases.
Although different methods have been used for poly(A) site prediction [10
], current methods achieve only moderate sensitivity and specificity. For example, about 47% of known poly(A) sequences in the polyA_DB database were not predicted the Support Vector Machine (polya_svm) [10
]. Among our predicted 3'-end exon sites, less than thirty can be predicted by polya_svm (threshold = 0.5 when the genomic region containing the poly(A) cluster region ± 300 nucleotides was used for predictions). However, 68% of the 17,201 ESTs, which correspond to about 63% of the 10,844 3'-end exons (Additional file 1
), have at least one of thirteen known PAS hexamers. This low detection rate of prediction by polya_svm likely results from heterogeneity of the intronic poly(A) sites compared to the conventional 3'-most poly(A) sites.
It is worthy of note, a method different to ours for identification of 3'-ends of genes was made according to EST frequency histogram along the genome by Muro et al
]. They show that 22-52% of sequences in commonly used human and murine "full-length" transcript databases may not currently end at bona fide polyadenylation sites. Since the average length of the 3'-end exons of all the current human RefSeqs is about 820 nucleotides, they will get longer according to Muro et al
's results. As the comparison in the text has shown, Muro et al
's method and ours have respective advantages, and complement each other. Both methods will contribute to identification of full-length transcripts.
Novel 3'-end exons we detected could be defined as "hidden exons" and "composite exons" described previously [19
]. However, some apparent "hidden exons" could be actually "composite", because ESTs only represent partial cDNA sequences and may be extended to overlap with known exons.
Not all intronic poly(A) sites correspond to actual novel transcript variants. For example, internal priming, due to a consecutive string of 'A's in the mRNAs, results in false positives. For cDNA library construction, oligo-dT is often used as the primer for first strand cDNA synthesis. This primer can anneal to the internal priming site, producing truncated sequences. Internal priming accounts for about 12% for the total 3' ESTs in the database [63
]. In previous study like Tian's [11
], the genomic DNA sequence around the predicted poly(A) site was checked. If there were more than 6 consecutive 'A's or at least 7 'A's in 10 nt window, it was suspected to be an internal priming site. However, when applied the criterion to study the adjacent DNA sequence of 3'-end of human RefSeq mRNAs, it was found that 19.4% (6,147/31,642) mRNAs had such A trait at their 3'-ends. So if using the above criterion, many true positive sites might be missed. In our analysis, we tried to reduce internal priming sites by eliminating all ESTs that could be aligned well with known RefSeq mRNAs (see Methods).
In order to find novel transcript variants as many as possible, we did not request an accurate signature of exon junction and cleavage site. This is different to those previous reported [17
]. The 3'-end exon site provides the approximate locus of the "composite exons" or the "hidden exons" for novel isoforms. The supporting ESTs of a 3'-end exon site further provide proper sites for downstream primer designing to amplify the full coding region of corresponding novel isoforms. We performed RT-PCR to validate some interested candidates with success rate of about 38% (10/26, see Results). Sequence analysis revealed they were derived from processed mature mRNA, but not unspliced precursor.
In our analysis, although most of the sites are supported by at least two types of evidences, there are still 1,468 sites containing only one EST sequence without supporting in other way. Some of these sites may truly represent novel transcript variants associated with low expression levels. For example, the sites DB550185 (ExonSiteNo: 8501), DB347581 (ExonSiteNo: 8549), DB536313 (ExonSiteNo: 8628), and DB517750 (ExonSiteNo: 9840), and DB512524 (ExonSiteNo: 10422), they contain only one EST sequence, but the EST is from a full-length cDNA clone (Additional file 1
One type of RNA polyadenylation controls RNA degradation in the nucleus [64
]. The exosome plays a key role in the surveillance of nuclear mRNA synthesis and maturation. Poly(A) tails guiding RNA to be degraded by the exosome are usually shorter than those increasing mRNA stability, and these poly(A) tails are not made strictly of 'A's. These sites were not actively eliminated in our analysis, but they are unlikely to greatly affect the results because they would not be detected under our stringent criteria. On the other hand, sequence analysis of the poly(A/T)-tailed ESTs revealed that PAS did exist in most of our ESTs. This result combined with other evidences, suggest our predicted poly(A) sites should represent bona fide mRNAs, but not unspliced precursor mRNAs, neither the degradation products.
Another type of RNA quality control is nonsense-mediated mRNA decay (NMD), which selectively degrades mRNAs that contain a premature translation termination codon (PTC, also called "nonsense codon") [67
]. Although NMD mainly acts as quality control to eliminate faulty transcripts in gene expression, it is also involved in physiological and pathological functions [68
]. Usually, NMD occurs when translation terminates more than 50-55 nucleotides upstream of the exon-exon junction, in which case components of the termination complex are thought to interact with the exon-junction complex (EJC) to elicit NMD [67
]. Although 45% of alternatively spliced mRNAs are predicted to be an NMD target [68
], an mRNA is immune to NMD if translation terminates less than 50-55 nucleotides upstream of the 3'-most exon-exon junction or downstream of the junction. This means if a natural stop codon of an mRNA exists in the 3'-end exon, it is not subject to NMD. The transcripts predicted in our study use an alternative 3'-UTRs, assuming that upstream exons do not change. Because we have not got the full-length form for each transcripts, we can not estimate the proportion of our results that would be affected by NMD. However, it has been reported that alternative polyadenylation may be an NMD-rescue regulatory mechanism in PTC-containing mRNAs [70
]. Our data seem to be consistent with the view. Actually all the novel transcripts proved by RT-PCR experiments in our study comprise the natural stop codon in the last exon. A further analysis revealed that in nearly all the 3'-end ESTs except some very short ones, stop codons exist in all three reading frames (data not shown). So if there were no correct stop codons in the 5'-exons, the stop codon in the 3'-end exons of our result would be used. This is different to middle exons that may not contain in-frame stop codons and could not help conveniently clone transcripts with complete coding regions.
It should be noted that a large number of non-coding RNAs (ncRNAs) are expressed from the mammalian genome [71
]. These ncRNAs include miRNAs, snoRNAs, snRNAs, and piRNAs, and so on, which are involved in controlling various levels of gene expression in physiology and development. Non-coding RNAs can be derived from antisense or sense transcripts with overlapping or interlacing exons, or retained introns. To investigate that whether the internal intronic transcripts in our data actually represent known ncRNAs, we compared the chromosome alignment position between the 3'-end exon sites in our study and those of human ncRNAs from NONCODE v2.0 [72
]. In 35,2434 human ncRNA entries collected in NONCODE v2.0, less than one hundred 3'-end exon sites were overlapped (data not shown). So it seems that most of our 3'-end exons do not represent known ncRNAs. Whereas, we found many poly(A) sites were located in the introns before the coding exons. If they were real, the potential novel transcripts would be composed of the 5'-UTR of the original mRNA. Whether the transcripts encode small ORFs or regulatory small RNAs needs to study in the future.