The 5' splice junction prediction methods disclosed in this work were conceived to estimate
trans-splicing sites for all input sequences using a simple and effective metric. Since pyrimidines play an important role in
trans-splicing, including such a parameter into the inter-AG splice prediction model was forthright and can be warranted by the subsequent increase in sensitivity. Although rather effective, the inter-AG metric's principal hoodwink resides in its synthetic nature, as the underlying biological process is difficult to conceive. The assessment of polypyrimidine tract length was not considered in this work as it has been shown that the inter-AG metric is more powerful [
23]. Even if our splice junction prediction results are encouraging, some uncertainty subsists when testing on unconfirmed sequences. This may potentially be a consequence of the parasitic nature of trypanosomatids, which coerces these protozoa to alternate between different life-stages depending on their insect and mammalian host. An additional level of complexity may be essential to improve
in-silico predictions in view of the fact that
trans-splicing of certain transcripts is developmentally regulated in trypanosomes [
36,
37].
When compared to previously published
trans-splicing prediction rates [
23], the models we propose here appear to be just as effective at predicting known
trans-splicing sites when tested on the same search space (Table ). Their accuracy remains significant even when increasing the query sequence size (1.75× increase in search space at the cost of 0.9× accuracy). The augmented search space is in order to ensure that the full inter-AG fragments upstream of putative splice sites are considered. Overlapping into the downstream coding sequence is vindicated by erroneous genome annotations; it is not uncommon that the furthest in-frame ATG is selected as a start codon. Also, our scoring function rates all inter-AG fragments, unlike the previously proposed study that selects high-scoring fragments based upon their dinucleotide composition [
23]. As shown in Table , a scoring threshold can be implemented to ensure that few false-positives are unsuitably identified as splice-junctions at the cost of slightly lower specificity. However, a threshold will necessarily neglect certain sequences, which may be objectionable when dealing with few or essential queries. Since our method is more dependent on correct annotations, it is conceivable that coupling it to linear discriminant analysis would generate even better predictions at the cost of higher complexity.
Predicting poly(A) sites with PSSM's have previously been shown to successfully predict poly(A) sites in humans [
38]. Capturing the global nucleotide composition surrounding known poly(A) sites and utilizing it as a comparative predictor has also proven to be a successful prediction procedure in
Leishmania. Albeit the public EST data appears to be of questionable quality, stringent screening has permitted to reveal specific polyadenylated sequence traits. Given the nature of the sequence data, smaller mRNA transcripts may be favoured and this should be considered when analyzing results. Nonetheless, PSSM scanning is more than 10 times more effective at identifying poly(A) sites than the distance-only approach when precision is fundamental (Figures and ). This result can be interpreted as evidence that distance is not as powerful for targeting poly(A) sites in
Leishmania than in trypanosomes.
For
Leishmania, precision may not be essential when predicting 3'UTR extremities given that several mappings display heterogeneous poly(A) positions [
15]. This observation motivates the use of an error margin, which is interpreted as lowering the resolution of sensitivity testing in this work. Allowing correct predictions to be within a certain range of the mapped position emulates the identification of a polyadenylation region. We also tested a window scanning approach, where the cumulative bit-scores for a given range were averaged over the size of the window instead of considering each position independently. Such an approach yielded weaker overall predictions than the position-specific approach (data not shown), perhaps because the extent of polyadenylation regions varies among different transcripts.
The best 3'UTR predictions emanate from the grouping of distance limitation and scanning with dual PSSMs. Combining both metrics proved to be more effective than either one individually (Figures , , and ), a result that hints at the importance of each factor when predicting poly-A sites in
Leishmania. For restraining PSSM scanning, we tested various distances instead of using a specific confidence interval since spacer sequences display somewhat of a bias towards longer fragments. Although the data is partially derived from estimations, such a shift in the distribution supports the notion that polyadenylation does not occur randomly in
Leishmania. Poly(A) sites further away from the splice junction may be an effect of distant polypyrimidine tracts, a situation that has already been observed in trypanosomes [
20]. One must also consider that the longer non-coding regions in
Leishmania may contain non-annotated genes or provide alternative stage-specific polyadenylation sites, which could explain the longer spacer sequences. These are considerations that motivated the exclusion of intergenic sequences longer than 5000 nucleotides for sensitivity testing.
To our knowledge, no other method can predict poly(A) sites as effectively in
Leishmania spp. as the one described in this work. Even enforcing a highly-selective threshold only faintly affects this method's specificity (Table ). The rather unusual and non-specific nature of kinetoplastid polyadenylation is a line of reasoning to substantiate low computational prediction rates. Although over-represented A-rich hexamer motifs are found (Additional File
2), these are not however present in all the genomic poly(A) sites, which suggests that they may not play a central role in driving polyadenylation in
Leishmania. In addition, the genomic alignment of polyadenylated EST mappings cannot be used to mark out a precise consensus sequence, as it is impossible to distinguish the exact cleavage site among multiple consecutive adenosines on the unprocessed transcript. The heterogeneity of poly(A) sites in
Leishmania mRNA transcripts is extra incentive for using PSSMs that embody a global trend in nucleotide composition. Furthermore, neglecting secondary structure and stage-specificity are additional factors that make it difficult to conceive obtaining higher prediction accuracies at this point.
Notwithstanding the possibility that no consensus motif drives polyadenylation in kinetoplastids, there is evidence for a biological model based on sequence context. The low sensitivity obtained from a poly(A) prediction algorithm based on spacing metrics alone is an evidence for a more dynamic biological model. Also, the correlation between certain regions of the genomic alignment and their respective prediction rates is most interesting, as best illustrated by the sensitivity surface plots (Figure ). The data is presented in order to asses the innate characteristics that have an impact on poly(A) targeting.
Two main common sequence features appear to directly influence the prediction sensitivities. Firstly, the adenosine-rich region within close range to the mapped poly(A) site. Secondly, the pyrimidine-rich region 300 to 600 positions downstream. The latter, which represents the polypyrimidine tracts known to be crucial for trans-splicing, generates the best predictions when loosening the accuracy and bounding the search space. In turn, the A-rich region is responsible for the best predictions when precision is fundamental. Interestingly, the affluence of polypyrimidines (most notably thymines) in the -50 to -25 region (Figure ) may play a role in 3'UTR cleavage since its exclusion from scanning matrices reduces the sensitivity at close range (Figure ). The matrix encoding the sequence information of zero upstream bases and 25 downstream (0A25) is somewhat futile at predicting poly(A) sites, a rather surprising observation seeing as the adenosine concentration is comparable. Upon closer inspection, it is apparent that adenosine-rich regions are not a fundamental marker because many sequences do not contain profuse adenosine residues at their poly(A) site.
PSSMs can be regarded as a simplistic representation of the interaction between an enzymatic complex and a strand of nucleic acids. The highest scoring position corresponds to a region that is most similar to the consensus of all poly(A) sites, which relates to a high affinity region for the polyadenylation complex. In this perspective and based on our results, it is enticing to contemplate a generic biological model where adenosine richness (possibly contrasted by a pyrimidine-rich upstream region) helps to direct the polyadenylation of specific positions downstream of polypyrimidine tracts in unprocessed mRNA transcripts. Deletion studies directed at these features followed by mapping the modified transcript's poly(A) site could shed additional light into the biological process. Moreover, in-vitro UV cross-linking could help identifying novel ribonucleoproteins (RNPs) that might interact with the trans-splicing/polyadenylation complexes.
The computational tools we describe in this work have been implemented in a small JAVA program named PRED-A-TERM (PREDicting poly(A) sites and TERMinal splice junctions) that can be downloaded from Additional File
4. It emits poly(A) and
trans-splicing predictions from intergenic sequence input with partial coding sequence overlap and allows end-users the possibility to select various prediction parameters. The program is tuned for
L. infantum but is suitable for other
Leishmania species. Although trypanosomes have shorter average intergenic regions than
Leishmania, both share similar
trans-splicing machinery [
39,
40]. Scanning
Trypanosoma IRs will however, require additional sequence analysis and subsequent tuning of the model.