|Home | About | Journals | Submit | Contact Us | Français|
Circular RNAs (circRNAs) are broadly identified from precursor mRNA (pre-mRNA) back-splicing across various species. Recent studies have suggested a cell-/tissue- specific manner of circRNA expression. However, the distinct expression pattern of circRNAs among species and its underlying mechanism still remain to be explored. Here, we systematically compared circRNA expression from human and mouse, and found that only a small portion of human circRNAs could be determined in parallel mouse samples. The conserved circRNA expression between human and mouse is correlated with the existence of orientation-opposite complementary sequences in introns that flank back-spliced exons in both species, but not the circRNA sequences themselves. Quantification of RNA pairing capacity of orientation-opposite complementary sequences across circRNA-flanking introns by Complementary Sequence Index (CSI) identifies that among all types of complementary sequences, SINEs, especially Alu elements in human, contribute the most for circRNA formation and that their diverse distribution across species leads to the increased complexity of circRNA expression during species evolution. Together, our integrated and comparative reference catalog of circRNAs in different species reveals a species-specific pattern of circRNA expression and suggests a previously under-appreciated impact of fast-evolved SINEs on the regulation of (circRNA) gene expression.
Circular RNAs (circRNAs) formed by back-spliced (circularized) exons were sparsely identified over 20 y ago,1,2 and have been recently re-discovered genomewide in thousands of gene loci by taking advantage of deep sequencing of non-polyadenylated transcriptomes and specific bioinformatic pipelines that identify reads anchoring back-splicing junctions in a reversed order (for reviews, see refs3-5). Although inefficiently catalyzed by the spliceosome and generally expressed at low levels,6-8 circRNA formation can be facilitated by both RNA pairing of orientation-opposite complementary sequences across flanking introns and protein factors that are capable of binding flanking introns to bridge unfavorable back-splice sites to a close proximity presumably.6,9-12
Expression profiling showed diverse expression patterns of circRNAs among cell types/tissues13-17 with a significant enrichment of circRNAs in neurons/brains.8,18-20 Interestingly, a single gene locus can produce multiple circRNAs, referred to as alternative circularization.10 Our recent study further suggests that both alternative back-splice site selection and alternative splicing site selection within circRNAs are involved in alternative circularization and contribute to circRNA complexity.17 The competition of putative RNA pairs across introns that bracket different sets of alternative back-splice sites leads to diverse back-splice site selection17; whereas the regulation of alternative splicing within circRNAs is largely unknown.
CircRNAs were also found to be expressed across different species.12-16,19 Comparison of circRNAs in several human and mouse data sets revealed that only a small portion of mouse circRNAs were orthologous to those in human,13 suggesting a species-specific manner of circRNA expression. However, a detailed view of this species-specificity of circRNA expression and its underlying mechanism remained largely uncharacterized. Since circRNA expression is associated with orientation-opposite complementary sequences across their flanking introns and that most of these complementary sequences are abundant primate-specific Alu elements in human, we speculate that the evolvement of complementary sequences and their variable distribution in different species may largely contribute to the species-specific expression of circRNAs.
We hereby systematically compared circRNA expression patterns between human and mouse, and found that only a small portion of human circRNAs could be determined in parallel mouse samples and that the majority of circRNAs are species-specifically expressed. Further analysis revealed that the conserved circRNA expression between human and mouse is correlated with the existence of orientation-opposite complementary sequences in introns that flank back-spliced exons in both species, but not the circRNA sequences themselves. Among these complementary sequences, SINEs (short interspersed nuclear repetitive DNA elements), especially Alu elements in human, contribute the most for boosting circRNA formation. The diverse distribution of SINEs across species may lead to the increased complexity of circRNA expression during species evolution.
Multiple human and mouse RNA-seq data sets (Table S1) of ribosomal-depleted (Ribo–), non-polyadenylated (poly(A)–) or RNase R-treated poly(A)– (poly(A)–/RNase R) RNAs from a spectrum of tissues were analyzed using CIRCexplorer2 pipeline17 with the aligner of TopHat2/TopHat-Fusion (version 2.0.9, Materials and Methods). Approximately 100,000 and 36,000 circRNAs were identified in examined human or mouse samples, respectively (Tables S2 and S3). In addition, the majority (> 90%) of these identified circRNAs could be accordingly detected in human or mouse brain/neuron tissues (Fig. 1A and S1), consistent with the previous reports that circRNAs were highly enriched in brains/neurons8,18-20.
In addition to the tissue-specificity, comparison of circRNA expression in human and mouse samples suggested that the total number of expressed circRNAs is much higher in human than that in mouse samples (Fig. 1A and S1). Strikingly, much more circRNAs could be identified in human than in parallel mouse samples after normalized by sequencing depth (Fig. 1B), further indicating the enrichment of expressed circRNAs in human. In addition, the expression levels of the top 100, 500 and 1000 circRNAs in human are also much higher than those in mouse samples (Fig. 1C and S2). Of note, the expression levels of their cognate mRNAs are comparable between human and mouse (Fig. 1D and S2). Together, these results indicate that circRNAs are more prevalently and highly expressed in human than in mouse (Table S2 and S3).
Genomic feature analysis suggested that most circRNAs contain multiple exons, commonly two to three exons, in both human and mouse (Fig. S3A), and their flanking introns are much longer than randomly selected ones (Fig. S3B), as previously reported.10
We next applied LiftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver) to identify conserved circRNAs between human and mouse (Materials and Methods). For each human circRNA, on the one hand, if there were an expressed circRNA identified in the mouse orthologous locus, this circRNA could be suggested as a conserved circRNA between human and mouse; on the other hand, if no mouse circRNA ortholog were found, this human circRNA could be suggested as a human-specific one. The same LiftOver analysis was also performed with all identified mouse circRNAs.
Using this method, about 15,000 of circRNAs could be identified in both human and mouse, representing about 15% or 40% of total circRNAs in human or mouse, respectively (Fig. 2A), while the majority (about 85% in human and 60% in mouse) of circRNAs could be only found in one of the two species, indicative of their species-specific expression. Further analysis revealed that the expression level of conserved circRNAs is higher than that of species-specific circRNAs in all examined human and mouse circRNAs (Fig. 2B and S4A). Similar results were obtained by using highly-expressed circRNAs for analysis (Fig. 2C and S4B).
We then asked what factor(s) determines a circRNA to be conserved or species-specific. Genomic feature analysis suggested that the sequence conservation of conserved circRNAs is only slightly higher (PhastCons score: 0.95 vs 0.93) than that of species-specific circRNAs (Fig. S4C). Interestingly, however, flanking introns of conserved circRNAs are much longer than those of species-specific ones (Fig. S4D). It is well known that intronic base pairing across circRNA-flanking introns facilitates circRNA formation,9,10 and therefore it was appealing to speculate that the species-specific expression of circRNAs might be correlated with species-specific base pairing across circRNA-flanking introns.
Our analyses showed this speculation is largely true. When compared in human genome context, intronic complementary sequences bracketing either conserve or human-specific circRNAs exhibited no difference (Fig. 3A, left panel); however, a striking difference was observed after LiftOver human circRNAs to mouse genome: about 75% of conserved circRNA-flanking introns contain orientation-opposite complementary sequences in the mouse genome, while only 34% of human-specific circRNA-flanking introns contain orientation-opposite complementary sequences in the mouse orthologous loci (Fig. 3A, right panel). Similar observation was also found with the analysis of mouse circRNAs (Fig. 3B). This global analysis thus supports the view that conserved circRNAs preferentially contain orientation-opposite complementary sequences across their flanking introns in both human and mouse orthologous loci.
To compare the effect of complementary sequences on circRNA formation, we developed Complementary Sequence Index (CSI) to quantitate RNA pairing capacity of orientation-opposite complementary sequences across circRNA-flanking introns (Fig. 4A, and Materials and Methods). In this evaluation, we considered many factors that may affect the RNA pairing formation, including sequence pairing strengths (Blast Score), distances (Symmetry length) and competition with other complementary sequences. A maximum CSI was selected to represent the strongest RNA pairing potential for each given flanking intron set (Materials and Methods). By plotting CSIs of introns flanking randomly-selected 500 highly-expressed circRNAs (with RPM ≥ 0.1 in Ribo– and poly(A)– samples or RPM ≥ 0.2 in poly(A)–/RNase R samples) and those of 500 control flanking introns with (lengths ≥ 8,000 nt, which is the median length of flanking introns of circRNA-producing genes in human) or without intron length limitation, the method of CSI achieved an AUC (area under the curve) as 71.4% or 86.4% (Fig. 4B and S5A), respectively. Here, the AUC of the receiver operating characteristic (ROC) curve of CSI was used to evaluate the specificity and sensitivity of this index. Similar results were obtained by plotting CSIs of 1,000 or 5,000 highly-expressed circRNAs and those of 1,000 or 5,000 control flanking introns with lengths ≥ 8,000 nt (Fig. S5B)
A large number of different types of repetitive elements could potentially form complementary sequences when reversely embedded across flanking introns to facilitate circRNA formation (Fig. S5C). Remarkably, calculation of CSI revealed that about 93.3% of circRNA-flanking introns in human exhibited the strongest RNA pairing capacity from inverted repeated Alu sequences (IRAlus) and only a very small portion from other non-Alu repetitive sequences, such as LINEs (long interspersed nuclear repetitive DNA elements), and other non-repetitive but complementary sequences (Fig. 4C). Furthermore, these IRAlus showed much higher CSIs than those by other non-Alu repetitive sequences (Fig. 4D). In the mouse context, about 77.4% of circRNA-flanking introns showed the strongest RNA pairing capacity from SINE elements, mostly B1, B2 and B4 (Fig. 4E). Similarly, inverted repeated SINE sequences (IRSINEs) showed higher CSIs than those by other non-SINE repetitive sequences in mouse (Fig. 4F). Importantly, we found that CSIs of IRAlus in human (the median CSI is 3.6, Fig. 4D) are dramatically higher than those of IRSINEs in mouse (the median CSI is 0.8, Fig. 4F). This observation suggests that the higher pairing capacity of IRAlus in human than that of IRSINEs in mouse may lead to the elevated circRNA expression in human. It was notable that the non-repetitive but complementary sequences showed even higher CSIs in both human and mouse, but they are only sparsely present (37 in human and 25 in mouse, Fig. 4D and 4F), indicating that they play at best limited role in circRNA formation.
To further examine circRNA expression along species evolution, we extended the comparison of expressed circRNAs in model organisms from human and mouse to fruitfly and worm. As shown in Fig. 5A, the total number of circRNAs was significantly increased from worm and fruitfly to mouse and human with currently available samples (Table S1). In addition, much more circRNAs could be identified in human than in mouse, fruitfly and worm samples after normalized by sequencing depths (Fig. 5B). Similar result was observed from 3,263 orthologous genes (Fig. 5C, left panel) and 1,345 orthologous genes with their linear mRNA expression at RPKM ≥ 1 in ESCs from all four species (Fig. 5C, right panel).
Since SINE elements, especially Alus in human, contribute the most to circRNA formation among all examined complementary sequences in human and mouse (Fig. 4), we suspect that the increasing SINEs (Alus in human) during species evolution (Fig. 6A) could result in the increased complexity of circRNAs in metazoan. In addition to the dramatic increase of absolute numbers of SINEs in primates, the RNA pairing capacity, which is mainly reflected by SINEs (Alus in human) and represented by CSI, is also significantly increased along with species evolution (Fig. 6B). Taken together, these results suggest that the species-specific distribution of SINE elements and their distinct pairing capacity may play an important role in increasing circRNA complexity during species evolution.
Interestingly, we found that alternative circularization10 is generally more prevalent in examined human orthologs than in other examined species in our analysis (Fig. S6A). For example, 16 circRNAs, mainly produced by alternative back-splicing selection,17 can be found in the human RERE (Arginine-Glutamic Acid Dipeptide Repeats) locus in the human embryonic stem cell H9 line (Fig. S6B); whereas only 9 in mouse, 2 in fruitfly, and none in worm stem cells could be determined in the relevant RERE locus with examined stem cell data sets (Fig. S6B).
In addition to canonical splicing that sequentially joins exons to produce linear RNAs with the 5′ to 3′ polarity, split exons in eukaryotes can also be linked by back-splicing, by which downstream 5′ splice (donor) sites are connected with upstream 3′ splice (acceptor) sites in a reversed orientation for circRNA formation (for reviews, see refs.3,5). Recently, a large number of circRNAs have been predicted in a variety of cell lines/tissues and across various species.12-20 In addition, several studies have shown that circRNAs are tissue-specifically expressed, with an enrichment in neuron/brain tissues.18-20 The neuron-enriched circRNA expression is possibly resulted from multiple layers of regulation, including transcription rate, RNA turnover and rate of cell division.8 An early study has suggested a species-specific expression of circRNAs by showing that about 20% of mouse circRNAs are orthologous to those in human.13 In this study, we further confirmed the species-specific expression of circRNAs (Figs. 1, ,22 and and5)5) and identified that SINEs, especially Alu elements in human genome, significantly contribute circRNA formation in mammals (Figs. 3, ,44 and and5).5). Importantly, we also suggest that the diverse distribution of SINEs across species may lead to the increased complexity of circRNA expression during species evolution (Fig. 5).
Back-splicing can be thought as one specific type of alternative splicing. In this scenario, it is not a surprise to identify the species-specific expression of circRNAs, since alternative splicing is frequently species-specific.21,22 The species-specific expression of circRNAs is correlated with species-specific base pairing across circRNA-flanking introns. With a quantitative score by CSI, we showed that the primate-specific Alu elements generally contribute to circRNA formation (Fig. 4C and and4D).4D). Correspondingly, IRSINEs play the most important role in mouse circRNA formation (Fig. 4E and 4F), although much weaker than IRAlus. Extended analysis suggested that the complexity of circRNA expression is significantly increased along species evolution, which is highly correlated with the accumulated SINE elements (Alus in human) and their enhanced pairing capacity from worm and fruitfly to mouse and human (Figs. 5 and and6).6). Thus, the lack of strong complementary sequences, such as IRAlus and IRSINEs, in lower species may lead to the observation that flanking intronic complementarity may not be a critical feature for circRNA formation in fruitfly.19
Apparently, circRNA expression among species is also regulated by other factors. For example, binding of RBPs to circRNA-flanking introns could affect back-splicing;6,11 so it is possible that differentially expression of RBPs among species may lead to species-specific circRNA expression. Furthermore, extended flanking intron lengths, but not complementary sequences, may function as a major mechanistic determinant for circRNA formation in fruitfly19; thus different intron lengths across species may affect circRNA formation in distinct species. Moreover, circRNA formation is influenced by transcription speed of circRNA-producing genes and fast-transcribed genes tend to produce more circRNAs.8 Finally, although inefficiently back-spliced,6-8 circRNAs are resistant to exonucleolytic degradation due to their covalently closed circle structure,8 which allows their accumulation to relatively high levels over time.8,19,23 In this case, the elevated circRNA expression in human could be also in part explained by the fact that human tissues are significantly older than samples from other species. Taken together, RNA complementarity determined by CSI may play a limited role in the evaluation of circRNA expression among different species. Although failed to draw a positive correlation between CSIs and circRNA expression, we observed that CSIs of highly-expressed circRNAs are much higher than those of lowly-expressed ones in both human and mouse (Fig. S6C).
Collectively, the complexity of circRNA expression is remarkably increased along with accumulated SINE elements (Alus in human) during species evolution, suggesting a regulatory role of fast-evolved SINEs (especially Alus in human) in (circRNA) gene expression. However, it is also possible that some of bioinfomatically-identified circRNAs could be possibly generated as splicing artifacts in gene loci with complementary SINEs without any evolutionary advantage.
A number of RNA-seq data sets, including samples from multiple tissues and cell lines across different species (Table S1), were applied to retrieve back-splicing junction reads for circRNA prediction using CIRCexplorer2 pipeline.17 Briefly, RNA-seq reads from each sample were mapped by TopHat2 (2.0.9; parameters: -a 6 -g 1 –microexon-search -m 2) against GRCh37/hg19 human reference genome, GRCm38/mm10 mouse reference genome, BDGP R5/dm3 fruitfly reference genome and WS220/ce10 worm reference genome with known gene annotations (human: knownGene.txt updated at 2013/06/30, mouse: knownGene.txt updated at 2015/06/01, fruitfly: refFlat.txt updated at 2015/11/22, worm: refFlat.txt updated at 2013/03/18), respectively. The unmapped reads were then mapped to the relevant reference genome using TopHat-Fusion, and back-splicing junction reads were further retrieved to determine the back-splicing sites according to known gene annotations (human: knownGene.txt updated at 2013/06/30, refFlat.txt updated at 2013/10/13 and ensGene.txt updated at 2014/04/06; mouse: knownGene.txt updated at 2015/06/01, refFlat.txt updated at 2015/07/29 and ensGene.txt updated at 2014/04/06; fruitfly: refFlat.txt updated at 2015/11/22, ensGene.txt updated at 2014/04/06; worm: refFlat.txt updated at 2013/03/18, ensGene.txt updated at 2012/01/05). Finally, RPM (Reads Per Million mapped reads) was calculated to quantitate circRNA expression as previously reported.10 Highly-expressed circRNAs were defined by RPM ≥ 0.1 in poly(A)– or Ribo– samples or RPM ≥ 0.2 in poly(A)–/RNase R samples from at least one sample for each species.
Sequences of circRNA-flanking introns were extracted from known gene annotations, described as above. The length distribution of all or highly-expressed circRNA-flanking introns was individually analyzed. Control introns were randomly selected from 5,000 intron pairs with known gene annotations in human or mouse.
Sequences of back-spliced exons were extracted from known gene annotations, described as above. To decipher conserved circRNAs between human and mouse, LiftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver) was used to identify orthologous coordinates between human and mouse (parameters: -bedPlus = 3 -tab -minMatch = 0.1 -minBlocks = 1). For each human circRNA, on the one hand, if there were an expressed circRNA identified in mouse orthologous locus within 5nt difference, this circRNA was suggested as a conserved circRNA between human and mouse; on the other hand, if no mouse circRNA ortholog were found, this human circRNA was suggested as a human-specific one. Similar strategy was also applied to mouse circRNAs. To examine the sequence conservation, PhastCons scores of circRNA-producing exons and their flanking introns were retrieved from UCSC and plotted for comparison.
Complementary sequences of circRNA-flanking introns were detected by BLASTn (parameters: -word_size 11 -gapopen 5 -gapextend 2 -penalty -3 -reward 2 -outfmt 6 -strand minus). Orientation-opposite paired sequences with at least 50 nts on each side of circRNA flanking introns in human and mouse or 20 nts on each side of circRNA flanking introns in fruitfly and worm were defined as complementary sequences.
CSI was used to quantitate RNA pairing capacity of each orientation-opposite complementary sequence pair across circRNA-flanking introns. Three basic elements including symmetry length between the complementary sequences and their proximal back-splicing sites, pairing ability detected by Blast and competition of this paired complementary sequences with other complementary sequences, were all considered in the CSI estimation.
Symmetry length (SL) is the distance from one side of complementary sequence pairs to its proximal back-splicing site. There are two SLs for each complementary sequence pair, a short one (sSL) and a longer one(lSL).
The pairing ability of each given orientation-opposite complementary sequence pair across circRNA-flanking introns is determined by Pair Score (PSacross) that is quantitated by BlastScore/L2. Here, “L” is the distance between the pair of orientation-opposite complementary sequences, excluding the distances between back-splicing sites, which equals to “sSL+lSL.”
For a given orientation-opposite complementary sequence pair across circRNA-flanking introns, it could also be paired with suppressing complementary sequences within the same individual introns to inhibit (PSi) its pairing across circRNA-flanking introns. At the same time, the suppressing complementary sequences could also be competed by other sequences to enhance (PSe) the RNA pairing potential of this given orientation-opposite complementary sequence pair across circRNA-flanking introns. Competition of the upstream circRNA-flanking intron of this given orientation-opposite complementary sequence pair is calculated as PSup = ∑((PSi/(PSi+∑(PSe))xPSi), and competition in the downstream circRNA-flanking intron is accordingly determined as PSdown. Taken together, the competition of this given orientation-opposite complementary sequence pair from their hosting introns is summed up as PSup+down = PSup+PSdown. Thus, the potential complementarity of each given orientation-opposite complementary sequence pair across circRNA-flanking introns was competed by other pairing and could be calculated as PSacross/(PSacross+PSup+down).
Finally, the CSI of each given orientation-opposite complementary sequence pair is calculated by aggregation of all these factors and shown in Fig. 4A. A maximum CSI was obtained from a strongest pair of orientation-opposite complementary sequences in each circRNA-flanking intron and was used to represent the strongest RNA pairing potential to enhance the relevant circRNA formation. The source codes of CSI will be accessed from https://github.com/YangLab/CSI.
To evaluate CSI performance, R software and the pROC R library24 were used for ROC curve construction and AUC estimation. Flanking introns from 500 randomly-selected highly-expressed circRNAs and control intron pairs from 500 randomly-selected non-circRNA producing genes (Fig. 4B: intron length ≥ 8,000 nts; Fig. S5: intron length ≥ 0 nt) were chosen for ROC curve construction. AUC of the ROC curve was calculated by pROC R library. The AUC of the ROC curve of CSI was used to evaluate the specificity and sensitivity of this index.
Orientation-opposite complementary sequence pairs with maximum CSIs were overlapped with known repetitive element annotation (human: rmsk.txt updated at 2009/02/27; mouse: rmsk.txt updated at 2012/05/07; fruitfly: rmsk.txt updated at 2007/07/11; worm: rmsk.txt updated at 2012/01/05) to further categorize the types of complementary sequences that contribute to RNA pairing (Fig. 4C and 4E). Basically, IRAlus in human and IRSINEs in mouse play the most important role in circRNA formation. The distribution of different types of repetitive elements was calculated in both genomic and intronic regions (Fig. S5B and 5D).
In total, 3,263 orthologous genes from worm, fruitfly, mouse and human were retrieved from OrthoDB.25 Among them, 1,345 of orthologous genes have their linear mRNAs expressed at RPKM ≥ 1 in ESCs from all four species (H9 in human, R1 in mouse, S2 in fruitfly and N2 in worm). Accordingly, the total numbers of circRNAs generated from 3,263 or 1,345 orthologous genes were calculated from all samples or the ESCs in these four organisms. The numbers of circRNAs from 3,263 or 1,345 orthologous genes were individually plotted by Cluster (version 3.0) and shown by Java Treeview (version 1.1.6r4).
No potential conflicts of interest were disclosed.
We are grateful to laboratory members for discussion.
This work is supported by grants 2014CB910601 from MOST and 91540115, 91440202, 31271390 and 31471241 from NSFC.
L.Y. and L.-L.C. conceived and designed the project. R.D. and X.-K.M. performed the bioinformatic analysis. L.Y. and L.-L.C. wrote the paper with contributions from coauthors. L.Y. supervised the project.