|Home | About | Journals | Submit | Contact Us | Français|
Lens development involves a complex and highly orchestrated regulatory program. Here, we investigate the transcriptomic alterations and splicing events during mouse lens formation using RNA-seq data from multiple developmental stages, and construct a molecular portrait of known and novel transcripts. We show that the extent of novelty of expressed transcripts decreases significantly in post-natal lens compared to embryonic stages. Characterization of novel transcripts into partially novel transcripts (PNTs) and completely novel transcripts (CNTs) (novelty score≥70%) revealed that the PNTs are both highly conserved across vertebrates and highly expressed across multiple stages. Functional analysis of PNTs revealed their widespread role in lens developmental processes while hundreds of CNTs were found to be widely expressed and predicted to encode for proteins. We verified the expression of four CNTs across stages. Examination of splice isoforms revealed skipped exon and retained intron to be the most abundant alternative splicing events during lens development. We validated by RT-PCR and Sanger sequencing, the predicted splice isoforms of several genes Banf1, Cdk4, Cryaa, Eif4g2, Pax6, and Rbm5. Finally, we present a splicing browser Eye Splicer (http://www.iupui.edu/~sysbio/eye-splicer/), to facilitate exploration of developmentally altered splicing events and to improve understanding of post-transcriptional regulatory networks during mouse lens development.
The past decade has seen a surge in transcriptome-level studies for specific developmental stages of the eye and its tissue sub-types1, 2. The development of the eye involves a complex and highly orchestrated regulatory program with several specification and differentiation processes3, 4. The lens is a transparent tissue that focuses light on the retina5. It originates from the surface ectoderm early in embryogenesis and is composed of two cell types, namely the anteriorly located epithelial cells and the posteriorly located fiber cells6, 7. During development and throughout the life of the animal, epithelial cells differentiate into fiber cells that elongate and migrate towards the center of the lens, while degrading their organelles, including nucleus.
High-throughput sequencing techniques, collectively known as Next Generation Sequencing (NGS) approaches, have significantly advanced our understanding of the molecular portrait of various cell types and disease states8–10. One of the primary advantages of high-throughput RNA sequencing (RNA-Seq) is that it enables accurate assembly of the transcriptome, and its alterations across experimental conditions, so as to allow prioritization of the transcripts and splice forms that are potentially most relevant to the observed phenotype. However, employing RNA-Seq datasets for genome-scale elucidation of the splicing alterations across developmental11 and disease states12 or to study inter-individual differences in humans is still in its early stages13.
Greater than 94% of multi-exonic genes in the human genome are alternatively spliced14. Further, alternative splicing is an essential and highly controlled post-transcriptional regulatory mechanism which provides transcriptomic and proteomic diversity in eukaryotic organisms15. Due to the extensive prevalence of splicing events in higher eukaryotes, various transcriptomic datasets across developmental stages have been previously explored in multiple model organisms to study the structure and composition of protein-coding and non-coding genes16–19. These RNA-Seq based studies revealed more accurate and comprehensive set of known and novel genes for downstream functional and comparative analysis.
Previous studies report that ocular tissues such as the retina can exhibit highly diverse transcript profiles with hundreds of novel transcripts, likely contributed by the ensemble of multiple cell types abundant in retina1, 20. However, few RNA-Seq based studies have been conducted so far for investigating the lens transcriptome21, 22 especially over different developmental stages23, 24. Further, these studies have used only known or annotated genes in their analysis. Thus, to date the complete lens transcriptome and the various isoforms expressed in the developing lens has not been fully characterized. In this study, we investigated the transcriptomic alterations and splicing events from publicly available lens RNA-Seq data, and have constructed a comprehensive molecular portrait of known as well as novel transcript isoforms in the mouse lens across developmental stages.
Although mouse lens transcriptome profiling has been the focus of few studies in recent years21–24, our understanding of the complete repertoire of expressed transcripts and their splicing alterations during lens development is far from complete. In this study, we investigated the transcriptomic alterations and alternative splicing events in mouse lens across developmental stages. Overview of the analysis pipeline is illustrated in Fig. 1. In brief, we collected the available RNA-Seq data for mouse lens across varying developmental stages and processed the raw sequence reads using HISAT25 and StringTie26. The processed and quantified data were formatted into expression matrices and were utilized for investigation of complete transcriptomic architecture, extent of transcript novelty, and their evolutionary conservation (see Materials and Methods). Additionally, we investigated the alternative splicing events using rMATS27 followed by an extensive functional analysis of the genes associated with enriched splicing event types. The most prominent splicing event types namely skipped exon and retained intron events were made available through Eye splicer (http://www.iupui.edu/~sysbio/eye-splicer/), a web based splicing browser showing developmentally altered splicing events in mouse lens.
We collected the RNA sequencing data from two studies23, 28 and preprocessed them using a NGS pipeline for data processing to facilitate their downstream analysis (See Materials and Methods, Fig. 1 and Table 1). The RNA-Seq data of different developmental stages were processed separately using the proposed pipeline. Raw RNA-Seq reads were aligned onto mouse reference genome mm10 using HISAT. The overall percentage of alignment for each sample is shown in Table 1. All datasets exhibited a good read quality (Phred score>20) and a high fraction of read alignment to the reference genome (alignment score≥93%) using HISAT.
Since previous reports studying the eye transcriptomes indicated diverse transcriptomic architecture4, our goal was to investigate whether such diversity exists in different developmental stages of lens. For this purpose we first quantified the expression of transcripts and corresponding exons using StringTie. This allowed us to obtain expression levels for 90689 transcripts (68166 annotated and 22523 novel transcripts) in the mouse genome. The analysis indicated the existence of ~25% novel transcripts in the developmental mouse lens transcriptome. In order to further investigate the extent of the novel transcripts in each developmental stage, we analyzed the proportion of known and novel transcripts (with TPM>1.0) across different developmental stages (Fig. 2a). We observed that in each of the developmental stages of mouse lens there are about ~35–50% of novel transcripts. Such variations in the distribution of known versus novel transcripts with respect to different developmental stages was found to be consistent despite filtering for different TPM thresholds (i.e. >0.5, >2.0, and >5.0). In particular, despite the expression threshold employed for defining the expression of a transcript, several thousands of novel transcripts were still identified (Figure S1a–c and Table S1). These observations support the presence of a diverse transcriptome with thousands of novel transcripts being expressed in various lens developmental stages as well as the predominance of complex transcriptional and post-transcriptional regulatory mechanisms in embryonic and post-natal stages during mouse lens formation.
To further investigate whether the expression of these novel transcripts differs between stages, we calculated novelty score of a transcript to measure the differences in the extent of novelty across stages using KS (Kolmogorov–Smirnov) test. Novelty score of a transcript is defined as the percentage of non-overlapping novel transcript length to the reference annotated transcriptome (Fig. 2b). We observed that in embryonic stages, each pair of neighboring developmental stages were found to be significantly different in their distribution of novelty scores for the novel transcripts (p-value≤0.005) and this pattern was observed until birth (P0). In general, the novelty score distributions of the novel transcripts for embryonic stages were observed to be significantly higher compared to those seen in post-natal stages (median novelty score: 10.89 vs 9.04, p=1.06e-12, KS-test, Fig. 2b).
In our study, we identified a total of 22523 novel transcripts (~25% of total transcripts) in mouse lens as documented in Table S2 along with their novelty score and expression levels. As discussed above, we observed differences in the distribution of novelty scores of transcripts between embryonic and postnatal developmental stages. Hence, we further classified the novel transcripts based on their novelty score (See Materials and Methods and Table S2). We categorized the novel transcripts into two groups; Partially Novel Transcripts (PNTs, novelty score<70%, 13207 transcripts) and Completely Novel Transcripts (CNTs, novelty score≥70%, 9316 transcripts).
To investigate and compare the extent of conservation of known and novel transcripts, we used phastCons scores from UCSC Genome Browser, which provide a nucleotide level conservation score across 46 vertebrate genomes, facilitating a measure to quantify conservation for mouse genomic loci (see Materials and Methods). We calculated the phastCons score distributions for each group of transcripts; known transcripts, PNTs and CNTs (Materials and Methods section, Fig. 2c). We observed a significant difference in phastCons score distributions among these groups (median for known transcripts=0.67, median for PNTs=0.76, and median for CNTs=0.13; Wilcoxon rank sum test, p-value<2.2e-16). The score distribution indicates that PNTs exhibit higher conservation patterns than already known transcripts while their patterns were less comparable to CNTs. These observations suggest that since lens tissue and corresponding cell line transcriptomes have been poorly or rarely studied by genome annotation consortiums like ENCODE29 or FANTOM30, it is possible that hundreds of transcripts specific to lens may have been rarely documented in genomic/transcriptomic resources. However, integrative analyses and databases based on next generation RNA-sequencing datasets specific to such overlooked tissues, would be able to capture such missing transcript isoforms or poorly annotated genes, suggesting the need for such focused studies. In contrast, most of the CNTs were found to be poorly conserved based on phastCons score profiles. Interestingly, we found a few of the CNTs as outliers in the box plot exhibiting extremely high conservation (Fig. 2c, CNTs, above third quartile), which met the median phastCons threshold of both known and CNTs, and hence are likely to be active but functionally uncharacterized for biological processes.
To understand whether particular functions and processes are over-represented as gene ontology (GO) categories for these novel transcripts, we performed functional enrichment analysis of the PNTs by using the annotations of the corresponding mouse genes with which they overlap partially. To generate a high confident set of evolutionary conserved novel transcripts with annotated information, we filtered the PNTs with phastCons score>0.8 and obtained a set of 3982 genes satisfying these criteria. We performed functional enrichment analysis of these genes with corrected p-value (Bonferroni correction) threshold<10−10 using ClueGO31. ClueGO is a Cytoscape plugin which enables the functional grouping of GO terms or gene sets to represent the enriched functional themes as networks. Figure S2 shows the resulting network for GO biological processes. We found significant clustering of genes into 26 thematic groups based on enriched GO terms using ClueGO (Table S3). Specific biological processes and associated modules are highlighted in Fig. 2d. We observed that ‘alternative mRNA splicing via spliceosome’, ‘mRNA metabolism process’, ‘ubiquitin mediated proteolysis’, ‘nervous system development’, ‘neurological system process’, ‘organelle organization’, ‘cell cycle’, ‘protein localization’ etc were over-represented in PNTs (Fig. 2d and Figure S2). For instance, we found group 19 (i.e. nervous system development) to be significantly enriched (adjusted p-value=9.93e-32) with 841 genes i.e. ~30% of the genes (Table S3) annotated with neurogenesis, neuron differentiation and nervous system developmental processes. These observations clearly reveal the role of several poorly characterized transcripts associated with nervous system development, RNA metabolism, cell cycle, organelle and chromatin organization, regulation of anatomical structure morphogenesis and cell differentiation, during lens development.
We further investigated and compared the transcriptomes of the three groups of transcripts across each developmental stage. We averaged the expression level of a transcript across biological replicates in each developmental stage in order to compare the distribution of expression levels for known transcripts, PNTs and CNTs. We included the subset of transcripts in each class which were found to be expressed in all seven stages which resulted in 23121 known transcripts, 4531 PNTs and 4027 CNTs. The expression values were log-transformed and represented as box plot for each class across individual stages separately (Figure S3). We observed that, all three transcript classes exhibited significantly different expression profiles for each developmental stage (Wilcoxon rank sum test, p-value<0.001, See Table S4), with known and PNTs exhibiting significantly higher expression compared to CNTs. In particular, our analysis also revealed that PNTs are highly expressed than known transcripts (Wilcoxon rank sum test, p-value<0.001, See Table S4). These observations are similar to the conservation pattern of PNTs being higher than other transcript groups. These results indicate that PNTs are significantly more expressed than CNTs across all developmental stages and are often more expressed than even annotated transcripts suggesting that these PNTs are likely functional in lens development.
Although we observed that transcripts belonging to the CNT class were generally poorly conserved compared to the other two groups (Fig. 2c), nevertheless a small fraction (~8.6%) of CNTs exhibited high conservation with phastCons scores greater than 0.76. We considered these CNTs as highly conserved completely novel transcripts because the median conservation score of known and partially novel transcripts was found to be 0.67 and 0.76 respectively. In order to further interrogate the activity of these ~8.6% completely novel transcripts, we analyzed their expression profile across developmental stages. We further filtered them to obtain a set of CNTs with a phastCons score>0.8 and expressed in at least one developmental stage, after excluding RNA-seq samples from E15.5 which originate from a different study in order to avoid any potential batch effect. We found a total of 647 CNTs (see Table S5) that exhibited varying levels of expression across developmental stages (Figure S4). Figure 3a shows a clustering snapshot of the distribution of these expression profiles across stages with expression levels of a transcript normalized by its maximum level across developmental stages (Materials and Methods, see Figure S4 for an extended heatmap). We analyzed the expression profiles of CNTs based on hierarchical clustering to identify representative panels of transcripts expressed in only one specific developmental stage (Fig. 3b) and in all developmental stages analyzed (Fig. 3c). These heatmap panels show the genomic co-ordinates as well as the novelty and phastCons scores associated with each CNT. We observed that ~10% of the CNTs (phastCons score>0.8) were expressed in specific developmental stages as shown in Fig. 3b. In contrast, ~47% of the transcripts were found to be expressed across all developmental stages, with a selected set of hierarchically clustered CNTs following this trend shown in Fig. 3c. This suggests that a small fraction of CNTs with uncharacterized function could be potentially regulating stage specific developmental processes while majority of the CNTs could have broader functional roles across stages albeit uncharacterized.
We also investigated the transcript structure of different transcript classes by comparing the number of exons and length distributions. We observed significant difference in the distribution of exonic composition for PNTs and known transcripts (p<2.2e-16, Kolmogorov–Smirnov test), with majority of the PNTs being multiexonic (>3 exons). In particular, about 20% of the PNTs were found to have more than 20 exons and were enriched in genes associated with several processes including ‘microtubule cytoskeleton organization’, ‘cell cycle’, ‘nervous system development’, ‘cell projection morphogenesis’, ‘embryo development’, ‘focal adhesion’ and ‘chromatin remodeling’. In contrast, we observed that ~90% of CNTs were single or bi-exonic with a small fraction of them exhibiting multiexonic structure as shown in Figure S5a. We also investigated the length for the three groups of transcripts and found significantly (p-value<2.2e-16) varying distribution of lengths as shown in Figure S5b. We observed that the known transcripts exhibited an expected distribution of transcript length as previously described32 with an abundance of transcripts having length between ~102bp and ~104bp. However, among the novel transcript groups; PNTs exhibited a distribution more similar to that of known transcripts when compared to CNTs. In particular, we observed that most PNTs had length ranging from 102–107bp with abundance of transcripts having length in the range of 105–106bp. In contrast, we found majority of the CNTs ranging in length from 102 to 105bp dominated by relatively shorter length (100–1000bp) transcripts. Indeed, studies from GENCODE consortium33 observed that human long noncoding RNAs (lncRNAs) are typically encoded as single or biexonic transcripts with significantly lower exome lengths compared to annotated protein coding transcripts, suggesting that several of the CNTs detected in our study are likely to be noncoding RNAs.
We hypothesized that the varying exonic composition and transcript length distribution of CNTs could help further filter them in order to build a high confidence compendium of active CNTs for downstream analysis. We thus included transcript length as an additional parameter along with high conservation score to delineate probably functionally active 100% novel transcripts. We applied two sets of filters; a) 300bp≤transcript length≤10000bp; phastCons score>0.95; average expression>5.0 TPM and expressed in at least four developmental stages and b) 300bp≤transcript length≤10000bp; phastCons score>0.95; no expression threshold; but expressed in at most two developmental stages to generate two sets of CNTs. Table 2 show a list of 25 and 19 CNTs satisfying these two sets of filters respectively. To investigate if any of the highly conserved 25 CNTs can encode for potential functional protein coding sequences, we performed ORF prediction using an ad hoc Python script to detect both canonical and non-canonical start codons in six open reading frames. This analysis allowed us to confirm the presence of ORFs in 13 CNTs. Among these 13 ORFs, 8 of them were found to contain Pfam domains according to HMMER34 (Table S6). We performed ORF prediction on 654 CNTs that are 100% novel and exhibited a phastCons score>0.8. We observed that 202 of them encode for ORFs with 121 of them exhibiting at least one hit using HMMSCAN34 against Pfam, suggesting that at least 18% of the CNTs are likely to encode for functional domains (Table S7). Further, we validated four of the CNTs shown in Table 2 by RT-PCR in three different developmental stages of lens, among which two transcripts were predicted to encode for ORFs (Fig. 3d, Figure S6). We found that all the four completely novel transcripts were expressed in P0 and P10 stages. As predicted from our transcriptomic analysis, the MSTRG.17446.1 transcript was not detected at E15.5. These results further validate the stage-specific expression of CNTs shown in Fig. 3 and Table 2 .
Alternative splicing is an important molecular mechanism which contributes to the transcriptomic diversity in higher eukaryotes35. Increasing evidence supports the role of splicing and post-transcriptional regulatory alterations in development11 and disease12, 36–38, in addition to their prominent role in generating multiple transcripts and protein isoforms in normal cells.
Since we observed significant differences in the distribution of novelty scores for novel transcripts between the embryonic and post-natal stages in mouse lens, we argued that alternative splicing could contribute to these differences. In addition to contributing to transcript isoforms, splicing events can also contribute to differential regulation of the gene products across developmental stages by controlling the abundance of the required isoform. Hence, we employed rMATS27, a framework for detecting splicing alterations from next generation RNA-sequencing datasets, to investigate such key events for molecular diversity across developmental stages (see Materials and Methods). Table 3 shows the number of high confident Alternative Splicing (AS) events detected using rMATS pipeline (FDR<0.01) across every pair of developmental stages with replicates. Table includes the number of detected AS events reported to be significant by rMATS, for the five types of events namely skipped exon (SE), alternative 5′ splice site (A5SS), alternative 3′ splice site (A3SS), mutually exclusive exons (MXE) and retained intron (RI). These results clearly indicate an abundance of SE and RI events compared to the other types during lens development. Tables S8 and S9 summarizes the highly significant (1% FDR) SE and RI events discovered across developmental stages.
Skipped exons are one of the most prevalent alternative splicing events in higher eukaryotes39. In these events, the splicing machinery can ‘skip over’ an exon by splicing it, thereby masking its contribution in the final RNA or protein product. We obtained 418 significant (FDR<1%) exon skipping events corresponding to 266 exons observed in 399 transcripts from 213 genes across various developmental stages (Table S8).
We performed functional enrichment analysis of the genes associated with skipped exonic events identified at 1% FDR using ClueGO31 (Materials and Methods). Enrichment results from this analysis are summarized in Table S10. We found several significant (adjusted p-value<2.05e-04) groups of functional processes to be enriched including ‘mRNA processing’, ‘microtubule-based process’, ‘splicing factor NOVA regulated synpatic proteins’, ‘regulation of intrinsic apoptotic signaling pathway’, ‘lens development in camera-type eye’, ‘protein polymerization’, ‘tight junction’, ‘positive regulation of developmental growth’ and ‘striated muscle cell differentiation’ (Table S10, Fig. 4a). These observations indicate the prevalence of skipped exonic events in several differentiation and developmental processes via post-transcriptional regulation. For instance, we found 6 genes significantly (adjusted p-value=8.30e-05) associated with the term ‘lens development in camera-type eye’. The genes that belong to this functional theme include Cdk4 (Cyclin-Dependent Kinase 4), Cryba1 (Crystallin, Beta A1), Lim2 (Lens Intrinsic Membrane Protein 2), Meis1 (Meis Homeobox 1), Pax6 (Paired Box 6), and Smarca4 (SWI/SNF Related, Matrix Associated, Actin Dependent Regulator of Chromatin, Subfamily A, Member 4), which contributes to ~8% of genes annotated with lens developmental processes.
Paired Box 6 (Pax6) is a transcription factor encoded by 14 exonic gene Pax6. This gene has previously been documented as a key regulator for sensory developmental processes40, 41 and lens regeneration42. We found that a particular exon, ENSMUSE00001311933 (Pax6) was included in all developmental stages except P9 with high Percent Splicing Index (PSI) values ranging between 0.93 and 0.99. Similarly, we found that ENSMUSE00000736151 (Cyclin-Dependent Kinase 4, Cdk4) is differentially included in E18 (PSI value=0.964) versus P0 (PSI value=0.8025) (FDR<1%) and ENSMUSE00000691476 (Crystallin, Beta A4, Cryba4) is included all developmental stages except P0 with high PSI values ranging between 0.97 and 0.99, suggesting its importance in lens development (FDR<1%) (Table 3).
We validated the expression of alternate isoforms of Pax6 and Cdk4 by RT-PCR and Sanger sequencing across developmental stages. Both Pax6 and Cdk4 follow the predicted trend (Table 3, Fig. 4b, Figures S7 and S8). For example, the ENSMUSE00001311933 exon of Pax6 is expressed at stages E15.5 and P0, while its expression is undetected at P10. Cdk4 exon ENSMUSE00000736151 is expressed at all three stages, E15.5, P0 and P10. Further, we validated skipped exonic events detected in four other genes (Banf1, Cryaa, Eif4g2, Rbm5) that have been detected at an FDR<5% (Fig. 4b, Figure S7 and Table S8). Additional validation of these splicing events in P0 lens using Sanger sequencing independently confirmed our findings (Figure S8, Materials and Methods). Mutations in Cryaa have been previously shown to cause cataracts in humans and mice43, 44. Eif4g2 and Rbm5 encode for RNA binding proteins and Banf1 encodes a DNA binding protein. While the function of these genes has not been characterized in the lens, they exhibit high expression in the lens tissue. Interestingly, another Rbm family protein, Rbm24, is expressed highly in vertebrate lens development45 and its deficiency in Zebrafish causes microphthalmia46. All five genes have alternatively spliced isoforms that are differentially expressed across lens developmental stages, as predicted by splicing analysis (Table S10). For example, the ENSMUSE00000145472 exon of Banf1 is skipped at P10. Further, the ENSMUSE00000352893 exon of Cryaa is not highly expressed at any of the lens developmental stages tested, suggesting no potential function of the ENSMUST00000019192 transcript during late embryonic and early postnatal stages of mouse lens development. The isoform of Eif4g2 containing exon ENSMUSE00000203223 is expressed in all developmental stages, while the alternate isoform without the exon is not as highly expressed. Rbm5 has a distinct expression pattern during lens development. While the Rbm5 isoform including the ENSMUSE00001225318 exon is expressed at all stages, the isoform with skipped ENSMUSE00001225318 exon is expressed highly only at P0. This suggests a potential function for the ENSMUSE00001225318 exon at early perinatal stages. Together, the RT-PCR validation analysis suggests that alternatively spliced isoforms of genes expressed in the lens are also differentially expressed at different developmental stages. This indicates that certain isoforms of genes function specifically during embryonic or postnatal development, indicating the significant contribution of post-transcriptional regulation to the functional diversity of the isoforms.
Retained intron (RI) is an important but less characterized AS mechanism. It causes retention of intronic region that may or may not also include some exonic regions during splicing (Fig. 5a). It is commonly suggested that, most of the transcripts exhibiting RI, could open a new targeting motif for small interfering RNA (siRNA) at RI loci, thus are degraded by nonsense-mediated decay47. However, recent studies indicate that intron-retaining mRNAs are likely to have a more conserved role in development and numerous diseases48. Our splicing analysis indicated that retained intron events are the second most abundant alternative splicing events after skipped exon events (Table 3). We obtained 193 significant (FDR<1%) intron retention events corresponding to 178 exons observed in 192 transcripts from 168 genes across various developmental stages (Table S9, Table 3).
Functional enrichment analysis of the genes which exhibited retained intronic events at 1% FDR threshold clearly revealed an enrichment for genes annotated with significant groups (p-value<2.25e-04) such as ‘RNA splicing’, ‘M Phase’, ‘cellular responses to stress’, ‘autodegradation of Cdh1 by Cdh1:APC/C’, ‘regulation of RNA splicing’, ‘snRNP assembly’, ‘response to epidermal growth factor’ and ‘mitophagy’ suggesting that the genes whose regulation is controlled by intron retention appear to be associated with developmental check points or stress related (Fig. 5b and Table S11). For instance, we found several genes (Anapc2, Anapc5, Cdk4, Ehmt2, Ensa, H3f3b, Id1, Mcm7, Ncapg, Nup35, Pole, Ppp1cc, Psmc4, Psmd11, Psmd4, Rps27a, Tpr and Trp53) associated with cell cycle [M-Phase], which were found to be exhibiting retained introns in various developmental stages (Table S11). Similarly, we found genes associated with ‘autodegradation of Cdh1 by Cdh1: APC/C’ to be significantly enriched (p-value=1.87e-07) with 15 genes (Anapc2, Anapc5, Atg4b, Becn1, Cdk4, Ehmt2, H3f3b, Id1, Map1lc3b, Psmc4, Psmd11, Psmd4, Rps27a, Trp53, Wipi2) contributing to 6% of the genes associated with Cdh1 mediated proteolysis/ degradation of mitotic proteins. Cdh1 (epithelial cadherin) is an important protein which controls the mitotic arrest with G1-phase elongation in neurogenesis49.
Celf1 (also known as CUG triplet repeat, RNA binding protein 1) is a well characterized RNA binding protein belonging to the CUG-BP family. CUG-BP family is known for protein members, which control the embryonically lethal abnormal vision via potential involvement in developmentally regulated alternative splicing50. Celf1 is a highly conserved RNA binding protein, involved in alternative splicing, polyadenylation, mRNA stability, and translation processes51. Additionally, Celf1 was documented to potentially regulate the genes involved in embryonic heart muscle development51, 52 with some support for the role of these family members in lens development53. In our analysis, we found that the last exon - ENSMUSE00000643467 (Celf1) exhibited retained intronic event with high PSI values in post-natal developmental stages compared to embryonic stages, with a very low abundance in E15 as shown in Fig. 5c. In addition to Celf1, several other RBPs were found to exhibit intron retention events supporting the notion that RBPs can be regulated post-transcriptionally by non-sense mediated mRNA decay of the unproductive splicing isoforms which might harbor stop codons54. These observations suggest a stage dependent regulation of RBP’s transcript levels by auto-regulation at post transcriptional level, to fine tune the downstream post-transcriptional regulatory networks in lens development.
In this study, we investigated the transcriptomic alterations and splicing events during lens formation (i.e. across different developmental stages; E15, E15.5, E18, P0, P3, P6 and P9), and constructed a molecular portrait of known and novel transcript isoforms in the mouse lens. Although samples from the developmental time point E15.5 originated from a different study and read length distribution compared to the rest of the datasets, comparison of the annotated gene expression profiles in the mouse genome between E15.5 and E15 stages revealed a significant correlation (Pearson R=0.86, p<2.2e-16), suggestive of significant similarity in the expression profiles of developmentally close time points, irrespective of the source and sequencing platform. In contrast, expression profiles from E18 and post-natal stages exhibited lower correlation with respect to E15.5 dataset, further confirming the quality and robustness of expression profiles to delineate the developmental stages. Although, increasing number of studies using RNA-sequencing protocols are able to generate a wild type control as part of their research projects leading to stage-specific developmental transcriptomes56–58, several issues need to be considered before employing them in large-scale meta-analysis studies, which can significantly improve the quality and number of high confidence predictions. For instance, several of these publicly available datasets are generated with single replicates, provide separate transcriptomes of epithelial and fiber cells as opposed to whole lens, are generated with differing read lengths and arise from different labs. Hence, future efforts to integrate the datasets should account for sample heterogeneity by normalizing the samples before expression quantification or modifications should be adopted in splicing prediction software to account for variable sequencing fragment lengths across datasets.
Classification of ~25% of the total transcripts defined as novel transcripts, into partially and completely novel transcript types (PNTs and CNTs) based on their extent of overlap with current annotations, allowed us to uncover the properties of these transcript sub-types. We found that the extent of novelty of the transcripts decreased significantly in post-natal lens stages compared to embryonic stages, suggesting the presence of several uncharacterized novel transcript forms expressed during early lens development. PNTs were found to exhibit significantly higher conservation as well as expression levels compared to both completely novel and known transcripts, across the developmental stages studied here. Functional analysis of PNTs suggested the prominent role of several processes such as neural system development, structural morphogenesis, protein localization, cell division and differentiation, important for lens development. Notably, majority of the CNTs were widely expressed across developmental stages albeit exhibiting significantly lower expression, conservation and length compared to partially novel transcripts. Nevertheless, ORF prediction on a subset of ~600 CNTs which are conserved across all the studied species indicated protein coding ability for at least 30% of these novel transcripts. We confirm the expression of several of these CNTs across lens developmental stages. Functional analysis of the genes exhibiting the most abundant alternative splicing events, namely skipped exon and retained intron events, revealed the enrichment of mRNA processing, apoptotic signaling pathways, protein polymerization, cell development and differentiation for the former and the enrichment of cell cycle processes, stress and splicing regulation for the later type of events. We found several genes such as Banf1, Cdk4, Cryaa, Eif4g2, Pax6 and Rbm5 that are associated with lens development, to exhibit skipped exonic events. We have validated the expression of different isoforms as well as novel genes in developing mouse lens by qRT-PCR. Further, we have developed a splicing browser ‘Eye Splicer’ to access and view developmentally altered splicing events in mouse lens. Together, this in-depth analysis provides a high-resolution architecture of the mouse lens transcriptome and provides a one-stop portal for furthering the understanding of splicing alterations during lens development.
To obtain a comprehensive understanding of the transcriptome and splicing alterations across various stages of lens development, we collected multiple publicly available RNA-seq datasets corresponding to the raw RNA sequence reads of mouse lens from different developmental stages (Table 1). These datasets were aligned to the mouse reference genome, quantified for expression levels of known and novel transcripts as well as to investigate splicing alterations as illustrated in the workflow (Fig. 1). In the following sections, each of the major steps employed in processing and analysis are described in further detail.
We collected the raw RNA sequence reads of different developmental stages (E15, E15.5, E18, P0, P3, P6 and P9 each with its biological replicate) of mouse lens from Gene Expression Omnibus (GEO)59 and European Nucleotide Archive (ENA)60. Table 1 shows the relevant source of the RNA-seq dataset along with several metrics resulting from the alignment of the reads to the reference genome. Briefly, we downloaded the single end datasets in FASTQ format using the SRA Toolkit (fastq-dump command), and the paired end datasets were directly downloaded from ENA. We ensured the quality of the aligned sequence reads to a minimum quality score of 20 for each sample using FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html).
HISAT25 (Hierarchical Indexing for Spliced Alignment of Transcripts) is a highly efficient alignment tool for aligning short reads from RNA sequencing experiments onto reference genome. We used HISAT with default parameters and setting the number of processors to 32, for rapidly aligning the quality filtered RNA sequence reads collected from different sources (See Table 1) against mouse reference genome mm10 annotation files. SAM (Sequence Alignment/Map) files obtained as outputs from HISAT were post processed using SAMtools (version 0.1.19)61, 62 for converting SAM to BAM (Binary Alignment/Map) followed by sorting the output BAM files. The sorted binary alignment files (sorted-BAM) obtained after post-processing were employed for further data processing i.e. quantification of expression levels of transcripts and splicing analysis.
We used StringTie (version 1.2.1)26 for identification and quantification of transcripts from the aligned RNA-Seq reads. StringTie is novel network flow algorithm based on fast and highly efficient assembler, to quantitate the transcripts of each genomic locus considering all possible multiple splice events. In addition to annotated transcripts, it can also provide the information of possible novel transcripts in each sample. Transcript level expression data quantified using StringTie were stored in GTF (Gene Transfer Format) providing expression levels for both known as well as novel transcripts against mouse reference genome (mm10-Mus_musculus.GRCm38.84.gtf). All the GTFs previously obtained for each sample were grouped and provided as an input for stringtie “merge” mode along with mouse reference genome (mm10-Mus_musculus.GRCm38.84.gtf). The merged GTF thus obtained was then utilized as reference annotation file in re-running StringTie with the sorted-BAM for the corresponding samples. As a result, we obtained a matrix of expression levels for 90689 transcripts (68166 annotated and 22523 novel transcripts) in the mouse genome. Known transcripts are defined as the transcripts whose genomic co-ordinates and annotations completely overlapped with those reported in Ensembl database63 for the mouse genome. In contrast, novel transcripts were defined as the transcripts that were exclusively predicted by StringTie and hence could overlap partially with already annotated exonic regions in the mouse genome. A quantification matrix was generated for lens transcriptome with respect to different developmental stages extracting the TPM (transcripts per million) values from StringTie outputs. This matrix was utilized for downstream analysis.
We calculated the proportion of known and novel transcripts for each RNA-seq sample with an expression threshold of TPM>1.0 and averaged the values for corresponding replicates from each developmental stage. The obtained proportions were represented as a bar graph for each developmental stage. Similarly, we calculated the proportion of known and novel transcripts with varying expression thresholds (TPM>0.5, >2 and >5) and represented as bar graphs to study the reproducibility of our observed trends.
To investigate the discovered novel transcripts for their extent of novelty with respect to the known transcript architectures documented in the mouse reference genome mm10, we mapped the length of the discovered transcript to annotated reference transcript coordinates and calculated a novelty score for each novel transcript by using the below formula,
We examined the distribution of novelty score of novel transcripts in each developmental stage and represented it as a density plot. We performed K–S (Kolmogorov–Smirnov) test to investigate for statistically significant differences in the novelty score distributions between any pair of developmental stages. Based on prior calculations and distribution of novelty scores, we categorized the novel transcripts into two groups; partially novel transcripts (PNTs, novelty score<70%) and completely novel transcripts (CNTs, novelty score≥70%). We analyzed the expression levels of transcripts across all stages for each transcript group - known, partially annotated novel and completely novel transcripts and performed Wilcoxon rank sum test to study the distribution of expression levels between transcript groups for each developmental stage separately. These results were represented as box plots in supplementary material.
To validate the expression levels of novel transcripts discovered from RNA-Seq analysis, total RNA was extracted using a RNeasy Mini kit (Qiagen Inc, Valencia, CA) from microdissected C57Bl/6 mouse lenses at three stages, namely, embryonic day (E) 15.5, and post-natal day (P)0 and P10. Each of the three biological replicates at E15.5 comprised of six lenses, and at P0 and P10 comprised of two lenses. RNA was treated with RNase free DNase (Qiagen Inc #79254, Valencia, CA). cDNA was synthesized from 200 ng of total RNA, representing three biological replicates at each developmental stage using Bio-Rad iScriptTM cDNA Synthesis Kit (Bio-Rad Laboratories, Hercules, CA), and was used as a template in PCR analysis. Primers were designed for the exonic regions of four CNTs (Table S12). The PCR products were run on 1% agarose gel. Presence of specific bands at the expected size were indicative of transcript expression in the lens.
Although some reports indicate that mouse lens is likely to have a diverse transcriptome, the evolutionary significance of the transcriptome is poorly understood. Hence to address this, we investigated the evolutionary conservation of the identified transcripts. Multiple sequence alignment of genomic loci across several genomes provides a comprehensive snapshot of the evolutionary conservation, which can act as a proxy for functional preservation of a selected region64. For instance, protein coding genomic loci were documented to be highly conserved across the genome than non-functional genomic loci65. We applied this technique to conjecture and identify novel transcripts which could be functionality active across large phylogenetic distances. We downloaded the phastCons scores66 from the UCSC Genome Browser for the complete mouse genome. PhastCons score employed in this study provides an estimate of the individual nucleotide level conservation, calculated based on multiple sequence alignment of 46 vertebrate genomes with respect to mouse reference genome mm10. It ranges from 0–1 with higher the score higher is the conservation of the individual nucleotide across the genomes. For this study, we utilized the available nucleotide resolution conservation score data for mm10 and calculated the phastCons score for each exon of the novel transcripts by averaging the per-base scores and then computed a representative conservation score for each transcript as the mean phastCons score of the exons representing the novel transcript. Final scores were analyzed for known (annotated) transcripts, PNTs and CNTs to compare their relative extents of conservation.
Since Gene Ontology (GO) based functional enrichment analysis can provide important clues about the functions and molecular processes predominantly associated with novel transcripts, we analyzed the Partially Novel Transcripts (PNTs) that shared majority (>70%) of their genomic region with known/annotated transcript containing genes to understand the likely functions associated with them. This involved filtering the PNTs with phastCons score (>0.8) to first identify highly conserved transcripts and using the resulting set of genes associated with these PNTs for downstream functional analysis. Functional enrichment analysis was performed with p-value threshold<10−10 for collected genes using Cytoscape67-ClueGO31 plugin and was represented as a clustered GO network. Significant clustering of genes, color coded by annotation group, based on enriched GO biological processes were highlighted in these representations.
Transcripts belonging to the completely novel class share less than 30% of their genomic region with known transcripts. We hypothesized that completely novel transcripts with high conservation and expressed in at least one developmental stage could be active with uncharacterized function. Hence, we filtered the transcripts based on phastCons score (>0.8) and analyzed their expression pattern. Expression profiles normalized by their maximum expression level across stages for these highly conserved completely novel transcripts were hierarchically clustered using Cluster 3.068 and visualized as a heatmap using Java Treeview69. Representative hierarchically clustered panels of transcripts expressed in only one specific developmental stage and in all developmental stages were shown separately. Novelty Score (NS) and phastCons Score (PS) indices for transcripts were shown as an additional scale bar in each heatmap.
We also investigated the distribution of the number of exons and length of the transcripts for known, partially novel and completely novel transcripts. We performed K–S (Kolmogorov-Smirnov) test to evaluate whether length distributions of transcripts significantly differ. Likewise, exon counts were also compared for these three categories of transcripts.
In order to identify high confident completely novel transcripts for potential experimental validation, we applied three simultaneous filters namely phastCons score, expression and transcript length. Briefly, these robust filters comprised of novelty score set to 100% and (a) 300≤transcript length≤10000; phastCons score>0.95; and average transcript expression (across all developmental stage)>5.0 TPM in at least four developmental stages to generate novel transcript predictions broadly expressed across developmental stages and (b) 300≤transcript length≤10000; phastCons score>0.95; exhibits expression in at most two developmental stages to generate novel transcript predictions specifically expressed in particular developmental stages. Resulting sets of broadly expressed, highly conserved and 100% novel transcripts were selected for experimental validation and discussed in the results section.
RNA-Seq data provides an opportunity to detect differential alternative splicing events across conditions. Since we have two replicates of RNA-seq for each developmental stage of mouse lens tissue resulting from the same sequencing platform, we applied rMATS (replicate Multivariate Analysis of Transcript Splicing)27 to identify differential alternative splicing (AS) events. rMATS provides a computational framework to identify all possible splicing events which are altered between two samples, by inspecting the status of exons/introns as they are included or excluded resulting from alternative splicing. We used sorted BAM (Binary Alignment/Map) files, obtained from aligning the raw RNA-seq datasets against the mouse reference genome using HISAT as discussed above, as input to rMATS by pairing with their corresponding replicates from each developmental stage. This allowed us to compare each pair of developmental stages for alterations in various splicing events. Since rMATS requires all input datasets to have the same read length, we excluded the dataset from E15.5 which had a different read length compared to others. Also, we have provided the GFF (General Feature Format) file downloaded from Ensembl (version 82, September 2015)70 as input to rMATS and have used the default thresholds for remaining options. Briefly, rMATS enabled us to analyze the inclusion/exclusion of target exons/introns contributing to different types of alternative splicing events, namely skipped exon (SE), alternative 5′ splice site (A5SS), alternative 3′ splice site (A3SS), mutually exclusive exons (MXE) and retained intron (RI), across any pair of developmental stages with replicates. An AS event is quantified based on the difference in the level of inclusion of an exon which is defined as the splice index or Percentage Splicing Index (ψ score) between two samples or conditions and ranges between 0 and 1. PSI represents the inclusion/exclusion of an exon for a transcript isoform considering all alternate possible isoforms. Reads aligning to the alternative exon or to its junctions with adjacent constitutive exons provide support for the inclusion isoform, whereas reads aligning to the junction between the adjacent constitutive exons support the exclusion isoform; the relative read density of these two sets forms the standard estimate of ψ. Significant differences in the values of ψ for an exon, between a pair of conditions compared to a null distribution indicate its differential abundance. We ran rMATS for all pairs of six developmental stages (E15, E18, P0, P3, P6 and P9) and generated a summary table with the number of different alternative splicing events that were detected below 1% FDR threshold (Table 3). Since skipped exon and retained intron events were the most abundant, we collected these events from raw rMATS outputs specifically those which are supported by reads that span splicing junctions and reads on target below 1% FDR. Functional enrichment analysis of genes belonging to these splicing events was performed using ClueGO31.
To confirm splicing events during lens development, we selected genes based on their potential relevance to lens biology and which were predicted with less than 5% FDR in our splicing analysis. For alternative splicing analysis, primers (listed in Table S12) were designed on exons flanking the alternatively spliced exon (skipped exon) on either side. Total RNA from E15.5, P0 and P10 C57Bl/6 mouse lens was collected as described above. RNA was treated with RNase free DNase (Qiagen Inc #79254, Valencia, CA). 200ng of lens total RNA was used as template for cDNA synthesis using in vitro reverse transcription kit as described earlier and cDNA was used as a template for PCR reactions. The different splice isoforms were identified based on size differences of PCR products separated by 1% agarose gel electrophoresis. We further analyzed the PCR products obtained using RNA from P0 lens by Sanger sequencing. The different splice isoform DNA bands from the P0 lens samples were excised from the gel and subjected to DNA purification using Wizard® SV Gel and PCR Clean-Up System (Promega #A9281, Madison, WI). DNA isolated from specific splice isoforms was sequenced by Sanger sequencing method.
This work was supported by the startup funds from School of Informatics and Computing, Indiana University Purdue University Indianapolis (SCJ), National Institute of General Medical Sciences under Award Number R01GM123314 (SCJ) and the National Eye Institute of the National Institutes of Health under Award Number R01EY021505 (SAL). SAL is a Pew Scholar in Biomedical Sciences.
R.S., G.B. and S.C.J. designed the study, implemented the computational approaches and performed the data analysis. R.S., S.C.J. and S.L. interpreted the data. S.D. and S.L. designed and validated the predicted targets experimentally. R.S., G.B., S.D. wrote the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Rajneesh Srivastava and Gungor Budak contributed equally to this work.
Electronic supplementary material
Supplementary information accompanies this paper at doi:10.1038/s41598-017-10615-4
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.