Search tips
Search criteria

Results 1-25 (1173871)

Clipboard (0)

Related Articles

1.  Next-generation sequencing facilitates quantitative analysis of wild-type and Nrl−/− retinal transcriptomes 
Molecular Vision  2011;17:3034-3054.
Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The goals of this study are to compare NGS-derived retinal transcriptome profiling (RNA-seq) to microarray and quantitative reverse transcription polymerase chain reaction (qRT–PCR) methods and to evaluate protocols for optimal high-throughput data analysis.
Retinal mRNA profiles of 21-day-old wild-type (WT) and neural retina leucine zipper knockout (Nrl−/−) mice were generated by deep sequencing, in triplicate, using Illumina GAIIx. The sequence reads that passed quality filters were analyzed at the transcript isoform level with two methods: Burrows–Wheeler Aligner (BWA) followed by ANOVA (ANOVA) and TopHat followed by Cufflinks. qRT–PCR validation was performed using TaqMan and SYBR Green assays.
Using an optimized data analysis workflow, we mapped about 30 million sequence reads per sample to the mouse genome (build mm9) and identified 16,014 transcripts in the retinas of WT and Nrl−/− mice with BWA workflow and 34,115 transcripts with TopHat workflow. RNA-seq data confirmed stable expression of 25 known housekeeping genes, and 12 of these were validated with qRT–PCR. RNA-seq data had a linear relationship with qRT–PCR for more than four orders of magnitude and a goodness of fit (R2) of 0.8798. Approximately 10% of the transcripts showed differential expression between the WT and Nrl−/− retina, with a fold change ≥1.5 and p value <0.05. Altered expression of 25 genes was confirmed with qRT–PCR, demonstrating the high degree of sensitivity of the RNA-seq method. Hierarchical clustering of differentially expressed genes uncovered several as yet uncharacterized genes that may contribute to retinal function. Data analysis with BWA and TopHat workflows revealed a significant overlap yet provided complementary insights in transcriptome profiling.
Our study represents the first detailed analysis of retinal transcriptomes, with biologic replicates, generated by RNA-seq technology. The optimized data analysis workflows reported here should provide a framework for comparative investigations of expression profiles. Our results show that NGS offers a comprehensive and more accurate quantitative and qualitative evaluation of mRNA content within a cell or tissue. We conclude that RNA-seq based transcriptome characterization would expedite genetic network analyses and permit the dissection of complex biologic functions.
PMCID: PMC3233386  PMID: 22162623
2.  Inference of Isoforms from Short Sequence Reads 
Journal of Computational Biology  2011;18(3):305-321.
Due to alternative splicing events in eukaryotic species, the identification of mRNA isoforms (or splicing variants) is a difficult problem. Traditional experimental methods for this purpose are time consuming and cost ineffective. The emerging RNA-Seq technology provides a possible effective method to address this problem. Although the advantages of RNA-Seq over traditional methods in transcriptome analysis have been confirmed by many studies, the inference of isoforms from millions of short sequence reads (e.g., Illumina/Solexa reads) has remained computationally challenging. In this work, we propose a method to calculate the expression levels of isoforms and infer isoforms from short RNA-Seq reads using exon-intron boundary, transcription start site (TSS) and poly-A site (PAS) information. We first formulate the relationship among exons, isoforms, and single-end reads as a convex quadratic program, and then use an efficient algorithm (called IsoInfer) to search for isoforms. IsoInfer can calculate the expression levels of isoforms accurately if all the isoforms are known and infer novel isoforms from scratch. Our experimental tests on known mouse isoforms with both simulated expression levels and reads demonstrate that IsoInfer is able to calculate the expression levels of isoforms with an accuracy comparable to the state-of-the-art statistical method and a 60 times faster speed. Moreover, our tests on both simulated and real reads show that it achieves a good precision and sensitivity in inferring isoforms when given accurate exon-intron boundary, TSS, and PAS information, especially for isoforms whose expression levels are significantly high. The software is publicly available for free at∼jianxing/IsoInfer.html.
PMCID: PMC3123862  PMID: 21385036
alternative splicing; convex quadratic programming; deep sequencing; isoform inference; RNA-Seq
3.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome 
BMC Bioinformatics  2011;12:323.
RNA-Seq is revolutionizing the way transcript abundances are measured. A key challenge in transcript quantification from RNA-Seq data is the handling of reads that map to multiple genes or isoforms. This issue is particularly important for quantification with de novo transcriptome assemblies in the absence of sequenced genomes, as it is difficult to determine which transcripts are isoforms of the same gene. A second significant issue is the design of RNA-Seq experiments, in terms of the number of reads, read length, and whether reads come from one or both ends of cDNA fragments.
We present RSEM, an user-friendly software package for quantifying gene and isoform abundances from single-end or paired-end RNA-Seq data. RSEM outputs abundance estimates, 95% credibility intervals, and visualization files and can also simulate RNA-Seq data. In contrast to other existing tools, the software does not require a reference genome. Thus, in combination with a de novo transcriptome assembler, RSEM enables accurate transcript quantification for species without sequenced genomes. On simulated and real data sets, RSEM has superior or comparable performance to quantification methods that rely on a reference genome. Taking advantage of RSEM's ability to effectively use ambiguously-mapping reads, we show that accurate gene-level abundance estimates are best obtained with large numbers of short single-end reads. On the other hand, estimates of the relative frequencies of isoforms within single genes may be improved through the use of paired-end reads, depending on the number of possible splice forms for each gene.
RSEM is an accurate and user-friendly software tool for quantifying transcript abundances from RNA-Seq data. As it does not rely on the existence of a reference genome, it is particularly useful for quantification with de novo transcriptome assemblies. In addition, RSEM has enabled valuable guidance for cost-efficient design of quantification experiments with RNA-Seq, which is currently relatively expensive.
PMCID: PMC3163565  PMID: 21816040
4.  NSMAP: A method for spliced isoforms identification and quantification from RNA-Seq 
BMC Bioinformatics  2011;12:162.
The development of techniques for sequencing the messenger RNA (RNA-Seq) enables it to study the biological mechanisms such as alternative splicing and gene expression regulation more deeply and accurately. Most existing methods employ RNA-Seq to quantify the expression levels of already annotated isoforms from the reference genome. However, the current reference genome is very incomplete due to the complexity of the transcriptome which hiders the comprehensive investigation of transcriptome using RNA-Seq. Novel study on isoform inference and estimation purely from RNA-Seq without annotation information is desirable.
A Nonnegativity and Sparsity constrained Maximum APosteriori (NSMAP) model has been proposed to estimate the expression levels of isoforms from RNA-Seq data without the annotation information. In contrast to previous methods, NSMAP performs identification of the structures of expressed isoforms and estimation of the expression levels of those expressed isoforms simultaneously, which enables better identification of isoforms. In the simulations parameterized by two real RNA-Seq data sets, more than 77% expressed isoforms are correctly identified and quantified. Then, we apply NSMAP on two RNA-Seq data sets of myelodysplastic syndromes (MDS) samples and one normal sample in order to identify differentially expressed known and novel isoforms in MDS disease.
NSMAP provides a good strategy to identify and quantify novel isoforms without the knowledge of annotated reference genome which can further realize the potential of RNA-Seq technique in transcriptome analysis. NSMAP package is freely available at
PMCID: PMC3113944  PMID: 21575225
5.  Assessing the impact of human genome annotation choice on RNA-seq expression estimates 
BMC Bioinformatics  2013;14(Suppl 11):S8.
Genome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus.
We define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates.
Based on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.
PMCID: PMC3816316  PMID: 24564364
6.  NPEBseq: nonparametric empirical bayesian-based procedure for differential expression analysis of RNA-seq data 
BMC Bioinformatics  2013;14:262.
RNA-seq, a massive parallel-sequencing-based transcriptome profiling method, provides digital data in the form of aligned sequence read counts. The comparative analyses of the data require appropriate statistical methods to estimate the differential expression of transcript variants across different cell/tissue types and disease conditions.
We developed a novel nonparametric empirical Bayesian-based approach (NPEBseq) to model the RNA-seq data. The prior distribution of the Bayesian model is empirically estimated from the data without any parametric assumption, and hence the method is “nonparametric” in nature. Based on this model, we proposed a method for detecting differentially expressed genes across different conditions. We also extended this method to detect differential usage of exons from RNA-seq data. The evaluation of NPEBseq on both simulated and publicly available RNA-seq datasets and comparison with three popular methods showed improved results for experiments with or without biological replicates.
NPEBseq can successfully detect differential expression between different conditions not only at gene level but also at exon level from RNA-seq datasets. In addition, NPEBSeq performs significantly better than current methods and can be applied to genome-wide RNA-seq datasets. Sample datasets and R package are available at
PMCID: PMC3765716  PMID: 23981227
7.  FDM: a graph-based statistical method to detect differential transcription using RNA-seq data 
Bioinformatics  2011;27(19):2633-2640.
Motivation: In eukaryotic cells, alternative splicing expands the diversity of RNA transcripts and plays an important role in tissue-specific differentiation, and can be misregulated in disease. To understand these processes, there is a great need for methods to detect differential transcription between samples. Our focus is on samples observed using short-read RNA sequencing (RNA-seq).
Methods: We characterize differential transcription between two samples as the difference in the relative abundance of the transcript isoforms present in the samples. The magnitude of differential transcription of a gene between two samples can be measured by the square root of the Jensen Shannon Divergence (JSD*) between the gene's transcript abundance vectors in each sample. We define a weighted splice-graph representation of RNA-seq data, summarizing in compact form the alignment of RNA-seq reads to a reference genome. The flow difference metric (FDM) identifies regions of differential RNA transcript expression between pairs of splice graphs, without need for an underlying gene model or catalog of transcripts. We present a novel non-parametric statistical test between splice graphs to assess the significance of differential transcription, and extend it to group-wise comparison incorporating sample replicates.
Results: Using simulated RNA-seq data consisting of four technical replicates of two samples with varying transcription between genes, we show that (i) the FDM is highly correlated with JSD* (r=0.82) when average RNA-seq coverage of the transcripts is sufficiently deep; and (ii) the FDM is able to identify 90% of genes with differential transcription when JSD* >0.28 and coverage >7. This represents higher sensitivity than Cufflinks (without annotations) and rDiff (MMD), which respectively identified 69 and 49% of the genes in this region as differential transcribed. Using annotations identifying the transcripts, Cufflinks was able to identify 86% of the genes in this region as differentially transcribed. Using experimental data consisting of four replicates each for two cancer cell lines (MCF7 and SUM102), FDM identified 1425 genes as significantly different in transcription. Subsequent study of the samples using quantitative real time polymerase chain reaction (qRT-PCR) of several differential transcription sites identified by FDM, confirmed significant differences at these sites.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3179659  PMID: 21824971
8.  Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching 
BMC Bioinformatics  2013;14:370.
RNA-seq is now widely used to quantitatively assess gene expression, expression differences and isoform switching, and promises to deliver results for the entire transcriptome. However, whether the transcriptional state of a gene can be captured accurately depends critically on library preparation, read alignment, expression estimation and the tests for differential expression and isoform switching. There are comparisons available for the individual steps but there is not yet a systematic investigation which specific genes are impacted by biases throughout the entire analysis workflow. It is especially unclear whether for a given gene, with current methods and protocols, expression changes and isoform switches can be detected.
For the human genes, we report their detectability under various conditions using different approaches. Overall, we find that the input material has the biggest influence and may, depending on the protocol and RNA degradation, exhibit already strong length-dependent over- and underrepresentation of transcripts. The alignment step aligns for 50% of the isoforms up to 99% of the reads correctly; only in the presence of transcript modifications mainly short isoforms will have a low alignment rate. In our dataset, we found that, depending on the aligner and the input material used, the expression estimation of up to 93% of the genes being accurate within a factor of two; with the deviations being due to ambiguous alignments. Detection of differential expression using a negative-binomial count model works reliably for our simulated data but is dependent on the count accuracy. Interestingly, using the fold-change instead of the p-value as a score for differential expression yields the same performance in the situation of three replicates and the true change being two-fold. Isoform switching is harder to detect and for at least 109 genes the isoform differences evade detection independent of the method used.
RNA-seq is a reliable tool but the repetitive nature of the human genome makes the origin of the reads ambiguous and limits the detectability for certain genes. RNA-seq does not equally well represent isoforms independent of their size which may range from ~200nt to ~100′000nt. Researchers are advised to verify that their target genes do not have extreme properties with respect to repeated regions, GC content, and isoform length and complexity.
PMCID: PMC3879183  PMID: 24365034
9.  SpliceTrap: a method to quantify alternative splicing under single cellular conditions 
Bioinformatics  2011;27(21):3010-3016.
Motivation: Alternative splicing (AS) is a pre-mRNA maturation process leading to the expression of multiple mRNA variants from the same primary transcript. More than 90% of human genes are expressed via AS. Therefore, quantifying the inclusion level of every exon is crucial for generating accurate transcriptomic maps and studying the regulation of AS.
Results: Here we introduce SpliceTrap, a method to quantify exon inclusion levels using paired-end RNA-seq data. Unlike other tools, which focus on full-length transcript isoforms, SpliceTrap approaches the expression-level estimation of each exon as an independent Bayesian inference problem. In addition, SpliceTrap can identify major classes of alternative splicing events under a single cellular condition, without requiring a background set of reads to estimate relative splicing changes. We tested SpliceTrap both by simulation and real data analysis, and compared it to state-of-the-art tools for transcript quantification. SpliceTrap demonstrated improved accuracy, robustness and reliability in quantifying exon-inclusion ratios.
Conclusions: SpliceTrap is a useful tool to study alternative splicing regulation, especially for accurate quantification of local exon-inclusion ratios from RNA-seq data.
Availability and Implementation: SpliceTrap can be implemented online through the CSH Galaxy server and is also available for download and installation at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3198574  PMID: 21896509
10.  Piecing the puzzle together: a revisit to transcript reconstruction problem in RNA-seq 
BMC Bioinformatics  2014;15(Suppl 9):S3.
The advancement of RNA sequencing (RNA-seq) has provided an unprecedented opportunity to assess both the diversity and quantity of transcript isoforms in an mRNA transcriptome. In this paper, we revisit the computational problem of transcript reconstruction and quantification. Unlike existing methods which focus on how to explain the exons and splice variants detected by the reads with a set of isoforms, we aim at reconstructing transcripts by piecing the reads into individual effective transcript copies. Simultaneously, the quantity of each isoform is explicitly measured by the number of assembled effective copies, instead of estimated solely based on the collective read count. We have developed a novel method named Astroid that solves the problem of effective copy reconstruction on the basis of a flow network. The RNA-seq reads are represented as vertices in the flow network and are connected by weighted edges that evaluate the likelihood of two reads originating from the same effective copy. A maximum likelihood set of transcript copies is then reconstructed by solving a minimum-cost flow problem on the flow network. Simulation studies on the human transcriptome have demonstrated the superior sensitivity and specificity of Astroid in transcript reconstruction as well as improved accuracy in transcript quantification over several existing approaches. The application of Astroid on two real RNA-seq datasets has further demonstrated its accuracy through high correlation between the estimated isoform abundance and the qRT-PCR validations.
PMCID: PMC4168703  PMID: 25252653
Transcript reconstruction; Transcript quantification; Transcriptome; RNA-seq
11.  Accurate quantification of transcriptome from RNA-Seq data by effective length normalization 
Nucleic Acids Research  2010;39(2):e9.
We propose a novel, efficient and intuitive approach of estimating mRNA abundances from the whole transcriptome shotgun sequencing (RNA-Seq) data. Our method, NEUMA (Normalization by Expected Uniquely Mappable Area), is based on effective length normalization using uniquely mappable areas of gene and mRNA isoform models. Using the known transcriptome sequence model such as RefSeq, NEUMA pre-computes the numbers of all possible gene-wise and isoform-wise informative reads: the former being sequences mapped to all mRNA isoforms of a single gene exclusively and the latter uniquely mapped to a single mRNA isoform. The results are used to estimate the effective length of genes and transcripts, taking experimental distributions of fragment size into consideration. Quantitative RT–PCR based on 27 randomly selected genes in two human cell lines and computer simulation experiments demonstrated superior accuracy of NEUMA over other recently developed methods. NEUMA covers a large proportion of genes and mRNA isoforms and offers a measure of consistency (‘consistency coefficient’) for each gene between an independently measured gene-wise level and the sum of the isoform levels. NEUMA is applicable to both paired-end and single-end RNA-Seq data. We propose that NEUMA could make a standard method in quantifying gene transcript levels from RNA-Seq data.
PMCID: PMC3025570  PMID: 21059678
12.  Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads 
Bioinformatics  2012;28(22):2914-2921.
Motivation: RNA-Seq uses the high-throughput sequencing technology to identify and quantify transcriptome at an unprecedented high resolution and low cost. However, RNA-Seq reads are usually not uniformly distributed and biases in RNA-Seq data post great challenges in many applications including transcriptome assembly and the expression level estimation of genes or isoforms. Much effort has been made in the literature to calibrate the expression level estimation from biased RNA-Seq data, but the effect of biases on transcriptome assembly remains largely unexplored.
Results: Here, we propose a statistical framework for both transcriptome assembly and isoform expression level estimation from biased RNA-Seq data. Using a quasi-multinomial distribution model, our method is able to capture various types of RNA-Seq biases, including positional, sequencing and mappability biases. Our experimental results on simulated and real RNA-Seq datasets exhibit interesting effects of RNA-Seq biases on both transcriptome assembly and isoform expression level estimation. The advantage of our method is clearly shown in the experimental analysis by its high sensitivity and precision in transcriptome assembly and the high concordance of its estimated expression levels with quantitative reverse transcription–polymerase chain reaction data.
Availability: CEM is freely available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3496342  PMID: 23060617
13.  Estimation of Gene Expression at Isoform Level from mRNA-Seq Data by Bayesian Hierarchical Modeling 
Frontiers in Genetics  2012;3:239.
mRNA-Seq is a precise and highly reproducible technique for measurement of transcripts levels and yields sequence information of a transcriptome at a single nucleotide base-level thus enabling us to determine splice junctions and alternative splicing events with high confidence. Often analysis of mRNA-Seq data does not attempt to quantify the expressions at isoform level. In this paper our objective would be use the mRNA-Seq data to infer expression at isoform level, where splicing patterns of a gene is assumed to be known. A Bayesian latent variable based modeling framework is proposed here, where the parameterization enables us to infer at various levels. For example, expression variability of an isoform across different conditions; the model parameterization also allows us to carry out two-sample comparisons, e.g., using a Bayesian t-test, in addition simple presence or absence of an isoform can also be estimated by the use of the latent variables present in the model. In this paper we would carry out inference on isoform expression under different normalization techniques, since it has been recently shown that one of the most prominent sources of variation in differential call using mRNA-Seq data is the normalization method used. The statistical framework is developed for multiple isoforms and easily extends to reads mapping to multiple genes. This could be achieved by slight conceptual modifications in definitions of what we consider as a gene and what as an exon. Additionally proposed framework can be extended by appropriate modeling of the design matrix to infer about yet unknown novel transcripts. However such attempts should be made judiciously since the input date used in the proposed model does not use reads from splice junctions.
PMCID: PMC3536024  PMID: 23293650
mRNA-Seq; isoform expression; Bayesian latent variable modeling; multi-sample comparison; Bayesian t-test; spike-n-slab method
14.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) 
Bioinformatics  2011;27(18):2518-2528.
Motivation: A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
Results: We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription–polymerase chain reaction (RT–PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
Availability: The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (
Supplementary Information:The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.
PMCID: PMC3167048  PMID: 21775302
15.  Unproductive alternative splicing and nonsense mRNAs: A widespread phenomenon among plant circadian clock genes 
Biology Direct  2012;7:20.
Recent mapping of eukaryotic transcriptomes and spliceomes using massively parallel RNA sequencing (RNA-seq) has revealed that the extent of alternative splicing has been considerably underestimated. Evidence also suggests that many pre-mRNAs undergo unproductive alternative splicing resulting in incorporation of in-frame premature termination codons (PTCs). The destinies and potential functions of the PTC-harboring mRNAs remain poorly understood. Unproductive alternative splicing in circadian clock genes presents a special case study because the daily oscillations of protein expression levels require rapid and steep adjustments in mRNA levels.
We conducted a systematic survey of alternative splicing of plant circadian clock genes using RNA-seq and found that many Arabidopsis thaliana circadian clock-associated genes are alternatively spliced. Results were confirmed using reverse transcription polymerase chain reaction (RT-PCR), quantitative RT-PCR (qRT-PCR), and/or Sanger sequencing. Intron retention events were frequently observed in mRNAs of the CCA1/LHY-like subfamily of MYB transcription factors. In contrast, the REVEILLE2 (RVE2) transcript was alternatively spliced via inclusion of a "poison cassette exon" (PCE). The PCE type events introducing in-frame PTCs are conserved in some mammalian and plant serine/arginine-rich splicing factors. For some circadian genes such as CCA1 the ratio of the productive isoform (i.e., a representative splice variant encoding the full-length protein) to its PTC counterpart shifted sharply under specific environmental stress conditions.
Our results demonstrate that unproductive alternative splicing is a widespread phenomenon among plant circadian clock genes that frequently generates mRNA isoforms harboring in-frame PTCs. Because LHY and CCA1 are core components of the plant central circadian oscillator, the conservation of alternatively spliced variants between CCA1 and LHY and for CCA1 across phyla [2] indicates a potential role of nonsense transcripts in regulation of circadian rhythms. Most of the alternatively spliced isoforms harbor in-frame PTCs that arise from full or partial intron retention events. However, a PTC in the RVE2 transcript is introduced through a PCE event. The conservation of AS events and modulation of the relative abundance of nonsense isoforms by environmental and diurnal conditions suggests possible regulatory roles for these alternatively spliced transcripts in circadian clock function. The temperature-dependent expression of the PTC transcripts among members of CCA1/LHY subfamily indicates that alternative splicing may be involved in regulation of the clock temperature compensation mechanism.
This article was reviewed by Dr. Eugene Koonin, Dr. Chungoo Park (nominated by Dr. Kateryna Makova), and Dr. Marcelo Yanovsky (nominated by Dr. Valerian Dolja).
PMCID: PMC3403997  PMID: 22747664
Arabidopsis thaliana; Alternative splicing; Circadian clock; RNA-seq; Intron retention; Cassette exon; Nonsense mRNAs; Premature termination codon; CIRCADIAN CLOCK ASSOCIATED 1 (CCA1); LATE ELONGATED HYPOCOTYL (LHY); REVEILLE 2 (RVE2).
16.  Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data 
PLoS Computational Biology  2013;9(11):e1003314.
Integrating large-scale functional genomic data has significantly accelerated our understanding of gene functions. However, no algorithm has been developed to differentiate functions for isoforms of the same gene using high-throughput genomic data. This is because standard supervised learning requires ‘ground-truth’ functional annotations, which are lacking at the isoform level. To address this challenge, we developed a generic framework that interrogates public RNA-seq data at the transcript level to differentiate functions for alternatively spliced isoforms. For a specific function, our algorithm identifies the ‘responsible’ isoform(s) of a gene and generates classifying models at the isoform level instead of at the gene level. Through cross-validation, we demonstrated that our algorithm is effective in assigning functions to genes, especially the ones with multiple isoforms, and robust to gene expression levels and removal of homologous gene pairs. We identified genes in the mouse whose isoforms are predicted to have disparate functionalities and experimentally validated the ‘responsible’ isoforms using data from mammary tissue. With protein structure modeling and experimental evidence, we further validated the predicted isoform functional differences for the genes Cdkn2a and Anxa6. Our generic framework is the first to predict and differentiate functions for alternatively spliced isoforms, instead of genes, using genomic data. It is extendable to any base machine learner and other species with alternatively spliced isoforms, and shifts the current gene-centered function prediction to isoform-level predictions.
Author Summary
In mammalian genomes, a single gene can be alternatively spliced into multiple isoforms which greatly increase the functional diversity of the genome. In the human, more than 95% of multi-exon genes undergo alternative splicing. It is hard to computationally differentiate the functions for the splice isoforms of the same gene, because they are almost always annotated with the same functions and share similar sequences. In this paper, we developed a generic framework to identify the ‘responsible’ isoform(s) for each function that the gene carries out, and therefore predict functional assignment on the isoform level instead of on the gene level. Within this generic framework, we implemented and evaluated several related algorithms for isoform function prediction. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm represents the first effort to predict and differentiate isoforms through large-scale genomic data integration.
PMCID: PMC3820534  PMID: 24244129
17.  RNA-Seq gene expression estimation with read mapping uncertainty 
Bioinformatics  2009;26(4):493-500.
Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.
Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.
Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at
Supplementary information: Supplementary data are available at Bioinformatics on
PMCID: PMC2820677  PMID: 20022975
18.  Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling 
BMC Genomics  2014;15(1):419.
RNA sequencing (RNA-Seq) is often used for transcriptome profiling as well as the identification of novel transcripts and alternative splicing events. Typically, RNA-Seq libraries are prepared from total RNA using poly(A) enrichment of the mRNA (mRNA-Seq) to remove ribosomal RNA (rRNA), however, this method fails to capture non-poly(A) transcripts or partially degraded mRNAs. Hence, a mRNA-Seq protocol will not be compatible for use with RNAs coming from Formalin-Fixed and Paraffin-Embedded (FFPE) samples.
To address the desire to perform RNA-Seq on FFPE materials, we evaluated two different library preparation protocols that could be compatible for use with small RNA fragments. We obtained paired Fresh Frozen (FF) and FFPE RNAs from multiple tumors and subjected these to different gene expression profiling methods. We tested 11 human breast tumor samples using: (a) FF RNAs by microarray, mRNA-Seq, Ribo-Zero-Seq and DSN-Seq (Duplex-Specific Nuclease) and (b) FFPE RNAs by Ribo-Zero-Seq and DSN-Seq. We also performed these different RNA-Seq protocols using 10 TCGA tumors as a validation set.
The data from paired RNA samples showed high concordance in transcript quantification across all protocols and between FF and FFPE RNAs. In both FF and FFPE, Ribo-Zero-Seq removed rRNA with comparable efficiency as mRNA-Seq, and it provided an equivalent or less biased coverage on gene 3′ ends. Compared to mRNA-Seq where 69% of bases were mapped to the transcriptome, DSN-Seq and Ribo-Zero-Seq contained significantly fewer reads mapping to the transcriptome (20-30%); in these RNA-Seq protocols, many if not most reads mapped to intronic regions. Approximately 14 million reads in mRNA-Seq and 45–65 million reads in Ribo-Zero-Seq or DSN-Seq were required to achieve the same gene detection levels as a standard Agilent DNA microarray.
Our results demonstrate that compared to mRNA-Seq and microarrays, Ribo-Zero-Seq provides equivalent rRNA removal efficiency, coverage uniformity, genome-based mapped reads, and consistently high quality quantification of transcripts. Moreover, Ribo-Zero-Seq and DSN-Seq have consistent transcript quantification using FFPE RNAs, suggesting that RNA-Seq can be used with FFPE-derived RNAs for gene expression profiling.
Electronic supplementary material
The online version of this article (doi: 10.1186/1471-2164-15-419) contains supplementary material, which is available to authorized users.
PMCID: PMC4070569  PMID: 24888378
RNA sequencing; FFPE; RNA depletion; Ribo-zero; Gene expression; Microarray
19.  RNASEQR—a streamlined and accurate RNA-seq sequence analysis program 
Nucleic Acids Research  2011;40(6):e42.
Next-generation sequencing (NGS) technologies-based transcriptomic profiling method often called RNA-seq has been widely used to study global gene expression, alternative exon usage, new exon discovery, novel transcriptional isoforms and genomic sequence variations. However, this technique also poses many biological and informatics challenges to extracting meaningful biological information. The RNA-seq data analysis is built on the foundation of high quality initial genome localization and alignment information for RNA-seq sequences. Toward this goal, we have developed RNASEQR to accurately and effectively map millions of RNA-seq sequences. We have systematically compared RNASEQR with four of the most widely used tools using a simulated data set created from the Consensus CDS project and two experimental RNA-seq data sets generated from a human glioblastoma patient. Our results showed that RNASEQR yields more accurate estimates for gene expression, complete gene structures and new transcript isoforms, as well as more accurate detection of single nucleotide variants (SNVs). RNASEQR analyzes raw data from RNA-seq experiments effectively and outputs results in a manner that is compatible with a wide variety of specialized downstream analyses on desktop computers.
PMCID: PMC3315322  PMID: 22199257
20.  A Robust Method for Transcript Quantification with RNA-Seq Data 
Journal of Computational Biology  2013;20(3):167-187.
The advent of high throughput RNA-seq technology allows deep sampling of the transcriptome, making it possible to characterize both the diversity and the abundance of transcript isoforms. Accurate abundance estimation or transcript quantification of isoforms is critical for downstream differential analysis (e.g., healthy vs. diseased cells) but remains a challenging problem for several reasons. First, while various types of algorithms have been developed for abundance estimation, short reads often do not uniquely identify the transcript isoforms from which they were sampled. As a result, the quantification problem may not be identifiable, i.e., lacks a unique transcript solution even if the read maps uniquely to the reference genome. In this article, we develop a general linear model for transcript quantification that leverages reads spanning multiple splice junctions to ameliorate identifiability. Second, RNA-seq reads sampled from the transcriptome exhibit unknown position-specific and sequence-specific biases. We extend our method to simultaneously learn bias parameters during transcript quantification to improve accuracy. Third, transcript quantification is often provided with a candidate set of isoforms, not all of which are likely to be significantly expressed in a given tissue type or condition. By resolving the linear system with LASSO, our approach can infer an accurate set of dominantly expressed transcripts while existing methods tend to assign positive expression to every candidate isoform. Using simulated RNA-seq datasets, our method demonstrated better quantification accuracy and the inference of dominant set of transcripts than existing methods. The application of our method on real data experimentally demonstrated that transcript quantification is effective for differential analysis of transcriptomes.
PMCID: PMC3590898  PMID: 23461570
transcriptome; transcript quantification; RNA-seq
21.  Bias detection and correction in RNA-Sequencing data 
BMC Bioinformatics  2011;12:290.
High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates.
In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively.
Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.
PMCID: PMC3149584  PMID: 21771300
22.  DiffSplice: the genome-wide detection of differential splicing events with RNA-seq 
Nucleic Acids Research  2012;41(2):e39.
The RNA transcriptome varies in response to cellular differentiation as well as environmental factors, and can be characterized by the diversity and abundance of transcript isoforms. Differential transcription analysis, the detection of differences between the transcriptomes of different cells, may improve understanding of cell differentiation and development and enable the identification of biomarkers that classify disease types. The availability of high-throughput short-read RNA sequencing technologies provides in-depth sampling of the transcriptome, making it possible to accurately detect the differences between transcriptomes. In this article, we present a new method for the detection and visualization of differential transcription. Our approach does not depend on transcript or gene annotations. It also circumvents the need for full transcript inference and quantification, which is a challenging problem because of short read lengths, as well as various sampling biases. Instead, our method takes a divide-and-conquer approach to localize the difference between transcriptomes in the form of alternative splicing modules (ASMs), where transcript isoforms diverge. Our approach starts with the identification of ASMs from the splice graph, constructed directly from the exons and introns predicted from RNA-seq read alignments. The abundance of alternative splicing isoforms residing in each ASM is estimated for each sample and is compared across sample groups. A non-parametric statistical test is applied to each ASM to detect significant differential transcription with a controlled false discovery rate. The sensitivity and specificity of the method have been assessed using simulated data sets and compared with other state-of-the-art approaches. Experimental validation using qRT-PCR confirmed a selected set of genes that are differentially expressed in a lung differentiation study and a breast cancer data set, demonstrating the utility of the approach applied on experimental biological data sets. The software of DiffSplice is available at
PMCID: PMC3553996  PMID: 23155066
23.  Increased IR-A/IR-B ratio in non-small cell lung cancers associates with lower epithelial-mesenchymal transition signature and longer survival in squamous cell lung carcinoma 
BMC Cancer  2014;14:131.
To evaluate the insulin receptor isoform mRNA expression status in non-small cell lung cancer (NSCLC) patients.
RNA-seq data from 614 NSCLC [355 adenocarcinomas (LUAD) and 259 squamous cell carcinomas (LUSC)] and 92 normal lung specimens were obtained from The Cancer Genome Atlas (TCGA) to evaluate the mRNA expression of insulin receptor isoform A (IR-A) and insulin receptor isoform B (IR-B). The differential expression status of the insulin receptor isoforms in NSCLC patients was confirmed using qRT-PCR assays with lung cancer cDNA arrays and primary tumor samples.
The mRNA expression levels of IR-B were significantly lower in some NSCLC samples compared to normal lung specimens, including both LUAD and LUSC. Notably, no IR-B transcripts were detected - only the IR-A isoform was expressed in 11% of NSCLC patients. This decrease in IR-B expression contributed to an elevated IR-A/IR-B ratio, which was also associated with lower epithelial-mesenchymal transition gene signatures in NSCLC and longer patient survival under standard of care in LUSC. In addition to NSCLC, RNA-seq data from TCGA revealed a similar increase in IR-A/IR-B ratio in many other cancer types, with high prevalence in acute myeloid leukemia, glioblastoma multiforme, and brain lower grade glioma.
Our results indicate a common reduction of the mRNA expression level of IR-B and an increased IR-A/IR-B mRNA ratio in NSCLC and other tumor types. The relationship of altered IR-A/IR-B ratios with cancer progression and patient survival should be prospectively explored in future studies.
PMCID: PMC3974050  PMID: 24571613
24.  Cell-type specific analysis of translating RNAs in developing flowers reveals new levels of control 
Combining translating ribosome affinity purification with RNA-seq for cell-specific profiling of translating RNAs in developing flowers.Cell type comparisons of cell type-specific hormone responses, promoter motifs, coexpressed cognate binding factor candidates, and splicing isoforms.Widespread post-transcriptional regulation at both the intron splicing and translational stages.A new class of noncoding RNAs associated with polysomes.
What constitutes a differentiated cell type? How much do cell types differ in their transcription of genes? The development and functions of tissues rely on constant interactions among distinct and nonequivalent cell types. Answering these questions will require quantitative information on transcriptomes, proteomes, protein–protein interactions, protein–nucleic acid interactions, and metabolomes at cellular resolution. The systems approaches emerging in biology promise to explain properties of biological systems based on genome-wide measurements of expression, interaction, regulation, and metabolism. To facilitate a systems approach, it is essential first to capture such components in a global manner, ideally at cellular resolution.
Recently, microarray analysis of transcriptomes has been extended to a cellular level of resolution by using laser microdissection or fluorescence-activated sorting (for review, see Nelson et al, 2008). These methods have been limited by stresses associated with cellular separation and isolation procedures, and biases associated with mandatory RNA amplification steps. A newly developed method, translating ribosome affinity purification (TRAP; Zanetti et al, 2005; Heiman et al, 2008; Mustroph et al, 2009), circumvents these problems by epitopetagging a ribosomal protein in specific cellular domains to selectively purify polysomes. We combined TRAP with deep sequencing, which we term TRAP-seq, to provide cell-level spatiotemporal maps for Arabidopsis early floral development at single-base resolution.
Flower development in Arabidopsis has been studied extensively and is one of the best understood aspects of plant development (for review, see Krizek and Fletcher, 2005). Genetic analysis of homeotic mutants established the ABC model, in which three classes of regulatory genes, A, B and C, work in a combinatorial manner to confer organ identities of four whorls (Coen and Meyerowitz, 1991). Each class of regulatory gene is expressed in a specific and evolutionarily conserved domain, and the action of the class A, B and C genes is necessary for specification of organ identity (Figure 1A).
Using TRAP-seq, we purified cell-specific translating mRNA populations, which we and others call the translatome, from the A, B and C domains of early developing flowers, in which floral patterning and the specification of floral organs is established. To achieve temporal specificity, we used a floral induction system to facilitate collection of early stage flowers (Wellmer et al, 2006). The combination of TRAP-seq with domain-specific promoters and this floral induction system enabled fine spatiotemporal isolation of translating mRNA in specific cellular domains, and at specific developmental stages.
Multiple lines of evidence confirmed the specificity of this approach, including detecting the expression in expected domains but not in other domains for well-studied flower marker genes and known physiological functions (Figures 1B–D and 2A–C). Furthermore, we provide numerous examples from flower development in which a spatiotemporal map of rigorously comparable cell-specific translatomes makes possible new views of the properties of cell domains not evident in data obtained from whole organs or tissues, including patterns of transcription and cis-regulation, new physiological differences among cell domains and between flower stages, putative hormone-active centers, and splicing events specific for flower domains (Figure 2A–D). Such findings may provide new targets for reverse genetics studies and may aid in the formulation and validation of interaction and pathway networks.
Beside cellular heterogeneity, the transcriptome is regulated at several steps through the life of mRNA molecules, which are not directly available through traditional transcriptome profiling of total mRNA abundance. By comparing the translatome and transcriptome, we integratively profiled two key posttranscriptional control points, intron splicing and translation state. From our translatome-wide profiling, we (i) confirmed that both posttranscriptional regulation control points were used by a large portion of the transcriptome; (ii) identified a number of cis-acting features within the coding or noncoding sequences that correlate with splicing or translation state; and (iii) revealed correlation between each regulation mechanism and gene function. Our transcriptome-wide surveys have highlighted target genes transcripts of which are probably under extensive posttranscriptional regulation during flower development.
Finally, we reported the finding of a large number of polysome-associated ncRNAs. About one-third of all annotated ncRNA in the Arabidopsis genome were observed co-purified with polysomes. Coding capacity analysis confirmed that most of them are real ncRNA without conserved ORFs. The group of polysome-associated ncRNA reported in this study is a potential new addition to the expanding riboregulator catalog; they could have roles in translational regulation during early flower development.
Determining both the expression levels of mRNA and the regulation of its translation is important in understanding specialized cell functions. In this study, we describe both the expression profiles of cells within spatiotemporal domains of the Arabidopsis thaliana flower and the post-transcriptional regulation of these mRNAs, at nucleotide resolution. We express a tagged ribosomal protein under the promoters of three master regulators of flower development. By precipitating tagged polysomes, we isolated cell type-specific mRNAs that are probably translating, and quantified those mRNAs through deep sequencing. Cell type comparisons identified known cell-specific transcripts and uncovered many new ones, from which we inferred cell type-specific hormone responses, promoter motifs and coexpressed cognate binding factor candidates, and splicing isoforms. By comparing translating mRNAs with steady-state overall transcripts, we found evidence for widespread post-transcriptional regulation at both the intron splicing and translational stages. Sequence analyses identified structural features associated with each step. Finally, we identified a new class of noncoding RNAs associated with polysomes. Findings from our profiling lead to new hypotheses in the understanding of flower development.
PMCID: PMC2990639  PMID: 20924354
Arabidopsis; flower; intron; transcriptome; translation
25.  Computational approaches for isoform detection and estimation: good and bad news 
BMC Bioinformatics  2014;15:135.
The main goal of the whole transcriptome analysis is to correctly identify all expressed transcripts within a specific cell/tissue - at a particular stage and condition - to determine their structures and to measure their abundances. RNA-seq data promise to allow identification and quantification of transcriptome at unprecedented level of resolution, accuracy and low cost. Several computational methods have been proposed to achieve such purposes. However, it is still not clear which promises are already met and which challenges are still open and require further methodological developments.
We carried out a simulation study to assess the performance of 5 widely used tools, such as: CEM, Cufflinks, iReckon, RSEM, and SLIDE. All of them have been used with default parameters. In particular, we considered the effect of the following three different scenarios: the availability of complete annotation, incomplete annotation, and no annotation at all. Moreover, comparisons were carried out using the methods in three different modes of action. In the first mode, the methods were forced to only deal with those isoforms that are present in the annotation; in the second mode, they were allowed to detect novel isoforms using the annotation as guide; in the third mode, they were operating in fully data driven way (although with the support of the alignment on the reference genome). In the latter modality, precision and recall are quite poor. On the contrary, results are better with the support of the annotation, even though it is not complete. Finally, abundance estimation error often shows a very skewed distribution. The performance strongly depends on the true real abundance of the isoforms. Lowly (and sometimes also moderately) expressed isoforms are poorly detected and estimated. In particular, lowly expressed isoforms are identified mainly if they are provided in the original annotation as potential isoforms.
Both detection and quantification of all isoforms from RNA-seq data are still hard problems and they are affected by many factors. Overall, the performance significantly changes since it depends on the modes of action and on the type of available annotation. Results obtained using complete or partial annotation are able to detect most of the expressed isoforms, even though the number of false positives is often high. Fully data driven approaches require more attention, at least for complex eucaryotic genomes. Improvements are desirable especially for isoform quantification and for isoform detection with low abundance.
PMCID: PMC4098781  PMID: 24885830

Results 1-25 (1173871)