PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (641731)

Clipboard (0)
None

Related Articles

1.  Transcribed dark matter: meaning or myth? 
Human Molecular Genetics  2010;19(R2):R162-R168.
Genomic tiling arrays, cDNA sequencing and, more recently, RNA-Seq have provided initial insights into the extent and depth of transcribed sequence across human and other genomes. These methods have led to greatly improved annotations of protein-coding genes, but have also identified transcription outside of annotated exons. One resultant issue that has aroused dispute is the balance of transcription of known exons against transcription outside of known exons. While non-genic ‘dark matter’ transcription was found by tiling arrays to be pervasive, it was seen to contribute only a small percentage of the polyadenylated transcriptome in some RNA-Seq experiments. This apparent contradiction has been compounded by a lack of clarity about what exactly constitutes a protein-coding gene. It remains unclear, for example, whether or not all transcripts that overlap on either strand within a genomic locus should be assigned to a single gene locus, including those that fail to share promoters, exons and splice junctions. The inability of tiling arrays and RNA-Seq to count transcripts, rather than exons or exon pairs, adds to these difficulties. While there is agreement that thousands of apparently non-coding loci are present outside of protein-coding genes in the human genome, there is vigorous debate of what constitutes evidence for their functionality. These issues will only be resolved upon the demonstration, or otherwise, that organismal or cellular phenotypes frequently result when non-coding RNA loci are disrupted.
doi:10.1093/hmg/ddq362
PMCID: PMC2953743  PMID: 20798109
2.  An approach to comparing tiling array and high throughput sequencing technologies for genomic transcript mapping 
BMC Research Notes  2009;2:150.
Background
There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data.
Findings
This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies.
Conclusion
Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure.
doi:10.1186/1756-0500-2-150
PMCID: PMC2764720  PMID: 19630981
3.  Most “Dark Matter” Transcripts Are Associated With Known Genes 
PLoS Biology  2010;8(5):e1000371.
Short-read RNA sequencing in mouse and human tissues shows that most transcripts are encoded within or nearby known genes and that most of the genome is not transcribed.
A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.
Author Summary
The human genome was sequenced a decade ago, but its exact gene composition remains a subject of debate. The number of protein-coding genes is much lower than initially expected, and the number of distinct transcripts is much larger than the number of protein-coding genes. Moreover, the proportion of the genome that is transcribed in any given cell type remains an open question: results from “tiling” microarray analyses suggest that transcription is pervasive and that most of the genome is transcribed, whereas new deep sequencing-based methods suggest that most transcripts originate from known genes. We have addressed this discrepancy by comparing samples from the same tissues using both technologies. Our analyses indicate that RNA sequencing appears more reliable for transcripts with low expression levels, that most transcripts correspond to known genes or are near known genes, and that many transcripts may represent new exons or aberrant products of the transcription process. We also identify several thousand small transcripts that map outside known genes; their sequences are often conserved and are often encoded in regions of open chromatin. We propose that most of these transcripts may be by-products of the activity of enhancers, which associate with promoters as part of their role as long-range gene regulatory sites. Overall, however, we find that most of the genome is not appreciably transcribed.
doi:10.1371/journal.pbio.1000371
PMCID: PMC2872640  PMID: 20502517
4.  Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome 
Nucleic Acids Research  2013;42(5):2820-2832.
Recent sequencing technologies that allow massive parallel production of short reads are the method of choice for transcriptome analysis. Particularly, digital gene expression (DGE) technologies produce a large dynamic range of expression data by generating short tag signatures for each cell transcript. These tags can be mapped back to a reference genome to identify new transcribed regions that can be further covered by RNA-sequencing (RNA-Seq) reads. Here, we applied an integrated bioinformatics approach that combines DGE tags, RNA-Seq, tiling array expression data and species-comparison to explore new transcriptional regions and their specific biological features, particularly tissue expression or conservation. We analysed tags from a large DGE data set (designated as ‘TranscriRef’). We then annotated 750 000 tags that were uniquely mapped to the human genome according to Ensembl. We retained transcripts originating from both DNA strands and categorized tags corresponding to protein-coding genes, antisense, intronic- or intergenic-transcribed regions and computed their overlap with annotated non-coding transcripts. Using this bioinformatics approach, we identified ∼34 000 novel transcribed regions located outside the boundaries of known protein-coding genes. As demonstrated using sequencing data from human pluripotent stem cells for biological validation, the method could be easily applied for the selection of tissue-specific candidate transcripts. DigitagCT is available at http://cractools.gforge.inria.fr/softwares/digitagct.
doi:10.1093/nar/gkt1300
PMCID: PMC3950697  PMID: 24357408
5.  Support vector machines-based identification of alternative splicing in Arabidopsis thaliana from whole-genome tiling arrays 
BMC Bioinformatics  2011;12:55.
Background
Alternative splicing (AS) is a process which generates several distinct mRNA isoforms from the same gene by splicing different portions out of the precursor transcript. Due to the (patho-)physiological importance of AS, a complete inventory of AS is of great interest. While this is in reach for human and mammalian model organisms, our knowledge of AS in plants has remained more incomplete. Experimental approaches for monitoring AS are either based on transcript sequencing or rely on hybridization to DNA microarrays. Among the microarray platforms facilitating the discovery of AS events, tiling arrays are well-suited for identifying intron retention, the most prevalent type of AS in plants. However, analyzing tiling array data is challenging, because of high noise levels and limited probe coverage.
Results
In this work, we present a novel method to detect intron retentions (IR) and exon skips (ES) from tiling arrays. While statistical tests have typically been proposed for this purpose, our method instead utilizes support vector machines (SVMs) which are appreciated for their accuracy and robustness to noise. Existing EST and cDNA sequences served for supervised training and evaluation. Analyzing a large collection of publicly available microarray and sequence data for the model plant A. thaliana, we demonstrated that our method is more accurate than existing approaches. The method was applied in a genome-wide screen which resulted in the discovery of 1,355 IR events. A comparison of these IR events to the TAIR annotation and a large set of short-read RNA-seq data showed that 830 of the predicted IR events are novel and that 525 events (39%) overlap with either the TAIR annotation or the IR events inferred from the RNA-seq data.
Conclusions
The method developed in this work expands the scarce repertoire of analysis tools for the identification of alternative mRNA splicing from whole-genome tiling arrays. Our predictions are highly enriched with known AS events and complement the A. thaliana genome annotation with respect to AS. Since all predicted AS events can be precisely attributed to experimental conditions, our work provides a basis for follow-up studies focused on the elucidation of the regulatory mechanisms underlying tissue-specific and stress-dependent AS in plants.
doi:10.1186/1471-2105-12-55
PMCID: PMC3051901  PMID: 21324185
6.  High resolution transcriptome maps for wild-type and nonsense-mediated decay-defective Caenorhabditis elegans 
Genome Biology  2009;10(9):R101.
The high-resolution transcriptome of wild-type and nonsense-mediated decay (NMD) defective C. elegans during development reveals insights into the NMD pathway and it’s role in development.
Background
While many genome sequences are complete, transcriptomes are less well characterized. We used both genome-scale tiling arrays and massively parallel sequencing to map the Caenorhabditis elegans transcriptome across development. We utilized this framework to identify transcriptome changes in animals lacking the nonsense-mediated decay (NMD) pathway.
Results
We find that while the majority of detectable transcripts map to known gene structures, >5% of transcribed regions fall outside current gene annotations. We show that >40% of these are novel exons. Using both technologies to assess isoform complexity, we estimate that >17% of genes change isoform across development. Next we examined how the transcriptome is perturbed in animals lacking NMD. NMD prevents expression of truncated proteins by degrading transcripts containing premature termination codons. We find that approximately 20% of genes produce transcripts that appear to be NMD targets. While most of these arise from splicing errors, NMD targets are enriched for transcripts containing open reading frames upstream of the predicted translational start (uORFs). We identify a relationship between the Kozak consensus surrounding the true start codon and the degree to which uORF-containing transcripts are targeted by NMD and speculate that translational efficiency may be coupled to transcript turnover via the NMD pathway for some transcripts.
Conclusions
We generated a high-resolution transcriptome map for C. elegans and used it to identify endogenous targets of NMD. We find that these transcripts arise principally through splicing errors, strengthening the prevailing view that splicing and NMD are highly interlinked processes.
doi:10.1186/gb-2009-10-9-r101
PMCID: PMC2768976  PMID: 19778439
7.  Use of cDNA Tiling Arrays for Identifying Protein Interactions Selected by In Vitro Display Technologies 
PLoS ONE  2008;3(2):e1646.
In vitro display technologies such as mRNA display are powerful screening tools for protein interaction analysis, but the final cloning and sequencing processes represent a bottleneck, resulting in many false negatives. Here we describe an application of tiling array technology to identify specifically binding proteins selected with the in vitro virus (IVV) mRNA display technology. We constructed transcription-factor tiling (TFT) arrays containing ∼1,600 open reading frame sequences of known and predicted mouse transcription-regulatory factors (334,372 oligonucleotides, 50-mer in length) to analyze cDNA fragments from mRNA-display screening for Jun-associated proteins. The use of the TFT arrays greatly increased the coverage of known Jun-interactors to 28% (from 14% with the cloning and sequencing approach), without reducing the accuracy (∼75%). This method could detect even targets with extremely low expression levels (less than a single mRNA copy per cell in whole brain tissue). This highly sensitive and reliable method should be useful for high-throughput protein interaction analysis on a genome-wide scale.
doi:10.1371/journal.pone.0001646
PMCID: PMC2241667  PMID: 18286201
8.  Comments on sequence normalization of tiling array expression 
Bioinformatics  2009;25(17):2171-2173.
Motivation: Methods to improve tiling array expression signals are needed to accurately detect genome features. Royce et al. provide statistical normalizations of tile signal based on probe sequence content that promises improved accuracy, and should be independently verified.
Results: Assessment of the sequence content normalization methods identified a problem: confounding of probe sequence content with gene structure (intron/exon) sequence content. Normalization obscured tile signal changes at gene structure boundaries. This and other evidence suggests that simple sequence normalization does not improve detection of genes from tile expression data.
Availability: http://wfleabase.org/genome-summaries/tile-expression/tileseqnorms/
Contact: gilbertd@indiana.edu
doi:10.1093/bioinformatics/btp389
PMCID: PMC2800354  PMID: 19578171
9.  Custom Design and Analysis of High-Density Oligonucleotide Bacterial Tiling Microarrays 
PLoS ONE  2009;4(6):e5943.
Background
High-density tiling microarrays are a powerful tool for the characterization of complete genomes. The two major computational challenges associated with custom-made arrays are design and analysis. Firstly, several genome dependent variables, such as the genome's complexity and sequence composition, need to be considered in the design to ensure a high quality microarray. Secondly, since tiling projects today very often exceed the limits of conventional array-experiments, researchers cannot use established computer tools designed for commercial arrays, and instead have to redesign previous methods or create novel tools.
Principal Findings
Here we describe the multiple aspects involved in the design of tiling arrays for transcriptome analysis and detail the normalisation and analysis procedures for such microarrays. We introduce a novel design method to make two 280,000 feature microarrays covering the entire genome of the bacterial species Escherichia coli and Neisseria meningitidis, respectively, as well as the use of multiple copies of control probe-sets on tiling microarrays. Furthermore, a novel normalisation and background estimation procedure for tiling arrays is presented along with a method for array analysis focused on detection of short transcripts. The design, normalisation and analysis methods have been applied in various experiments and several of the detected novel short transcripts have been biologically confirmed by Northern blot tests.
Conclusions
Tiling-arrays are becoming increasingly applicable in genomic research, but researchers still lack both the tools for custom design of arrays, as well as the systems and procedures for analysis of the vast amount of data resulting from such experiments. We believe that the methods described herein will be a useful contribution and resource for researchers designing and analysing custom tiling arrays for both bacteria and higher organisms.
doi:10.1371/journal.pone.0005943
PMCID: PMC2691959  PMID: 19536279
10.  A RecA-mediated exon profiling method 
Nucleic Acids Research  2006;34(13):e97.
We have developed a RecA-mediated simple, rapid and scalable method for identifying novel alternatively spliced full-length cDNA candidates. This method is based on the principle that RecA proteins allow to carry radioisotope-labeled probe DNAs to their homologous sequences, resulting in forming triplexes. The resulting complex is easily detected by mobility difference on electrophoresis. We applied this exon profiling method to four selected mouse genes as a feasibility study. To design probes for detection, the information on known exonic regions was extracted from public database, RefSeq. Concerning the potentially transcribed novel exonic regions, RNA mapping experiment using Affymetrix tiling array was performed. As a result, we were able to identify alternative splice variants of Thioredoxin domain containing 5, Interleukin1β, Interleukin 1 family 6 and glutamine-rich hypothetical protein. In addition, full-length sequencing demonstrated that our method could profile exon structures with >90% accuracy. This reliable method can allow us to screen novel splice variants from a huge number of cDNA clone set effectively.
doi:10.1093/nar/gkl497
PMCID: PMC1540731  PMID: 16896013
11.  A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species 
Nucleic Acids Research  2010;39(2):578-588.
RNA-Seq has emerged as a revolutionary technology for transcriptome analysis. In this article, we report a systematic comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. On a panel of human/chimpanzee/rhesus cerebellum RNA samples previously examined by the high-density human exon junction array (HJAY) and real-time qPCR, we generated 48.68 million RNA-Seq reads. Our results indicate that RNA-Seq has significantly improved gene coverage and increased sensitivity for differentially expressed genes compared with the high-density HJAY array. Meanwhile, we observed a systematic increase in the RNA-Seq error rate for lowly expressed genes. Specifically, between-species DEGs detected by array/qPCR but missed by RNA-Seq were characterized by relatively low expression levels, as indicated by lower RNA-Seq read counts, lower HJAY array expression indices and higher qPCR raw cycle threshold values. Furthermore, this issue was not unique to between-species comparisons of gene expression. In the RNA-Seq analysis of MicroArray Quality Control human reference RNA samples with extensive qPCR data, we also observed an increase in both the false-negative rate and the false-positive rate for lowly expressed genes. These findings have important implications for the design and data interpretation of RNA-Seq studies on gene expression differences between and within species.
doi:10.1093/nar/gkq817
PMCID: PMC3025565  PMID: 20864445
12.  Microarray-Based Capture of Novel Expressed Cell Type–Specific Transfrags (CoNECT) to Annotate Tissue-Specific Transcription in Drosophila melanogaster 
G3: Genes|Genomes|Genetics  2012;2(8):873-882.
Faithful annotation of tissue-specific transcript isoforms is important not only to understand how genes are organized and regulated but also to identify potential novel, unannotated exons of genes, which may be additional targets of mutation in disease states or while performing mutagenic screens. We have developed a microarray enrichment methodology followed by long-read, next-generation sequencing for identification of unannotated transcript isoforms expressed in two Drosophila tissues, the ovary and the testis. Even with limited sequencing, these studies have identified a large number of novel transcription units, including 5′ exons and extensions, 3′ exons and extensions, internal exons and exon extensions, gene fusions, and both germline-specific splicing events and promoters. Additionally, comparing our capture dataset with tiling array and traditional RNA-seq analysis, we demonstrate that our enrichment strategy is able to capture low-abundance transcripts that cannot readily be identified by the other strategies. Finally, we show that our methodology can help identify transcriptional signatures of minority cell types within the ovary that would otherwise be difficult to reveal without the CoNECT enrichment strategy. These studies introduce an efficient methodology for cataloging tissue-specific transcriptomes in which specific classes of genes or transcripts can be targeted for capture and sequence, thus reducing the significant sequencing depth normally required for accurate annotation.
doi:10.1534/g3.112.003194
PMCID: PMC3411243  PMID: 22908036
transcriptome; array capture; enrichment; transcript isoforms
13.  The Developmental Transcriptome of Drosophila melanogaster 
Nature  2010;471(7339):473-479.
Drosophila melanogaster is one of the most well studied genetic model organisms, nonetheless its genome still contains unannotated coding and non-coding genes, transcripts, exons, and RNA editing sites. Full discovery and annotation are prerequisites for understanding how the regulation of transcription, splicing, and RNA editing directs development of this complex organism. We used RNA-Seq, tiling microarrays, and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. Together, these data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.
doi:10.1038/nature09715
PMCID: PMC3075879  PMID: 21179090
14.  RNA-Seq and find: entering the RNA deep field 
Genome Medicine  2011;3(11):74.
Initial high-throughput RNA sequencing (RNA-Seq) experiments have revealed a complex and dynamic transcriptome, but because it samples transcripts in proportion to their abundances, assessing the extent and nature of low-level transcription using this technique has been difficult. A new assay, RNA CaptureSeq, addresses this limitation of RNA-Seq by enriching for low-level transcripts with cDNA tiling arrays prior to high-throughput sequencing. This approach reveals a plethora of transcripts that have been previously dismissed as 'noise', and hints at single-cell transcription fingerprints that may be crucial in defining cellular function in normal and disease states.
doi:10.1186/gm290
PMCID: PMC3308029  PMID: 22113004
15.  Targeted RNA sequencing reveals the deep complexity of the human transcriptome 
Nature biotechnology  2011;30(1):99-104.
Transcriptomic analyses have revealed an unexpected complexity to the human transcriptome, whose breadth and depth exceeds current RNA sequencing capability1–4. Using tiling arrays to target and sequence select portions of the transcriptome, we identify and characterize unannotated transcripts whose rare or transient expression is below the detection limits of conventional sequencing approaches. We use the unprecedented depth of coverage afforded by this technique to reach the deepest limits of the human transcriptome, exposing widespread, regulated and remarkably complex noncoding transcription in intergenic regions, as well as unannotated exons and splicing patterns in even intensively studied protein-coding loci such as p53 and HOX. The data also show that intermittent sequenced reads observed in conventional RNA sequencing data sets, previously dismissed as noise, are in fact indicative of unassembled rare transcripts. Collectively, these results reveal the range, depth and complexity of a human transcriptome that is far from fully characterized.
doi:10.1038/nbt.2024
PMCID: PMC3710462  PMID: 22081020
16.  The Saccharomyces cerevisiae W303-K6001 cross-platform genome sequence: insights into ancestry and physiology of a laboratory mutt 
Open Biology  2012;2(8):120093.
Saccharomyces cerevisiae strain W303 is a widely used model organism. However, little is known about its genetic origins, as it was created in the 1970s from crossing yeast strains of uncertain genealogy. To obtain insights into its ancestry and physiology, we sequenced the genome of its variant W303-K6001, a yeast model of ageing research. The combination of two next-generation sequencing (NGS) technologies (Illumina and Roche/454 sequencing) yielded an 11.8 Mb genome assembly at an N50 contig length of 262 kb. Although sequencing was substantially more precise and sensitive than whole-genome tiling arrays, both NGS platforms produced a number of false positives. At a 378× average coverage, only 74 per cent of called differences to the S288c reference genome were confirmed by both techniques. The consensus W303-K6001 genome differs in 8133 positions from S288c, predicting altered amino acid sequence in 799 proteins, including factors of ageing and stress resistance. The W303-K6001 (85.4%) genome is virtually identical (less than equal to 0.5 variations per kb) to S288c, and thus originates in the same ancestor. Non-S288c regions distribute unequally over the genome, with chromosome XVI the most (99.6%) and chromosome XI the least (54.5%) S288c-like. Several of these clusters are shared with Σ1278B, another widely used S288c-related model, indicating that these strains share a second ancestor. Thus, the W303-K6001 genome pictures details of complex genetic relationships between the model strains that date back to the early days of experimental yeast genetics. Moreover, this study underlines the necessity of combining multiple NGS and genome-assembling techniques for achieving accurate variant calling in genomic studies.
doi:10.1098/rsob.120093
PMCID: PMC3438534  PMID: 22977733
next-generation sequencing; yeast models; phylogeny reconstruction; mapping
17.  A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling 
BMC Genomics  2010;11:282.
Background
RNA-Seq exploits the rapid generation of gigabases of sequence data by Massively Parallel Nucleotide Sequencing, allowing for the mapping and digital quantification of whole transcriptomes. Whilst previous comparisons between RNA-Seq and microarrays have been performed at the level of gene expression, in this study we adopt a more fine-grained approach. Using RNA samples from a normal human breast epithelial cell line (MCF-10a) and a breast cancer cell line (MCF-7), we present a comprehensive comparison between RNA-Seq data generated on the Applied Biosystems SOLiD platform and data from Affymetrix Exon 1.0ST arrays. The use of Exon arrays makes it possible to assess the performance of RNA-Seq in two key areas: detection of expression at the granularity of individual exons, and discovery of transcription outside annotated loci.
Results
We found a high degree of correspondence between the two platforms in terms of exon-level fold changes and detection. For example, over 80% of exons detected as expressed in RNA-Seq were also detected on the Exon array, and 91% of exons flagged as changing from Absent to Present on at least one platform had fold-changes in the same direction. The greatest detection correspondence was seen when the read count threshold at which to flag exons Absent in the SOLiD data was set to t<1 suggesting that the background error rate is extremely low in RNA-Seq. We also found RNA-Seq more sensitive to detecting differentially expressed exons than the Exon array, reflecting the wider dynamic range achievable on the SOLiD platform. In addition, we find significant evidence of novel protein coding regions outside known exons, 93% of which map to Exon array probesets, and are able to infer the presence of thousands of novel transcripts through the detection of previously unreported exon-exon junctions.
Conclusions
By focusing on exon-level expression, we present the most fine-grained comparison between RNA-Seq and microarrays to date. Overall, our study demonstrates that data from a SOLiD RNA-Seq experiment are sufficient to generate results comparable to those produced from Affymetrix Exon arrays, even using only a single replicate from each platform, and when presented with a large genome.
doi:10.1186/1471-2164-11-282
PMCID: PMC2877694  PMID: 20444259
18.  A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level 
Nucleic Acids Research  2009;37(10):e75.
The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT–PCR experiments.
doi:10.1093/nar/gkp282
PMCID: PMC2691848  PMID: 19417075
19.  EcoBrowser: a web-based tool for visualizing transcriptome data of Escherichia coli 
BMC Research Notes  2011;4:405.
Background
Escherichia coli has been extensively studied as a prokaryotic model organism whose whole genome was determined in 1997. However, it is difficult to identify all the gene products involved in diverse functions by using whole genome sequencesalone. The high-resolution transcriptome mapping using tiling arrays has proved effective to improve the annotation of transcript units and discover new transcripts of ncRNAs. While abundant tiling array data have been generated, the lack of appropriate visualization tools to accommodate and integrate multiple sources of data has emerged.
Findings
EcoBrowser is a web-based tool for visualizing genome annotations and transcriptome data of E. coli. Important tiling array data of E. coli from different experimental platforms are collected and processed for query. An AJAX based genome browser is embedded for visualization. Thus, genome annotations can be compared with transcript profiling and genome occupancy profiling from independent experiments, which will be helpful in discovering new transcripts including novel mRNAs and ncRNAs, generating a detailed description of the transcription unit architecture, further providing clues for investigation of prokaryotic transcriptional regulation that has proved to be far more complex than previously thought.
Conclusions
With the help of EcoBrowser, users can get a systemic view both from the vertical and parallel sides, as well as inspirations for the design of new experiments which will expand our understanding of the regulation mechanism.
doi:10.1186/1756-0500-4-405
PMCID: PMC3203075  PMID: 21992408
20.  Unconstrained mining of transcript data reveals increased alternative splicing complexity in the human transcriptome 
Nucleic Acids Research  2010;38(14):4740-4754.
Mining massive amounts of transcript data for alternative splicing information is paramount to help understand how the maturation of RNA regulates gene expression. We developed an algorithm to cluster transcript data to annotated genes to detect unannotated splice variants. A higher number of alternatively spliced genes and isoforms were found compared to other alternative splicing databases. Comparison of human and mouse data revealed a marked increase, in human, of splice variants incorporating novel exons and retained introns. Previously unannotated exons were validated by tiling array expression data and shown to correspond preferentially to novel first exons. Retained introns were validated by tiling array and deep sequencing data. The majority of retained introns were shorter than 500 nt and had weak polypyrimidine tracts. A subset of retained introns matching small RNAs and displaying a high GC content suggests a possible coordination between splicing regulation and production of noncoding RNAs. Conservation of unannotated exons and retained introns was higher in horse, dog and cow than in rodents, and 64% of exon sequences were only found in primates. This analysis highlights previously bypassed alternative splice variants, which may be crucial to deciphering more complex pathways of gene regulation in human.
doi:10.1093/nar/gkq197
PMCID: PMC2919708  PMID: 20385588
21.  The incredible shrinking world of DNA microarrays 
Molecular bioSystems  2008;4(7):726-732.
The efficacy of microarrays in examining gene expression, gene and genome structure, protein-DNA interactions, whole-genome similarities and differences, microRNA expression, methylation, (and more), is no longer in question. It is a fast-developing, cutting edge technology that has grown up along with massive sequence databases and is likely to become part of everyday patient care. Many advances have recently expanded the power and utility of microarrays; among them is our development of a new array tiling technique that dramatically increases the scope of coverage of an oligonucleotide tiling array without substantially increasing its cost.
doi:10.1039/b706237k
PMCID: PMC2535915  PMID: 18563246
22.  Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome 
Genome Biology  2008;9(1):R3.
RACE sequencing of ENCODE regions shows that much of the human genome is represented in poly(A)+ RNA.
Background
Recent studies of the mammalian transcriptome have revealed a large number of additional transcribed regions and extraordinary complexity in transcript diversity. However, there is still much uncertainty regarding precisely what portion of the genome is transcribed, the exact structures of these novel transcripts, and the levels of the transcripts produced.
Results
We have interrogated the transcribed loci in 420 selected ENCyclopedia Of DNA Elements (ENCODE) regions using rapid amplification of cDNA ends (RACE) sequencing. We analyzed annotated known gene regions, but primarily we focused on novel transcriptionally active regions (TARs), which were previously identified by high-density oligonucleotide tiling arrays and on random regions that were not believed to be transcribed. We found RACE sequencing to be very sensitive and were able to detect low levels of transcripts in specific cell types that were not detectable by microarrays. We also observed many instances of sense-antisense transcripts; further analysis suggests that many of the antisense transcripts (but not all) may be artifacts generated from the reverse transcription reaction. Our results show that the majority of the novel TARs analyzed (60%) are connected to other novel TARs or known exons. Of previously unannotated random regions, 17% were shown to produce overlapping transcripts. Furthermore, it is estimated that 9% of the novel transcripts encode proteins.
Conclusion
We conclude that RACE sequencing is an efficient, sensitive, and highly accurate method for characterization of the transcriptome of specific cell/tissue types. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs can encode protein and are likely to be functional.
doi:10.1186/gb-2008-9-1-r3
PMCID: PMC2395237  PMID: 18173853
23.  Transcriptome Analysis Using Next-Generation Sequencing Technology 
High throughput RNA sequencing (RNA-Seq) is becoming increasingly utilized as the technology of choice to detect and quantify known and novel transcripts. Multiple next-generation sequencing (NGS) platforms are available that enable transcriptome profiling through RNA-Seq workflows. Demonstrations of the power of RNA-Seq to profile the well annotated transcriptome and also identify novel transcribed regions, gene fusions, and even identify novel classes of RNA are rapidly increasing in the field of RNA research. Our aim has been to develop library preparation methods and tools that aid in the reliable generation of libraries for next generation sequencing from total RNA. Reported here are results from the development of the Ambion® RNA-Seq Library Construction kit optimized for sequencing on the Illumina® next generation sequencing instruments. We show results from two protocols utilizing the same reagents that allow generation of RNA-Seq libraries targeting either the small RNA fraction of total RNA, or the whole transcriptome which includes transcripts larger than 100 base pairs. Results are reported from Illumina® Genome Analyzer II sequencing of both small RNA and transcriptome libraries with a focus on mapping to the miRBase and RefSeq references respectively. We also demonstrate the use of External RNA Control Consortium (ERCC) transcripts as spike-in controls for transcriptome libraries that aid in quality control of the library generation procedure and aid in downstream data analysis. The library construction technology embedded in the Ambion® RNA-Seq Library Construction kit enables researchers to analyze the transcriptome of their research samples in a precise, sensitive and robust manner while maintaining information regarding the genomic DNA strand to which the RNA transcript maps utilizing the Illumina® Genome Analyzer II sequencing platform. The workflow and results reported here demonstrate new commercially available options for library construction enabling small RNA and transcriptome profiling and novel discovery using next-generation sequencing technology.
PMCID: PMC3186484
24.  A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease 
BMC Medical Genomics  2012;5:28.
Background
Transcriptomic studies in clinical research are essential tools for deciphering the functional elements of the genome and unraveling underlying disease mechanisms. Various technologies have been developed to deduce and quantify the transcriptome including hybridization and sequencing-based approaches. Recently, high density exon microarrays have been successfully employed for detecting differentially expressed genes and alternative splicing events for biomarker discovery and disease diagnostics. The field of transcriptomics is currently being revolutionized by high throughput DNA sequencing methodologies to map, characterize, and quantify the transcriptome.
Methods
In an effort to understand the merits and limitations of each of these tools, we undertook a study of the transcriptome in sickle cell disease, a monogenic disease comparing the Affymetrix Human Exon 1.0 ST microarray (Exon array) and Illumina’s deep sequencing technology (RNA-seq) on whole blood clinical specimens.
Results
Analysis indicated a strong concordance (R = 0.64) between Exon array and RNA-seq data at both gene level and exon level transcript expression. The magnitude of differential expression was found to be generally higher in RNA-seq than in the Exon microarrays. We also demonstrate for the first time the ability of RNA-seq technology to discover novel transcript variants and differential expression in previously unannotated genomic regions in sickle cell disease. In addition to detecting expression level changes, RNA-seq technology was also able to identify sequence variation in the expressed transcripts.
Conclusions
Our findings suggest that microarrays remain useful and accurate for transcriptomic analysis of clinical samples with low input requirements, while RNA-seq technology complements and extends microarray measurements for novel discoveries.
doi:10.1186/1755-8794-5-28
PMCID: PMC3428653  PMID: 22747986
Sickle cell disease; RNA-Seq; Exon arrays; Transcriptome; Clinical genomics
25.  PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data 
Bioinformatics  2012;28(4):479-486.
Motivation: RNA-seq is a powerful technology for the study of transcriptome profiles that uses deep-sequencing technologies. Moreover, it may be used for cellular phenotyping and help establishing the etiology of diseases characterized by abnormal splicing patterns. In RNA-Seq, the exact nature of splicing events is buried in the reads that span exon–exon boundaries. The accurate and efficient mapping of these reads to the reference genome is a major challenge.
Results: We developed PASSion, a pattern growth algorithm-based pipeline for splice site detection in paired-end RNA-Seq reads. Comparing the performance of PASSion to three existing RNA-Seq analysis pipelines, TopHat, MapSplice and HMMSplicer, revealed that PASSion is competitive with these packages. Moreover, the performance of PASSion is not affected by read length and coverage. It performs better than the other three approaches when detecting junctions in highly abundant transcripts. PASSion has the ability to detect junctions that do not have known splicing motifs, which cannot be found by the other tools. Of the two public RNA-Seq datasets, PASSion predicted ∼ 137 000 and 173 000 splicing events, of which on average 82 are known junctions annotated in the Ensembl transcript database and 18% are novel. In addition, our package can discover differential and shared splicing patterns among multiple samples.
Availability: The code and utilities can be freely downloaded from https://trac.nbic.nl/passion and ftp://ftp.sanger.ac.uk/pub/zn1/passion
Contact: y.zhang@lumc.nl; k.ye@lumc.nl
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btr712
PMCID: PMC3278765  PMID: 22219203

Results 1-25 (641731)