Recent deep sequencing of transcriptomes from worm to human reveals that individual transcripts can be composed of sequence segments that are not collinear — with some mapping great distances apart and others to other chromosomes. Some of these chimeric transcripts are formed by genetic rearrangements but others appear to arise during post-transcriptional events. While in lower eukaryotes, this is accomplished by a well characterized trans-splicing process, in higher eukaryotes the processes leading to their formation remains unclear. While the biological importance of most chimeric RNAs is unclear as yet, the implications of their existence to the potential information content and functional organization of genomes are profound.
Motivation: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases.
Results: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80–90% success rate, corroborating the high precision of the STAR mapping strategy.
Availability and implementation: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
The study of transcription using genomic tiling arrays has lead to the identification of numerous additional exons. One example is the MECP2 gene on the X chromosome; using 5’RACE and RT-PCR in human tissues and cell lines, we have found more than 70 novel exons (RACEfrags) connecting to at least one annotated exon.. We sequenced all MECP2-connected exons and flanking sequences in 3 groups: 46 patients with the Rett syndrome and without mutations in the currently annotated exons of the MECP2 and CDKL5 genes; 32 patients with the Rett syndrome and identified mutations in the MECP2 gene; 100 control individuals from the same geoethnic group. Approximately 13kb were sequenced per sample, (2.4Mb of DNA resequencing). A total of 75 individuals had novel rare variants (mostly private variants) but no statistically significant difference was found among the 3 groups. These results suggest that variants in the newly discovered exons may not contribute to Rett syndrome. Interestingly however, there are about twice more variants in the novel exons than in the flanking sequences (44 vs. 21 for approximately 1.3 Mb sequenced for each class of sequences, p = 0.0025). Thus the evolutionary forces that shape these novel exons may be different than those of neighboring sequences.
MECP2; Rett syndrome; RACEfrags; SNP; rare variants; positive selection
The transcriptional landscape in embryonic stem cells (ESCs) and during ESC differentiation has received considerable attention, albeit mostly confined to the polyadenylated fraction of RNA, whereas the non-polyadenylated (NPA) fraction remained largely unexplored. Notwithstanding, the NPA RNA super-family has every potential to participate in the regulation of pluripotency and stem cell fate. We conducted a comprehensive analysis of NPA RNA in ESCs using a combination of whole-genome tiling arrays and deep sequencing technologies. In addition to identifying previously characterized and new non-coding RNA members, we describe a group of novel conserved RNAs (snacRNAs: small NPA conserved), some of which are differentially expressed between ESC and neuronal progenitor cells, providing the first evidence of a novel group of potentially functional NPA RNA involved in the regulation of pluripotency and stem cell fate. We further show that minor spliceosomal small nuclear RNAs, which are NPA, are almost completely absent in ESCs and are upregulated in differentiation. Finally, we show differential processing of the minor intron of the polycomb group gene Eed. Our data suggest that NPA RNA, both known and novel, play important roles in ESCs.
Many animal species use a chromosome-based mechanism of sex determination, which has led to the coordinate evolution of dosage-compensation systems. Dosage compensation not only corrects the imbalance in the number of X chromosomes between the sexes but also is hypothesized to correct dosage imbalance within cells that is due to monoallelic X-linked expression and biallelic autosomal expression, by upregulating X-linked genes twofold (termed ‘Ohno’s hypothesis’). Although this hypothesis is well supported by expression analyses of individual X-linked genes and by microarray-based transcriptome analyses, it was challenged by a recent study using RNA sequencing and proteomics. We obtained new, independent RNA-seq data, measured RNA polymerase distribution and reanalyzed published expression data in mammals, C. elegans and Drosophila. Our analyses, which take into account the skewed gene content of the X chromosome, support the hypothesis of upregulation of expressed X-linked genes to balance expression of the genome.
Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.
We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.
Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.
Analysis of bacterial transcriptomes have shown the existence of a genome-wide process of overlapping transcription due to the presence of antisense RNAs, as well as mRNAs that overlapped in their entire length or in some portion of the 5′- and 3′-UTR regions. The biological advantages of such overlapping transcription are unclear but may play important regulatory roles at the level of transcription, RNA stability and translation. In a recent report, the human pathogen Staphylococcus aureus is observed to generate genome-wide overlapping transcription in the same bacterial cells leading to a collection of short RNA fragments generated by the endoribonuclease III, RNase III. This processing appears most prominently in Gram-positive bacteria. The implications of both the use of pervasive overlapping transcription and the processing of these double stranded templates into short RNAs are explored and the consequences discussed.
overlapping transcription; RNase III; RNA processing; bacteria; transcriptome
Large-scale sequencing projects have revealed an unexpected complexity in the origins, structures and functions of mammalian transcripts. Many loci are known to produce overlapping coding and non-coding RNAs with capped 5′ ends that vary in size. Methods that identify the 5′ ends of transcripts will facilitate the discovery of novel promoters and 5′ ends derived from secondary capping events. Such methods often require high input amounts of RNA not obtainable from highly refined samples such as tissue microdissections and subcellular fractions. Therefore, we have developed nanoCAGE (Cap Analysis of Gene Expression), a method that captures the 5′ ends of transcripts from as little as 10 nanograms of total RNA and CAGEscan, a mate-pair adaptation of nanoCAGE that captures the transcript 5′ ends linked to a downstream region. Both of these methods allow further annotation-agnostic studies of the complex human transcriptome.
The p53 homologs, p63 and p73, share ∼85% amino acid identity in their DNA-binding domains, but they have distinct biological functions.
Using chromatin immunoprecipitation and high-resolution tiling arrays covering the human genome, we identify p73 DNA binding sites on a genome-wide level in ME180 human cervical carcinoma cells. Strikingly, the p73 binding profile is indistinguishable from the previously described binding profile for p63 in the same cells. Moreover, the p73∶p63 binding ratio is similar at all genomic loci tested, suggesting that there are few, if any, targets that are specific for one of these factors. As assayed by sequential chromatin immunoprecipitation, p63 and p73 co-occupy DNA target sites in vivo, suggesting that p63 and p73 bind primarily as heterotetrameric complexes in ME180 cells.
The observation that p63 and p73 associate with the same genomic targets suggest that their distinct biological functions are due to cell-type specific expression and/or protein domains that involve functions other than DNA binding.
The high-resolution transcriptome of wild-type and nonsense-mediated decay (NMD) defective C. elegans during development reveals insights into the NMD pathway and it’s role in development.
While many genome sequences are complete, transcriptomes are less well characterized. We used both genome-scale tiling arrays and massively parallel sequencing to map the Caenorhabditis elegans transcriptome across development. We utilized this framework to identify transcriptome changes in animals lacking the nonsense-mediated decay (NMD) pathway.
We find that while the majority of detectable transcripts map to known gene structures, >5% of transcribed regions fall outside current gene annotations. We show that >40% of these are novel exons. Using both technologies to assess isoform complexity, we estimate that >17% of genes change isoform across development. Next we examined how the transcriptome is perturbed in animals lacking NMD. NMD prevents expression of truncated proteins by degrading transcripts containing premature termination codons. We find that approximately 20% of genes produce transcripts that appear to be NMD targets. While most of these arise from splicing errors, NMD targets are enriched for transcripts containing open reading frames upstream of the predicted translational start (uORFs). We identify a relationship between the Kozak consensus surrounding the true start codon and the degree to which uORF-containing transcripts are targeted by NMD and speculate that translational efficiency may be coupled to transcript turnover via the NMD pathway for some transcripts.
We generated a high-resolution transcriptome map for C. elegans and used it to identify endogenous targets of NMD. We find that these transcripts arise principally through splicing errors, strengthening the prevailing view that splicing and NMD are highly interlinked processes.
The transcriptomes of eukaryotic cells are incredibly complex. Individual non-coding RNAs dwarf the number of protein-coding genes, and include classes that are well understood as well as classes for which the nature, extent and functional roles are obscure1. Deep sequencing of small RNAs (<200 nucleotides) from human HeLa and HepG2 cells revealed a remarkable breadth of species. These arose both from within annotated genes and from unannotated intergenic regions. Overall, small RNAs tended to align with CAGE (cap-analysis of gene expression) tags2, which mark the 5′ ends of capped, long RNA transcripts. Many small RNAs, including the previously described promoter-associated small RNAs3, appeared to possess cap structures. Members of an extensive class of both small RNAs and CAGE tags were distributed across internal exons of annotated protein coding and non-coding genes, sometimes crossing exon–exon junctions. Here we show that processing of mature mRNAs through an as yet unknown mechanism may generate complex populations of both long and short RNAs whose apparently capped 5′ ends coincide. Supplying synthetic promoter-associated small RNAs corresponding to the c-MYC transcriptional start site reduced MYC messenger RNA abundance. The studies presented here expand the catalogue of cellular small RNAs and demonstrate a biological impact for at least one class of non-canonical small RNAs.
The molecular mechanisms underlying pluripotency and lineage specification from embryonic stem (ES) cells are largely unclear. Differentiation pathways may be determined by the targeted activation of lineage specific genes or by selective silencing of genome regions during differentiation. Here we show that the ES cell genome is transcriptionally globally hyperactive and undergoes global silencing as cells differentiate. Normally silent repeat regions are active in ES cells and tissue-specific genes are sporadically expressed at low levels. Whole genome tiling arrays demonstrate widespread transcription in both coding and non-coding regions in pluripotent ES cells whereas the transcriptional landscape becomes more discrete as differentiation proceeds. The transcriptional hyperactivity in ES cells is accompanied by disproportionate expression of chromatin-remodeling genes and the general transcription machinery, but not histone modifying activities. Interference with several chromatin remodeling activities in ES cells affects their proliferation and differentiation behavior. We propose that global transcriptional activity is a hallmark of pluripotent ES cells that contributes to their plasticity and that lineage specification is strongly driven by reduction of the actively transcribed portion of the genome.
High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. These arrays are being increasingly used to study the associated processes of transcription, transcription factor binding, chromatin structure and their association. Studies of differential expression and/or regulation provide critical insight into the mechanics of transcription and regulation that occurs during the developmental program of a cell. The time-course experiment, which comprises an in-vivo system and the proposed analyses, is used to determine if annotated and un-annotated portions of genome manifest coordinated differential response to the induced developmental program.
We have proposed a novel approach, based on a piece-wise function – to analyze genome-wide differential response. This enables segmentation of the response based on protein-coding and non-coding regions; for genes the methodology also partitions differential response with a 5' versus 3' versus intra-genic bias.
The algorithm built upon the framework of Significance Analysis of Microarrays, uses a generalized logic to define regions/patterns of coordinated differential change. By not adhering to the gene-centric paradigm, discordant differential expression patterns between exons and introns have been identified at a FDR of less than 12 percent. A co-localization of differential binding between RNA Polymerase II and tetra-acetylated histone has been quantified at a p-value < 0.003; it is most significant at the 5' end of genes, at a p-value < 10-13. The prototype R code has been made available as supplementary material [see Additional file 1].
Regulatory T (T reg) cells are critical regulators of immune tolerance. Most T reg cells are defined based on expression of CD4, CD25, and the transcription factor, FoxP3. However, these markers have proven problematic for uniquely defining this specialized T cell subset in humans. We found that the IL-7 receptor (CD127) is down-regulated on a subset of CD4+ T cells in peripheral blood. We demonstrate that the majority of these cells are FoxP3+, including those that express low levels or no CD25. A combination of CD4, CD25, and CD127 resulted in a highly purified population of T reg cells accounting for significantly more cells that previously identified based on other cell surface markers. These cells were highly suppressive in functional suppressor assays. In fact, cells separated based solely on CD4 and CD127 expression were anergic and, although representing at least three times the number of cells (including both CD25+CD4+ and CD25−CD4+ T cell subsets), were as suppressive as the “classic” CD4+CD25hi T reg cell subset. Finally, we show that CD127 can be used to quantitate T reg cell subsets in individuals with type 1 diabetes supporting the use of CD127 as a biomarker for human T reg cells.
High density oligonucleotide tiling arrays are an effective and powerful platform for conducting unbiased genome-wide studies. The ab initio probe selection method employed in tiling arrays is unbiased, and thus ensures consistent sampling across coding and non-coding regions of the genome. Tiling arrays are increasingly used in chromatin immunoprecipitation (IP) experiments (ChIP on chip). ChIP on chip facilitates the generation of genome-wide maps of in-vivo interactions between DNA-associated proteins including transcription factors and DNA. Analysis of the hybridization of an immunoprecipitated sample to a tiling array facilitates the identification of ChIP-enriched segments of the genome. These enriched segments are putative targets of antibody assayable regulatory elements. The enrichment response is not ubiquitous across the genome. Typically 5 to 10% of tiled probes manifest some significant enrichment. Depending upon the factor being studied, this response can drop to less than 1%. The detection and assessment of significance for interactions that emanate from non-canonical and/or un-annotated regions of the genome is especially challenging. This is the motivation behind the proposed algorithm.
We have proposed a novel rank and replicate statistics-based methodology for identifying and ascribing statistical confidence to regions of ChIP-enrichment. The algorithm is optimized for identification of sites that manifest low levels of enrichment but are true positives, as validated by alternative biochemical experiments. Although the method is described here in the context of ChIP on chip experiments, it can be generalized to any treatment-control experimental design. The results of the algorithm show a high degree of concordance with independent biochemical validation methods. The sensitivity and specificity of the algorithm have been characterized via quantitative PCR and independent computational approaches.
The algorithm ranks all enrichment sites based on their intra-replicate ranks and inter-replicate rank consistency. Following the ranking, the method allows segmentation of sites based on a meta p-value, a composite array signal enrichment criterion, or a composite of these two measures. The sensitivities obtained subsequent to the segmentation of data using a meta p-value of 10-5, an array signal enrichment of 0.2 and a composite of these two values are 88%, 87% and 95%, respectively.
We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.
The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.
This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
Macrophages are activated from a resting state by a combination of cytokines and microbial products. Microbes are often sensed through Toll-like receptors signaling through MyD88. We used large-scale microarrays in multiple replicate experiments followed by stringent statistical analysis to compare gene expression in wild-type (WT) and MyD88−/− macrophages. We confirmed key results by quantitative reverse transcription polymerase chain reaction, Western blot, and enzyme-linked immunosorbent assay. Surprisingly, many genes, such as inducible nitric oxide synthase, IRG-1, IP-10, MIG, RANTES, and interleukin 6 were induced by interferon (IFN)-γ from 5- to 100-fold less extensively in MyD88−/− macrophages than in WT macrophages. Thus, widespread, full-scale activation of macrophages by IFN-γ requires MyD88. Analysis of the mechanism revealed that MyD88 mediates a process of self-priming by which resting macrophages produce a low level of tumor necrosis factor. This and other factors lead to basal activation of nuclear factor κB, which synergizes with IFN-γ for gene induction. In contrast, infection by live, virulent Mycobacterium tuberculosis (Mtb) activated macrophages largely through MyD88-independent pathways, and macrophages did not need MyD88 to kill Mtb in vitro. Thus, MyD88 plays a dynamic role in resting macrophages that supports IFN-γ–dependent activation, whereas macrophages can respond to a complex microbial stimulus, the tubercle bacillus, chiefly by other routes.
macrophage activation; Toll-like receptors; innate immunity; NF-κB; microarray gene expression analysis
Macrophage activation determines the outcome of infection by Mycobacterium tuberculosis (Mtb). Interferon-γ (IFN-γ) activates macrophages by driving Janus tyrosine kinase (JAK)/signal transducer and activator of transcription–dependent induction of transcription and PKR-dependent suppression of translation. Microarray-based experiments reported here enlarge this picture. Exposure to IFN-γ and/or Mtb led to altered expression of 25% of the monitored genome in macrophages. The number of genes suppressed by IFN-γ exceeded the number of genes induced, and much of the suppression was transcriptional. Five times as many genes related to immunity and inflammation were induced than suppressed. Mtb mimicked or synergized with IFN-γ more than antagonized its actions. Phagocytosis of nonviable Mtb or polystyrene beads affected many genes, but the transcriptional signature of macrophages infected with viable Mtb was distinct. Studies involving macrophages deficient in inducible nitric oxide synthase and/or phagocyte oxidase revealed that these two antimicrobial enzymes help orchestrate the profound transcriptional remodeling that underlies macrophage activation.
gene expression; microarray analysis; macrophage activation; innate immunity; phagocytosis
High density oligonucleotide arrays have been used extensively for expression studies of eukaryotic organisms. We have designed a prokaryotic high density oligonucleotide array using the complete Escherichia coli genome sequence to monitor expression levels of all genes and intergenic regions in the genome. Because previously described methods for preparing labeled target nucleic acids are not useful for prokaryotic cell analysis using such arrays, a mRNA enrichment and direct labeling protocol was developed together with a cDNA synthesis protocol. The reproducibility of each labeling method was determined using high density oligonucleotide probe arrays as a read-out methodology and the expression results from direct labeling were compared to the expression results from the cDNA synthesis. About 50% of all annotated E.coli open reading frames are observed to be transcribed, as measured by both protocols, when the cells were grown in rich LB medium. Each labeling method individually showed a high degree of concordance in replica experiments (95 and 99%, respectively), but when each sample preparation method was compared to the other, ∼32% of the genes observed to be expressed were discordant. However, both labeling methods can detect the same relative gene expression changes when RNA from IPTG-induced cells was labeled and compared to RNA from uninduced E.coli cells.
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Despite recent controversies, the evidence that the majority of the human genome is transcribed into RNA remains strong.
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.