|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: JL. Performed the experiments: DK. Analyzed the data: DK MKB OE. Wrote the paper: DK MKB OE JL.
Several recent studies have indicated that transcription is pervasive in regions outside of protein coding genes and that short antisense transcripts can originate from the promoter and terminator regions of genes. Here we investigate transcription of fragments longer than 200 nucleotides, focusing on antisense transcription for known protein coding genes and intergenic transcription. We find that roughly 12% to 16% of all reads that originate from promoter and terminator regions, respectively, map antisense to the gene in question. Furthermore, we detect a high number of novel transcriptionally active regions (TARs) that are generally expressed at a lower level than protein coding genes. We find that the correlation between RNA-seq data and microarray data is dependent on the gene length, with longer genes showing a better correlation. We detect high antisense transcriptional activity from promoter, terminator and intron regions of protein-coding genes and identify a vast number of previously unidentified TARs, including putative novel EGFR transcripts. This shows that in-depth analysis of the transcriptome using RNA-seq is a valuable tool for understanding complex transcriptional events. Furthermore, the development of new algorithms for estimation of gene expression from RNA-seq data is necessary to minimize length bias.
Less than 2% of the human genome encodes for proteins, yet a large fraction, recently estimated to 60% to 90% of the genome can be transcribed . The functions of the majority of these novel uncharacterized transcriptionally active regions (TARs) are currently unknown, but they are believed to be of regulatory importance. For example, Ebisuya and colleagues showed that “transcriptional ripples” can propagate along the genome and mediate regulation of genes several tens of kilobases away .
Several studies  have shown that antisense transcription is prevalent and likely to have a regulatory function. Studies indicate that 20% to 90% of all human protein-coding genes can generate transcripts with potential to form sense-antisense pairs – and that these generally are arranged in a tail-to-tail pattern. Recently, short fragments of RNA have been detected in the antisense direction in regions just upstream protein-coding genes –.
In parallel to experimental discovery of regulatory RNAs, computational methods are being developed to identify conserved structural RNA elements likely to be involved in transcriptional and translational control . These approaches aim to make in silico predictions of regulatory sites in the human genome that can be validated by the on-going massive transcriptome sequencing (RNA-Seq) efforts on cells, tissues and organs , however, more development is needed to make these algorithms more accurate and efficient.
In this study, we use massive DNA sequencing to investigate RNA longer than 200 nucleotides from three human cancer cell lines. We show that approximately 20% of all protein-coding genes have antisense transcription coupled to them and that antisense transcription is prevalent in introns.
In this study we investigate the transcriptome of three cell lines, A431, U-2 OS and U251, by applying the massive SOLiD DNA sequencing technology facilitating sense/antisense identification of reads. The cell lines were chosen to represent three different lineages; epithelial, mesenchymal and glia cells. A total of 10 to 15 million high quality 50-basepair reads were obtained for each cell line. The reads were mapped onto the human reference genome (hg18), after which reads were aggregated for each gene. An expression value was calculated based on the number of reads per kilobase gene and million reads in each sample (RPKM) . Analysis of the gene expression pattern demonstrated that 66% to 69% of all genes are expressed in each cell line of which 85% to 88% were shared for all three cell lines (figure S1).
To validate the results obtained from RNA-seq, we compared the data to gene expression data from the A431 and U251 cell lines obtained using microarrays (no data was available for U-2 OS). Since the microarray platform only generates relative expression values, the correlation between the RNA-seq data and the microarray data was calculated using the log2 value of the ratio between A431 and U251, which in the RNA-seq case yields one value per Ensembl-gene. Since one gene can be represented by several microarray probes, we used three different methods to convert these to a single value that could be compared to the RNA-seq data (mean, median and best probe, see Materials and Methods for details). The Spearman correlation was determined to 0.55, 0.55 and 0.64 for the three methods respectively, values in the same range as those described earlier . Oshlack and Wakefield recently showed that the variance estimation of the RPKM measure is dependent on the gene length . Thus, we hypothesized that the correlation between microarray data and RNA-seq data would share this dependence, since the log2-fold change in RNA-seq will have lower variance for longer genes that for shorter genes. This assumption turned out to be correct; for genes shorter than 2000 bases, the correlation was 0.48 to 0.52 depending on method, while for genes longer than 10 kb, this range was 0.59 to 0.71 (figure 1B–C and figure S2).
The vast majority of all reads originate from the sense strand of protein-coding genes (figure 2A and figure S3). A large fraction of the reads also originate from the introns of protein-coding genes, but when normalized to the length of the introns, the relative expression levels of introns are very low (figure 2B and figure S3, S4, S5). We also note that many reads map to regions distant from protein-coding genes (here denoted as “Other”), which to some extent can be expected since this includes many long non-protein-coding genes. Previous studies have described a class of short transcripts (20–90 nucleotides) that originate from the antisense strand in the promoter regions of genes –, . We investigated tag densities in promoter and terminator regions (defined as 1000 base pairs upstream and downstream of genes, respectively) and are unable to detect an increased density upstream of genes. This is expected since our extraction method does not capture fragments shorter than roughly 200 nucleotides. In the terminator regions, however, the relative antisense tag density is higher than that in exons and promoter regions (figure 2B and figure S3). This indicates that transcription of long RNAs in terminator regions could represent a regulatory mechanism for termination of transcription. We investigated the sense-to-antisense ratio for different regions of the genome. In protein-coding exons, 98% to 99.5% of the reads originate from the sense strand, indicating that antisense transcripts are present at very low levels (figure 2C and figure S3). Interestingly, the sense-to-antisense ratio is markedly increased for promoter and terminator regions. In promoter regions, about 12% of the reads originate from the antisense strand, and in terminator regions, the fraction is 16%. In introns, the corresponding number increases to approximately 50% (figure 2C and figure S3).
To identify novel TARs, we merged reads from all three cell lines and created clusters from overlapping and nearby reads. After subtraction of known genes and non-coding RNA genes, we identify approximately 40,000 novel TARs, of which most are short (figure 3A). In fact, only 1360 TARs are longer than 500 base pairs and only 508 are longer than 1000 base pairs. Expression values for all TARs were calculated using the same approach as for protein-coding genes. This showed that most TARs are lowly expressed and covered by few reads. In A431, only approximately 10% (4144 TARs) are detected by 10 reads or more (figure 3B). The corresponding number in protein-coding genes is 33%, but this number is likely biased by the fact that protein-coding genes are generally longer than the putative novel TARs (data not shown). The method for RNA extraction used in this study excludes fragments shorter than approximately 200 nucleotides. This suggests that the majority of TARs identified in this study are in fact 200 nucleotides or longer and that deeper sequencing is needed to cover the entire TARs in order to define their boundaries.
Even though most novel TARs seem to be lowly expressed, we find a few interesting instances among these. Several clusters are detected downstream of a tRNA pseudogene on chromosome 3. We believe that this is the result of transcription which has been initiated upstream of the pseudogene and continues downstream (Figure 4A). The pseudogene itself has 100% sequence identity to another region in the genome (chr5:79,982,623–79,982,691), and since only reads that map uniquely to the genome were used in this analysis, this gene appears not to be expressed. Very high expression of EGFR is one of the hallmarks for the A431 cell line . In figure 4B, a prolonged exon of the epidermal growth factor receptor (EGFR) is shown, along with two small clusters several tens of kilobases away. Whether or not the two small clusters are in fact novels exons remains to be investigated. In figure 4C, transcription is detected from both strands of a 1.3 Mb-region surrounding Peroxisome Proliferator-activated Receptor Coactivator-1 (PPARGC1A) on chromosome 4. Expression from this regions is detected at high levels in A-431 cells, but is almost completely shut off in U-2 OS and U-251 cells. This provides an intriguing example of complex transcription, and could indicate gene regulation through antisense transcript expression. This gene has been implicated in diabetes where lower expression has been linked to insulin resistance and DNA damage . What functional role the antisense transcript plays remains to be elucidated.
In the current study we have investigated the transcriptional levels of three human cancer cell lines using RNA-seq. We show that the correlation between DNA microarray data and RNA-seq data depends on gene length, and that the reason for this is the increased precision in expression level measurements for longer genes due to the fact that a higher number of reads will map to longer genes than short, as described earlier . In this study, this means that the correlation between DNA microarrays and RNA-seq ranges from approximately 0.48 to 0.71 depending on which gene length and microarray probe selection method is used. However, this points to a larger statistical issue when using RNA-seq data to assess differential expression, since long genes will bias for example lists of differentially expressed genes between samples, and thus influence the power of gene set enrichment analysis negatively. Future research in this area will certainly have to address this issue, for example by improved statistical methods or by limiting the analysis to reads mapping to the 3 part of the gene with length equal to the shortest gene included in the analysis.
We show that for approximately 20% of all human protein-coding genes, there is at least weak antisense transcription to exonic sequences. We also show that many of the antisense signatures overlap between the investigated cell lines (figure S3b). During recent years, several studies have indicated that 20–90% of all human genes can generate antisense transcripts that can mediate regulation of the sense transcript –. Our study falls into the lower end of that interval, possibly indicating that deeper sequencing is required to investigate this phenomenon further. We also investigate antisense transcription in different regions of the genome. He and colleagues demonstrated that antisense transcription was prevalent upstream of transcription start sites, and Preker and colleagues showed that these transcripts are polyadenylated and short (20–90 nucleotides) , . We do not identify such a pattern, which is likely explained by the fact that our study targets transcripts longer than 200 nucleotides. After clustering reads, we identify many novel TARs, most of which are shorter than 200 base pairs. This is likely due to the fact that they are generally lowly expressed, and a deeper sequencing of these samples would likely reveal the remaining parts of these TARs. Interestingly, we see approximately equal levels of transcription from both strands of introns of protein-coding genes. Non-protein coding intronic transcripts have been shown to be enriched in genes related to transcription regulation and interact with promoters to mediate regulation , .
The ENCODE project showed that transcription was frequent even outside of protein-coding genes , and with the recent emerge of new sequencing technologies, vast numbers of new transcriptionally active units have been detected. These TARs are situated in a non-random pattern along the chromosomes, indicating that they are not general background transcription. Some also show patterns of differential expression (figure 4). As more in-depth transcriptome studies deposit their data into publically available warehouses, such as Gene Expression Omnibus (http:www.ncbi.nlm.nih.gov/geo), more regions like these will likely be detected and characterized. It will be of great importance to functionally characterize these novel non-protein-coding transcripts and their potential role in gene regulation.
A431, U-2 OS and U251 cells were grown as described earlier . Cells were harvested and RNA was extracted using the RNeasy mini kit (Qiagen, Valencia, CA) following the manufacturer's instructions, and 15 g of total RNA was used as input material for the SOLiD Whole Transcriptome kit (Applied Biosystems Inc., Foster City, CA) and 14 372 246, 10 547 681 and 11 449 673 reads (each 50 nucleotides) passed quality filters including filtering against adaptors. The reads were mapped to chromosomes 1–22, X and Y of the human genome (hg18) using Corona lite with default parameters (Life Technologies/Applied Biosystems).
Two-color DNA microarray data for the cell lines A431 and U251 was provided by Gry et al., and was pre-processed as described elsewhere . The quality of the arrays has previously been addressed by comparison with MAQC data . To allow for comparison to with RNA-seq data, RPKM expression levels were calculated for every Ensembl gene (http:www.ensembl.org) as described elsewhere , and a log2-fold change was calculated for the ratio A431 versus U251. Since one gene can be interrogated with several microarray probes, three different methods were used transform the microarray expression data to one value per gene; the mean of all probes, the median of all probes or the probe with the value closest to the RNA-seq data. We used Spearman's rho to quantify the correlation between the two platforms.
To investigate the sense and antisense expression in different genomic regions, we calculated the number of reads that map to each region of interest. Some regions (coding regions, introns, 5 UTRs and 3 UTRs of protein-coding genes) were downloaded as BED-files from the UCSC table browser. Promoter and terminator regions were defined as 1 kb upstream or downstream of a protein-coding gene, respectively, similar to what has been used earlier . If a neighbouring gene resides within the promoter or terminator region, the overlap with this gene was removed from the promoter or terminator region. For each region type, we calculated the expression density by counting number of reads that map entirely within the region type and normalized to the total length of the regions and the total amount of sense or antisense reads. This procedure yields one relative tag density value for each region type, sense and sample. We also calculated the sense-to-antisense ratio for each regions type and sample.
To identify putative novel transcriptionally active regions, we clustered reads (using the online-version of Galaxy,  allowing for reads to be 15 bases apart and require at least three reads to be present to form a cluster. These first clusters were then merged across cell lines. We then subtracted clusters that overlap with known genes (as defined by Ensembl genes) as well as non-coding RNA genes (RNA genes, UCSC Genome Browser).
Overlap of sense and antisense expression between the cell lines.
(0.23 MB PDF)
Correlation to microarray data, binned per gene length in intervals of 2000 bps. See main text for discussion.
(11.70 MB TIF)
Information on read mappings for additional cell lines. (D, G) Fraction of reads mapping to different regions. (E, H) Relative tag density in different regions. (F, I) Fraction reads mapping to the sense and antisense strand for different regions. See main text for discussion.
(0.20 MB PDF)
Smooth scatterplots of log10(rpkm) between samples along with Spearman's rho correlation. The correlation is .87 to .88 between all samples. This indicates that most genes have similar levels across all samples.
(1.41 MB PDF)
Smooth scatterplots of log10(antisense-rpkm) between samples. Spearman's rho correlation coefficient is here slightly lower than that in the sense-case (supplementary figure S4). A reason for this could be that the majority of antisense transcripts are lowly expressed. It is also possible that these antisense transcripts have regulatory function and differ more than the bulk of mRNAs expressed in a cell.
(1.41 MB PDF)
We would like to thank Dr. Emma Lundberg for providing cells and Uppsala Genome Center for technical assistance and helpful discussions. Also, Dr. Rickard Sandberg and Daniel Ramskld are acknowledged for valuable discussion on data analysis and provision of analysis scripts.
Competing Interests: The authors have declared that no competing interests exist.
Funding: This work was supported by the Swedish Scientific Council, and the Knut and Alice Wallenberg Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.