We show for the first time that whole exome capture can be applied to cDNA for efficient detection of transcripts and alternative splice variants. Our data support and extends on results of more targeted studies (18
), showing a significantly increased power to detect transcripts present at very low level. There are several potential research and diagnostic applications for whole exome capture. One advantage of the approach is the ability to discover new exons. We find evidence for multiple previously unannotated exons, with the largest number identified in fetal frontal cortex. The majority of these map within existing gene annotations (n
= 4515) making them by far the most over-represented group, as compared with novel 5′ or 3′ exons (n
= 495 and 477, respectively). Previous studies have shown that a large fraction of sequence reads from poly(A)-RNAseq map to intronic regions (15
). The fact that so many of the novel exons we find map to introns may explain a significant portion of those intronic reads. A consequence of novel exons that extend existing gene models is that a larger fraction of the genome will be covered by genes than previously thought. Again, this may explain a significant portion of sequence reads mapping to intergenic regions.
We find a large number of previously unannotated alternative splice variants, based on splice junctions between known exons (on average 13 032 per tissue). We also note that many genes have multiple alternative 3′- UTRs, although many of these occur at different places within existing 3′-UTR annotations. This seems to be a common feature of many genes that is currently not well annotated. Our validation data show that the majority of new alternative splice isoforms that we detect are expressed at low levels, explaining why they have not been reported previously. Overall, recent data from us and others (15
) indicate that there are many low level transcripts that are not captured by current array or sequence-based methods.
There are some limitations to the ExomeRNAseq approach. Not all transcripts are represented in the existing commercial exome enrichment kits, and not all probes are efficient at capturing their targets. In addition, a drawback of RNA capture is that the ability to accurately quantify gene expression is significantly lowered compared with conventional RNA-seq strategies. This is mainly due to the differential efficiency of capture probes to hybridize to their targets. We attempted to correct for this by using results from DNA exome capture based on the same probes, where we know that the number of targets is two for most probes. Normalizing against the DNA coverage of the same probes seems to work well for some genes, but has only minor effect on the global correlation between TotalRNAseq and ExomeRNAseq (data not shown). One problem with this approach is that it normalizes also for other factors that affect coverage, such as mappability, and because the normalization is applied only to ExomeRNAseq and not TotalRNAseq, it may in some cases skew the results further. It is possible that additional correction that includes mappability scores could help improve the situation further, but the main conclusion from our results is that quantification accuracy is inevitably lowered after target enrichment.
There are several applications for which ExomeRNAseq may be beneficial compared with current approaches. The ability to detect junctions present at low levels opens up for the possibility to use ExomeRNAseq for genome wide discovery of translocations causing novel fusion transcripts in solid tumors. Translocations are challenging to detect in DNA tumor samples, and detection is further complicated by the fact that samples often represent a mix of tumor and normal cells so that only a fraction of the cells carry the rearrangement. We propose that RNA capture may be a way to approach these challenges.
Based on the fact that we identify a large number of new exons, we propose that ExomeRNAseq may be an excellent approach for cross-species comparisons. It was recently shown that exome capture on DNA can efficiently be used to map variation across primates (24
), and it should work equally well for RNA based capture. Since we show that we can find a large number of coding variants in the data, exome enrichment at the level of RNA can be used both for annotation of gene models and identification of variation.
Our data support previous findings that our understanding of transcription and post-transcriptional regulation is limited, and that current approaches are only finding a fraction of the transcript diversity. Our results suggest that very deep sequencing of captured or enriched portions of the transcriptome may be the best way towards uncovering the complete spectrum of transcript diversity in any given cell type or tissue.