Using an array-based sequence capture and enrichment strategy followed by long-read sequencing, we have identified large numbers of novel exons, splice sites, and transcription start sites in both testes and ovaries of the fruit fly. We also provide a list of available fly lines containing P-element mutations that target novel first exons we have identified (Table S8
), which should be useful to fly researchers studying these genes. This study provides proof of principle for more directed future studies that seek to enrich more specific types of transcripts. For example, the capture and sequencing of transcriptionally complex splice isoforms of a gene, each expressed in small subsets of cells, would be an obvious extension of our study. Drosophila
can potentially encode over 38,000 isoforms, and these isoforms are expressed in predominantly one tissue type (namely, the nervous system). Without ultra-deep sequencing of nervous system tissue, the true diversity of transcript isoforms could not be elucidated. By using a sequence capture array that only targets Dscam
, a more comprehensive cataloging of the expressed isoforms can be undertaken. Interestingly, we identified two partial transcripts in testes that aligned to Dscam
and represent novel isoforms that had not been annotated before, indicating that these are potential testes-specific Dscam
transcripts (Table S15
) that may be associated with the neurons that innervate the musculature of the testes.
We chose the 454 platform for our high-throughput sequencing over other available platforms for several reasons. First, the 454 system has been shown to work well with longer input sequence fragments. Because the cDNA nebulization step produced products with an average length of 700 bp, we were confident that the platform would generate high-quality sequences. Second, and most importantly, we wanted to use a system capable of long reads so that we could easily and unambiguously annotate any novel exon structures, including their connectivities with annotated (or other unannotated) exons. We recognize that the Illumina paired-end technology has been substantially developed over the last couple years, and this type of platform could be used for further studies in conjunction with array capture.
This study demonstrates that the CoNECT methodology can identify genes exclusively expressed in a small subset of the overall cell population of a given tissue. For example, genes expressed in two pairs of nerves in the Drosophila
) were identified with CoNECT, even though the neurons make up less than 0.01% of the total cell number in the ovary [over 100,000 nonneuronal cells, including stem, follicle, and nurse cells, are present in a developing ovary (Spradling 1993
)]. Along the same lines, both C(2)M
(two other genes identified by CoNECT) encode important meiosis-specific synaptonemal complex proteins and are only expressed in about 20 cells per ovariole. It is important to note that these genes and other classes of genes represented by lower overall sequencing read numbers are not necessarily genes expressed at low levels; their transcripts, even if expressed highly in a few cells, may simply be underrepresented in the overall transcript pool due to the largely nonneuronal cell population. Therefore, many of the genes identified by CoNECT that are expressed in smaller cellular populations of a tissue may in fact be important primary regulators of cellular function. By enrichment strategies followed by gene ontology analysis as outlined above, specific gene classes can be identified, even without prior knowledge of which cell types are present in the tissue of interest. Importantly, without enrichment strategies such as CoNECT, transcriptomics of minority cell types would be difficult. This methodology, alongside traditional RNA-seq (Gan et al. 2010
; Graveley et al. 2011
) and locus-directed capture strategies, such as those detailed in Mercer et al. (2012)
, provides a means to more faithfully annotate biologically important transcripts expressed in all layers of a tissue of interest across the entire genome. Several exome capture platforms are already commercially available for a variety of organisms, and these can immediately be employed to probe the desired transcriptome of interest.
It should be emphasized that we chose a whole-genome exome array for these initial studies to provide the proof of principle that our methodology works and to maximize our chances of identifying the largest number of novel exons and splicing events of the Drosophila
germline, a tissue that has been noted to have a large diversity of transcript isoforms. Nonetheless, future studies will target specific gene transcripts, thus allowing for greater sequencing depth of the relevant transcripts while eliminating any unwanted transcripts from the sequencing step. Indeed, we have shown that the exome array can specifically enrich for genes interrogated on the array at the expense of genes not included in the array design (e.g.
Why have we been able to identify so many additional genes expressed in the ovary? Several possibilities exist. First, we have shown that CoNECT can specifically enrich for lower-expressed transcripts at the expense of higher-expressed transcripts, which is most obvious for the CoNECT/tiling array expression comparison (Figure S3
A) but is also seen with the CoNECT/traditional RNA-seq comparison (Figure S3
B). This is likely in part due to the effect of some highly expressed transcript cDNAs saturating the available probes on the capture array such that no additional transcripts can be captured, thereby “enriching” for lower-expressed transcripts to a greater extent than higher-expressed transcripts. Second, the nature of the hybridization strategy of the capture protocol may facilitate enrichment of lower-level transcripts even in relation to transcripts that do not saturate their target exonic probes. For example, most transcripts of a gene expressed at moderate-to-high levels will efficiently hybridize to their capture probes, but the kinetics will change such that the hybridization becomes less efficient as time goes on because fewer target probes are available for hybridization. However, this is not the case for lower-expressed transcripts in which there is always an abundance of capture probes available for hybridization. Thus, over time, hybridization of lower-level transcripts would be favored over transcripts from more highly expressed genes. Third, given that highly repeated rRNA genes were omitted from the capture array design, we were able to more comprehensively interrogate the transcripts of interest, which by definition, would allow for better interrogation of low-level transcripts. Whatever the case, CoNECT was able to identify a great many additional genes as being expressed compared with either the traditional RNA-seq dataset (with 55-fold fewer reads) or with the tiling array gene expression dataset.