As a way to introduce and discuss microarrays and deep sequencing for measuring the transcriptome we will use a fly example from our own laboratory: specifically, experiments designed to profile gene expression in female and male heads of Drosophila pseudoobscura
. This is one of several species of fly that we are profiling to validate evolutionarily novel D. melanogaster
transcripts in the model organism Encyclopedia of DNA Elements (modENCODE) project [43
]. We performed microarray and RNA-Seq experiments on the same samples and then compared expression measurements between microarrays and RNA-Seq.
Figure describes an expression experiment designed to identify genes that are differentially expressed in D. pseudoobscura female and male heads, which were manually dissected from flies over dry ice, after which total RNA was extracted followed by a poly A+ selection. poly A+ selected mRNA was converted to cDNA using end-labeled random nonamers and reverse transcription. During this reaction, a fluorophore is added to the 5' end of each short cDNA. In this case, the cDNA of one sex was labeled with one type of fluorescent dye (cyanine 3 or Cy3) and the cDNA of the other sex was labeled with a different fluorescent dye (cyanine 5 or Cy5) with fluorescence at a different wavelength. We generated replicate samples (N = 4) and samples with dyes swapped between females and males in order to control for technical artifacts due to labeling and dye biases and to measure the inherent variability in gene expression irrespective of the sex of the sample.
Figure 1 Data production workflow for microarrays. Microarrays require labeling of target material, hybridization to arrays, washing, and scanning to obtain measures of gene expression. RNA converted to cDNA from the sample will hybridize to the corresponding (more ...)
As with any assay, replicate samples are critical for statistical analysis. The female and male labeled cDNA samples were mixed and applied to the microarray for hybridization. cDNAs that are complementary to probes on the microarray hybridize on the basis of simple first principles: more highly expressed genes will have more transcripts converted to labeled cDNA, and these more abundant cDNAs will bind more to their target probes than those of less expressed genes. Because we co-hybridized samples labeled with different fluorescent dyes we can take a ratiometric expression score between female and male heads: that is, genes that are more highly expressed in one sex than in the other will hybridize more to the target probe and generate a stronger signal. Genes that are expressed at the same level in both sexes will have equivalent amounts of transcript bound to probes and so the signal will be a combination of both Cy3 and Cy5 signal thereby generating a signal intermediate between the two (yellow fluorescence). The analysis and normalization methods for microarrays are highly developed [12
] and thus this experiment should allow the differences in steady-state mRNA levels between female and male head tissue to be reliably measured.
Figure shows the same analysis performed by RNA-Seq, using an Illumina Genome Analyzer and a commonly deployed protocol for preparing libraries [44
]. First, the transcriptomes for females and males are fragmented by alkaline hydrolysis, then reverse-transcribed to make double-stranded cDNAs using random hexamer primers. Next, the ends of transcript fragments are prepared to enable oligonucleotide adaptors to be ligated onto the ends. Fragments are then size-selected, amplified by PCR and injected into a flow cell. The flow cell is a glass slide that contains a lawn of oligonucleotides complementary to the adaptors ligated to transcripts and with a series of separate lanes in which sequencing reactions take place.
Figure 2 Data production workflow for RNA-Seq. RNA-Seq requires building libraries of fragmented RNA that are then converted to cDNA by reverse transcription, followed by adaptor ligation and size selection. Sequencing libraries are prepared for clustering on (more ...)
Once the adaptors on the DNA fragments have hybridized to the complementary oligonucleotides in the flow cell, the fragments are amplified by isothermal bridge amplification to generate clusters of DNA clones. (In isothermal bridge amplification, the templates arch over and bind to adjacent oligonucleotides and then DNA polymerase copies the templates.) Double-stranded DNAs are denatured and the process is repeated to generate clusters of DNA clones. Next, the free 3' OH ends of the linearized clusters are blocked to prevent nonspecific sequencing reactions. Finally, the clusters are denatured and a sequencing primer is hybridized to the linearized and blocked clusters.
Sequencing reactions consist of a series of reactions to image individual bases within each cluster. Bases are imaged by using reversible fluorophore terminator nucleotides. The first base in the cluster is identified by adding four labeled reversible terminators, primers, and polymerase. A laser is used to excite the fluorophores and this allows identification of the first base. The next cycle repeats the incorporation of four reversible terminator nucleotides, primers, and polymerase. A laser again excites the terminators and bases are identified. These cycles of adding reagents, followed by laser excitation, and data capture are repeated to produce a read and typical reads range from 25 to over 75 base pairs in size. At the end of a run (3-7 days or more depending on read length) there are 30-40 million (possibly more) high quality sequences.
The RNA-Seq measure of gene expression is density of reads mapping to a particular transcript. For species with sequenced genomes, a common method is to map reads to a reference genome. Illumina provides a mapper called ELAND but many free open source tools are available. The tools that we have used most extensively for RNA-Seq are the Tuxedo Suite Tools (Bowtie [45
], a short read mapper; Tophat [46
], a splice junction identifier, and Cufflinks [33
], a transcript assembler). Two expression metrics are commonly used which provide a value normalized by overall sequencing depth, FPKM (expected fragments per kilobase of transcript per million fragments mapped) and RPKM (reads per kilobase per million mapped reads) [23
], which are conceptually similar. In the example given in Figure , we estimate expression in units of RPKM by quantifying reads that map with genes predicted from genomic sequence. Therefore, higher RPKM in females would be examples of genes with female-biased expression, higher RPKM for males would be genes with male-biased expression, and equivalent RPKM in both sexes would be examples of non-sex-biased genes.