Deep sequencing of small RNA populations
RNA silencing represents a pathway that controls expression of specific genes transcriptionally and post-transcriptionally [
43]. In RNA silencing, small RNAs (smRNAs) comprise the sequence-specific effectors of RNA silencing pathways that direct the negative regulation or control of genes, repetitive sequences, viruses, and mobile elements [
44,
45].
To gain insights into the total population and gain a better understanding of smRNA function in plants a number of groups turned to sequencing the smRNA component of the plant transcriptome (smRNAome). Numerous groups have recently employed Genome Sequencer FLX from 454 Life Sciences and Illumina Genome Analyser sequencing technologies to look at the smRNAome of various plant species [
42•,
46–
60]. Putting these two technologies to work, the sequencing of smRNAomes from plants containing various genetic lesions has resulted in the elucidation and categorization of millions of smRNAs, as well as the identification of biogenesis factors and regulators of specific smRNA populations [
42•,
48–
51,
53,
55,
57]. For instance, sequencing the smRNAomes of
Arabidopsis thaliana plants harbouring lesions in genes encoding DNA methyltransferases in conjunction with single-base resolution DNA methylation analysis (see above) revealed a strong correlation between the location of smRNAs and DNA methylation, a disruption in biogenesis of specific smRNA size classes upon loss of CpG DNA methylation, and the potential of smRNAs for directing strand-specific DNA methylation in regions of RNA-DNA homology [
42•]. In another study, sequencing experiments using
Arabidopsis thaliana rdr2 and maize
mop1-1 mutant plants, which lack a homologous RNA-dependent RNA polymerase, revealed loss of this protein results in a significant decrease in the 24 nt smRNA population of the smRNAome. This loss of 24 nt smRNAs was accompanied in the sequencing experiments by an increase in sequencing of those that were 21 nt in length, which through subsequent analysis resulted in the identification of numerous unidentified miRNAs throughout the
Arabidopsis thaliana (
rdr2) and maize (
mop1-1) genomes. Furthermore, these studies revealed that 24 nt smRNAs, which are mostly associated with repetitive elements and heterochromatic regions of the genome, comprise the bulk of the
Arabidopsis thaliana and maize smRNAome complexity [
53,
55].
With accessibility to these technologies becoming increasingly available, the number of plant species with sequenced smRNAomes is ever increasing [
46,
47,
52,
54–
60]. So far this collection of sequence data has elucidated that smRNAomes are not statically maintained between all species. More specifically, the distribution of smRNAs amongst various size classes has been found to differ between plants. This differential distribution of smRNA lengths is hypothesized to reflect a disparity in the maintenance of genomic organization between plant species that have dramatic variations in the quantity of their genetic material [
54,
61].
Ultimately, with millions of sequence reads generated in each run, and the ability to determine specific nucleotide length of all identified smRNAs machines such as the 454 sequencer, Illumina Genome Analyser, and Applied Biosystems SOLiD provide ideal platforms for complete indexing of the plant smRNAome. Additionally, the increased use of barcoding of numerous smRNA samples [
51], and subsequent multiplexing will result in the sequencing of smRNAomes from an even greater variety of plant species. With the ensuing flood of smRNA sequencing data from an immense collection plant species, a clearer view of the dynamic nature of plant smRNAomes will emerge. Additionally, these datasets will aid in elucidating how these small regulatory RNA molecules have evolved between plant species to regulate genomes with such disparity in size.
mRNA sequencing for transcript discovery and profiling
As the astounding and unexpected complexity of eukaryotic transcriptomes has become apparent over the last few years [
24,
62–
68], so the requirement has grown for techniques that allow broad but accurate characterization of the dynamic cellular complement of transcripts. Ideally such approaches will incorporate highly specific, sensitive and quantitative measurements over a large dynamic range with a flexibility to identify unanticipated novelties in transcript structures and sequences.
A number of studies have recently used deep sequencing to perform surveys of the mRNA component of the transcriptome in various organisms, enabling parallel quantification and annotation of cellular transcripts. While sequencing of cDNA pools is a well established technique, for example the sequencing of EST libraries [
69], the ability to rapidly and cheaply generate diverse cDNA sequence datasets will allow the transcriptional activity of a vast array of different cell types, mutants and environmental conditions to be analyzed. Deep sequencing of cDNA, referred to as RNA-seq, overcomes several shortcomings of microarray-based detection of transcripts, including probe cross-hybridization [
70], restricted signal dynamic range, and low sensitivity and specificity, which often lead to difficulties in detection of low abundance transcripts and discrimination between similar sequences. Sequence-level transcript information has much greater power to distinguish between paralogous genes, better detection of low abundance transcripts, and allows replicable digital quantification based upon counting of sequence reads [
71–
75]. Furthermore, RNA-seq can identify transcript sequence polymorphisms, novel trans-splicing and splice isoforms, and there is no strict-requirement for a reference genome sequence. Whilst approaches such as SAGE, CAGE and MPSS have enabled parallel sequencing of short reads from many transcripts, they suffer from a poor coverage of each transcript and potentially ambiguous mapping due to the short read length [
76–
78]. In contrast, RNA-seq can produce complete coverage of transcripts, providing information about the sequence, structure and genomic origins of the entire transcript.
Several strategies have been employed to perform shotgun sequencing of cellular mRNAs, but they can be broadly categorized as either “stranded” RNA-seq, yielding strand-specific data that informs about transcript directionality, or “strandless” RNA-seq, where sequencing of double-stranded cDNA fragments loses the strand of origin information [
79•]. The first papers reporting RNA-seq of plant transcripts with one of the new deep sequencing technologies utilized the 454 sequencer, generating strandless RNA-seq data from double stranded cDNA of
Medicago truncatula,
Arabidopsis thaliana and maize [
80–
82]. Cheung and colleagues [
81] sequenced adapter-ligated fragments of a normalized
Medicago truncatula cDNA library, assembling the reads into contigs representing thousands of previously unobserved and rare transcripts. In
Arabidopsis thaliana seedlings, Weber
et al. [
81] generated reads mapping to 17,449 genes, accounting for ~90% of the transcripts estimated to be expressed in the sample, identifying reads from previously unannotated transcripts and predicted genes with no prior EST support. Finally, Emrich and colleagues [
81] sequenced cDNA from maize shoot apical meristem cells isolated by laser-capture microdissection, identifying over 25,000 genomic sequences, including nearly 400 orphan transcripts with no homology to sequences from any other species and which appeared to be expressed in a cell-type specific manner. Clearly, the sensitivity of the shotgun sequencing is applicable for characterization of the transcript complement of individual cell types.
Several recent publications have utilized the Illumina Genome Analyzer and Applied Biosystems SOLiD instruments to generate vast datasets of short expressed tags in
Arabidopsis thaliana, human, mouse and yeast [
42•,
71–
75,
83]. Essentially, these instruments yield vastly more transcriptome sequence per run than the 454 Life Sciences instrument, typically over one hundred million individual reads, however the length of these reads is significantly shorter than those from the 454 instrument. Thus, while many more unique sequence tags are generated, the shorter read length of the Illumina and Applied Biosystems machines provide a challenge to perform transcript assembly, identification of multiple splicing events within the same mRNA molecule, and unambiguous read alignment to some transcripts with highly similar sequences. However, the vast quantity of short read sequence is extremely powerful for transcript quantification, gene discovery, correction of transcriptional unit structure annotation, and detection of alternative splicing [
72••,
74].
In a recent study, Lister
et al. [
42•] utilized a strand-specific RNA-seq technique to sequence the transcriptome from flower buds of wild-type and DNA methyltransferase or DNA demethylase deficient mutant
Arabidopsis thaliana plants. By overlaying the RNA-seq data with the single-base resolution detection of DNA methylation in the same tissues, Lister and colleagues identified hundreds of genes that displayed altered transcript abundance upon perturbation of proximal DNA methylation patterns. Importantly, the stranded RNA-seq data was essential for identification of the strand from which the intergenic transcripts originated and unambiguous identification of repetitive transposon sequences reactivated upon loss of the repressive methylation modifications and alteration of proximal smRNA abundance ().
While RNA-seq offers previously unparalleled means to characterize cellular transcriptional activity, numerous methodological advances that are now being pursued offer to greatly enhance its effectiveness. Paired-read sequencing can be used assess the splicing patterns of multiple distal exons within a single transcript to be studied, while with single short reads it is generally only possible to assess one splice event. With increases in read length constantly being pursued eventually it will be feasible to sequence and assemble an entire transcript, thus revealing the precise splicing pattern. Such a development would also greatly facilitate an understanding of the transcriptome of plant species that do not yet possess high quality reference sequences, allowing identification of novel transcripts where shorter reads at this point may preclude effective contig assembly. It will be essential for RNA-seq techniques to be refined to require significantly less starting material, so as to enable the sequencing of single cells to characterize their transcriptional complement and identify cell-type specific transcripts. Together, such developments will greatly improve the value of RNA-seq, providing researchers with a more comprehensive understanding of the composition and dynamics of plant cell transcriptomes.
Recently, more specialized RNA-seq approaches have been developed to sample the 3′ cleavage fragments produced by endonucleolytic cuts, and in so doing captured a global snapshot of degraded RNAs [
49•,
84•,
85•]. These “degradome” sequencing approaches exploit the 5′-RACE principle but ignore the 5′ mRNA cap and selectively clone mRNA molecules with a 5′ monophosphate [
49•,
84•,
85•]. Analysis of the degradome sequencing data revealed that the vast majority of expressed genes had sequencing reads that mapped to them, the majority mapping specifically to the 3′ ends of mRNA molecules, suggesting that some level of endonucleolytic cleavage mostly targeted to the 3′ end of mRNAs and subsequent turnover is the norm for most expressed transcripts [
49•,
84•,
85•]. Additionally, this type of sequence information, which is riddled with sequenced miRNA-directed cleavage sites, has been used to identify known and previously unidentified miRNA target mRNAs [
84•,
85•]. Overall, these recent studies illustrate how high-throughput sequencing technologies can be utilized to gain insights into global RNA dynamics within plants.
Future prospects and concluding remarks
The advent of widely available new or now-generation sequencing technologies has spawned a remarkable array of applications to study genomic and cellular dynamics and features with unprecedented precision and breadth. Many of these new sequence-enabled techniques have been applied to plant systems, producing intriguing insights into cellular function, and genome and population dynamics that could not previously have been obtained. Widespread adoption of these new sequencing technologies will allow researchers to characterize a vast assortment of plant processes in both model and non-model species. The many varied techniques will inevitably be applied to generate detailed temporal and spatial maps of cellular states and activities, profiling not only different cell types within an organism but, with suitable advances in sample preparation and amplification methods, perhaps also single cells. A tantalizing goal is the effective integration of the many complex and rich sequencing datasets to yield cohesive views of cellular activities and dynamics, yet clearly there are substantial bioinformatic challenges that lie ahead on the path to this objective.
Theoretically, any cellular process or experimental assay for which the output is in nucleic acid form can be comprehensively interrogated, providing an opportunity for the development of a wide assortment of novel applications. For example, it should be possible to combine the yeast two-hybrid screening method [
86] with deep sequencing to perform a massively parallel protein-protein interaction experiment, interrogating every pairwise permutation of the full protein-coding complement of an organism’s genome to generate a complete direct-interaction network. In this proposed technique () interaction of bait and prey constructs results in the activation of the CRE recombination system and expression of a selective marker gene.
loxP sites situated at the end of each gene in the bait and prey constructs will be recombined to form a chimeric DNA molecule containing the two gene ORFs that encode the interacting proteins. Restriction digestion to release the chimeric molecule followed by paired-end sequencing of its two ends will yield a pair of sequences, one from each of the genes, thus identifying the two proteins that directly interacted. Two complex pools of yeast cells, each one containing the full complement of an organism’s gene ORFs fused to either the bait or the prey domain, would be mixed and allowed to mate. Deep sequencing performed on the complex pool of resulting chimeric DNA molecules would reveal every pairwise interaction that took place, interrogating the hundreds of millions of possible interactions between every protein encoded in a eukaryotic genome, Such a parallelized approach will be the only possible avenue through which to test the 784 million possible interactions of the 28,000 proteins encoded in the
Arabidopsis thaliana genome.
As enabling as this leap in technology has been, several companies already claim to soon deliver momentous increases in sequence read length and output (e.g. Pacific Biosciences,
http://www.pacificbiosciences.com; Complete Genomics,
http://www.completegenomics.com; Visigen Biotechnologies,
http://visigenbio.com). With such advances it may soon be possible to apply these new technologies to the study of plants with much larger genomes, and to survey a wide range of plant species, thus dramatically increasing the understanding of the diversity of plant life.