|Home | About | Journals | Submit | Contact Us | Français|
The sudden availability of DNA sequencing technologies that rapidly produce vast amounts of sequence information has triggered a paradigm shift in genomics, enabling massively parallel surveying of complex nucleic acid populations. The diversity of applications to which these technologies have already been applied demonstrates the immense range of cellular processes and properties that can now be studied at the single-base resolution. These include genome resequencing and polymorphism discovery, mutation mapping, DNA methylation, histone modifications, transcriptome sequencing, gene discovery, alternative splicing identification, small RNA profiling, DNA-protein and possibly even protein-protein interactions. Thus, these deep sequencing technologies offer plant biologists unprecedented opportunities to increase the understanding of the functions and dynamics of plant cells and populations.
The application of genomic techniques to plant research has yielded a multitude of discoveries concerning plant cellular biology, development and evolution. Now, the sudden rise of relatively low cost and rapid “next-generation” DNA sequencing technologies is dramatically advancing our ability to comprehensively interrogate the nucleic-acid based information in a cell at unparalleled resolution and depth. Already this technology has been employed to study genome sequence variation, ancient DNA, cytosine DNA methylation, protein-DNA interactions, transcriptomes, alternative-splicing, small RNA populations and mRNA regulation (Figure 1), with a number of these applications being effectively applied to plant systems. Current deep sequencing technologies produce many gigabases of single-base resolution information and can perform multiple genome-scale experiments in a single experimental run, thus being effective in the analysis of many plant genome equivalents. However, it should be noted that some significant challenges remain in the employment of this new technology, most evident are informatics and data processing issues that arise from the generation of such large (terabytes per run) volumes of data. Here we discuss several applications of these “now-generation” DNA sequencing technologies and the insights they have yielded into the diversity of plant genome regulation.
Currently, there are three widely deployed deep sequencing platforms in hundreds of research labs and in some core facilities worldwide, the Genome Sequencer FLX from 454 Life Sciences/Roche, Illumina Genome Analyzer, and Applied Biosystems SOLiD. Each instrument essentially massively parallelizes individual reactions, sequencing hundreds of thousands to hundreds of millions of distinct, relatively short (50 to 400 bases) DNA sequences in a single run. The technical details of the operation and chemistries of each sequencer have been reviewed in detail recently ([1, 2]). Here, we will briefly outline the quantity and constitution of sequence data produced by each platform. It should be noted that each of these platforms have seen dramatic and rapid increases in total yield, sequence quality and read length, such that the figures quoted will likely be rapidly surpassed by the time of publication of this review. The Genome Sequencer FLX from 454 Life Sciences is capable of producing over a million reads of up to 400 bases per 10 hour run, for a total yield of 400 – 600 megabases. The Illumina Genome Analyzer will yield over one hundred million high-quality short reads (up to 76 bases) per 3–5 day run, totaling several gigabases of aligned sequence. Finally, the Applied Biosystems SOLiD system will also produce hundreds of millions of short reads (up to 50 bases) per flow cell in a similar time frame to yield an equivalent quantity of sequence as the Illumina instrument. Furthermore, all three platforms offer the paired-read sequencing technique, where sequence is produced from both ends of a long DNA molecule, increasing the unambiguous mapping of sequence reads by spanning repetitive regions and anchoring one repetitive read to a distinct genomic location by its unique partner sequence. The base-calling error rates observed with the new sequencing technologies are on average ten times greater than capillary based Sanger sequencing, and the type of error varies between the different platforms . However, the massive increase in sequence output affords the possibility to generate multiple passes of the same sequence, thereby greatly reducing error rates.
Identification of sequence polymorphisms in related but phenotypically distinct individuals or groups within a species is an essential step in elucidation of the causative genetic differences that give rise to observed phenotypic variation. Furthermore, the distribution of genetic polymorphism is informative of population structure and evolutionary history. Hybridization of genomic DNA to high-density oligonucleotide arrays has successfully been used to identify genetic polymorphisms in several organisms including human, mouse and Arabidopsis thaliana [3–5]. However, utilization of tiling microarrays to identify genetic polymorphisms is limited to genomic regions that are highly similar to the reference strain sequence upon which the tiling array is designed, as efficient probe hybridization is necessary for deconvolution of the sequence in the other strains. Consequently, the analysis of genomic sequence variation is confined to these highly similar sequences, while regions containing small to large insertions or deletions, or a high density of polymorphisms cannot easily be interrogated.
The recent development of deep sequencing technologies is a major boon for the aforementioned areas of investigation, in which interrogating the genomic sequence of a wide range of individuals, strains or species is essential to generating highly informative datasets. The ability to generate vast amounts of sequence data from any organism enables the rapid discovery of much greater sequence variation than has been identified previously. Through a recent study in Arabidopsis thaliana, Ossowski et al. (2008) reported the resequencing of two naturally occurring and geographically distinct strains of Arabidopsis thaliana (Bur-0 and Tsu-1) with short reads generated by the Illumina sequencing technology [6•]. Furthermore, the study details the development of a new computational mapping tool, ShoRe, which enables identification of both SNPs and 1–3 bp indels at high sensitivity and specificity. Within these two studied strains, over 800,000 non-redundant single-nucleotide polymorphisms (SNP) were identified relative to the reference strain Col-0, constituting a dramatic increase in SNP discovery relative to previous array-based experiments . Furthermore, over 79,000 1–3 bp indels were identified in the genomes of these two strains, resulting in 1,839 potential frame shifts, and regions that showed significantly higher coverage than expected were identified, likely indicating duplicated regions. Finally, 3.4 megabases of the Bur-0 and Tsu-1 genomes that was identified as highly dissimilar, duplicated or deleted relative to the reference Arabidopsis thaliana genome was targeted for de novo sequence assembly, resulting in the generation of 10,921 high-confidence contigs of up to 408 bp. Clearly, a wide assortment of polymorphism information can be gleaned from limited short read sequencing of divergent Arabidopsis thaliana accessions, and the sequence of Bur-0 and Tsu-1 will be highly informative for ongoing research into extant Bur-0/Tsu-0 recombinant inbred lines. The study by Ossowski et al. marks the first data release of the international cooperative endeavor to sequence the genomes of 1,001 distinct strains of Arabidopsis thaliana (http://1001genomes.org), which will provide a vast resource for the comprehensive study of global polymorphism, population structure, and analysis of the genetic basis of natural phenotypic variation.
New developments in sequencing technology, such as significantly longer reads and paired reads separated by multiple kilobases, must to be applied to enable true de novo assembly of the complete plant genomes. Application of these technological advances will enable significantly more comprehensive detection of the genetic diversity such as large structural variation within related genomes, and consequently aid elucidation of the polymorphisms that dictate phenotypic variation.
Screening of populations subjected to mutagenesis and identification of the causative genetic lesions of mutant phenotypes is a fundamental approach in the discovery of gene function. Forward genetic screens have proven extremely powerful in Arabidopsis thaliana for assigning genes to specific biological pathways . The success of this approach is, in part, due to the highly accurate sequence of its compact genome , facile genetics, and extensive collection of mapping markers . However, identifying the causative mutation commonly takes several months to years after generating a mapping population, so approaches to expedite this step will be highly valuable. In a modification of an approach termed bulked segregant analysis [10, 11], deep sequencing of a pool of F2 individuals containing only mutant plants from a mapping population enables rapid mapping of the mutation. Every sequenced SNP between the two parental strains of the mapping population acts as a marker (Figure 2), and hundreds of thousands of SNPs can now be routinely detected with relatively low genome coverage [6•]. Tracts homozygous for the genotype of the mutagenized strain are indicative of no recombination events occurring within that region, and thus are within physical proximity of the mutation. Furthermore, the sequence within this region can be scoured for potential mutations to rapidly identify the exact location of the genetic lesion, although sequencing errors and accumulated non-causative polymorphisms in the mutant population compared to the reference sequence may contribute to false-positive identification. Recently, using a “sequencing with prior mapping” approach, Sarin et al. (2008) reported the use of the Illumina platform to sequence the genome of the C. elegans mutant lsy-12 to identify the causative mutation [12•]. Notably, for organisms with genomes of moderate size such as Arabidopsis thaliana, a 76 base read/paired-end sequencing run that yields ~40x coverage currently takes only two weeks and costs a few thousand dollars, both factors that will see continual, rapid and dramatic improvement based upon the progression in the last two years. While this mutation-mapping approach offers great potential and is already being applied in a number of plant laboratories, both statistical predictions and empirical testing of the size of the mutant pool and the required coverage are necessary to determine the most effective experimental strategies. Looking ahead, mutation mapping will likely soon undergo even more dramatic advances. Recent studies have already demonstrated the identification of specific mutations by deep sequencing without inter-strain crosses to generate a mapping population [13•, 14•]. Thus, with the rapid increases in sequence output it is now conceivable to directly identify mutations in plant genomes, effectively taking the Mendel (genetic crosses) out of mutation mapping.
DNA-protein interactions mediate innumerable critical nuclear processes that govern genome organization, replication and interpretation of the inherent underlying information. Chromatin structures such as nucleosome composition and position, and post-translational modifications of histones influence chromatin compaction and interactions with transcription machinery, thus affecting proximal transcriptional activity [15–18]. Therefore, comprehensive genome-wide maps of such chromatin composition and state, and more broadly the full range of DNA-protein interactions, are essential to generate a more complete understanding of genome and transcriptional regulation. While these interactions were historically gradually revealed by analysis of interactions at a small number of genomic loci, more recent studies have utilized genomic tools such as high-density oligonucleotide arrays to interrogate the sites of interaction throughout entire genomes. The ChIP-chip method involves immunoprecipitation of specific chromatin through its interaction with a protein of interest that is crosslinked to proximal genomic DNA in the context of its in vivo interactions [19, 20]. Purification, labeling and hybridization of the immunoprecipitated genomic DNA to arrays enables identification of the genomic sites at which interaction of the protein with the genomic DNA occurred [21, 22]. ChIP-chip has been used extensively to produce comprehensive maps of DNA-protein interactions in plants and animals [23–27]. With the availability of new sequencing technologies, the chromatin immunoprecipitation technique has rapidly been coupled to shotgun sequencing to generate even higher resolution maps of protein-DNA interactions, an approach dubbed “ChIP-seq”, revealing distinct patterns of transcription factor binding, RNA polymerase II, and histone modifications in human and mouse lineage-committed, differentiating, as well as pluripotent and induced-pluripotent stem cells [28••–32]. With several gigabases of sequence generated in each sequencing run, ChIP samples are perfectly suited for analysis with deep sequencing technology, generally requiring only a fraction of the total output of a single run to saturate detection of sites of protein-DNA interaction. In fact, the rapidly increasing output of the DNA sequencers such as the Illumina Genome Analyzer and Applied Biosystems SOLiD likely already provides a cost-benefit over array hybridization for analysis of ChIP samples, particularly in organisms that possess large genomes that are distributed over several arrays. Sample barcoding, by addition of a short unique sequence tag to all sequenced molecules within one library, and subsequent multiplexing will further decrease the cost [33•]. Further advantages over ChIP-chip are evident in the higher resolution of the interactions that can be observed through the distribution of ChIP-seq short read tags. While there are no publications of the utilization of ChIP-seq in plant systems, numerous laboratories are currently employing this technique to gain new insights into DNA-protein interactions in plant cells and a flurry of papers utilizing this new method is expected soon.
Methylation of cytosines in the nuclear genomes of diverse eukaryotic lineages is an epigenetic modification that is required for numerous cellular processes, including transposon silencing, genomic imprinting, embryogenesis and gene regulation [34–39]. Several distinct molecular pathways control the deposition of DNA methylation in plants, so clearly the comprehensive detection of these sites at single-base resolution is necessary to gain an understanding of the pathways involved in its patterning and how it affects the underlying genetic information.
Single-base resolution analysis of sites of DNA methylation can be achieved by sodium bisulfite (BS) treatment of genomic DNA, which converts cytosines, but not methylcytosines, to uracil . Subsequent sequencing of PCR-amplified bisulfite-converted DNA allows determination of the methylation state of the cytosines in the sequenced region of the genome, as methylcytosine will be sequenced as cytosine, and unmethylated cytosine as thymine. While historically this approach was limited to analysis of a small number of loci, deep sequencing technologies have recently enabled two groups to conduct shotgun bisulfite sequencing of the entire Arabidopsis thaliana genome with a technique dubbed BS-seq or methylC-seq, offering an unprecedented view of the DNA methylome [41•, 42•]. Using the Illumina Genome Analyzer, Cokus et al. and Lister et al. generated 2–3 gigabases of uniquely aligned bisulfite sequence to comprehensively identify sites of DNA methylation throughout the Arabidopsis thaliana genome at single base-resolution, including previously unidentified sites of cytosine methylation, and local sequence motifs associated with DNA methylation. The relationship with small RNA abundance, downstream effects upon transcription of modifying methylation patterns, and dynamics of DNA demethylation were also uncovered [42•].
Application of methylC-seq to study distinct cell types, related but genetically distinct natural populations, and organisms exposed to various biotic and abiotic stresses will provide an unparalleled assessment of the extent to which cytosine methylation patterns vary within and between organisms.
RNA silencing represents a pathway that controls expression of specific genes transcriptionally and post-transcriptionally . In RNA silencing, small RNAs (smRNAs) comprise the sequence-specific effectors of RNA silencing pathways that direct the negative regulation or control of genes, repetitive sequences, viruses, and mobile elements [44, 45].
To gain insights into the total population and gain a better understanding of smRNA function in plants a number of groups turned to sequencing the smRNA component of the plant transcriptome (smRNAome). Numerous groups have recently employed Genome Sequencer FLX from 454 Life Sciences and Illumina Genome Analyser sequencing technologies to look at the smRNAome of various plant species [42•, 46–60]. Putting these two technologies to work, the sequencing of smRNAomes from plants containing various genetic lesions has resulted in the elucidation and categorization of millions of smRNAs, as well as the identification of biogenesis factors and regulators of specific smRNA populations [42•, 48–51, 53, 55, 57]. For instance, sequencing the smRNAomes of Arabidopsis thaliana plants harbouring lesions in genes encoding DNA methyltransferases in conjunction with single-base resolution DNA methylation analysis (see above) revealed a strong correlation between the location of smRNAs and DNA methylation, a disruption in biogenesis of specific smRNA size classes upon loss of CpG DNA methylation, and the potential of smRNAs for directing strand-specific DNA methylation in regions of RNA-DNA homology [42•]. In another study, sequencing experiments using Arabidopsis thaliana rdr2 and maize mop1-1 mutant plants, which lack a homologous RNA-dependent RNA polymerase, revealed loss of this protein results in a significant decrease in the 24 nt smRNA population of the smRNAome. This loss of 24 nt smRNAs was accompanied in the sequencing experiments by an increase in sequencing of those that were 21 nt in length, which through subsequent analysis resulted in the identification of numerous unidentified miRNAs throughout the Arabidopsis thaliana (rdr2) and maize (mop1-1) genomes. Furthermore, these studies revealed that 24 nt smRNAs, which are mostly associated with repetitive elements and heterochromatic regions of the genome, comprise the bulk of the Arabidopsis thaliana and maize smRNAome complexity [53, 55].
With accessibility to these technologies becoming increasingly available, the number of plant species with sequenced smRNAomes is ever increasing [46, 47, 52, 54–60]. So far this collection of sequence data has elucidated that smRNAomes are not statically maintained between all species. More specifically, the distribution of smRNAs amongst various size classes has been found to differ between plants. This differential distribution of smRNA lengths is hypothesized to reflect a disparity in the maintenance of genomic organization between plant species that have dramatic variations in the quantity of their genetic material [54, 61].
Ultimately, with millions of sequence reads generated in each run, and the ability to determine specific nucleotide length of all identified smRNAs machines such as the 454 sequencer, Illumina Genome Analyser, and Applied Biosystems SOLiD provide ideal platforms for complete indexing of the plant smRNAome. Additionally, the increased use of barcoding of numerous smRNA samples , and subsequent multiplexing will result in the sequencing of smRNAomes from an even greater variety of plant species. With the ensuing flood of smRNA sequencing data from an immense collection plant species, a clearer view of the dynamic nature of plant smRNAomes will emerge. Additionally, these datasets will aid in elucidating how these small regulatory RNA molecules have evolved between plant species to regulate genomes with such disparity in size.
As the astounding and unexpected complexity of eukaryotic transcriptomes has become apparent over the last few years [24, 62–68], so the requirement has grown for techniques that allow broad but accurate characterization of the dynamic cellular complement of transcripts. Ideally such approaches will incorporate highly specific, sensitive and quantitative measurements over a large dynamic range with a flexibility to identify unanticipated novelties in transcript structures and sequences.
A number of studies have recently used deep sequencing to perform surveys of the mRNA component of the transcriptome in various organisms, enabling parallel quantification and annotation of cellular transcripts. While sequencing of cDNA pools is a well established technique, for example the sequencing of EST libraries , the ability to rapidly and cheaply generate diverse cDNA sequence datasets will allow the transcriptional activity of a vast array of different cell types, mutants and environmental conditions to be analyzed. Deep sequencing of cDNA, referred to as RNA-seq, overcomes several shortcomings of microarray-based detection of transcripts, including probe cross-hybridization , restricted signal dynamic range, and low sensitivity and specificity, which often lead to difficulties in detection of low abundance transcripts and discrimination between similar sequences. Sequence-level transcript information has much greater power to distinguish between paralogous genes, better detection of low abundance transcripts, and allows replicable digital quantification based upon counting of sequence reads [71–75]. Furthermore, RNA-seq can identify transcript sequence polymorphisms, novel trans-splicing and splice isoforms, and there is no strict-requirement for a reference genome sequence. Whilst approaches such as SAGE, CAGE and MPSS have enabled parallel sequencing of short reads from many transcripts, they suffer from a poor coverage of each transcript and potentially ambiguous mapping due to the short read length [76–78]. In contrast, RNA-seq can produce complete coverage of transcripts, providing information about the sequence, structure and genomic origins of the entire transcript.
Several strategies have been employed to perform shotgun sequencing of cellular mRNAs, but they can be broadly categorized as either “stranded” RNA-seq, yielding strand-specific data that informs about transcript directionality, or “strandless” RNA-seq, where sequencing of double-stranded cDNA fragments loses the strand of origin information [79•]. The first papers reporting RNA-seq of plant transcripts with one of the new deep sequencing technologies utilized the 454 sequencer, generating strandless RNA-seq data from double stranded cDNA of Medicago truncatula, Arabidopsis thaliana and maize [80–82]. Cheung and colleagues  sequenced adapter-ligated fragments of a normalized Medicago truncatula cDNA library, assembling the reads into contigs representing thousands of previously unobserved and rare transcripts. In Arabidopsis thaliana seedlings, Weber et al.  generated reads mapping to 17,449 genes, accounting for ~90% of the transcripts estimated to be expressed in the sample, identifying reads from previously unannotated transcripts and predicted genes with no prior EST support. Finally, Emrich and colleagues  sequenced cDNA from maize shoot apical meristem cells isolated by laser-capture microdissection, identifying over 25,000 genomic sequences, including nearly 400 orphan transcripts with no homology to sequences from any other species and which appeared to be expressed in a cell-type specific manner. Clearly, the sensitivity of the shotgun sequencing is applicable for characterization of the transcript complement of individual cell types.
Several recent publications have utilized the Illumina Genome Analyzer and Applied Biosystems SOLiD instruments to generate vast datasets of short expressed tags in Arabidopsis thaliana, human, mouse and yeast [42•, 71–75, 83]. Essentially, these instruments yield vastly more transcriptome sequence per run than the 454 Life Sciences instrument, typically over one hundred million individual reads, however the length of these reads is significantly shorter than those from the 454 instrument. Thus, while many more unique sequence tags are generated, the shorter read length of the Illumina and Applied Biosystems machines provide a challenge to perform transcript assembly, identification of multiple splicing events within the same mRNA molecule, and unambiguous read alignment to some transcripts with highly similar sequences. However, the vast quantity of short read sequence is extremely powerful for transcript quantification, gene discovery, correction of transcriptional unit structure annotation, and detection of alternative splicing [72••, 74].
In a recent study, Lister et al. [42•] utilized a strand-specific RNA-seq technique to sequence the transcriptome from flower buds of wild-type and DNA methyltransferase or DNA demethylase deficient mutant Arabidopsis thaliana plants. By overlaying the RNA-seq data with the single-base resolution detection of DNA methylation in the same tissues, Lister and colleagues identified hundreds of genes that displayed altered transcript abundance upon perturbation of proximal DNA methylation patterns. Importantly, the stranded RNA-seq data was essential for identification of the strand from which the intergenic transcripts originated and unambiguous identification of repetitive transposon sequences reactivated upon loss of the repressive methylation modifications and alteration of proximal smRNA abundance (Figure 3).
While RNA-seq offers previously unparalleled means to characterize cellular transcriptional activity, numerous methodological advances that are now being pursued offer to greatly enhance its effectiveness. Paired-read sequencing can be used assess the splicing patterns of multiple distal exons within a single transcript to be studied, while with single short reads it is generally only possible to assess one splice event. With increases in read length constantly being pursued eventually it will be feasible to sequence and assemble an entire transcript, thus revealing the precise splicing pattern. Such a development would also greatly facilitate an understanding of the transcriptome of plant species that do not yet possess high quality reference sequences, allowing identification of novel transcripts where shorter reads at this point may preclude effective contig assembly. It will be essential for RNA-seq techniques to be refined to require significantly less starting material, so as to enable the sequencing of single cells to characterize their transcriptional complement and identify cell-type specific transcripts. Together, such developments will greatly improve the value of RNA-seq, providing researchers with a more comprehensive understanding of the composition and dynamics of plant cell transcriptomes.
Recently, more specialized RNA-seq approaches have been developed to sample the 3′ cleavage fragments produced by endonucleolytic cuts, and in so doing captured a global snapshot of degraded RNAs [49•, 84•, 85•]. These “degradome” sequencing approaches exploit the 5′-RACE principle but ignore the 5′ mRNA cap and selectively clone mRNA molecules with a 5′ monophosphate [49•, 84•, 85•]. Analysis of the degradome sequencing data revealed that the vast majority of expressed genes had sequencing reads that mapped to them, the majority mapping specifically to the 3′ ends of mRNA molecules, suggesting that some level of endonucleolytic cleavage mostly targeted to the 3′ end of mRNAs and subsequent turnover is the norm for most expressed transcripts [49•, 84•, 85•]. Additionally, this type of sequence information, which is riddled with sequenced miRNA-directed cleavage sites, has been used to identify known and previously unidentified miRNA target mRNAs [84•, 85•]. Overall, these recent studies illustrate how high-throughput sequencing technologies can be utilized to gain insights into global RNA dynamics within plants.
The advent of widely available new or now-generation sequencing technologies has spawned a remarkable array of applications to study genomic and cellular dynamics and features with unprecedented precision and breadth. Many of these new sequence-enabled techniques have been applied to plant systems, producing intriguing insights into cellular function, and genome and population dynamics that could not previously have been obtained. Widespread adoption of these new sequencing technologies will allow researchers to characterize a vast assortment of plant processes in both model and non-model species. The many varied techniques will inevitably be applied to generate detailed temporal and spatial maps of cellular states and activities, profiling not only different cell types within an organism but, with suitable advances in sample preparation and amplification methods, perhaps also single cells. A tantalizing goal is the effective integration of the many complex and rich sequencing datasets to yield cohesive views of cellular activities and dynamics, yet clearly there are substantial bioinformatic challenges that lie ahead on the path to this objective.
Theoretically, any cellular process or experimental assay for which the output is in nucleic acid form can be comprehensively interrogated, providing an opportunity for the development of a wide assortment of novel applications. For example, it should be possible to combine the yeast two-hybrid screening method  with deep sequencing to perform a massively parallel protein-protein interaction experiment, interrogating every pairwise permutation of the full protein-coding complement of an organism’s genome to generate a complete direct-interaction network. In this proposed technique (Figure 4) interaction of bait and prey constructs results in the activation of the CRE recombination system and expression of a selective marker gene. loxP sites situated at the end of each gene in the bait and prey constructs will be recombined to form a chimeric DNA molecule containing the two gene ORFs that encode the interacting proteins. Restriction digestion to release the chimeric molecule followed by paired-end sequencing of its two ends will yield a pair of sequences, one from each of the genes, thus identifying the two proteins that directly interacted. Two complex pools of yeast cells, each one containing the full complement of an organism’s gene ORFs fused to either the bait or the prey domain, would be mixed and allowed to mate. Deep sequencing performed on the complex pool of resulting chimeric DNA molecules would reveal every pairwise interaction that took place, interrogating the hundreds of millions of possible interactions between every protein encoded in a eukaryotic genome, Such a parallelized approach will be the only possible avenue through which to test the 784 million possible interactions of the 28,000 proteins encoded in the Arabidopsis thaliana genome.
As enabling as this leap in technology has been, several companies already claim to soon deliver momentous increases in sequence read length and output (e.g. Pacific Biosciences, http://www.pacificbiosciences.com; Complete Genomics, http://www.completegenomics.com; Visigen Biotechnologies, http://visigenbio.com). With such advances it may soon be possible to apply these new technologies to the study of plants with much larger genomes, and to survey a wide range of plant species, thus dramatically increasing the understanding of the diversity of plant life.
We thank Dr. Robert Schmitz for valuable input in the manuscript preparation. R.L. is supported by a Human Frontier Science Program Long-term Fellowship. B.D.G. is a Damon Runyon Fellow supported by the Damon Runyon Cancer Research Foundation (DRG-1909-06). This work was supported by grants from the National Science Foundation, the Department of Energy, the National Institutes of Health, and the Mary K. Chapman Foundation to J.R.E.
Conflicts of interest
The authors declare that there are no conflicts of interest related to this publication.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.