Large numbers of molecular markers and sequence data from across the genome are playing an increasingly important role in population genomic studies of fine-scale genetic variation and the genetic basis of traits [
1]. Nevertheless, we lack genomic resources for most non-model organisms and whole genome sequencing is still largely impractical for most eukaryotes. Transcriptome, or Expressed Sequence Tag (EST), sequencing is an efficient means to generate functional genomic level data for non-model organisms or those with genome characteristics prohibitive to whole genome sequencing. EST sequencing is an attractive alternative to whole genome sequencing because the majority of most eukaryotic genomes is non-coding DNA, and EST sequences lack introns and intragenic regions that render analysis and interpretation of data more difficult [
2]. ESTs thus have a high functional information content, and often correspond to genes with known or predicted functions [
2,
3]. Large collections of EST sequences have proven invaluable for gene annotation and discovery [
2,
4], comparative genomics [
5], development of molecular markers [
6,
7], and for population genomic studies of genetic variation associated with adaptive traits [
8]. Nonetheless, until recently, traditional laboratory methods for the development of EST resources have required costly and time consuming approaches involving cloning, cDNA library construction, and many labor intensive Sanger sequencing runs [
2].
Massively parallel sequencing technologies, such as 454 pyrosequencing, remove many time consuming steps involved in Sanger sequencing of ESTs and have facilitated transcriptome sequencing at a fraction of the time and cost previously required [
5,
9-
11]. At present, a single run on a 454 GS XLR70 Titanium pyrosequencer can produce more than 10
6 sequences averaging greater than 300 base pairs (bp) in length. The
de novo assembly of the large numbers of short reads produced from this and similar technologies is a significant challenge for whole genome sequencing of large and complex genomes. In contrast, for transcriptome sequencing,
de novo assembly is facilitated by the possibility of increased coverage depth (number of reads per nucleotide in the template) for the much smaller number of nucleotides in the transcriptome than in the whole genome [
4]. In addition, the reduced amount of repetitive DNA found in genes compared to non-coding regions ameliorates one of the principal obstacles to
de novo assembly of short reads [
12]. Whereas most applications of parallel sequencing of ESTs have involved model organisms with draft genomes available to aid in assembly [
4,
13,
14], recent studies have demonstrated highly successful
de novo assemblies of 454 EST data for organisms with no prior genomic resources [
5,
7,
15,
16]. The generation of such large-scale sequence data will enable functional analyses that were previously limited to model organisms and their rapid application in ecologically important taxa [
17]. Here, we utilize pyrosequencing of cDNA to characterize the transcriptome of lodgepole pine (
Pinus contorta) and to develop genomic resources to support further research in this and other pines.
P. contorta is an ecologically and economically important tree that is widespread in the mountainous regions of western North America [
18]. It is a fire-adapted species that mediates regeneration after disturbance, has a major impact on forest structure and ecology, and is a foundation species of many montane forest ecosystems. It is one of the most variable pines, and grows in a variety of conditions ranging from low elevations to timberline [
19] where it has experienced and evolved in response to diverse selection pressures including that from variation in seed predator communities [
20-
23] and fire regime [
24,
25]. The current mountain pine bark beetle (
Dendroctonus ponderosae) epidemic is causing unprecedented mortality of
P. contorta throughout the Rocky Mountains [
26], which is likely to cause rapid and massive changes in community structure and ecosystem processes. Consequently, a greater understanding of fine-scale population genetic variation and the genetic control of traits important to these forests would be beneficial and timely.
Although a large number of EST sequences for loblolly pine (
P. taeda) exist in public databases (e.g., NCBI), far fewer resources exist for
P. contorta (1 EST prior to 2010, as of January 2010 ca. 40,000 ESTs) and other pines, despite the importance of the genus. This paucity exists in part because pines have enormous genomes (10,000-40,000 mega-base pairs vs. 115 Mbp in
Arabidopsis thaliana) with large amounts of repetitive DNA [
27,
28], making whole genome sequencing projects difficult or impractical. The construction of large EST collections is thus the most promising approach for providing functional genomic level information in pines [
29]. Whereas other labs are currently generating
P. contorta ESTs using Sanger sequencing (K. Ritland and J. Boehlman, pers. comm.), additional sequencing effort is needed to increase genomic level resources. The development of genomic resources for
P. contorta should facilitate basic and applied research on the genetics and evolutionary ecology of this species and its role in maintaining forest health and ecosystem function [
29,
30]. In addition, EST collections for
P. contorta will contribute to the development of molecular markers for other pines and facilitate comparative genomics and the study of adaptive variation across the genus.
Here we describe 454 pyrosequencing of
P. contorta cDNA and assess the utility of this approach for transcriptome characterization and marker discovery in a species with a large and complex genome. Normalized cDNA collections from multiple tissues and individuals were used to sample large numbers of expressed genes and to detect simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). We first describe the assembly and functional annotation of EST sequences, and the level of transcriptome coverage provided by our sequence data. Second, we discuss the detection and characterization of a surprisingly large number of sequences representing retrotransposons. Finally, we utilize our assembled sequence data for the development of a variety of gene-based markers for population genomic studies, including SSRs occurring within regions that are conserved with another pine species, and SNPs occurring in regions with many reads and deep coverage. We designed high quality PCR primers for a large number of the SSRs we characterized, providing an immediately available resource of genetic markers for pines. Along with other recent studies [
5,
7,
15,
16], our results demonstrate the utility and highlight some of the challenges of next generation transcriptome sequencing applied to non-model organisms.