Switchgrass (
Panicum virgatumL.) is a perennial C4 warm-season grass native to North America, where it occurs naturally from 55° N latitude to deep into Mexico, mostly as a dominant species of the tall grass prairies. In North America, it has been used for more than 50 years for soil conservation, as a forage crop, and as an ornamental grass
[1]. In 1992, switchgrass was designated by the United States Department of Energy (DOE) as a model herbaceous energy crop for ethanol and electricity production, selected out of a wide array of candidate species
[2]. Switchgrass possesses many desirable qualities of a biomass crop for energy and fiber production, including high-net biomass production per hectare, low production costs, low nutrient requirements, relatively low ash content, high water use efficiency, extended range of geographic adaptation, ease of establishment by seed, adaptation to marginal soils, and potential for carbon storage in soil
[3]–
[5].
Two genetically and phenotypically distinct switchgrass ecotypes, lowland and upland, were identified in early genetic screening studies. They are distinguished by a number of morphological traits and their natural habitat. The lowland ecotype has a taller, coarser, upright phenotype, with a more rapid growth habit compared to the upland ecotype, and is generally found in wetter habitats, such as floodplains. The upland ecotype is found in drier sites and is recognizeable by its finer stemmed, and often semi-decumbent phenotype
[1],
[6]. With respect to genetic distinguishing features such as ploidy levels, lowland switchgrass ecotypes are mostly tetraploid (2n

=

4x

=

36), whereas upland switchgrass ecotypes are much more complex in their ploidy levels, and generally display higher orders of ploidy. Upland ecotypes, despite the high frequency of octoploidy (2n

=

8x

=

72), also show a high incidence of aneuploidy and both tetraploid and hexaploid chromosome numbers, with rare reports of diploidy and even duodecaploid individual plants were reported
[7]–
[11]. Due to the differences in ploidy levels, these two ecotypes are mostly reproductively isolated with only occasional gene flow. However, the level of natural gene flow between the ecotypes is unknown. Although most of the recent research and breeding has focused on the lowland ecotype, with its stable, simpler genome and its high yield potential, particularly in warmer parts of the US, our project targeted northern-adapted, upland germplasm.
The molecular genetic characterization of switchgrass began with the use of both restriction fragment-length polymorphisms (RFLP) and randomly amplified polymorphic DNA (RAPD) markers to develop genetic fingerprints for the existing cultivars
[9],
[12]. These works established that the upland and lowland ecotypes were genetically distinct from one another, based on chloroplast and nuclear DNA markers. The natural distribution and history of switchgrass suggest that the species most likely possesses high levels of genetic variation. At higher ploidy levels that prevail in upland-adapted germplasm, polyploidy and polysomic inheritance patterns may contribute to this diversity. Observed frequencies of multivalents
[13],
[14] and increased levels of within-cultivar diversity observed in octoploids relative to tetraploids tends to favor this view
[15]. Although prior studies highlighted a need for molecular maps to assist and hasten breeding efforts on primary biofuel traits, preliminary marker studies in the 2000 s determined that mapping would not be straightforward
[16]. Starting in 2006, with the new wave of interest in genetic improvement for biofuel production through marker-assisted breeding and genomic selection, the DOE funded several new projects under the Biomass Genomics Research Program
[17]. To date, three of these projects targeted several switchgrass cultivars for EST and short-read genomic sequencing
[18] for the purpose of marker development to promote future efforts for quantitative genetic practices such as linkage mapping, association mapping, and genomic selection.
The massive natural distribution and outcrossing life history of switchgrass suggest that the species most likely possesses high levels of genetic variation. Domestication efforts targeting improvement of the feedstock characteristics of switchgrass have only been in progress for the last two decades. Therefore, in many regards, switchgrass is an undomesticated forage grass that has held a dominant ecological role in large parts of the US prairie. This suggests that even registered cultivars are likely to have retained considerable allelic variation that could be utilized to improve the biofuel production potential and efficiency of this species. Conventional breeding efforts of switchgrass are time consuming and challenging. Marker-assisted breeding could reduce the cycle time of this perennial by severalfold. However, assuming it has characteristics similar to other highly outcrossing grasses, such as maize, the traits targeted for domestication/breeding are very likely to be controlled by hundreds of quantitative trait loci (QTL) with small effects
[19]. In a breeding context, these numerous small-effect QTL are best utilized using a marker-assisted breeding approach known as “Genomic Selection” (GS) or “Genome-wide selection”(GWS). GS relies on a simple principle of marker-trait association: When thousands of markers spanning the whole genome have been tested together for their association with a trait, at least one marker will be in linkage disequilibrium (LD) with each and every QTL regardless of the effect size or the location, whereby all QTL effects can be captured
[20]. In low diversity species like cattle, this can be achieved by as few as 50,000 SNPs
[21]. However, in a high diversity species like maize, more than two million SNPs may be necessary
[22] to properly carry out GS across the species. Regardless of the species, however, all GS studies require preliminary studies with a large-scale, marker-discovery component. Such efforts are already underway forseveral other organisms such as
Arabidopsis thaliana
[23], rice
[24], and maize
[22],
[25]. For switchgrass, assuming it would be similar to maize, it is likely that GS efforts will also require a large number of markers, estimated between 50,000 and two million SNPs. The actual number will be determined by the effective population size, effective recombination rate, trait heritability, and number of QTL, which may be highly variable between individual breeding programs.
Next generation sequencing platforms are transforming the way genomes are analyzed. Although SNP discovery using short-read sequence data is still in its early stages, several studies have already demonstrated that large numbers of high quality SNPs can be identified in a cost-effective manner using these data
[26]. In these studies, deep-sequence coverage across many samples was necessary to identify high-quality SNPs. One way to achieve high levels of overlapping coverage between the libraries is to reduce the number of genomic sites surveyed in each library, which would allow for deep sequencing over selected fractions of the genome. This can be achieved by digesting each sample with a common restriction enzyme, often with a DNA-methylation state bias to enrich for transcribed regions and generate reduced representation genomic libraries (RRGLs)
[27]–
[29].
Here we describe how we coupled EST library and short-read sequencing approaches to discover over 149,000 SNPs in switchgrass. In the process, we developed a consensus-reference sequence of the switchgrass transcriptome of about 87.5 Mbs spanning the gene space of the switchgrass genome to anchor RRGL reads. Furthermore, we investigated population structure within our samples through PCA analysis on ~4400 SNPs that had complete genotype information across all samples.