Assignment of Illumina reads to reference wheat unigenes
Two near-isogenic lines segregating for the
GPC-B1 grain protein content locus were selected to identify SNPs across this chromosomal region. The first line was tetraploid durum wheat Langdon (LDN), which carries a non-functional allele at
GPC-B1. The second line was tetraploid recombinant substitution line 65 (RSL65), which is derived from a cross between LDN and a wild emmer (
Triticum turgidum ssp.
dicoccoides) chromosome 6B substitution line [LDN (DIC6B)][
4]. RSL65 carries a chromosome 6B segment of wild emmer of at least 30 cM that includes a functional
GPC-B1 allele, but should be isogenic to LDN outside this interval. To reduce variation from growing conditions, parental lines LDN and RSL65 were grown together in the same pot. Total RNA was extracted from leaves at 5th leaf stage and the non-normalized samples were prepared for mRNA-seq on the Illumina GAIIx (120 cycles, paired end (PE), one lane per parent). A flow-chart of the complete process is available in Additional File
1: Figure S1.
A total of 25.6 and 31.7 million paired end 120 base reads were obtained for LDN and RSL65, respectively, after excluding low quality reads and trimming for adaptor contamination and low quality regions. There is currently no reference genome sequence available for wheat, therefore individual reads from each parent were aligned to the NCBI wheat transcriptome [
25], which comprises 40,349 unigene sequences, totaling 31,671,110 bases. Almost 50% of LDN raw reads (25.5 million) mapped to the reference unigenes with 78.6% of these mapping in pairs. For RSL65, 47% of raw reads (29.9 million) mapped to the NCBI Unigene reference, with 67.2% of them mapping in pairs. Taken together, of the total 114.6 million Illumina reads produced, 48.3% mapped to the NCBI Unigene reference.
RNA was not normalized in this study as the objectives were both SNP discovery and subsequent identification of allele frequencies between bulked samples. To examine the consequences of this approach on the sequencing of highly expressed transcripts, the relative expression level of unigenes was estimated by calculating the transcript abundance expressed as reads per kilobase per million mapped reads (RPKM) [
26]. Unigenes were ranked according to their RPKM (from highest to lowest) and the accumulated frequency of mapped reads calculated (Figure ). For the LDN dataset, 14.8% of all mapped reads correspond to the 10 most expressed unigenes (including genes related to photosynthesis and a repetitive element) and half of all mapped reads correspond to the 248 most highly expressed unigenes. In RSL65, these values are slightly greater with 50% of mapped reads aligning to the 647 most highly expressed unigenes. The slope of the curve in Figure decreases after the 50% value in both parental lines. This suggests that the lack of normalization results in the extreme over-representation of a few transcripts (50% of reads map to 0.98% or 2.47% of all unigenes with mapped reads in LDN and RSL65, respectively).
Despite this and the incomplete reference used, there was sufficient sequencing depth in a single Illumina lane to align reads (with at least 8-fold coverage) to 24,083 and 25,080 unigenes in LDN and RSL65, respectively. This represents approximately 60% of the unigene reference set that was present in the parental data. Comparison of unigene transcript levels (expressed as RPKM) between parental lines showed a high correlation (R2 = 0.85, Figure ) suggesting that the approach followed should be efficient for both SNP discovery and the bulked segregant analysis later described.
SNP discovery in near-isogenic lines of tetraploid wheat
In polyploids, SNP discovery is confounded by the presence of two types of SNPs. The first corresponds to polymorphisms between homoeologous genomes that occur within homozygous individuals. These SNPs are commonly found in tetraploid (A and B genomes) and hexaploid wheat (A, B and D genomes) as the homoeologous genomes share sequence identities of ~96-98% [
27]. These SNPs are referred to as inter-homoeologue polymorphisms (IHP) [
21]. The second type of polymorphism corresponds to varietal SNPs between individuals, representing what is traditionally referred to as allelic variation. This type of SNP is much less frequent with modern wheat varieties being ~99.9% identical across corresponding orthologous loci. Since the transcript assemblies within the reference wheat unigenes represent a consensus sequence derived from the co-assembled A, B and D genomes (i.e. without ambiguity codes, just like the singletons), the two types of SNPs cannot be distinguished by simple SNP discovery pipelines. Therefore, SNPs were detected and scored based on methods previously developed for the polyploid oilseed rape
Brassica napus [
21,
28].
In this approach, it is expected that reads originating from homoeologous genes (A and B genomes in tetraploid wheat) will be mapped to the same unigene reference. Maq (default parameters)[
29] was used to call SNPs with respect to the reference for each parental line separately, thus generating two SNP sets. This was followed by a second step in which a custom Perl script was used to derive the symmetric difference of the two parental SNP sets. Polymorphisms between homoeologous genomes should generate the same ambiguity code in parental lines and should be common to both SNP sets (Figure , C/T generates a Y code). On the other hand, varietal SNPs between LDN and RSL65 should generate an ambiguity code for only one parent and therefore be unique to that corresponding SNP set (Figure , full-line boxed SNP). Such varietal SNPs at the same locus were termed "hemi-SNP" by Trick et al. [
21]. In cases when only one genome is expressed or for single copy genes, SNPs could be identified as traditional "simple-SNPs" with each parent represented by a single base.
The implementation of this two-step pipeline identified 3,963 putative varietal SNPs across 2,520 unigenes between LDN and RSL65. The vast majority of the SNPs discovered (80.8%) were of the hemi-SNP type, manifested as ambiguity codes and signifying co-expression of homoeologous genes, which have a particular nucleotide polymorphism in one parent only. In these circumstances, the hemi-SNP then becomes allelic and serves as a genetically tractable marker.
A concern when working with non-sequenced and complex genomes such as those of wheat is the presence of closely related paralogues or pseudogenes that could confound SNP discovery and downstream marker design and mapping. Evidence of paralogues being mapped to the same reference sequence would be revealed as unigenes with high SNP density. Therefore, SNP densities were calculated for the 2,520 unigenes with at least one varietal SNP and their frequency distribution examined. The average SNP density was 1.80 SNPs/kb (± 1.46), and decayed rapidly after 3 SNPs/kb. This value is similar to that calculated empirically (2.2 SNPs/kb) by sequencing 6.8 kb from exons and UTRs of four genes across the
GPC-B1 region [
14]. Based on these results, unigenes with SNP densities higher than 5 SNPs/kb were eliminated to avoid possible paralogues yet still allowing space for some variation. Roughly 13% of putative SNPs were discarded, reducing the number to 3,427 putative SNPs across 2,427 unigenes (1.80 SNPs/kb).
SNP discovery using modified alignment and SNP calling criteria
Visual examination of several SNPs revealed that Maq was not able to properly call SNPs occurring close to homoeologous SNPs or within a 10-bp sliding window. This meant that some IHPs were being called as varietal hemi-SNPs because of their close proximity. Therefore, the default Maq parameters were relaxed by setting the maximum summed quality score of mismatched bases to 120 (default 70) at each step in the workflow (referred to as Maq-120 hereafter). This allowed SNP haplotypes occurring within the 120 base reads to align with higher mapping quality and reduce these errors.
The Maq-120 analysis produced some improvements in terms of total reads mapped to the unigene reference (55.2% across both parental lines compared to 48.3% in Maq-default), although the percentage of reads that mapped in pairs remained constant (Table ). These reads aligned with at least 8-fold coverage to 25,262 and 26,180 unigenes in LDN and RSL65, respectively (approximately 63% of the NCBI Unigene set). A total of 6,035 putative SNPs were identified across 3,195 unigenes (2.41 SNPs/kb), which was a considerable increase compared to the Maq default pipeline. Again, hemi-SNPs were most predominant with over 89% of SNPs being assigned to this type. After filtering for unigenes with > 5 SNPs/kb, roughly one quarter of putative SNPs were discarded (1,605 SNPs across 226 unigenes). The final outputs of the Maq-120 analysis were 4,430 putative SNPs across 2,969 unigenes (1.90 SNPs/kb).
| Table 1Sequencing statistics and mapping results for parental lines using Maq-default and Maq-120 analysis |
Identification of putative linked SNPs via bulked segregant analysis
Twenty eight homozygous lines with recombination events across a ~12-cM interval (
Xwms508 -
Xwms193) [
5] were used to assemble bulks for contrasting phenotype (grain protein concentration). The lines were first characterized for three markers spanning the 250-kb region adjacent to
GPC-B1 to confirm their genotype. All lines produced the expected results, except RSL135, which was heterozygous for all markers and was therefore excluded from downstream work. Equal amounts of total RNA from the 14 individuals previously classified as having high protein content were mixed to produce the high protein RNA bulk, whereas 15 recombinant lines were used for the low protein bulk (for detailed description of these lines see 'Materials and Methods'). For each bulk, two libraries with slightly different average insert size (250-bp and 400-bp) were constructed and sequenced in individual Illumina lanes (80 cycles, paired end).
The quality control assessment of reads was similar to the parental lines. For both the Maq-default and the Maq-120 analysis, a higher percentage of reads mapping to the reference set was obtained in the 400-bp libraries compared to the 250-bp libraries (~25% more reads mapping in the former), although a smaller number of reads aligned in pairs (Table ). Combining both libraries, a total of 53.4 and 61.2 million paired end reads were produced for the High and Low bulk, respectively. Again, the Maq-120 analysis was more successful in aligning a higher percentage of reads to the reference (47.6%) compared to the Maq-default analysis (41.8%). These values are similar to those obtained in the parental lines when comparing only the equivalent 400-bp libraries.
| Table 2Sequencing statistics and mapping results of bulked samples using Maq-default and Maq-120 analysis |
The objective of sequencing RNA from bulked samples was to compare the allelic frequencies for the parental SNPs between the two bulks. In a diploid organism, a SNP coinciding with the gene underlying the trait of interest should be revealed by informative base frequencies tending to either 1.0 or to 0.0 in the two bulks, depending on the parental origin. However, in polyploid species this upper limit of 1.0 is lower. For example, for a hemi-SNP identified in LDN (C/G) with respect to RLS65 (C), the signal from the informative base (G) is partially masked by the homoeologous (non-informative) base (C) (Figure ). Therefore, the upper limit frequency of the informative base would tend to 0.5 in tetraploid species, assuming quantitative co-expression of the two homoeologous transcripts. This value would change when relative transcript levels depart from parity. For each bulk, the frequency of the informative base was calculated at each SNP position and then the ratio between the bulks was determined for each SNP. This ratio between bulks was termed bulk frequency ratio (BFR). For hemi-SNPs with an informative base derived from LDN, the BFRs were determined by dividing the frequency in the low bulk by the frequency in the high bulk. For hemi-SNPs derived from RSL65, these values were reversed. Thus, the BFR provides a relative measure of the enrichment of the corresponding parental allele in the appropriate bulk (LDN for the low bulk and RSL65 for the high bulk).
The approach focused exclusively on the putative SNPs previously identified between LDN and RSL65 in the initial experiment and did not seek to identify SNPs in the complex bulk RNA mixtures. Of the 3,427 SNPs (2,427 unigenes) identified in the Maq-default analysis, 1,619 SNPs were recovered and had sequence coverage of at least 8-fold in both bulks (Additional File
2: Table S1). In this analysis, only one SNP per unigene was used to estimate the BFR. Therefore, the percentage of SNPs recovered was only 47.2%, even though this represents 66.7% of the unigenes. The highest BFR identified was the simple SNP T115G in unigene Ta#S16259088, which had a 28.5-fold enrichment. The LDN allele was present in only 3.4% of the high bulk sequences, whereas 96.5% of the low bulk reads carried this allele. These values are consistent with the expected frequencies of a tightly linked simple SNP. This unigene was homologous to
Brachypodium Bradi3g03340, rice LOC_Os02g04500, and sorghum Sb04g003030, which are all in the corresponding collinear regions to wheat chromosome arm 6BS, where
GPC-B1 maps.
Allele frequencies of the Maq-120 putative SNPs (4,430 SNPs across 2,969 unigenes) were also examined in the bulks. In this new pipeline, the frequencies and ratios of multiple SNPs within a single unigene were estimated independently and thus all SNPs were considered. Over 71% of SNPs (3,172) had at least 8-fold coverage and could be detected in both bulks (Figure , Additional File
3). Again, the putative SNP with the highest BFR was T115G in unigene Ta#S16259088, although the exact values changed slightly with the Maq-120 pipeline (29.5-fold enrichment). The putative SNP with the second highest BFR was a RSL65 hemi-SNP G582R in Ta#S32700697. The informative base (A as R = G/A) was present in only 1.7% of the low bulk reads, whereas it was found in 39.7% of the reads from the high bulk (23-fold enrichment). These values are consistent with the expectations of a tightly linked hemi-SNP. Again, this unigene has homology to cereal genes that map within the syntenic regions to wheat chromosome arm 6BS (Bradi3g03530, LOC_Os02g04660, Sb04g003160).
SNP validation
The distribution of the BFRs across both analyses was used to determine which putative SNPs to validate and map (Table ). Approximately 85% of SNPs had a BFR lower than 2.0, suggesting that both LDN and RSL65 alleles were expressed at relatively similar levels in both bulks. As the BFR threshold increased, the number of SNPs decreased in comparative terms in both analyses. The objective was to use a threshold that was sufficiently low to examine the sensitivity of this approach, while maintaining the number of SNPs to be validated within a manageable number. Therefore, a BFR of ≥ 3.0 was empirically determined as the threshold resulting in a total of 270 SNPs (41 SNPs were common between both analyses) across 253 unigenes.
| Table 3Number and percentage of SNPs identified at different BFRs using Maq-default and Maq-120 analyses |
To validate these putative polymorphisms, a single SNP for each unigene was selected for marker development. The 99 putative SNPs from the Maq-default analysis (BFR above 3.0) were first examined. Fifteen SNPs were either located too close to the ends of the available sequence or their assigned unigenes were repetitive as determined by high number of hits to the 5× Chinese Spring wheat genomic sequence (> 100 hits, E-value 1E-50) [
30]. Therefore, these 15 SNPs were removed from further analysis. A total of 84 assays were developed of which 82 gave successful amplification in LDN and RSL65. PCR amplicons were visualized through single strand conformation polymorphism (SSCP) and the target polymorphism was detected in 48 unigenes (58.5%), whereas 34 assays (41.5%) were monomorphic. The absence of the putative SNPs in the monomorphic amplicons was confirmed by direct sequencing of PCR products (Additional File
1: Figure S2).
The subsequent Maq-120 analysis yielded a total of 212 putative SNPs with BFRs above 3.0, and 41 SNPs (19.3%) were in common with the Maq-default analysis. Examination of these common putative SNPs showed that of the 39 successful assays previously developed, 30 were polymorphic between LDN and RSL65 (76.9%), whereas only nine were monomorphic. This was a large increase with respect to the average polymorphic rate in the initial Maq-default analysis (58.5% polymorphic) (Figure ).
To assess if the increase in validation rate was consistent across all the Maq-120 putative SNPs, SNP assays were developed for an additional 43 putative SNPs that were only identified in the Maq-120 analysis. The Maq-120 unique SNPs had a lower validation rate (16 polymorphic, 37.2%) to those of the common SNPs, comparable to the validation rate of the Maq-default unique SNPs (41.5%). In total, 82 putative SNPs were successfully screened from the Maq-120 analysis (39 common and 43 unique) with 46 SNPs being validated as polymorphic (56.1%) and 36 confirmed monomorphic. These results suggest that the Maq-120 analysis is able to identify putative SNPs with a similar conversion rate (56.1%) to that of the Maq-default analysis (58.5%), but that the SNPs common between both analyses produces the putative SNP set with the highest validation percentage (76.9%).
In summary, 64 SNPs (18 unique to Maq-default, 16 unique to Maq-120, 30 common between both methods) from a total of 125 putative SNPs were confirmed (51.2%) as polymorphic in the parental lines.
Mapping of validated SNPs
The 64 validated SNPs were screened in the individual RSLs that comprised the high and low protein bulks. The SNPs from the Maq-default analysis were screened by SSCP (42 SNPs) and direct sequencing of PCR products (6 SNPs), whereas the confirmed SNPs from the Maq-120 analysis (excluding the common ones with the Maq-default analysis) were screened using KASPar assays (16 SNPs). All confirmed SNPs could be scored in the 29 RSLs and parental lines, except for one SNP (Ta#S32606580), which was very difficult to score unequivocally by SSCP and was therefore removed from further analysis.
A total of 40 SNPs mapped to the 12.2 cM region encompassing
GPC-B1, defined by
Xwms508-
Xwms193 (Figure , Additional File
1: Figure S3). This translates to roughly 63% of validated SNPs mapping to the target region as determined by the RSLs. Twenty-four SNPs could not be mapped immediately linked to this region, although 11 SNPs were linked amongst them. Of the 40 linked SNPs, one mapped just distal to
Xwms508, whereas 39 SNPs mapped within the 12.2 cM target interval (Figure ). This equates to an average marker density of one SNP marker per 0.31 cM across this region, with values ranging from one SNP marker per 0.59 cM in the distal end of the map (
Xwms508-GPC-B1), to one SNP marker every 0.19 cM in the proximal end (
GPC-B1-
Xwms193). Almost all recombination events across the region were identified, including additional ones that were previously not resolved. The recombination events not identified correspond to those just flanking the
GPC-B1 gene on either side, between
Xucw79 and
Xucw71. The composition of the 40 SNPs was similar to that of the overall Maq-default and Maq-120 analysis, with a higher number of hemi-SNPs (33) than simple SNPs (7), and a similar proportion coming from each parent (16 LDN and 17 RSL65 SNPs). The SNP marker details are provided in Additional File
4.
Unique orthologues could be determined for 32 of the 40 wheat unigenes in the sequenced Brachypodium and rice genomes, whereas 31 unigenes had unique orthologues in sorghum. Based on the established syntenic relationships, 22, 18 and 20 of these orthologues were located on the corresponding syntenic regions in Brachypodium, rice and sorghum, respectively. This implies that between 56% (rice) and 69% (Brachypodium) of the mapped unigenes are syntenic. These collinear regions encompass between 5.5 and 6.8 Mb and include 665, 989 and 607 genes in Brachypodium, rice and sorghum, respectively. The distribution of the orthologous syntenic cereal genes was examined to estimate the number of genes between each SNP in the sequenced genomes. On average, the orthologous syntenic genes were separated by 32, 59 and 32 genes in Brachypodium, rice and sorghum, respectively. This average includes markers on adjacent genes to markers separated by at most 103 genes in Brachypodium and sorghum and 248 genes in rice. The remaining unigenes have orthologues that map elsewhere in these sequenced grass genomes and there is no apparent pattern except for two unigenes, which are linked in wheat and map closely in Brachypodium (Bradi1g67540 and Bradi1g67570), rice (LOC_Os03g15390 and LOC_Os03g15350) and sorghum (Sb01g040520 and Sb01g04530).
The
GPC-B1 gene was mapped using the phenotypic information previously published for the RSLs [
5,
14] within a 0.4 cM interval defined by flanking loci Ta#S37941845 and Ta#S17984935 (Table , Figure ). These markers were either completely linked or just one recombination away from the SNPs with the highest BFRs from the Maq-default and Maq-120 analyses (Ta#S16259088 and Ta#S32700697, Figure ). The homology of these markers to
Brachypodium, rice and sorghum provides an immediate location for
GPC-B1 within a region including approximately 18, 16 and 13 genes, respectively. Table shows the characteristics of these markers with their syntenic relationship. These results, together with the distribution of SNP markers across the wheat
GPC-B1 genetic map, suggest that the enrichment via BSA was very effective at identifying closely spaced markers, enabling the mapping of the gene to a narrow genetic interval.
| Table 4Characteristics of the eight SNP markers surrounding the GPC-B1 locus |
To further refine this position, we examined the parental SNP data to identify cases where a collinear cereal gene did not have a corresponding wheat unigene, or to assess if any parental SNPs in candidate genes were not identified in the bulks. We analyzed 23 wheat unigenes with homology to genes within the intervals defined in Brachypodium, rice and sorghum, but only one had a SNP between LDN and RSL65 (BFR 1.9) and was confirmed to be monomorphic. We next developed an additional nine gene models in wheat for genes that had no corresponding unigene in the reference set used to map the Illumina reads. Again, this search was unsuccessful, as no additional parental SNPs were found based on our mapping criteria. Regardless of these specific results, the ability to examine the data in an iterative manner should be useful for other gene targets.
Effects of coverage on SNP calling and enrichment in bulks
Increasing the depth threshold applied to the whole SNP calling process serves to increase the probability that reads from each homoeologue are sampled. We therefore examined the effect of different minimum depth thresholds (8-fold, 12-fold and 16-fold) on the false positive rate of SNPs identified under the Maq-120 analysis. First, the total number of SNPs with a BFR above 3.0 decreased from 212 using 8-fold coverage, to 121 (12-fold coverage) and 84 putative SNPs using 16-fold coverage. The subset of 82 SNPs from the Maq-120 analysis, which was experimentally tested, was further examined (Figure ). With 8-fold coverage, 56.8% of predicted SNPs (46 of 81 functional assays) were validated in the parental lines and could be mapped in the recombinants. This value increased considerably with 12-fold and 16-fold coverage, where 66% (31 of 47) and 83% (25 of 30) of the putative SNPs were confirmed to be polymorphic. A similar number of the total putative SNPs was assayed (between 37-40%) under the three coverage scenarios, suggesting that the comparisons are meaningful. In all three cases, 67-72% of the polymorphic SNPs mapped to the GPC-B1 target interval. In summary, increasing coverage from 8-fold to 16-fold reduces the total number of putative SNPs identified in the bulks by 60%, increases the validation rate from 57%-83%, but does not affect the percentage of validated SNPs mapping to the target interval (approximately 67-72%).