Array-based SFP discovery proved to be a high-throughput approach to develop new molecular markers for genetics and breeding in tomato. We used genomic DNA as a hybridization target to detect SFPs on an Affymetrix (Santa Clara, CA) array. Validation rate leveled off between 71% (α ≤ 10-4) and 75% (α ≤ 10-7) for predicted SFPs between three cultivated tomato varieties. However, using α values between 10-6 and 10-7 resulted in a high percentage of known SFPs within probe features being excluded.
Our empirically determined estimate of the efficiency of random sequencing in cultivated tomato is 3.5% SNP discovery on a per gene (EST) basis, an estimate that is influenced by the 1/8,500 bp occurrence of polymorphism and the unequal distribution of polymorphism within genes [15
]. A random sequencing approach is therefore highly inefficient for SNP discovery within cultivated tomato. The efficiency of random sequencing would increase to 18%-19% on a per gene basis if non-coding sequences were targeted [16
]. The use of array hybridization to improve the rate of SFP validation to over 70% represents a dramatic improvement in efficiency relative to random sequencing.
The complexity of the target, the method of detection (algorithm), and stringency of probability impact validation rate. Complexity of the target is dependent on whether cDNA (mRNA) or genomic DNA is used and on genome size when genomic DNA is the target. Using mRNA as a hybridization target has been used to reduce the complexity of large plant genomes. This approach, however, adds other issues due to presence of multi-gene families, variation in the level of expression, and post-transcriptional sequence polymorphism [21
]. The robustified projection pursuit (RPP) method has been developed as a way to improve SFP detection with fewer biological replicates [19
]. Projection pursuit analyses also perform well under a range of distributions [29
], a feature that is of particular importance to hybridizations using mRNA target where the range of expression must be considered. Using RPP and selecting probes with overall outlying scores (u) from the 5% distribution tail, the validation rate in Barley was 80% [19
]. Using the same method for Cowpea, but selecting u from the 15% distribution resulted in a validation rate of 67% [23
]. Thus, the stringency of selection is a key feature of increasing reducing false discovery rate. SFP detection accuracy for which known SNP genotypes are predicted from mRNA hybridizations are reported to be as high as 95% when multiple methods are used [20
]. These values drop to ~80% when a single method is used [20
]. Our detection of SFPs using DNA as the target was less sensitive than similar studies in A. thaliana
]. We attribute this difference to the complexity of the target, which is approximately 950 Mb for tomato and 125 Mb for A. thaliana
. Our validation rate was comparable to the 75% found for rice, which has a genome size of 400 Mb [22
]. Increasing the stringency (lowering the α value) decreased the false discovery rate while increasing Type II error, an observation that is expected. Values of α between 10-4
provided the best balance between false discovery and eliminating true polymorphisms.
The SNPs discovered from array hybridization provided a tool to both estimate θ (F
st) and inspect allele distribution within and between groups in order to assess the affects of selection during the breeding history of cultivated tomato. Selection of individuals with favorable mutations during domestication and through breeding practices has led to a reduction of genetic diversity in crop species [30
]. A narrow genetic base has previously been reported in cultivated tomato [31
]. It is postulated that genetic bottlenecks occurred during domestication and during the introduction of tomato to Europe from Latin America by Spanish explorers [31
]. The patterns of lower SNP variation we observed in vintage and landrace groups relative to wild tomatoes document a genetic bottleneck. However, breeding practices have stressed the introgression of new genetic variation, especially for disease resistance from wild species [34
]. Tomato breeding for fresh-market and processing varieties diverged with a strong emphasis on distinct ideotypes reinforced by the initiation of mechanical harvest. Efforts to develop tomatoes specifically for mechanical harvest were initiated in 1943, but did not produce acceptable varieties until the mid 1960s [36
]. Given the historical practices of tomato breeding that include introgression and market differentiation, we hypothesize that genetic differentiation may have occurred between varietal classifications and that elite germplasm may contain more variation relative to landrace and vintage varieties. Our pairwise estimates of θ between the five subpopulations representing fresh-market, processing, vintage, landrace, and S. pimpinellifolium
strongly suggest genetic differentiation has occurred due to breeding. Furthermore within subpopulation estimates of genetic diversity provide evidence that modern breeding practices have broadened the genetic diversity of tomato relative to landrace and vintage varieties. These results are consistent with previous findings [31
We also investigated whether a subset of highly polymorphic (≥ 4 SNPs) genes might contain functional changes. We identified six loci with high ratios of non-synonymous substitution relative to our control gene. Proteins encoded by these genes include, a phosphoinositide-specific phospholipase C (Le011957), a cytochrome P450 (Le006895), a fertility restorer-like protein (Le013946), a ubiquinol-cytochrome C chaperone protein (Le004790), and two proteins of unknown function (Le013904 and Le007111). These genes may be candidates for functional analysis in order to identify genes that contribute to existing phenotypic variation in crop plants.
Plant genomes have evolved under human selection. Perhaps the best-documented consequence of this selection is a reduction of variation caused by genetic bottlenecks during the domestication process and through selective sweeps due to linkage to genes that are desirable in agriculture [38
]. Much of what we currently know about the genes that were selected during domestication and breeding derive from the map-based cloning of individual genes. Selection has often been toward loss-of-function mutations. Examples include the loss of seed dispersal in grains through shattering [39
] and loss of branching [38
]. At the same time, some desirable phenotypes are due to gain of function mutations. Examples include disease resistance [2
], high beta-carotene content in tomato which is conferred by a promoter mutation leading to increased expression of the fruit-specific beta-cyclase [40
], and elongated fruit shape due to the duplication, translocation, and subsequent over-expression of an IQ67 domain-containing gene in tomato fruit [12
The idea that algorithms might be applied for the high throughput identification of the genes selected during crop improvement has been proposed for a number of plant species. The application of such approaches is somewhat crop specific, and is influenced by mating system and rates of polymorphism. In highly diverse species, such as maize, the focus has been to identify selective sweeps [38
]. However in crops that have experienced severe genetic bottlenecks it will be difficult to distinguish selective sweeps from the effects of genetic drift due to the bottlenecks themselves. On the other hand, species with reduced genetic variation might offer a model to detect genes with increased levels of polymorphism. Arguably, these are the genes that are of most interest to plant breeders as they likely contribute to existing phenotypic variation. In the case of tomato, an obstacle will be to distinguish genes that are associated with introgressions due to linkage disequilibrium from the selected genes themselves.