|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: AJL MH JLJ. Performed the experiments: AJL MH. Analyzed the data: AJL MH. Wrote the paper: AJL MH. Interpreted the results: AL MH JJ.
Genome-wide association studies (GWAS) may benefit from utilizing haplotype information for making marker-phenotype associations. Several rationales for grouping single nucleotide polymorphisms (SNPs) into haplotype blocks exist, but any advantage may depend on such factors as genetic architecture of traits, patterns of linkage disequilibrium in the study population, and marker density. The objective of this study was to explore the utility of haplotypes for GWAS in barley (Hordeum vulgare) to offer a first detailed look at this approach for identifying agronomically important genes in crops. To accomplish this, we used genotype and phenotype data from the Barley Coordinated Agricultural Project and constructed haplotypes using three different methods. Marker-trait associations were tested by the efficient mixed-model association algorithm (EMMA). When QTL were simulated using single SNPs dropped from the marker dataset, a simple sliding window performed as well or better than single SNPs or the more sophisticated methods of blocking SNPs into haplotypes. Moreover, the haplotype analyses performed better 1) when QTL were simulated as polymorphisms that arose subsequent to marker variants, and 2) in analysis of empirical heading date data. These results demonstrate that the information content of haplotypes is dependent on the particular mutational and recombinational history of the QTL and nearby markers. Analysis of the empirical data also confirmed our intuition that the distribution of QTL alleles in nature is often unlike the distribution of marker variants, and hence utilizing haplotype information could capture associations that would elude single SNPs. We recommend routine use of both single SNP and haplotype markers for GWAS to take advantage of the full information content of the genotype data.
Recent advances in sequencing and genotyping technology have allowed the collection of large amounts of genome-wide single nucleotide polymorphism (SNP) data for many species, primarily with the goal of finding associations between alleles and phenotypes of interest. Numerous statistical methods for such association studies have been proposed, many focused on the mapping of variation underlying common diseases in humans (e.g., Welcome Trust Case Control Consortium ), while others have focused on organisms as diverse as Arabidopsis , dogs  and cattle . It has become apparent that the choice of association mapping methodologies depends on the characteristics of the study population. In contrast to biparental mapping populations, in which the pattern of LD and the allele frequency distribution are known and are independent of population genetic parameters, every association mapping population has a unique population history (both recent and ancient) that shapes its patterns of genetic variation and may determine which mapping method works best.
Patterns of variation can be described by the extent to which variation at linked SNPs is “block-like”, i.e., most haplotypes fall into a few classes with little evidence of recombination. This quality was first observed in dense SNP data from The SNP Consortium Allele Frequency Project , and has led to a great interest in determining whether the power and accuracy of association mapping can be improved by grouping SNPs into haplotype blocks (see Zhao et al.  for a review). Various rationales for testing for associations between phenotypes and haplotypes, rather than single SNPs, have been proposed, including that haplotypes: capture epistastic interactions between SNPs at a locus , ; provide more information to estimate whether two alleles are IBD ; reduce the number of tests and hence the type I error rate ; allow informed testing between clades of haplotype alleles by capturing information from evolutionary history ; provide more power than single SNPs when an allelic series exists at a locus .
These rationales may be more or less relevant depending on the marker density and LD structure of the data. For example, the possibility of epistasis among SNPs is much greater when SNP density is very high, and is particularly relevant to studies where many common SNPs have been typed across a candidate gene . Using genotype and gene expression data from HapMap populations, Dimas et al.  have proposed that interactions between protein-coding and regulatory SNPs may be common, and there is some evidence for SNP-SNP interactions within genes and gene clusters (e.g., Hamon et al. ). But there is no evidence to suggest that epistasis occurs frequently between randomly chosen SNPs hundreds of kilobases apart.
Long and Langley , in a simulation study using parameters based on human data, concluded that “[o]ver the entire parameter space examined in this work and under the simple population genetic model considered, single-marker-based, permutation-based tests are either of similar or greater power than haplotype-based tests.” Since then, however, simulations based on the LD and population history of livestock have shown that haplotypes can provide greater QTL detection power and mapping accuracy than single markers can –. Zhao et al. , conducting simulations designed to resemble the demography and population history of livestock, found no apparent advantage to using haplotypes over single SNPs. These conflicting results suggest, as Long and Langley  noted, that “under different models relating genotype to phenotype or under different demographic scenarios, [their] conclusion may not be valid,” and that the power of QTL mapping with haplotypes must be evaluated on a case-by-case basis.
A GWAS report  as well as LD fine mapping studies ,  using empirical data have found significant associations between haplotypes and phenotypes that were not detectable by a single SNP analysis. These results underscore a key difficulty in simulation studies: unless simulations accurately model the genetic architecture and population history of QTL alleles, they will have limited relevance to empirical datasets. Furthermore, it is likely that the nature of the QTL-marker associations is sufficiently variable that no one simulation approach can capture them.
Genome-wide association studies (GWAS) are now being used to identify genes underlying agronomically important traits in crops, many of which have self-pollinating mating systems. The objective of this study was to explore the utility of haplotypes for GWAS in barley, as a representative of such crops. To accomplish this, we used genotype and phenotype data from the Barley Coordinated Agricultural Project (Barley CAP; www.barleyCAP.org) for both simulations and analysis of empirical data. As for many plant studies, the current Barley CAP marker dataset is fairly sparse, with an average SNP spacing of about 2.4 Mb (2198 SNPs/5350 Mb genome), in a population of 1807 individuals. While this number of markers may seem too small to be of any value, linkage disequilibrium in this population is quite extensive in comparison to humans  or cattle , and there is substantial clustering of SNPs , so many adjacent SNPs are correlated. There are various criteria for defining haplotype blocks, with the most appropriate for a given case dependant on how the haplotypes will be used . We used three different approaches to group sets of SNPs into haplotype blocks: the four gamete method as implemented in Haploview , based on recombination; the HapBlock method , based on diversity; and a simple sliding window. For each one of these approaches, we tested the power to detect associations between haplotype alleles and simulated QTL. In addition, we used the blocks defined in the four gamete method to construct parsimony trees, and tested the power to detect associations between the simulated QTL and edges in the trees, as proposed by Templeton et al.  and implemented in the TreeScan software . Finally, association analysis using the haplotypes, parsimony tree edges, and single SNPs was conducted on heading date data collected on a large set of barley germplasm from the Barley CAP.
The mean genetic map distance between adjacent markers was ~0.5 cM, though 60% of adjacent SNP pairs had the same map position. When markers mapped to the same position, it was most likely because of insufficient resolution of bi-parental maps rather than because of actual identical positions . The distributions of minor allele frequency (MAF), mean r2 between adjacent SNPs, and the highest r2 within 10 cM, are shown in Table 1. Because the complete sequence of the barley genome is not available, the physical distance between markers is not known.
We identified haplotype blocks using three methods based on different properties of the data. Properties of the blocks are shown in Table 2. Both the 4gamete and HapBlock methods produce blocks that vary greatly in size, depending on regional properties of the data (i.e., linkage disequilibrium and marker density). About 26% of the block boundaries are shared between the 4gamete and HapBlock methods; across the genome, 38 of the blocks are identical. To contrast with these methods, we also grouped SNPs by a simple sliding window approach. Because the median block size of the other two methods was three SNPs, we used a block size of three SNPs (SlideWin3).
The 4gamete method resulted in the fewest SNPs being included in blocks (85%), and less than half of the genetic map was covered by blocks. In contrast, the SlideWin3 method covered the entire genetic map, because the blocks were overlapping. The HapBlock method produced the largest average number of alleles per locus (4.7), consistent with its higher average block size in cM. The lower average alleles per locus for the 4gamete method was mainly due to the large number of singleton (i.e., unblocked) SNPs.
Power to detect single causal SNPs (QTL) removed from the marker dataset for each QTL size (p) and h2 is presented in Table 3. Substantial power to detect QTL at reasonable false discovery rates (FDRs) was observed only when p was set to 0.12 (Table 3; Figure 1). At an FDR of 0.1, power to detect p=0.03 QTL was 0.01 or less in all cases, while power to detect p=0.06 QTL was 0.09 and 0.18 at h2 levels 0.25 and 0.75, respectively. Increasing the proportion of phenotypic variation caused by the polygenic effect (i.e., h2) increased power at constant QTL size because the K matrix, which describes the genome-wide genetic covariance between individuals, can account for more of the phenotypic variation. This produces a clearer QTL signal. The relationship between power and FDR for each QTL size is presented in Figure 1.
Power was largely equivalent between the different analysis methods when p=0.06 and 0.03 and h2=0.25. When h2=0.75 and p=0.06, the single SNP and SlideWin3 analyses provided more power at both FDRs. SlideWin3 displayed a substantial advantage over all other methods when p=0.12. The advantage ranged from 20% to 33% over the next best method.
Simulating QTL by assigning a phenotypic value to a genotyped SNP marker assumes that the frequency distribution of QTL is similar to that of markers. To simulate a scenario in which the causal variant is younger than the marker variants, pairs of SNPs at the center of haplotype blocks were chosen and a phenotypic effect was assigned to lines carrying a specific combination of alleles at those loci (see Methods). Because we found that p=0.06 and h2=0.75 best separated the different analyses when QTL were simulated as removed, single causal SNPs, these parameter levels were used for this second round of simulations. With few exceptions, each analysis provided the highest power only when the corresponding block structure was used for simulating the QTL (Table 4). For example, when QTL simulations were based on pairs of SNPs at the center of HapBlock haplotype blocks, QTL were detected most effectively by performing the association analysis with HapBlock haplotypes. An important observation to note is that SlideWin3 almost always resulted in the second best power, and sometimes even numerically better than the haplotype analysis method matching the QTL simulation method, e.g. 4gamete QTL at FDR=0.20 in Table 4.
We conducted genome-wide association mapping on heading date in the Barley CAP germplasm using the same single SNP and haplotype analyses performed on the simulated phenotypes. Across all analyses, three associations were found between markers and heading date (Figure 2), two of which were detected by haplotype analyses only. The strongest association was for markers on the long arm of chromosome 2H near the centromere (position 63.53 cM on the consensus map of Close et al. ). This region has been referred to as Qrgz-2H-8  and is typically associated with QTL for heading date and Fusarium Head Blight resistance . Microsatellite marker GBM1023, used by Nduulu et al.  for genetically dissecting this region, is linked to the SNP highly associated with heading date in this study, POPA2_1399 .
Working across Figure 2, the next marker-heading date association occurred on chromosome 3H, position 126.3. Interestingly, this association was found by the TreeScan analysis only. Figure 3 shows the configuration of the parsimony tree constructed with the alleles of this 4gamete haplotype block. There was a significant difference between allele 011 and the average of alleles 000, 010, and 110. The SNP at position 3 in this block separates 011 from the other alleles in this tree, but this SNP also groups 011 with 001 and 111, which are recombinant alleles that do not fit in the parsimony tree displayed in Figure 3. The allelic effects of these two recombinant alleles are quite different than 011. Despite a small number of individuals carrying the recombinant alleles, they influenced the effect of the SNP at position three by enough to increase the p-value to above the significance threshold. The p-value for the SNP at position 3 was 3.7×10−4, while that of the TreeScan edge was 8.5×10−6. The p-value for the 4gamete analysis was 5.5×10−4, illustrating that grouping lines with alleles 000, 010, and 110 together and performing a single degree of freedom test provided more power than the 4gamete multi-degree of freedom test. The SNP at position three in this block (POPA2_0650) maps very near to markers found to be associated with heading date in barley in at least three prior mapping studies. These include RFLP ABG377, reported to be associated with heading date by Hayes et al.  and Thomas et al. , and microsatellite Hvm33 by Pillen et al. .
An association was found on chromosome 7H solely by the 4gamete and SlideWin3 analyses. The three alleles at this locus and best linear unbiased estimates of their allelic effects (in parentheses) were: 101 (0.735), 010 (0.728), and 011 (0). Grouping lines by any of the individuals SNPs in this haplotype block groups lines with different phenotypes, nearly eliminating any power to detect this association. Accordingly, p-values for the single SNP analysis in this region were much higher than the haplotype analyses (Figure 2). The number of individuals carrying allele 011 was 145. These lines were distributed across four breeding programs as follows: Aberdeen, 30 lines; NDSU 2-row, 73 lines; Washington State, 42 lines. The three SNPs that composed this haplotype block (POPA3_0893, POPA3_0894, POPA3_0895) were included on the Barley CAP genotyping platform because they are located within VRN-H3, a cloned gene controlling vernalization response that is orthologous to Arabidopsis Flowering Locus T . Yan et al.  showed that variation exists in the promoter region of this gene independent of growth habit and could not rule out the possibility that this locus contributes to variation in flowering time within barley growth habit.
As mentioned in the introduction, one advantage of grouping SNPs into haplotype blocks is reducing the number of tests and making it easier to reject the null hypothesis. Because tests involving markers within linkage groups are not independent of one another, and the distribution of markers across the genetic map is not uniform, the effective number of independent tests is not necessarily linearly related to the number of markers (i.e., the number of loci as reported in Table 2). Using permutations, we determined the significance threshold required for an experiment-wise error rate (EER) of 0.05. The p-value thresholds were related to marker number, with the single SNP and TreeScan thresholds being the most stringent and the HapBlock threshold, with the fewest markers (Table 2), being the least stringent (Figure 2). The Bonferroni correction for multiple tests calculates the p-value that should be used for individual tests in order to maintain a desired EER assuming independent tests. Since our tests are not independent, it would be useful to calculate the effective number of independent tests to illuminate the degree to which grouping linked SNPs into haplotypes reduces the problem of multiple testing. We solved the Bonferroni function for test number and found the following effective independent test number for the various analyses: single SNP, 1164 tests; 4gamete, 744 tests; SlideWin3, 791 tests; HapBlock, 521 tests; TreeScan, 1306 tests. The concept discussed above is illustrated by the difference in effective test number between SlideWin3 and single SNP. These two methods have nearly the same number of loci because each SNP, with the exception of those on chromosome ends, gives rise to a new haplotype block in SlideWin3. Adjacent SlideWin3 blocks, however, share two SNPs and are therefore correlated more strongly with one another compared to adjacent single SNPs. This shows that although overlapping sliding windows generate nearly as many loci as single SNPs, it is still useful for reducing the problem of multiple testing.
The dramatically increasing availability of DNA markers will produce a landslide of genome-wide association mapping studies in crop species in the coming years. Our goal was to identify which analytical methods perform best in discovering genes controlling complex traits in crop germplasm collections so that allelic diversity can be mined most efficiently. Studies in other organisms have shown that, under certain conditions, multimarker SNP haplotypes may provide increased power to detect QTL. However, those conditions are not well understood, and the population genetic characteristics of plant species may differ in important ways from human and animal species, e.g., many plants are self-pollinating.
The first issue we faced was the choice of blocking methods, of which a large number have been proposed in the literature. The vast majority of those methods were designed for application to human data sets with high densities of markers; the density of SNP markers in the BarleyCAP data set is much lower. However, in spite of this low density, there is significant LD between adjacent markers, due both to the substantial clustering of SNPs on the genetic map and to LD that extends over at least 10 cM . We explored the use of two methods, based on LD (4gamete) or diversity (HapBlock), to assign SNPs to blocks; using the settings we implemented, 85% or 98% of SNPs were incorporated into blocks, respectively. Although these two methods grouped a large fraction of SNPs in blocks, when the genetic map distance between blocks was tallied, 37% and 53% of the genetic map fell between block boundaries (Table 2). A similar proportion of our simulated QTL also fell between block boundaries: 63% of the QTL fell between 4gamete blocks and fell 39% between HapBlock blocks. On the other hand, the sliding window method grouped all SNPs into blocks. Because the sliding windows were overlapping, there was no map distance left between blocks as well as no simulated QTL between blocks.
To compare the power of haplotypes and single SNPs in finding QTL, we simulated QTL at known genetic locations using 100 evenly distributed SNPs removed from the marker data set. We found that haplotypes did not provide an advantage at smaller QTL sizes, but at the largest QTL size investigated, SlideWin3 provided substantially more power than the single SNP analysis (Table 3). TreeScan analysis performed significantly worse than the single SNP analysis when heritability and QTL effect size were high. We considered several technical factors that could contribute to the small differences in power observed in some situations:
The differences in power that we observed between the methods are the net, combined effect of these technical factors as well as any differences in information content (i.e., marker-QTL LD) of the various marker types. As evaluated by the power/FDR relationships summarized in Table 3, this net effect is apparently close to zero for some of the methods and situations, but comes out in favor of SlideWin3 for large QTL size. To better understand the relative importance of technical factors versus information content for detection power, we compared the information content of the various marker types. We calculated r2 (or its multiallelic equivalent, χ2′ ) between each of the 100 QTL and the markers within 10 cM, for each method. This maximal r2 should strongly affect detection power of a method. Comparisons among the methods are shown in Table 5. While the LD between single SNPs and QTL is highly correlated with the LD between haplotype blocks and QTL (the correlations range from 0.84 to 0.90), for 63 of the QTL, at least one haplotype method generates a block with χ2′ at least 0.1 units higher than r2 between that QTL and the best single SNP. At 15 QTL, all three haplotype methods produce markers with χ2′ at least 0.1 units higher than r2 for any single SNP. For only 13 QTL is the r2 for the best single SNP 0.1 higher than the maximum χ2′ of any of the blocking methods. In no case was a QTL more strongly associated to a single SNP than to a block from more than one haplotype method. That haplotype alleles were in higher LD with QTL than were the single SNP but often did not provide greater detection power indicated that, overall, the technical factors worked against the haplotype methods.
Examination of Table 5 also provided a hypothesis for the greater increase in power of the SlideWin3 method for QTL of large effect. First we note that, with the exception of the TreeScan method, all haplotype methods increased in power more than the single SNPs as QTL size went from 0.06 to 0.12. Likewise, the variance across all QTL in their maximal LD with a marker was lower for haplotype methods (with the exception of TreeScan) than for the single SNPs (Table 5). For QTL of size 0.06, the median likelihood ratio was below the detection threshold (i.e., power was below 50%). In that case, having a higher variance in LD was beneficial because it caused a higher variance in the likelihood ratio such that in more cases the ratio exceeded the threshold. In contrast, for QTL of size 0.12, the median likelihood ratio was above the threshold. Higher variance was then detrimental because it caused more cases to fall below the threshold.
Our analyses using phenotypes simulated on the basis of single SNP genotypes suggested that there can be an advantage to using haplotypes instead of single SNP markers in GWAS. Moreover, it is likely that the method of simulating the QTL biased the results in favor of the single SNP analysis. There is no reason to assume that the QTL alleles found in nature – the only relevant QTL – are distributed in populations and across the genome the same as SNPs placed on an Illumina GoldenGate SNP chip. To determine the effect of QTL simulation on our results, we performed three additional sets of simulations in which no SNPs were removed, and QTLs were assigned to a pair of adjacent SNP alleles (see Methods). This approach confirmed that, when the causal SNP is one of the genotyped markers, the power of single SNP analysis is superior to that of haplotypes (Table 4). However, when the QTL effect was simulated by a combination of SNPs, the blocking method that combined that pair had the greatest power to detect it in most cases. A QTL such as this may occur in nature when the age of causal polymorphism is younger than the surrounding marker polymorphisms . If the haplotype containing the causal mutation subsequently increases in frequency through drift or selection, the QTL allele would be in higher LD with the surrounding haplotype as a whole than with any of the single SNPs within the haplotype. In this scenario, the increased information content of the haplotype alleles clearly offset any loss of power associated with technical issues.
The results of the heading date analysis are consistent with those from the QTL simulations: different marker-trait associations were found across different analyses, reflecting the heterogeneous history of mutation, recombination, and drift across the barley genome. The power to detect a QTL is highest when the associated marker allele has a similar frequency to that of the QTL allele; when a QTL allele is in lower frequency than the nearby markers, a combination of alleles at two or more markers may generate a haplotype that is closer in frequency. This appears to be the case for the QTL on chromosome 7H, which was in higher in LD with a 4gamete haplotype at VRN-H3 than with any single SNP within this block. In an equilibrium population, this situation is most likely when the mutation giving rise to the QTL is younger than the mutations giving rise to the markers. In a domesticated species that has experienced at least one bottleneck and strong selection, allele frequency is not necessarily a reflection of allele age, and such a configuration can also be due to genetic drift.
The basis of the association between heading date and a TreeScan edge on chromosome 3H is somewhat more complicated. The increased power of the TreeScan method at this locus was due to 1) the higher power of a test with one degree of freedom and 2) the elimination of the phenotypically discordant recombinant alleles from the set of individuals that carry allele 1 at the most strongly associated SNP (Figure 3). Had we included these recombinant alleles in the parsimony tree, this would not have changed the result, namely, the significance of the edge between haplotype I and haplotypes II, III, and IV. However, the results in Table 5 suggest that this type of scenario favoring TreeScan does not happen frequently. Given the considerable effort involved in implementing the TreeScan method, which was originally developed for candidate gene analysis and cannot easily be automated, we do not recommend its routine use in GWAS.
It is worth mentioning that all the QTL detected for heading date were found by at least one of the haplotype analyses; even the peak on chromosome 2 (Figure 2) was topped off by an association with a 4gamete haplotype. However, the reasons for potential greater power of the single SNP analysis listed above, combined with the fact that no extra effort is required for performing the single SNP analysis, warrant its continued importance in GWAS.
Because both genetic architecture and population history are likely to differ across genes and traits, it is not reasonable to expect one method to be superior at detection of all QTL. The examination of LD around individual simulated QTL shows that, in some cases, multi-SNP haplotypes can be in much stronger LD with a QTL-SNP than are any of the constituent SNPs. Both our simulation and barley heading date results provide good support for the use of simple, overlapping sliding windows for GWAS to complement a single SNP analysis. Fortunately, the sliding window haplotype method is also the easiest to implement, as it is implemented in the available software programs PLINK  and TASSEL . Whether haplotypes would provide any advantage when marker density is much higher remains to be determined.
SNP data consisted of 3072 SNPs scored on 1824 barley lines using two Illumina GoldenGate oligonucleotide pool assays (BOPA1 and BOPA2 in Close et al. ). Barley lines were from CAP years 2006 and 2007 only. Unmapped SNPs and those with MAFs<0.028 were removed from the data set (a minimum of 50 individuals carried the minor allele). Since barley is highly inbred, the genotypic data were treated as effectively haploid. Heterozygous loci were rare and were scored as missing data. After removing duplicate lines and lines with large amounts of missing data, the final data set consisted of 2198 mapped SNPs scored in1807 lines.
There were many sets of SNPs, ranging in size from two to 34 SNPs, with identical genetic map position; these sets were ordered so as to generate maximal LD among adjacent markers. For example, if markers a, b, c, d, e, f were at 8, 10, 12, 12, 12, 15 cM, the best order would be chosen from a-b–[c, d, or e]- [c, d, or e]- [c, d, or e]-f by maximizing LD between b - [c, d, or e] and f - [c, d, or e] as well as LD among c, d, and e.
Three methods were used to identify haplotype blocks among the SNPs:
Haplotype block identification was conducted on the original genotype data, which included 0.7% missing data. Because some of our analysis methods would not tolerate missing data, we used FastPHASE  to impute all missing SNP alleles. This program uses haplotype clustering that changes with genetic map position and allele frequencies to calculate a probability that an individual carries the reference allele at a locus. For each blocking method, we extracted the SNP boundaries and common alleles (frequency ≥0.02) identified using the original data and applied this information to the dataset with the imputed data, creating a haplotype allele incidence matrix that contained the probabilities each individual carried allele i at haplotype block j. The haplotype allele incidence matrix used for the association analysis was created as follows:
k columns were created for each haplotype block, where k is the number of alleles with frequency ≥0.02, and allele 1 is the most frequent. In addition, each unblocked SNP was represented by one column, as in the single SNP analysis. For each block, columns 1 to (k−1) store the probabilities that an individual carries alleles 2 through k, respectively. Column k stores the probability that an individual carries a minor allele. In the case that an individual carries allele 1, the row values for that block will be all 0s. If an individual is missing data for at least one SNP within a haplotype block, at least one column among the set of columns representing that block will have a value between 0 and 1, since the probabilities of various haplotype alleles are the products of the probabilities of the component SNPs as determined by the imputation procedure.
Parsimony trees for the 4gamete haplotypes were estimated using the pars function in Phylip . Blocks produced by the HapBlock and SlideWin3 methods were unsuitable because they could include significant amounts of recombination that would result in failure to find a single best tree. The parsimony tree file for each 4gamete block was analyzed by the program TreeScan , which identifies the edges in the tree and, for each edge, the alleles that belong to the two clades defined by the edge. For example, if a block has four alleles (101,111,001,110), the tree topology is (31.00,(41.00,20.00)1.00,10.00) and there are three edges in the tree: 3 / 4 2 1; 4 2 / 3 1; 4 / 3 2 1. This is illustrated in Figure 4.
Using the TreeScan output, a clade membership probability file was created as follows: 2 columns were created for each of the k edges at each 4gamete block. The first column for each edge stores the probability that the individual belongs to the first clade at that edge. The second column stores the probability that the individual belongs to neither clade; this occurs when an individual carries a rare haplotype that wasn't included in the tree. As for the other haplotype methods, unblocked SNPs were represented by one column each.
Simulated phenotypes were composed of one QTL, a polygenic effect, and error. Different QTL sizes (p=0.03, 0.06, and 0.12 of phenotypic variation) and heritability (h2=0.25 and 0.75) were simulated. One hundred SNPs, chosen to minimize pairwise LD and to maximize genetic map coverage, were designated as quantitative trait nucleotides (QTLs) and removed from the marker dataset, leaving 2098 markers for the association mapping analysis. Minor allele frequency among the 100 QTL was at least 0.10; median MAF was 0.34. Simulated QTL were assigned an additive effect inversely proportional to the standard deviation of the allelic state in order to standardize the amount of genetic variance attributed to QTL with different MAF. SNPs composing the polygenic effect were selected by forming 400 k-means  clusters of SNP markers from Barley CAP genotype data and selecting the SNP nearest the centroid of each cluster. This step ensured that more than one SNP per haplotype block was not used for the polygenic effect. Marker scores at these loci were simply summed for each line to create the polygenic component of the phenotype.
Variation due to the QTL and polygenic component together compose the total genetic variance. Error effects were randomly sampled from a standard normal distribution and added to the genetic values to obtain phenotypes with h2=0.75 or 0.25. Each of the 100 SNPs designated as QTL was used to simulate 10 phenotypes, producing 1000 phenotypes. To summarize, a factorial of QTL size (p=3, 6, and 12% of phenotypic variation) and h2 (0.75 and 0.25) was used to form six sets of 1000 phenotypes.
Because we were concerned that simulating QTL using SNPs from the Barley CAP genotyping platform favored the single SNP analysis, we also simulated QTL alleles that deviated from the single SNP model in terms of frequency and LD with surrounding SNPs. This was accomplished by defining a QTL effect as a particular combination of adjacent SNPs. The pairs of adjacent SNPs (hereafter referred to as QTL pair) were chosen on the basis of each haplotype block structure so that each haplotype block method would be favored by one set of simulated QTL. Designating QTL pairs based on the SlideWin3 block structure, an effect was given to individuals with the allelic combination (0-0) at the first two positions within each 3-SNP haplotype block. In Figure 4, this would correspond to Edge 1, which separates alleles 001 from all other alleles. To simulate QTL based on the 4gamete block structure, an effect was given to individuals with a specific pair of alleles at the center of the blocks as defined by the 4gamete method. Because there are only 791 4gamete haplotype blocks and therefore only 791 QTL pairs, another allelic combination (0-1) was used to designate additional QTL pair effects. QTL pair designation on the basis of HapBlock haplotype blocks was carried out like that for 4gamete. In all three cases, allele combinations with frequency less than 2.8% were not considered. As before, single SNPs were used to simulate QTL, but in this case the causal SNPs were left in the marker dataset so a marker in complete LD with the QTL always existed. The remainder of the phenotype was simulated as described above.
A population of 400 lines was selected from the 1807 original lines by forming 400 k-means clusters based on the SNP data and selecting the line nearest the centroid. If two or more lines were equidistant from the centroid, one of these lines was randomly selected. Population size was reduced from 1807 to 400 to reduce computing time. Additionally, a population size of 400 is within the range of population sizes typically used in association genetics studies . Creating 400 k-means clusters and sampling the line nearest the centroid maximized allelic diversity and historical recombination within the selected lines. Maximizing genetic diversity is often a primary objective in designing association mapping panels , .
For QTL detection, marker-trait associations were tested by the mixed linear model
where y is a vector of phenotypes, is a vector of fixed marker effects (i.e., single SNP, haplotype alleles, or parsimony tree edges), h is a vector of polygenic effects caused by relatedness, e is vector of residual effects, and X and Z are incidence matrices relating y to and y to h, respectively. It is assumed h~N(0, ) and e~N(0, ), where K is an allele-sharing matrix calculated from the SNP data, is the genetic variance, I is an identity matrix, and is the residual variance. Zhao et al.  found that modeling population structure with an allele-sharing matrix controlled false positives as well as using a mixed model including both a kinship matrix and population substructure effect (Q matrix) used by Yu et al. .
The above model was implemented using the efficient mixed-model association algorithm (EMMA) of Kang et al. . The most important advantage of EMMA for our purposes is its speed, being orders of magnitude faster than other mixed model algorithms. An allele-sharing matrix was calculated using the kinship.emma function. As it is currently set up, the EMMA R package has strict formatting requirements and handles only bi-allelic marker data. In order to use EMMA to analyze the 4gamete, HapBlock, and SlideWin3 data, we modified the EMMA R package to perform likelihood-ratio tests of more than one degree of freedom (df). The haplotype allele incidence matrix, as described in Haplotype Block Identification section, is used in place of the SNP matrix. A likelihood-ratio test was performed for each haplotype block, and the test statistic was compared to a chi-squared distribution with df equal to the number of haplotypes at the locus in question minus one. To validate our modified EMMA R script, one phenotype was analyzed using the above model and 4gamete haplotype data in PROC MIXED of SAS by performing a likelihood-ratio test for each locus. The resulting p-value of each 4gamete haplotype block was the same across all markers as those obtained from the modified EMMA script.
The format of the TreeScan edge score data is exactly that of single SNP data, allowing use of the unmodified EMMA package. While the format is the same, it is important to note that each TreeScan test is for the effect of a tree edge on the phenotype. That is, lines are grouped by their clade membership within the parsimony tree created for each 4gamete block and clades connected by an edge are tested for phenotype differences. This obviously can result in several tests per haplotype block. The between-clade test resulting in the lowest p-value is retained as the representative test of any given haplotype block. This results in matrix of p-values with the same dimensions as the 4gamete p-value matrix.
Comparing performance between the five different marker types – single SNP, 4gamete, HapBlock, SlideWin3, TreeScan edges – was done on the basis of power at empirical FDR of 0.10 and 0.20. A 20 cM window surrounding the QTL was used for declaring a true discovery. For the haplotype markers, distance to the QTL was calculated from the borders of the block. When the QTL was simulated as a combination of two QTLs, the QTL position was set to the midposition of the QTL pair. At any given p-value, power is calculated as the number of times at least one association is found within 10 cM of the true QTL position divided by 1000 (because there were 1000 single-QTL simulations). False discovery rate was calculated as 1−(true associations)/(total associations), where true associations is the total number of marker-phenotype associations declared significant at a nominal p-value where the marker in question was located within 10 cM of the QTL and total associations is the total number of marker-phenotype associations declared significant regardless of position relative to the QTL. A superior marker type for association mapping maximizes the power at the given FDR.
To estimate power and its standard error at an empirical FDR of 0.10 or 0.20, the following bootstrapping algorithm was performed: 1) p-values from the association analysis were arranged in a 1000 by m matrix, where 1000 is equal to the number of phenotypes analyzed and m is equal to the number of markers of the method used (e.g., m=2098 for single SNPs and m=791 for 4gamete; Table 2). 2) 1000 rows (i.e., simulated phenotypes) of this matrix were randomly sampled with replacement. 3) Power and FDR were calculated as described above. 4) Power at a specific FDR was interpolated using a local simple linear regression model relating power to FDR. 5) Steps 2–4 were repeated 1000 times to produce a bootstrap sampling distribution of power at a designated FDR. 6) A point estimate of power was taken as the mean of the bootstrap sample distribution and the standard error was calculated as the standard deviation of the bootstrap sampling distribution. Bootstrap sample power estimates were normally distributed in all cases.
Each year of the Barley CAP, every participating breeding program contributed 96 lines for genotyping and phenotyping. Phenotypic evaluations were conducted at one or more locations within the breeding program's geographical region. Evaluations were conducted during one year only, i.e. lines contributed in a specific year are evaluated during that year's field season weather permitting. Hence, the phenotypic data of the Barley CAP is highly unbalanced.
Heading date (days after planting when 50% of panicles emerged from flag leaf) data sets of spring barley were obtained from The Hordeum Toolbox (THT; www.hordeumtoolbox.org). If a set of CAP lines (96) was evaluated at more than one location, data was combined across locations and repeatability on an entry-mean basis was calculated as where l is the number of locations. Data sets from individual breeding programs were included if to ensure high data quality. If a breeding program only used one location, data from that program was included if the reported coefficient of variation on THT was less than 10%. After selecting datasets according to the above criteria, the remaining heading date data was from five programs: Aberdeen, ID; University of Minnesota; North Dakota State University (NDSU) 6-row; NDSU 2-row; Washington State University. Data from both 2006 and 2007 (i.e., 192 lines from each program) was used from each program.
Best linear unbiased predictions (BLUPs) of line performance were calculated using mixed model involving trial as a fixed effect and line as a random effect. The trial variable corresponds to the individuals locations used by different breeding programs and was modeled because trials within breeding programs were often unbalanced. Check varieties common between locations within breeding programs and across breeding programs provided the overlap needed so that line BLUPs could be adjusted for environmental effects. Outliers were identified as those with a standardized value greater than three standard deviations and removed from the dataset. The number of lines remaining was 944.
The statistical model used for association mapping of heading date was the same as that used for simulated phenotypes except for the inclusion of breeding program as a fixed effect. This model effectively associated within-breeding program genetic variation with markers, removing any unbalanced environmental effects that could cause false phenotype-marker associations. A statistical threshold corresponding to an experiment-wise type I error rate of 0.05 was established for each marker method by randomly permuting the phenotypes 1000 times.
The authors would like to acknowledge Peter Bradbury for his valuable discussion and programming support. Also, thanks are due to Elliot Heffner, Jesse Poland, and Patrick Hayes. We thank Hiro Iwata for writing the program to order the loci that had identical map positions.
Competing Interests: The authors have declared that no competing interests exist.
Funding: United States Department of Agriculture, National Institute of Food and Agriculture Grant No. 2009-85606-05701, “Barley Coordinated Agricultural Project: Leveraging Genomics, Genetics, and Breeding for Gene Discovery and Barley Improvement.” The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.