|Home | About | Journals | Submit | Contact Us | Français|
Mixed linear model (MLM) methods have proven useful in controlling for population structure and relatedness within genome-wide association studies. However, MLM-based methods can be computationally challenging for large datasets. We report a compression approach, called ‘compressed MLM’, that decreases the effective sample size of such datasets by clustering individuals into groups. We also present a complementary approach, ‘population parameters previously determined’ (P3D), that eliminates the need to re-compute variance components. We applied these two methods both independently and combined in selected genetic association datasets from human, dog and maize. The joint implementation of these two methods markedly reduced computing time and either maintained or improved statistical power. We used simulations to demonstrate the usefulness in controlling for substructure in genetic association datasets for a range of species and genetic architectures. We have made these methods available within an implementation of the software program TASSEL.
Although genome-wide association studies (GWAS) have the potential to pinpoint genetic polymorphisms underlying human diseases and agriculturally important traits, false discoveries are a major concern1 and can be partially attributed to spurious associations caused by population structure and unequal relatedness among individuals in a given cohort. Population stratification was initially addressed using general linear model (GLM)-based methods such as structured association2, genomic control3 and family-based tests of association4. The introduction of MLM approaches has more recently been demonstrated as an improved method to simultaneously account for population structure and unequal relatedness among individuals5.
In the MLM-based methods, population structure2,6 is fit as a fixed effect, whereas kinship among individuals is incorporated as the variance-covariance structure of the random effect for the individuals. Regardless of the applied statistical method, GWAS require large sample sizes to achieve sufficient statistical power7, especially in order to detect the small effect polymorphisms that underlie most complex traits8. For the MLM approach, datasets with these large sample sizes create a heavy computational burden because the computing time for solving a MLM increases with the cube of the number of individuals fit as a random effect. The earliest effort to reduce the size of the random effect in an MLM can be traced back to the sire model approach used in animal breeding9–12, which replaces an individual’s genetic effect with that of its sire. Consequently, the sire-model approach requires pedigrees, which are not always available, and which in particular are often not available in plant studies. Even with available pedigrees, the use of a marker-based kinship is preferred because of its higher accuracy13,14. The computing time is further increased because iteration is needed to estimate population parameters, such as variance components15, for each tested marker. Even though a number of studies have sought to improve the speed of the iteration process, including development of the recent efficient mixed-model association (EMMA) algorithm16, solving an MLM for a large number of individuals and markers remains computationally intensive. To address this issue, a residual approach was proposed based on a two-step strategy17. The first step optimized a reduced MLM with the genetic marker effect excluded. In the second step, the residual from the reduced MLM was fit as the dependent variable to test each marker in a GLM. Because the random genetic effect was not fit in the second step, iteration was not required when testing markers. This residual approach can be performed much faster than the one-step MLM with full optimization for all unknown parameters, but it has a statistical power equivalent to that of the full optimization approach only for traits with low heritability. We propose here methods to reduce the size of the random genetic effect in the absence of pedigree information and eliminate iterations to re-estimate the population parameters for each marker without compromising statistical power. We show that the joint use of these two methods greatly reduces computing time and maintains or even increases statistical power.
The total computing time for a GWAS with a standard MLM is mpn3, where m is the total number of markers, p is the number of iterations to solve the MLM and n is the total number of individuals assessed. Conducting a GWAS with a large sample size becomes computationally intensive because each iteration takes an amount of time that is proportional to the cube of the number of individuals in the random effect15,18. An approach for reducing this computational burden is to reduce the size of the random effect. We achieve this by substituting n individuals with a smaller number of groups, s (s ≤ n), clustered based on the kinship among individuals. Consequently, the kinship between pairs of groups replaces the kinship between pairs of individuals for the random effect of an MLM. If c = n/s is the average number of individuals per group, referred to hereafter as the compression level, this approach will reduce computing time by a factor of c3. We named this approach compression, referring to how the random effect in a MLM is compressed from individuals to groups. An MLM that uses compression is called a compressed MLM.
Because in this method individuals are clustered into groups based on kinship estimates, we consider the compressed MLM to be an extension of the pedigree-based sire model9–12 with notable advancements. The groups used in the compressed MLM can be clustered from kinship calculated from either markers or pedigrees. In addition, the number of groups in the compressed MLM can vary from n to 1, whereas the number of sires is fixed in the traditional method for a particular pedigree. This flexibility in the number of groups allows the accuracy of the group mean and number of groups to be optimized, which is a method similar to choosing the numbers of sires and progeny per sire to maximize genetic improvement in a breeding program19–21. The ability to optimize the number of groups could lead to increased statistical power in GWAS.
Compressed MLM crosses the boundary between GLM and MLM because GLM and MLM can both be considered extreme cases of compressed MLM (Fig. 1). MLM is equivalent to compressed MLM when each individual is treated as a single group (that is, s = n), whereas GLM is equivalent to compressed MLM when all individuals are in one group (s = 1). The latter causes the random effect to have a single level, thereby preventing the separate estimation of the random effect and residual variance components. In addition, the random effect and the overall mean are linearly dependent and thus cannot be estimated separately.
To further reduce computing time, we developed the P3D algorithm, a two-step approach that does not require iteration to estimate population parameters such as genetic variance and residual variance for each marker. The first step in the algorithm is to optimize the reduced MLM with the marker effect excluded. If compression is incorporated in the model, the population parameters also include the clustering algorithm and compression level. Taken from a similar approach that was applied to marker-assisted breeding22, the second step of the algorithm continues to fit the random genetic effect in the MLM with the previously determined population parameters fixed as empirical Bayesian priors23. Subsequently, the non-population parameters, including the marker effect and the random genetic effect, are estimated for each marker.
P3D is similar to the two-step residual approach17, but it also has notable differences. The residual approach fits the residuals from the reduced MLM as the dependent variable in the second step, whereas the original phenotype is fit as the dependent variable in the second step of P3D. In addition, the residual approach does not fit the random genetic effect and uses a GLM when testing markers, whereas P3D fits the random genetic effect with previously determined population parameters fixed in an MLM framework.
To evaluate compression and P3D relative to the standard MLM with full optimization of all unknown parameters for each marker, we conducted a series of association studies between observed or simulated phenotypes and observed markers in human, dog and maize. For the observed phenotypes, we evaluated the fit of compressed MLMs at different compression levels and with different clustering algorithms. Under the assumption that there is no association between the observed phenotypes and the observed genetic markers, we investigated the distribution of false positives by using the compressed MLM. The simulated phenotypes were used to evaluate statistical power by considering the potential true associations between the observed phenotypes and the observed markers. The simulated phenotypes were generated from the observed SNPs by adding the genetic effects. The SNPs with assigned genetic effects are called quantitative trait nucleotides (QTNs). Total number of QTNs, heritability and dominance and epistatic effects were varied to validate the robustness of P3D for phenotypes with different genetic architectures. We used the distribution of the F statistics from the association tests between the simulated phenotypes and the non-QTN markers to determine an empirical threshold5 at a significance level of 5% (P < 0.05). We then calculated the statistical power as the proportion of QTNs with F values greater than the thresholds.
We examined the fit of the compressed MLM on human height with eight hierarchical clustering algorithms24,25: unweighted pair group method with arithmetic average (UPGMA); unweighted pair-group centroid; complete linkage; Lance-Williams flexible-beta method; McQuitty’s similarity analysis (weighted pair-group method using arithmetic averages); weighted pair-group centroid median; single linkage (nearest neighbor); and Ward’s method. The fit of each model varied considerably with the use of different combinations of clustering algorithms and compression levels. For each clustering algorithm, at least one compression level had a better fit with the data than the standard MLM, with the exception of the unweighted and weighted pair-group centroid median algorithms in the human dataset (Supplementary Fig. 1). The variation in model fit among clustering algorithms suggests that additional research is needed to better understand the relationship between clustering algorithms and compression levels; however, this is beyond the scope of our study. Because UPGMA produced models that were generally equivalent to or better than other clustering algorithms, we chose to use that in the rest of the work presented here, including the examination of model fit for different phenotypes within the same population (Supplementary Fig. 2).
Under the assumption that there is no association between the observed phenotypes and the tested markers, the P values from the association tests should follow a uniform [0,1] distribution. This distribution is shown in the quantile-quantile plot in Figure 1. Notably, compressed MLM controlled the false positive rate better than the standard MLM when the compression levels were within the range of 1.5 to 10 (Fig. 2). At these same compression levels, the compressed MLM had a better model fit than the standard MLM when marker effects were excluded (the top panel in Fig. 3).
To deal with the risk that reducing the number of false positives might affect the ability to detect true positives (that is, statistical power), especially in the case of assuming that no association is violated, we examined the performance of the compressed MLM by simulation studies. After QTN effects were added to the observed phenotypes, tests of association between these simulated phenotypes and markers showed that the statistical power (that is, the ability to detect the simulated QTN) and model fit followed the same trend. The compression level that best fit a model without markers also provided the highest power to detect QTN (middle, Fig. 3). Compared to the standard MLM, equivalent power was achieved using compressed MLM with as much as 5- to 10-fold compression. The compression level with the best-fitting model increased statistical power by 34%, 42% and 20% for human, dog and maize for a QTN that explained 0.12, 0.30 and 0.30 units of the phenotypic standard deviation, respectively.
We compared P values obtained from using P3D to P values from using full optimization for testing the association between observed phenotypes and markers in human, dog and maize. The coefficient of determination (r2; Pearson’s correlation coefficient squared) between corresponding P values obtained from the two approaches were all greater than 0.96. Therefore, we concluded that the association tests obtained from the P3D and full optimization methods were approximately the same.
To evaluate the performance of P3D using phenotypes with different genetic architectures, we performed association tests on simulated phenotypes. Different numbers of QTNs with various levels of heritability, dominance and epistatic effects were simulated. Similarly, strong correlations (r2 > 0.97) between the corresponding P values from the P3D and full optimization approaches were observed for both QTN and non-QTN SNPs (top two panels in Fig. 4 and Supplementary Figs. 3 and 4).
For a simulated phenotype with a heritability of 50% and that is controlled by 20 QTNs randomly sampled from the SNPs in the human dataset, we used four compression levels. At each compression level, association tests were performed using both the P3D and full optimization approach. Strong correlations between the corresponding P values from P3D and the full optimization were also observed (r2 > 0.99) for both QTN and non-QTN SNPs across the different compression levels (top two panels in Supplementary Fig. 5).
We used the distribution of the F statistics for the non-QTN SNPs to derive the empirical threshold for evaluating F values at each compression level. We calculated the empirical statistical power as the proportion of QTNs with F values greater than the threshold corresponding to a significance level of 5% (P < 0.05). The empirical statistical power of the P3D and full optimization approaches were approximately the same in all tested scenarios (bottom panels in Fig. 4 and Supplementary Figs. 3–5).
Compression decreases computing time in proportion to the inverse of the cube of the compression level. For instance, a compression level of 2 will reduce the computing time by about 87%. The standard MLM with each individual as a single group has a compression level of 1 and requires the most computing time. The GLM, equivalent to the highest compression level with all individuals assigned to a single group, requires the least computing time. In our analyses, both model fit and statistical power improved as the compression level increased from one. After reaching the optimum compression level, further compression reduced model fit and statistical power, which eventually became the same as the power with the GLM at the maximum compression.
The fit of the reduced model (that is, the model without markers) under different compression levels followed the same trend as the statistical power of the full model for testing markers. Because the reduced model did not include marker effects, the computing time required to find the compression level with the best-fitting model was independent of the number of markers. For these reasons, the P3D model used an efficient strategy that determined the optimal clustering algorithm and compression level only once.
Similar to the residual approach, P3D eliminates the need to estimate population parameters separately for every marker. The advantage of P3D is that it does not lower statistical power regardless of the genetic architecture of the phenotypes. The P3D method works well for different numbers of QTNs and with various levels of heritability, dominance or epistatic effects.
Compressed MLM and P3D can be applied either separately or jointly and can also be used in combination with other approaches, such as the EMMA algorithm, to speed up the iteration process in the first step of P3D. The compressed MLM improves both computing speed and statistical power, whereas P3D improves computing speed without sacrificing statistical power. In addition, compressed MLMs can be applied at various compression levels. For an analysis in which statistical power is the top priority, the compression level with the best model-fit should be chosen; otherwise, a higher compression level may be chosen to reduce computing time. It should be noted that no trend has been identified to determine the compression level with the best model-fit across different datasets. The compression level that generated the best model-fit varied among phenotypes in the same population when the same kinship was used (Supplementary Fig. 2). Thus, for each new study, the compression level needed to be optimized using the reduced MLM.
The theoretical computing time reduction is faster by a factor of pc3 for the joint use of compressed MLM and P3D, where p is the number of iterations and c is the compression level. When using proc mixed and proc cluster in SAS26 on the three datasets, we showed that the computing time for the human dataset (largest sample size) decreased 19-fold with compressed MLM alone and 877-fold with compression with P3D at the compression level with the greatest statistical power (Fig. 3, bottom). Choosing a compression level that had power equivalent to that of the standard MLM reduced the computing time even more: computing time was 103-fold faster with compression alone and 7,582-fold faster with compression with P3D, respectively. For the human dataset with 1,315 individuals, the standard MLM (no compression, no P3D) took 821 s to screen one marker. (Fig. 3) At this speed, it would take 9,502 d (26 years) to analyze a GWAS with 1 million markers. The current methods (compression with P3D) took 0.34 s to screen a marker at the compression level of 3.8, which showed the highest statistical power, and at this speed, it would take only 2.7 d to screen one million markers. The increased speed is even more important for larger datasets (for example, one containing 5,000 individuals). This suggests that current GWAS datasets on several thousand of individuals at 500,000–1,000,000 markers could be analyzed by our methods within several days. We have made these methods available within an implementation of the software program TASSEL27.
Methods and any associated references are available in the online version of the paper at http://www.nature.com/naturegenetics/.
This study was supported by the US National Science Foundation (NSF)–Plant Genome Program (DBI-0321467, 0703908 and 0820619), NSF–Plant Genome Comparative Sequencing Program (DBI-06638566), US National Institutes of Health (1R21AR055228-01A1), National Heart, Lung, and Blood Institute (U 01 HL72524, HL54776 and 5U01HL072524-06), US Department of Agriculture Research Service (53-K06–5-10 and 58–1950-9–001), USDA–Cooperative State Research, Education and Extension Service National Research Initiative (2006-35300-17155), Morris Animal Foundation (D04CA-135), WALTHAM Centre for Pet Nutrition, Cornell Advanced Technology in Biotechnology and the Collaborative Research Program in the Cornell Veterinary College. The authors would like to thank K. Zhao for providing the source code to compute kinship and L. Rigamer Lirette, A.L. Ingham and S. Myles for editing of the manuscript.
Note: Supplementary information is available on the Nature Genetics website.
AUTHOR CONTRIBUTIONS: Z.Z. conceptualized the study, performed the data analyses and wrote the manuscript. E.E., M.A.G. and J.Y. participated in the data analyses and wrote the manuscript. P.J.B. implemented the two new methods in the TASSEL software package. C.L., H.K.T., D.K.A. and J.M.O. provided the human data and supervised its analyses. R.J.T. provided the dog data and supervised its analyses. E.S.B designed and supervised the project. All authors edited the manuscript.
COMPETING FINANCIAL INTERESTS: The authors declare no competing financial interests.
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/.