Home | About | Journals | Submit | Contact Us | Français |

**|**Genetics**|**v.203(1); 2016 May**|**PMC4858800

Formats

Article sections

Authors

Related links

Genetics. 2016 May; 203(1): 573–581.

Published online 2016 March 4. doi: 10.1534/genetics.116.187013

PMCID: PMC4858800

Ivan Pocrnic,^{*,}^{1} Daniela A. L. Lourenco,^{*} Yutaka Masuda,^{*} Andres Legarra,^{†} and Ignacy Misztal^{*}

Received 2016 January 19; Accepted 2016 February 29.

Copyright © 2016 by the Genetics Society of America

Available freely online through the author-supported open access option.

This article has been cited by other articles in PMC.

The genomic relationship matrix (GRM) can be inverted by the algorithm for proven and young (APY) based on recursion on a random subset of animals. While a regular inverse has a cubic cost, the cost of the APY inverse can be close to linear. Theory for the APY assumes that the optimal size of the subset (maximizing accuracy of genomic predictions) is due to a limited dimensionality of the GRM, which is a function of the effective population size (*N*_{e}). The objective of this study was to evaluate these assumptions by simulation. Six populations were simulated with approximate effective population size (*N*_{e}) from 20 to 200. Each population consisted of 10 nonoverlapping generations, with 25,000 animals per generation and phenotypes available for generations 1–9. The last 3 generations were fully genotyped assuming genome length *L* = 30. The GRM was constructed for each population and analyzed for distribution of eigenvalues. Genomic estimated breeding values (GEBV) were computed by single-step GBLUP, using either a direct or an APY inverse of GRM. The sizes of the subset in APY were set to the number of the largest eigenvalues explaining *x*% of variation (EIG*x*, *x* = 90, 95, 98, 99) in GRM. Accuracies of GEBV for the last generation with the APY inverse peaked at EIG98 and were slightly lower with EIG95, EIG99, or the direct inverse. Most information in the GRM is contained in ~*N*_{e}*L* largest eigenvalues, with no information beyond 4*N*_{e}*L*. Genomic predictions with the APY inverse of the GRM are more accurate than by the regular inverse.

WHEN SNP information is available, genomic predictions most commonly use SNP-BLUP (and derivatives) or genomic BLUP (GBLUP) models (Meuwissen *et al.* 2001; VanRaden 2008). In the first model SNP effects are fitted directly, and the second model uses SNPs indirectly via a genomic relationship matrix. While both models are equivalent theoretically, analyses with complex models (multiple traits, several genetic effects, genotype-by-environment interactions) are simpler with GBLUP. For populations where only a small fraction of phenotyped animals are genotyped, there is a modification of GBLUP called single-step GBLUP (ssGBLUP) based on combining genomic and pedigree relationships (Aguilar *et al.* 2010; Christensen and Lund 2010). The ssGBLUP is becoming popular for commercial genetic evaluations because of simplicity of use, as existing models can be reused, and high accuracy due to joint modeling of phenotypes, pedigrees, and genotypes (Legarra *et al.* 2014).

GBLUP-based methods require an inverse of the genomic relationship matrix (GRM). A direct inverse has a cubic cost and can be computed efficiently for perhaps up to 150,000 individuals. Due to the popularity of commercial genotyping, some populations have >1 million genotyped animals (*e.g.*, U.S. Holstein cattle), and computing an inverse would be prohibitively expensive. Additionally, the GRM is not positive definite for larger dimensions and additional steps (*e.g.*, blending with a pedigree-based relationship matrix) are required to make the GRM positive definite (VanRaden 2008).

Misztal *et al.* (2014) postulated that the inverse can be computed efficiently using recursion on a small subset of animals (initially labeled as high-accuracy or “proven” in earlier studies) and named the method the algorithm for proven and young (APY). In this article, we refer to the inverse calculated with this algorithm as the APY inverse and to animals in the subset as core animals. While computing costs of APY are cubic for the subset, they are only linear for animals outside the subset. Fragomeni *et al.* (2015) analyzed Holstein data with 100,000 genotyped animals and found that any subset of animals containing at least 10,000 animals resulted in an accurate inverse. The optimal subset size was estimated as slightly >8000 for Angus cattle (Lourenco *et al.* 2015b). The APY inverse was successfully computed for ~570,000 genotyped Holsteins in <2 hr of computing time on an average server (Masuda *et al.* 2016). More than 10,000 animals in the recursion did not improve genetic predictions. A regular inverse for 570,000 individuals would require weeks of computing and require memory available only in the largest computing clusters.

Misztal (2016) proposed a theory for the APY inverse. Assume that the additive information in a population is contained in a limited number (say, *n*) of independent chromosome segments (*M*_{e}) or effective SNP markers (ESM). If *M*_{e} or ESM completely explain the additive variation, breeding values of *n* animals are linear functions of *M*_{e} or ESM and contain nearly all the information included in *M*_{e} or ESM. Treating any subset of *n* animals as core animals, a recursion on any *n* animals is sufficient, because there is a high redundancy in genomic information. Whereas the number of *M*_{e} is a function of effective population size, the number of ESM could be computed as the number of eigenvalues explaining nearly all the variation in **G**. Assuming that *M*_{e} and ESM describe the same concept, the optimal subset size must be a function of effective population size (*N*_{e}) and can be derived from eigenvalue analysis of the GRM.

The purpose of this article was to test the theory of the APY with simulated data. In particular, we wanted to find whether (1) the optimal size of the recursion is related to effective population size, (2) the optimal size can be derived from eigenvalue analysis of the GRM, and (3) genetic predictions obtained with APY **G**^{−1} are superior to those with a regular inverse.

Data for this study were simulated using the QMSim software (Sargolzaei and Schenkel 2009). The historical population consisted of 1000 generations with a gradual increase in size from 1000 to 100,000 breeding individuals, with equal sex ratio, nonoverlapping generations, random mating, no selection, and no migration to create initial linkage disequilibrium (LD) and establish mutation–drift balance in the population. Six populations with different effective population size were created by selecting different numbers of breeding animals from the last generation of the historical population. Whereas the number of breeding females per generation was kept constant at 12,500, the number of males varied from 5 to 50 (5, 10, 20, 30, 40, and 50), aiming for approximate effective population sizes from 20 to 200 (data sets P20, P40, P80, P120, P160, and P200). In each generation randomly selected male offspring were used as sires for the next generation, while all the females were used as dams for the next generation. Ten recent generations were simulated for each population by random mating and with litter size of 2. All 75,000 individuals in generations 8–10 had genotypic information available. The simulated genome was assumed to have 30 chromosomes of equal length of 100 cM each, with 49,980 evenly allocated biallelic SNP markers and equal allele frequencies in the first generation of the historical population. A total of 4980 biallelic and randomly distributed QTL affected the trait, with allelic effects sampled from a gamma distribution with a shape parameter of 0.4. The recurrent mutation rate of the markers and QTL was assumed to be 2.5 × 10^{−5} per locus per generation (Solberg *et al.* 2008). Phenotypes were simulated with an overall mean as the only fixed effect and assuming heritability of 0.3. All animals in the recent generations had phenotypes available, except for animals in the last generation. The simulation was replicated five times.

The raw genomic relationship matrix was constructed as in VanRaden (2008),

$${G}_{0}=\frac{Z{Z}^{\prime}}{2\mathrm{\Sigma}{p}_{j}\left(1-{p}_{j}\right)},$$

where **Z** is the centered matrix of gene content adjusted for gene frequencies, and *p _{j}* is the gene frequency for SNP

$$G=0.95{G}_{0}+0.05{A}_{22}$$

(VanRaden 2008), where **A**_{22} is a pedigree-based numerator relationship matrix for genotyped animals. In preliminary tests, the blending had very little impact on realized accuracies (Aguilar *et al.* 2010).

The APY for inversion of **G** is based on a recursion on a subset of animals (Misztal *et al.* 2014; Misztal 2016). Split animals arbitrarily into core (*c*) and noncore (*n*) such that the number of core animals is close to the dimensionality of **G** or the number of effective SNPs. Also denote

$$G=\left[\begin{array}{cc}{G}_{cc}& {G}_{cn}\\ {G}_{nc}& {G}_{nn}\end{array}\right].$$

Assume that the breeding values (BV) **u** for noncore animals are linear functions of those for core animals,

$${u}_{n}^{}={P}_{nc}{u}_{c}^{}+{\Phi}_{n},$$

where **P*** _{nc}* is a matrix relating BV of noncore to core animals and

$$\left[\begin{array}{c}{\mathbf{u}}_{c}\\ {\mathbf{u}}_{n}\end{array}\right]=\left[\begin{array}{ll}\mathbf{I}\hfill & \mathbf{0}\hfill \\ {\mathbf{P}}_{nc}\hfill & \mathbf{I}\hfill \end{array}\right]\left[\begin{array}{l}{\mathbf{u}}_{c}\\ {\mathbf{\Phi}}_{n}\end{array}\right]$$

and

$$\mathrm{var}\left(u\right)={G}_{\mathrm{APY}}^{}=\left[\begin{array}{cc}I& 0\\ {P}_{nc}& I\end{array}\right]\left[\begin{array}{cc}{G}_{cc}^{}& 0\\ 0& {M}_{nn}^{}\end{array}\right]\left[\begin{array}{cc}I& {P}_{cn}\\ 0& I\end{array}\right],$$

where ${M}_{nn}^{}=\mathrm{var}\left({\Phi}_{n}\right).$ Subsequently,

$${G}_{\mathrm{APY}}^{-1}=\left[\begin{array}{cc}I& -{P}_{cn}\\ 0& I\end{array}\right]\left[\begin{array}{cc}{G}_{cc}^{-1}& 0\\ 0& {M}_{nn}^{-1}\end{array}\right]\left[\begin{array}{cc}I& 0\\ -{P}_{cn}& I\end{array}\right].$$

Using conditional distributions, ${P}_{nc}={G}_{nc}{G}_{cc}^{-1},$
${M}_{nn}=\mathrm{diag}\left\{{m}_{nn,i}\right\}=\mathrm{diag}\left\{{g}_{ii}-{g}_{ic}{G}_{cc}^{-1}{g}_{ci}\right\},$ and the final formula is as originally defined in Misztal *et al.* (2014):

$${\mathbf{G}}_{\mathrm{APY}}^{-1}=\left[\begin{array}{cc}{\mathbf{G}}_{cc}^{-1}& 0\\ 0& 0\end{array}\right]+\left[\begin{array}{c}-{\mathbf{G}}_{cc}^{-1}{\mathbf{G}}_{cn}\\ \mathbf{I}\end{array}\right]{\mathbf{M}}_{nn}^{-1}\left[-{\mathbf{G}}_{nc}{\mathbf{G}}_{cc}^{-1}\hspace{1em}\mathbf{I}\right].$$

In this algorithm, the direct inversion is only for **G*** _{cc}* and computing

Phenotypes were analyzed using the ssGBLUP model

$$y=1\mu +Su+e,$$

in which **y** is the observation vector for the first nine of the recent generations, *µ* is an overall mean, **u** is the vector of additive animal effects, **S** is the incidence matrix relating observations in **y** to additive genetic effects in **u**, and **e** is the vector of random residuals. We assumed that the variances were

$$\mathrm{var}\left(u\right)=0.3H\hspace{0.5em}\text{and}\hspace{0.5em}\mathrm{var}\left(e\right)=0.7I,$$

where **H** is a matrix combining pedigree and genomic relationships, with its inverse as in Aguilar *et al.* (2010); *i.e.*,

$${H}^{-1}={A}^{-1}+\left[\begin{array}{cc}0& 0\\ 0& {G}_{}^{-1}-{A}_{22}^{-1}\end{array}\right],$$

where **A**^{−1} is the inverse of a numerator relationship matrix for all animals included in the analysis. The partition in blocks refers to animals with/without genotypes.

Effective population size was calculated using two formulas. Theoretical effective population size $\left({N}_{{\mathrm{e}}_{\mathrm{T}}}\right)$ was calculated using the formula

$${N}_{{\mathrm{e}}_{\mathrm{T}}}=\frac{4{N}_{\mathrm{m}}{N}_{\mathrm{f}}}{{N}_{\mathrm{m}}+{N}_{\mathrm{f}}}$$

(Wright 1931), where *N*_{m} and *N*_{f} are the numbers of breeding males and females per generation, respectively. Inbreeding (or realized) effective population size $\left({N}_{{\mathrm{e}}_{\mathrm{F}}}\right)$ was calculated from the realized increase in inbreeding by generation, using the formula by Falconer and Mackay (1996),

$${N}_{{\mathrm{e}}_{\mathrm{F}}}=\frac{1}{2\mathrm{\Delta}F},$$

where

$$\mathrm{\Delta}F=\frac{{F}_{n}-{F}_{n-1}}{1-{F}_{n-1}}$$

and *F _{n}* is the average inbreeding in the

Genomic estimated breeding values (GEBV) were calculated using either an explicit inverse of **G** or the APY inverse. Core animals in the APY were selected randomly and their number corresponded to the number of the largest eigenvalues in **G**_{0} that explained 90% (EIG90), 95% (EIG95), 98% (EIG98), and 99% (EIG99) of the retained variance. Validation accuracies were computed only for animals in the 10th generation (without phenotypes) and defined as correlations between simulated breeding values and GEBV computed with either the regular inversion (GEBV_{REG}) or the APY (GEBV_{APY}) and a different number of core animals. All computations were applied to each of the six data sets.

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Figure 1 shows ${N}_{{\mathrm{e}}_{\mathrm{T}}}$ and ${N}_{{\mathrm{e}}_{\mathrm{F}}}$ with the different number of breeding males per generation. Both ${N}_{{\mathrm{e}}_{\mathrm{T}}}$ and ${N}_{{\mathrm{e}}_{\mathrm{F}}}$ were very similar and increased with the number of breeding males, from 20 and 19.3 to 199.2 and 188.1 for P20 and P200, respectively. If we take into account family size and variation in family size (Laporte and Charlesworth 2002), the *N*_{e} values would be in the same range (20–200) as in the simulation. Thus, the simulation scheme was effective in creating populations with close to the desired *N*_{e}. For simplicity, all graphs and discussions use rounded ${N}_{{\mathrm{e}}_{\mathrm{T}}}.$

Theoretical and realized effective population size (*N*_{e}) as a function of breeding males per generation when the number of breeding females was 12,500 per generation.

The number of eigenvalues of **G**_{0} that accounted for 90%, 95%, 98%, and 99% of the original variation is shown in Figure 2 and *Appendix*
Table A1. Accounting for 90% of the original variation (EIG90) required 814 ± 14 eigenvalues in population P20 and 5512 ± 19 eigenvalues in population P200. Accounting for 99% of the original variation (EIG99) required 6523 ± 68 eigenvalues in population P20 and 20,786 ± 29 eigenvalues in population P200. Thus, increasing *N*_{e} ~10 times increased the number of selected eigenvalues by 6.8 for EIG90 and by 3.2 for EIG99. While the number of eigenvalues increased with *N*_{e}, the increase was less than proportional especially for higher *N*_{e}. Graphically (Figure 2), the increases in the number of eigenvalues corresponding to 90% and 95% were close to linear, but less so corresponding to 98% and especially for 99% past *N*_{e} = 120. The total number of positive eigenvalues in **G** is bounded by the number of SNPs (49,980 in this study) and the number of genotyped individuals (75,000 in this study). Subsequently, steeper declines from a linear trend when the number of eigenvalues is very high could be due to a limited number of SNPs and individuals used in the simulation.

Number of largest eigenvalues that explain 90%, 95%, 98%, and 99% of variation in the genomic relationship matrix for populations with different effective population sizes (*N*_{e}). Solid lines show *N*_{e}*L*, 2*N*_{e}*L*, and 4*N*_{e}*L*, where *L* = 30 M.

The number of eigenvalues can also be expressed in terms of *N*_{e} and genome length *L* (*L* = 30). The value of EIG90 varies from ~40*N*_{e} (P20) to 27*N*_{e} (P200), the value of EIG95 varies from 80*N*_{e} (P20) to 46*N*_{e} (P200), and the value of EIG98 varies from ~185*N*_{e} (P20) to 75*N*_{e} (P200). Assuming that the increase in the number of eigenvalues is indeed linear with *N*_{e} but affected by limited number of SNPs and genotyped individuals, the approximate values would be EIG90 ≈ *N*_{e}*L*, EIG95 ≈ 2*N*_{e}*L*, and EIG98 ≈ 4*N*_{e}*L*.

Figure 3 shows the correlation between GEBV_{REG} and GEBV_{APY} for validation animals with variable numbers of core animals (from EIG90 to EIG99). Populations with greater *N*_{e} required a larger number of core animals to reach equivalent correlations. For all populations, the correlations were >0.99 with the number of core animals equal to EIG99 and >0.98 with the number of core animals equal to EIG98. Figure 4 shows the results as in Figure 3 but with the percentage of explained variance on the abscissa. The curves are linear and nearly identical. This means that the correlations between GEBV_{REG} and GEBV_{APY} are nearly a linear function of the percentage of the explained variance, regardless of *N*_{e}. The correlations are slightly higher than the percentage of explained variance, probably because GEBV contain not only contributions due to genomics but also some due to parent average (VanRaden 2008; Lourenco *et al.* 2015a).

Correlations between genomic estimated breeding values obtained with the direct inverse (ssGBLUP) and the inverse with the algorithm for proven and young (APY) of the genomic relationship matrix for six simulated populations as a function of the number **...**

Correlations between genomic estimated breeding values obtained with the direct inverse (ssGBLUP) and the inverse with the algorithm for proven and young (APY) of the genomic relationship matrix for six simulated populations where the number of core animals **...**

Figure 5 shows true accuracies (defined as the correlation between simulated breeding value and GEBV) across the six simulated populations as a function of the number of eigenvalues explaining the given amounts of variance. All SDs of accuracies across replicates were ≤0.01. The accuracy is inversely related to *N*_{e} as it was highest for population P20 (0.89 ± 0.01) and lowest for P200 (0.77 ± 0.01). In simulated populations, Muir (2007) and Goddard (2009) showed that accuracy of GEBV decreases as *N*_{e} increases. Smaller *N*_{e} means fewer *M*_{e} or ESM to estimate and subsequently smaller prediction error variance of these effects. The accuracies were only ~0.03 below the peak level with the number of core animals corresponding to EIG90; the accuracies increase by ~0.02 at EIG95, peaking at EIG98; and they are slightly lower at EIG99. The accuracies with the regular inverse (noted as 100% in the graph) were slightly lower than with EIG99. The results indicate that the majority of the information for GEBV is provided by EIG90 largest eigenvalues. The accuracy provided by eigenvalues present beyond EIG95 in EIG98 was small but required almost doubling the number of core animals. Eigenvalues corresponding to the last 2% variation do not provide any information and in fact slightly reduce the accuracies, which shows that the genomic information may be redundant and in fact overfitted the data. Subsequently we can conclude that the dimensionality of the genomic information (defined as the number of the larger, informative eigenvalues) in this study does not exceed EIG98. For genomic prediction, using the number of core animals corresponding to EIG98 is sufficient, with reduction to EIG95 when computing is expensive.

Accuracies of genomic estimated breeding values across six simulated populations where the number of core animals is defined as the number of eigenvalues that explain 90%, 95%, 98%, and 99% of original variability; values for 100% correspond to the regular **...**

The theory for the APY was developed either based on the dimensionality of **G** as computed from eigenvalues or based on the independent chromosome segments (*M*_{e}) (Misztal 2016). Both concepts may be closely related. In particular, the number of *M*_{e} is similar to the number of core animals beyond which the accuracy of GEBV does not increase. In this study, such a number corresponded to EIG98. Stam (1980) derived a probability density function for a size of an independent chromosome segment, which leads to the expected number of segments *M*_{e} = 4*N*_{e}*L*, where *L* is the size of the genome in morgans. With *L* = 30 in this study, *M*_{e} = 120*N*_{e}, which is close to an estimate of 140*N*_{e} for EIG98. The approximate values in this study, EIG90 ≈ 35*N*_{e}, EIG95 ≈ 70*N*_{e}, and EIG98 ≈ 140*N*_{e}, could be simplified to EIG90 ≈ *N*_{e}*L*, EIG95 ≈ 2*N*_{e}*L*, and EIG98 ≈ 4*N*_{e}*L*, respectively. As the segments are of variable size, Goddard (2009) argued that a more relevant formula is 2*N*_{e}*L*/log(4*N*_{e}*L*), which is equivalent to *M*_{e} = 8*N*_{e} (*N*_{e} = 20) to 6*N*_{e} (*N*_{e} = 200). Both numbers are well below EIG90. Several formulas for *M*_{e} were compared in a meta-analysis by Brard and Ricard (2015), and none were found satisfactory. Such conclusions could be due to several factors. First, their study looked at realized accuracies, and these are strongly affected by selection (Bijma 2012; Lourenco *et al.* 2015a). Second, the implicit assumption was of segments of equal size, while segment sizes were variable (Stam 1980). Third, Brard and Ricard (2015) pointed out that part of the difficulty resided in getting good estimates of *N*_{e}, a parameter that is not always well defined and that changes over time. We can posit that in genomic selection we can estimate the effects of the largest chromosome segments well and those of smaller segments not as well but they are still useful for prediction, and the remaining smallest segments provide insufficient accuracy for prediction. Compared to methods reviewed by Brard and Ricard (2015), the possible definition of *M*_{e} by EIG98 does not depend on realized accuracies or trait definition, but does require genotype collection.

This study focused on dimensionality of the GRM. In fact, the eigenvalue distribution of a SNP BLUP matrix (**Z**′**Z**, where **Z** is gene content) is the same as both share the same singular values from singular value decomposition of **Z**. Therefore, the dimensionality of the GRM can be defined as dimensionality of the SNP genomic information in general. In this study, the eigenvalues were computed from the GRM explicitly constructed. However, for large data sets, it is possible to compute them from the singular value decomposition of matrix **Z**, with a cost quadratic in the number of markers and only linear in the number of individuals (*e.g.*, by subroutine DGESVD in LAPACK).

Some results of this study could be influenced by simulation parameters. In particular, a larger number of genotyped animals and the number of SNP markers could have increased the dimensionality especially for higher *N*_{e}. Genotypes by simulation are perfect while in real data they are affected by quality control and possibly imputation. In addition, the simulated population was not selected and the number of genotyped generations was small. Further studies will show applicability of the results of this article to real populations undergoing selection.

Although the largest population size simulated in this study had *N*_{e} = 200, the dimensionality of the GRM can be extrapolated for populations with a larger *N*_{e}. In general, the dimensionality of the genomic information (**G** or **Z′Z**) is ≤min (*M*_{e}, *N*_{snp}, *N*_{ind}), where *N*_{snp} is the number of SNPs and *N*_{ind} is the number of genotyped individuals. In this study, *N*_{snp} and *N*_{ind} were several times larger than *M*_{e} although the dimensionality of the GRM was depressed by limited *N*_{snp} and *N*_{ind} especially for a large *N*_{e}. It appears that the dimensionality of the GRM is close to *M*_{e} when *N*_{snp} and *N*_{ind} are a few times larger than *M*_{e}. In fact, MacLeod *et al.* (2005) found that detection of 90% of junctions between independent chromosome segments required ~12 times as many markers as the number of junctions (≈*M*_{e}). Assume *N*_{e} = 3000 and *M*_{e} = 360,000. If *N*_{ind} or *N*_{snp} is low in comparison to *M*_{e}, dimensionality will be close to min (*N*_{ind}, *N*_{snp}). The dimensionality will reach *M*_{e} only when both *N*_{ind} and *N*_{snp} are >>*M*_{e}. For polygenic traits, *N*_{snp} determines a fraction of the additive variance explained by the genomic information (Jensen *et al.* 2012). Hypothesizing that in fact it is the ratio of *N*_{e} to *M*_{e} that is important, even a large *N*_{snp} can create a “missing heritability” problem in humans where *N*_{e} and subsequently *M*_{e} are large (Yang *et al.* 2010).

Simulations in this study assumed a polygenic model with an equal variance of each SNP. Heterogeneous SNP variance can be incorporated via weights for each trait separately as discussed in Misztal (2016). In particular, if positions of all causative SNPs were known, the rank of **G** would be equal to the number of causative SNPs (Misztal 2016); with 200 causative SNPs the rank of **G** would be 200. If only a few causative SNPs are known and their position/variance is not known precisely, we can expect the rank of **G** to be lower than that estimated from the effective population size but larger than the number of causative SNPs.

Results of this study may be applicable toward understanding the limits of genome-wide association studies (GWAS) resolution. Wang *et al.* (2012) found in a simulation study that the highest correlation of a simulated QTL value was not with the SNP effect closest to the QTL but with the average of 8–16 adjacent SNP markers. Su *et al.* (2014) investigated individual or block variances on 50,000 SNPs in Holstein cattle and found that slightly higher accuracy was obtained when the same variances were imposed on a block of 30 SNPs, which corresponds to 2 Mb or ~15*N*_{e} segments (assuming *N*_{e} = 100 for Holsteins). In a simulation study, Hassani *et al.* (2015) found that QTL effects were better predicted by averages of ±100 flanking markers than by an average of a smaller number of flanking markers. The resolution of GWAS may be limited to a size of an individual chromosome segment and subsequently by *N*_{e}.

In summary, when the number of SNP markers and genotyped animals is large, the dimensionality of the SNP genomic information defined by the eigenvalue of the GRM is approximately a linear function of effective population size. Subsequently, an inverse of the GRM based on limited recursion can be computed inexpensively for a large number of individuals. Such an inverse results in more accurate estimation of GEBV than a direct inverse.

The authors thank Paul VanRaden for helpful suggestions. Editing by Heather L. Bradford and Shogo Tsuruta is gratefully acknowledged. This research was primarily supported by grants from the Holstein Association USA (Brattleboro, VT), the American Angus Association, Zoetis, Cobb-Vantress, Smithfield Premium Genetics, the Pig Improvement Company, and the U.S. Department of Agriculture’s National Institute of Food and Agriculture (Agriculture and Food Research Initiative competitive grant 2015-67015-22936).

N_{e} | 90% | 95% | 98% | 99% |
---|---|---|---|---|

P20 | 814 ± 14 | 1,611 ± 26 | 3,701 ± 48 | 6,523 ± 68 |

P40 | 1,540 ± 8 | 2,954 ± 16 | 6,226 ± 29 | 10,006 ± 37 |

P80 | 2,749 ± 14 | 5,026 ± 25 | 9,622 ± 40 | 14,226 ± 47 |

P120 | 3,844 ± 14 | 6,769 ± 22 | 12,169 ± 31 | 17,163 ± 34 |

P160 | 4,760 ± 7 | 8,151 ± 12 | 14,058 ± 15 | 19,253 ± 15 |

P200 | 5,512 ± 19 | 9,245 ± 25 | 15,483 ± 29 | 20,786 ± 29 |

Communicating editor: D. J. de Koning

- Aguilar I., Misztal I., Johnson D. L., Legarra A., Tsuruta S., et al. , 2010. Hot topic: a unified approach to utilize phenotypic, full pedigree, and genomic information for genetic evaluation of Holstein final score. J. Dairy Sci. 93: 743–752. [PubMed]
- Bijma P., 2012. Accuracies of estimated breeding values from ordinary genetic evaluations do not reflect the correlation between true and estimated breeding values in selected populations. J. Anim. Breed. Genet. 129: 345–358. [PubMed]
- Brard S., Ricard A., 2015. Is the use of formulae a reliable way to predict the accuracy of genomic selection? J. Anim. Breed. Genet. 132: 207–217. [PubMed]
- Christensen O. F., Lund M. S., 2010. Genomic prediction when some animals are not genotyped. Genet. Sel. Evol. 42: 2. [PMC free article] [PubMed]
- Falconer D. S., Mackay T. F. C., 1996. Introduction to Quantitative Genetics, Ed. 4 Longman, Essex, UK.
- Fragomeni B. O., Lourenco D. A. L., Tsuruta S., Masuda Y., Aguilar I., et al. , 2015. Hot topic: use of genomic recursions in single-step genomic best linear unbiased predictor (BLUP) with a large number of genotypes. J. Dairy Sci. 98: 4090–4094. [PubMed]
- Goddard M., 2009. Genomic selection: prediction of accuracy and maximization of long term response. Genetica 136: 245–257. [PubMed]
- Hassani S., Saatchi M., Fernando R. L., Garrick D. J., 2015. Accuracy of prediction of simulated polygenic phenotypes and their underlying quantitative trait loci genotypes using real or imputed whole-genome markers in cattle. Genet. Sel. Evol. 47: 99. [PMC free article] [PubMed]
- Jensen J., Su G., Madsen P., 2012. Partitioning additive genetic variance into genomic and remaining polygenic components for complex traits in dairy cattle. BMC Genet. 13: 44. [PMC free article] [PubMed]
- Laporte V., Charlesworth B., 2002. Effective population size and population subdivision in demographically structured populations. Genetics 162: 501–519 [PubMed]
- Legarra A., Christensen O. F., Aguilar I., Misztal I., 2014. Single Step, a general approach for genomic selection. Livest. Sci. 166: 54–65.
- Lourenco D. A. L., Fragomeni B. O., Tsuruta S., Aguilar I., Zumbach B., et al. , 2015. aAccuracy of estimated breeding values with genomic information on males, females, or both: an example on broiler chicken. Genet. Sel. Evol. 47: 56. [PMC free article] [PubMed]
- Lourenco D. A. L., Tsuruta S., Fragomeni B. O., Masuda Y., Aguilar I., et al. , 2015. bGenetic evaluation using single-step genomic best linear unbiased predictor in American Angus. J. Anim. Sci. 93: 2653–2662. [PubMed]
- MacLeod A. K., Haley C. S., Woolliams J. A., Stam P., 2005. Marker densities and the mapping of ancestral junctions. Genet. Res. 85: 69–79. [PubMed]
- Masuda Y., Misztal I., Tsuruta S., Legarra A., Aguilar I., et al. , 2016. Implementation of genomic recursions in single-step genomic BLUP for US Holsteins with a large number of genotyped animals. J. Dairy Sci. 99: 1968–1974. [PubMed]
- Meuwissen T. H. E., Hayes B. J., Goddard M. E., 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. [PubMed]
- Misztal I., 2016. Inexpensive computation of the inverse of the genomic relationship matrix in populations with small effective population size. Genetics 202: 401–409 [PMC free article] [PubMed]
- Misztal I., Legarra A., Aguilar I., 2014. Using recursion to compute the inverse of the genomic relationship matrix. J. Dairy Sci. 97: 3943–3952. [PubMed]
- Muir W. M., 2007. Comparison of genomic and traditional BLUP-estimated breeding value accuracy and selection response under alternative trait and genomic parameters. J. Anim. Breed. Genet. 124: 342–355. [PubMed]
- Sargolzaei M., Schenkel F. S., 2009. QMSim: a large-scale genome simulator for livestock. Bioinformatics 25: 680–681. [PubMed]
- Solberg T. R., Sonesson A. K., Woolliams J. A., Meuwissen T. H. E., 2008. Genomic selection using different marker types and densities. J. Anim. Sci. 86: 2447–2454. [PubMed]
- Stam P., 1980. The distribution of the fraction of the genome identical by descent in finite random mating populations. Genet. Res. 35: 131–155.
- Su G., Christensen O. F., Janss L., Lund M. S., 2014. Comparison of genomic predictions using genomic relationship matrices built with different weighting factors to account for locus-specific variances. J. Dairy Sci. 97: 6547–6559. [PubMed]
- VanRaden P. M., 2008. Efficient methods to compute genomic predictions. J. Dairy Sci. 91: 4414–4423. [PubMed]
- Wang H., Misztal I., Aguilar I., Legarra A., Muir W. M., 2012. Genome-wide association mapping including phenotypes from relatives without genotypes. Genet. Res. 94: 73–83. [PubMed]
- Wright S., 1931. Evolution in Mendelian populations. Genetics 16: 97–159. [PubMed]
- Yang J., Benyamin B., McEvoy B. P., Gordon S., Henders A. K., et al. , 2010. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42: 565–569. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of **Genetics Society of America**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |