|Home | About | Journals | Submit | Contact Us | Français|
The identity-by-descent (IBD) based variance component analysis is an important method for mapping quantitative trait loci (QTL) in outbred populations. The interval-mapping approach and various modified versions of it may have limited use in evaluating the genetic variances of the entire genome because they require evaluation of multiple models and model selection. In this study, we developed a multiple variance component model for genome-wide evaluation using both the maximum likelihood (ML) method and the MCMC implemented Bayesian method. We placed one QTL in every few cM on the entire genome and estimated the QTL variances and positions simultaneously in a single model. Genomic regions that have no QTL usually showed no evidence of QTL while regions with large QTL always showed strong evidence of QTL. While the Bayesian method produced the optimal result, the ML method is computationally more efficient than the Bayesian method. Simulation experiments were conducted to demonstrate the efficacy of the new methods.
The online version of this article (doi:10.1007/s10709-010-9497-1) contains supplementary material, which is available to authorized users.
Identical-by-descent (IBD) based variance component method is often used to map quantitative trait loci (QTL) for outbred populations (Goldgar 1990; Amos 1994). The commonly used method is the interval mapping where two markers are used at a time to infer the IBD matrix for any positions bracketed by the two markers (Fulker and Cardon 1994). The model usually contains one QTL and a polygenic effect so that the variance of the QTL, the polygenic variance and the residual variance are the only variance components to be estimated. If multiple QTL exist, this interval mapping approach will produce biased estimate for the QTL variance. When the entire genome is scanned, the total genetic variance (sum of all variances of detected QTL) is often greater than the total phenotypic variance. This phenomenon always occurs in interval mapping, regardless whether the random model for an outbred population or the fixed model for a line cross is used. The reason for that is that QTL effects or QTL variances of different locations are estimated using different models. To scan the entire genome, multiple analyses are conducted, one for each putative location. None of the single QTL models is correct if multiple QTL exist. Therefore, the optimal method should be a multiple variance component model in which all QTL are included in a single model.
Multiple variance components may be difficult to estimate if the number of QTL included in the model is extremely large. However, the popular MCMC implemented Bayesian method is designed to handle large saturated models and it is the ideal method for multiple variance component estimation (Uimari and Hoeschele 1997). The maximum likelihood method may also be sufficient to handle large saturated models under the random model framework; we just never thought of placing one QTL in every few cM of the genome. Meuwissen et al. (2001) first attempted to evaluate the entire genome using a high dense marker map under the popular Bayesian approach. Their method actually treats the positions of QTL as fixed and only estimates the QTL variances and other parameters. Meuwissen et al. (2001) placed many QTL in the model. As a result QTL positions may not be relevant because the whole genome is already well covered by the proposed QTL. Yi and Xu (2000) used the reversible jump MCMC to infer the number of QTL under the random model framework. Only large QTL were eventually included in the model and the entire genome may not be evaluated thoroughly due to the slow mixing behavior of the reversible jump MCMC.
In this study, we proposed to cover the entire genome by QTL and estimated the QTL variances simultaneously within a single model. As long as the extra QTL placed in regions of the genome that do not contain QTL have estimated QTL variances close to zero, we can put as many QTL as we want to make sure that the entire genome is evaluated fairly. We investigated both the ML method and the Bayesian method and showed the pros and cons of each method.
Consider N independent families and the size of the jth family is nj for j = 1, …, N. For simplicity, let us assume that the family size is constant across families so that nj = n for all j = 1, …, N. Assume that we want to put M quantitative trait loci (QTL) in the linear model described below,
where yj is an n × 1 vector for the phenotypic values of a quantitative trait for family j, μ is the population mean of the trait, 1 is an n × 1 unity vector, γjk is an n × 1 vector for the additive genetic effects (breeding values) for family j at the kth QTL and j is an n × 1 vector for the environmental effects. The number of QTL proposed in the model is not the actual number of QTL but a larger number determined by the investigator based on the size of the genome, the marker density and the population size. If the marker density is not high, say one marker in every 10 centiMorgan (cM), the number of proposed QTL can take the number of markers. If the marker density is high, we may choose to place one QTL among a few markers. If the marker density is low, we may insert virtual markers between two consecutive markers. The bottom line is to put one QTL in every d cM. The number of QTL proposed should be sufficiently large to make sure that the entire genome is well evaluated without any large gaps. If a proposed QTL is nearby a true QTL, the effect of the true QTL will be absorbed by the proposed QTL. If a proposed QTL is further away from a true QTL, its estimated effect will be close to zero.
The expectation and variance–covariance matrix of yj are
respectively, where is an IBD matrix for the jth family at the kth QTL, is the genetic variance for the kth QTL and is the environmental error variance. Assume that and , the log likelihood function for the jth family is
where is the parameter vector. The overall log likelihood function for the entire population is
The challenge for the genome-wide evaluation is that, for a large genome, the number of proposed QTL can be very large and majority of the proposed QTL should have estimated variance components close to zero. This will cause problems in the parameter estimation. The EM algorithm is the first candidate method for the variance component model (Thompson and Shaw 1990). However, it is sensitive to the initial values of the parameters. We cannot choose zero as the initial value for , although most are in fact zero. Other initial values are hard to choose. Therefore, we decide to directly maximize the log likelihood function using a sequential approach by updating one variance component at a time, conditional on the values of all other variance components. When a single variance component is considered, maximizing the log likelihood function is a one-dimension problem; the bisection or any other simple algorithm can be used when one parameter is updated. When all parameters are updated, we go back to the first parameter and update the value again. The sequential algorithm requires iterations nested within other iterations until a certain criterion of convergence is satisfied. The iterations within an iteration are called the inner iterations while the iterations outside are called the outer iterations. This algorithm requires more iterations than an algorithm that updates all parameters simultaneously, but choosing the initial value for the parameter of interest becomes trivial, i.e., can be used as initial for all . The sequential approach of Xu (2007) was adopted here, where and are calculated only once for each outer iteration. For large family sizes, much of the computing burden comes from calculating and . Therefore, the sequential algorithm can save computing time substantially, in addition to ease the choice of initial values.
The fixed position approach described previously requires a full coverage of the genome by the proposed QTL. We now introduce a method that can update the positions of the proposed QTL. When the QTL positions are estimated, we can place a smaller number of QTL but still maintain a high probability that regions containing true QTL are visited frequently by the proposed QTL. Let λk be the position of the kth proposed QTL for k = 1, …, M. The parameter vector is now defined as
The QTL positions can move along the genome, but the order of the QTL remains unchanged, as denoted by λ1 < λ2…< λM The connection between the log likelihood function and the QTL positions is through the IBD matrices. We first calculate the IBD matrix for each putative position of the genome (Gessler and Xu 2000). If a QTL moves to a new position, the IBD matrix for the new position is used to evaluate the log likelihood function. The search for QTL positions is also sequential, i.e., we update one position at a time, given positions of all other QTL. For the kth QTL, we use a grid search between λk−1 and λk+1 with 2 cM increment. When the iterations converge, all parameters, including the QTL positions, will remain unchanged. We now have the MLE of all parameters, including the MLE for the QTL positions.
The maximum likelihood method provides a point estimate for each parameter. The main purpose of the genome evaluation is to examine the entire genome for possible association with a quantitative trait. The Bayesian method is adopted here because it gives a chance to evaluate every putative location of the genome. To compare the Bayesian method with the maximum likelihood method, we choose uniform prior for each parameter, including the population mean, the variance components, QTL positions and the environmental variance. The prior distribution for λk is also uniform but within the range defined by λk−1 < λk < λk+1. The posterior distribution for the population mean is normal with mean
from which a realization of μ is sampled. Other parameters do not have explicit forms of a distribution, and thus they are sampled based on the Metropolis-Hastings rule (Metropolis et al. 1953; Hastings 1970). For each of the parameters sampled using the M-H rule, the proposal distribution is a uniform distribution centered in the parameter value of the previous cycle. For example, the proposed value of λk in cycle t + 1 is
and δ is a small positive number, say δ = 2 cM. If is accepted, , otherwise, . In most situations, the Metropolis algorithm is sufficient, but when the value of a parameter is near the boundary, the Hastings adjustment is required to ensure that the parameter is not trapped to a fixed point. For the sampling of QTL position, the Hastings adjustment can be found in Wang et al. (2005). The variance component for each QTL is bounded between zero and the phenotypic variance present in the data, where zero is a legal value of the variance component. The residual variance is also bounded between zero and the phenotypic variance present in the data, but zero is excluded. The posterior sample consists of the observations after burn—in deletion and chain thinning.
In addition to the uniform prior for each variance component, we also considered the following hierarchical prior distribution for the QTL variances. The following exponential distribution was assigned to ,
where the parameter τ2 was also assigned a Gamma prior,
The values of the hyper parameters a > 0 and b > 0 can be chosen arbitrarily, e.g. (a,b) = (0.5, 0.1).
We designed the following simulation experiment to evaluate the performance of the proposed ML and Bayesian methods. We simulated a single large chromosome of 1,000 cM in length. The genome was covered by 101 evenly spaced markers with 10 cM per marker interval. The population size was N × n = 500 × 3 = 1,500. The parental alleles of markers were randomly sampled from five different alleles with an equal frequency. Eight QTL were placed in the genome with positions and QTL variances shown in Table 1. The parental alleles of the QTL were sampled from an infinite number of alleles, i.e., each parental allele was different from any other parental alleles. The genetic effect of the kth QTL was the sum of the two allelic effects while each allelic effect was sample from . The positions of these simulated QTL varied in terms of distances from the nearest markers, some overlapping with a marker and some residing in the middle of an interval bracketed by two markers. The heritability of an individual QTL (proportion of the phenotypic variance contributed by the QTL) ranged from 0.30 to 40.0%. The overall mean and the residual variance were set at μ = 10 and σ2 = 1.0, respectively. The overall proportion of the phenotypic variance contributed by all the eight QTL was 79.78%.
The simulation experiment with this setup is called the standard setup. Some parameters were eventually altered relative to the standard setup in the extended simulation experiments. For example, the marker density was later decreased from 10 cM per interval to 20 and 40 cM per interval. The sampling strategy was also extended to 750 × 2 = 1,500 and 375 × 4 = 1,500. When one experimental parameter was altered, the remaining parameters were fixed at the values in the standard setup.
Under the standard setup (10 cM per marker interval and 3 siblings per family), we simulated one dataset, which was analyzed under the following three models: (a) 10 proposed QTL, (b) 20 proposed QTL and (c) 40 proposed QTL. The real number of QTL is eight, which is smaller than the proposed number of QTL in all situations. The maximum likelihood method was used to analyze the data. The QTL positions were also treated as parameters and were searched along with other parameters. The estimated QTL variance and the positions are shown in Fig. 1. All three models have correctly identified the three largest QTL, i.e., QTL whose contributions to the phenotypic variance are greater than or equal to 10%. The Akaike’s information criterion (AIC) is a simple and very useful criterion for selecting the best model among alternative models (Wada and Kashiwagi 1990). The AIC values for the three models (10 QTL, 20 QTL and 40 QTL) are 3,851.81, 3,879.45 and 3,922.46, respectively. The minimum AIC occurs for the model with 10 proposed QTL, and thus this model is the best. However, in real data analysis, the actual number is unknown and the proposed number of QTL is often larger than the true number of QTL. The extra QTL placed in the model should be closed to zero for the estimated QTL variances and this has been demonstrated by Fig. 1, where all the superfluous QTL have very small estimated variances unless they are close to a true large QTL. Therefore, it is safe to place more QTL in the model than the actual number of QTL and let the program shrink the superfluous QTL to zero. When the proposed number of QTL is too big, a real large QTL may be split by two or more proposed QTL in the neighborhood of the true QTL, as demonstrated in Fig. 1b ,c. This presents no problem because we can choose more appropriate model among a few different models with the AIC criterion.
We now further extend the analysis by including 100 proposed QTL in the model. Under this analysis, we investigated two situations: (a) the positions of the 100 proposed QTL were estimated along with the QTL variances, labeled “100 moving”; (b) the positions of the 100 proposed QTL were fixed and evenly placed along the genome, labeled “100 fixed”. The estimated QTL variances and their locations are demonstrated in Fig. 2. Both methods work very well regarding the ability to identify large QTL (contribution greater than 10%). Again, a large QTL is often split by a few proposed QTL in the neighborhood of the true QTL. This analysis shows that if a large number of QTL are placed in the model, the positions of the proposed QTL do not have to be estimated. The moving position and fixed position approaches generate almost identical results.
To compare the result with the interval mapping of Xu and Atchley (1995), we also analyzed the same dataset with the interval mapping approach. The result is shown in Fig. 3. The interval mapping only detected the two largest QTL. The third largest QTL (10% contribution to the phenotypic variance) was not detectable while this QTL has been detected by the multiple variance component model.
Under the standard setup (10 cM per interval and 500 family each with three siblings), we replicated the experiment 100 times and each replicated dataset was analyzed with two methods. One method is the multiple variance component model proposed in this study, where the proposed number of QTL included in the model was 20 and the positions of the 20 proposed QTL were also estimated using the maximum likelihood method. The other method is the interval mapping of Xu and Atchley (1995) in which a single QTL and a polygenic effect were included in the model. Since the multiple variance component model has no test for a chromosome location, we simply examined the estimated QTL variance in the neighborhood of a true QTL. When the estimated QTL variance in the neighborhood (within 20 cM) of a true QTL is sufficiently large (larger than any peek appearing in a non-QTL region), the QTL was claimed to be detected. For each simulated true QTL, the mean estimate and the standard deviation across the 100 replicated simulations were calculated. The empirical statistical power for each simulated QTL was also calculated as the proportion of the replicated experiments that the QTL was detected. It appears to be subjective, but the multiple variance component model usually provides very small estimated QTL variances for regions that are not placed for any QTL. Therefore, any region that has a noticeable estimated QTL variance indicates that a true QTL is nearby. For the interval mapping of Xu and Atchley (1995), the likelihood ratio test statistic was used to claim the significance of a QTL. If an estimated QTL variance nearby a true QTL (within 20 cM) is significant, this QTL was claimed to be detected. The estimated QTL variances and QTL positions for the interval mapping are compared with those obtained from the multiple variance component model (see Table 1 for the comparison). Overall the multiple variance component model performs better than the interval mapping. The interval mapping provided biased (upward) estimates for all the QTL variances, especially when the true QTL variance was small. Because of the large biases for the estimated QTL variances, they do not add up, i.e., the sum of all the estimated QTL variances is greater than the total phenotypic variance. Therefore, the multiple variance component model outperforms the interval mapping approach.
We now extended the simulation to examine the effect of different family structure on the result of the multiple variance component model. The two additional family structures were 750 × 2 and 375 × 4. Other parameter settings were the same as the previous experiment, i.e., 10 cM per marker interval and 20 proposed QTL included in the model. The experiment was replicated 100 times. The results are shown in Table 2. Result of structure 375 × 4 appears to be better that structure 750 × 2 in terms of smaller estimation errors and higher statistical power. Therefore, the multiple variance component model performs better with small number of large families.
Further extension was made in terms of varying the marker density given that other experimental parameters were the same as those of the standard setup. We examined two more different marker densities, one is 20 cM per marker interval and the other is 40 cM per marker interval. The family structure was 500 × 3 and the proposed number of QTL was 20. The results are given in Table 3, showing that higher marker density has improved the performance of the method.
Finally, we examined the MCMC implemented Bayesian analysis for the simulated data in the standard setting. Again, we placed 20 QTL on the genome. First, we used the uniform prior for each parameter, including the variances and positions of proposed QTL. The first 2000 iterations were treated as burn-in and thereafter one observation was saved in every 20 iterations to reduce the serial correlation. The posterior sample contained 2,000 observations for post MCMC analysis. The result of the MCMC implemented Bayesian analysis is shown in Fig. 4. The frequency profile is shown in the top panel while the QTL variance profile is shown in the bottom panel. The estimated QTL variances are very close to the true values. From Fig. 4b, we can see that all QTL but the two smallest ones are detectable. This demonstrates the advantage of the Bayesian method over the maximum likelihood method.
We further examined the MCMC implemented Bayesian method under the hierarchical modeling with exponential prior for each QTL variance and the parameter of the exponential prior was further assigned a Gamma prior with parameter a and b, Gamma(a,b). The result of Gamma(0.5, 0.1) is shown in Fig. 5. We also choose Gamma(0.5, 0.01) and the result is given in Fig. 6. The hierarchical models with two different sets of hyper parameters are similar to each other, meaning that the choice of the hyper parameters (a and b) does not have too much influence on the result. These results (hierarchical modeling) are also similar to the result of the uniform prior. Overall, the MCMC implemented Bayesian analysis performs better than the ML analysis. However, the ML method is computationally more efficient that the Bayesian method.
We examined two different methods for genome-wide evaluation of QTL in outbred populations. The ML method is an extension of the interval mapping of Xu and Atchley (1995) to handle multiple QTL. The MCMC implemented Bayesian method is an extension of the Bayesian shrinkage analysis of Wang et al. (2005) for line crosses to outbred populations. Similar random model methodology has been proposed by Yi and Xu (2000) who used the reversible jump MCMC algorithm for model selection. In Yi and Xu (2000), the QTL number was treated as a parameter and sampled along with other parameters. In this study, we emphasize genome evaluation rather than QTL mapping. The difference between genome evaluation and QTL mapping is that the former tries to evaluate the entire genome, including regions that have no QTL, while the latter emphasizes detecting regions of the genome that have QTL. We purposely placed more QTL than necessary to give the method a better chance to evaluate the entire genome. For regions of the genome that contain no QTL, the proposed QTL in those regions often have very small estimated variances. Another advantage of the genome evaluation is that it has avoided model selection, which is still a hot topic for discussion in the literature (Kadane and Lazar 2004).
We used multiple full-sib families as an example to demonstrate the method. Extension to multiple complicated pedigrees is straightforward, at least, theoretically because the method requires only the IBD matrices for each putative location of the genome. Methods to calculate the IBD matrix using marker information are available for arbitrarily complicated pedigrees (Amos et al. 1990; Almasy and Blangero 1998). The programs Lokie (Heath 1997) and SimWalk2 (Sobel et al. 2001) are the most well known software packages for IBD matrix calculation.
Surprisingly, the multiple variance component model has very low false positive rate (also called the Type I error). Although we did not actually calculate the Type I error in our simulation experiments, just by visual inspection on the QTL variance profiles, we can see that regions of the genome that contain no QTL rarely show any noticeable peaks while regions with large QTL always have strong signals. This observation implies that the multiple variance component model has great power and small Type I error. Of course, statistical power and Type I error are concepts of frequentists, not of Bayesians. Another surprising discovery is that the Bayesian method is very robust to the prior choice for the QTL variance components. We examined the uniform prior and hierarchical prior (exponential and Gamma), they all generated similar results.
Finally, genome evaluation has two purposes: identifying the regions of the genome for association with the variance of a trait (similar to QTL mapping) and evaluating the genetic effect for each individual animal or plant (marker assisted selection). This study emphasizes the first purpose. To estimate the genetic effects (breeding values) for all individuals in a particular family, the best linear unbiased prediction (BLUP) technology can be applied. For example, to calculate the BLUP estimate for the kth QTL for all individuals in the jth family, the following BLUP equation can be used,
The variance–covariance matrix of this BLUP estimate is
The overall breeding values for all members of the jth family is
Individuals can be ranked based on the estimated breeding values and selected as candidates for breeding. This is referred to as marker assisted selection.
The SAS/IML programs for the maximum likelihood method and Bayesian method are posted on the journal website along with sample data.
Below is the link to the electronic supplementary material.
We thank two anonymous reviewers for their comments on an early version of the manuscript and suggestions on the improvement of the manuscript. This project was supported by the Agriculture and Food Research Initiative (AFRI) of the USDA National Institute of Food and Agriculture under the Plant Genome, Genetics and Breeding Program 2007-35300-18285.
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.