Home | About | Journals | Submit | Contact Us | Français |

**|**Springer Open Choice**|**PMC2871098

Formats

Article sections

- Abstract
- Introduction
- Model and methods
- Hypothesis tests
- Applications
- Simulation studies
- Discussion
- References

Authors

Related links

TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik

Theor Appl Genet. 2010 June; 121(1): 47–63.

Published online 2010 February 24. doi: 10.1007/s00122-010-1290-0

PMCID: PMC2871098

Department of Botany and Plant Sciences, University of California, Riverside, CA 92521 USA

Shizhong Xu, Phone: +1-951-8275898, Fax: +1-951-8274437, Email: ude.rcu@ux.gnohzihs.

Communicated by M. Sillanpaa.

Received 2009 September 2; Accepted 2010 February 1.

Copyright © The Author(s) 2010

This article has been cited by other articles in PMC.

We developed a generalized linear model of QTL mapping for discrete traits in line crossing experiments. Parameter estimation was achieved using two different algorithms, a mixture model-based EM (expectation–maximization) algorithm and a GEE (generalized estimating equation) algorithm under a heterogeneous residual variance model. The methods were developed using ordinal data, binary data, binomial data and Poisson data as examples. Applications of the methods to simulated as well as real data are presented. The two different algorithms were compared in the data analyses. In most situations, the two algorithms were indistinguishable, but when large QTL are located in large marker intervals, the mixture model-based EM algorithm can fail to converge to the correct solutions. Both algorithms were coded in C++ and interfaced with SAS as a user-defined SAS procedure called PROC QTL.

Interval mapping (Lander and Botstein 1989) is the most commonly used method for mapping quantitative trait loci (QTL). The method usually applies to quantitative traits, i.e., traits that have a continuous distribution. In agricultural crops, the phenotypes of some traits are measured as discrete variables. Ordinal traits, e.g., disease resistance in plants and animals, are typical examples of such discrete traits. These traits are usually modeled by a multinomial distribution. Traits measured as counts, such as liter size in pigs and tiller number in rice, are usually modeled by the Poisson distribution. Binomial traits are also common in agricultural experiments, such as the ratio of germinated seeds to the total number of seeds planted. These traits are not normally distributed. Although some transformations, such us the log transformation or the more general Box–Cox transformation (Box and Cox 1964), can be used to improve the normality of the traits, not all traits can be transformed. For example, a binary trait has no appropriate transformation to make it normal.

The generalized linear model approach (McCullagh and Nelder 1989; Wedderburn 1974) is the most appropriate method for analyzing traits with non-normal distribution and it has been widely used in statistics for parameter estimation. Generalized linear model takes advantage of all theory and methods developed in the usual linear model methodology (Searle 1997). It has been applied to QTL mapping for some special traits, e.g., binary traits (Deng et al. 2006; Xu and Atchley 1996; Yi and Xu 1999a, b, 2000), ordinal traits (Hackett and Weller 1995; Rao and Xu 1998) and Poisson traits (Cui et al. 2006; Cui and Yang 2009). Depending on the special characteristics of the traits, distribution-specific generalized linear models have been developed for these traits. These methods are not sufficiently general to extend to all traits that can be modeled by the generalized linear model. For example, the EM algorithm developed by Xu et al. (2003, 2005) only applies to binary and ordinal traits. They treated both the marker genotypes and the latent variable as missing values. Although parameter estimation under the EM algorithm is simple, the information matrix of the estimated parameters is difficult to calculate. A more comprehensive analysis of the generalized linear model applied to QTL mapping is the seminal paper by Lange and Whittaker (2001). They adopted the generalized estimating equations (GEE) approach to analyze multiple traits with arbitrary combination of continuous and discrete trait components. The method replaces the unobserved QTL genotypes by the conditional expectations of the genotype indicator variable given flanking marker information. The uncertainties of the genotype indicator variables are ignored. In addition, detailed formulas for the partial derivatives of the expectation of the data with respect to the parameters are not given.

When there are no missing values, commercial software packages are available to estimate parameters under a wide range of distribution of the traits, e.g., SAS (SAS Institute 1999) and GENESTAT (PHOEBE Biostatistics Group 2007, http://www.genestat.org). Although these programs may handle missing values using the imputation algorithm or the EM algorithm (Dempster et al. 1977), the missing patterns handled by these commercial programs are usually different from that of QTL mapping. In interval mapping, genotypes are missing for every individual at a putative QTL position unless the QTL overlaps with a fully informative marker. Special mixture models are required in QTL mapping (Lander and Botstein 1989). Hackett and Weller (1995) and Xu and Atchley (1996) were the first group of people to use the EM algorithm to estimate QTL parameters for ordinal traits, but they did not investigate the variance–covariance matrix of the estimated QTL effects. Hackett and Weller (1995) took advantage of an existing commercial software (GENESTAT) for generalized linear model analysis by iteratively calling the subroutine for generalized linear model with non-missing genotypes and calculating the weight (posterior probability of QTL genotype). The attractive property of that method is that users do not have to write their own code for the maximization step, which is conducted by the commercial software. They only incorporated the expectation step into the existing program for parameter updating. As a result, no variance–covariance matrix of the estimated parameters was provided. Xu et al. (2003) developed an EM algorithm for binary data and used the Monte Carlo simulation approach to obtain the Louis’ (1982) information matrix. The variance–covariance matrix of the estimated parameters can be approximated by the inverse of the information matrix. This method is computationally intensive due to the use of Monte Carlo simulation for approximate integration.

In the statistics literature, generalized linear model with missing covariates is often handled with the EM algorithm (Horton and Laird 1998). However, other methods are also available, as summarized by Ibrahim et al. (2005), who reviewed four general approaches: maximum likelihood method implemented via the EM algorithm by the method of weights (Horton and Laird 1998), multiple imputation (Rubin 1987), fully Bayesian (Ibrahim et al. 2002) and weighted estimation equation (Robins and Ritov 2001). Ibrahim et al. (2005) concluded that the most accurate method is the fully Bayesian method, although the method is associated with a high cost in terms of computing time. The second best method is the EM algorithm via the method of weights. Application of this method to interval mapping has not been analyzed.

The mixture model-based maximum likelihood estimation of parameters has a slow convergence speed. The computational intensity is high within each round of the iteration. Therefore, an approximate method that improves the computational efficacy with little loss in power may be desirable. Xu (1998) developed a weighted least square approximation for QTL mapping of normally distributed traits. Recently, Han and Xu (2008) made a further improvement of the weighted least square estimation. Their idea can be applied to the generalized linear model. Therefore, we will present an approximate method to improve the computational efficiency over the mixture model maximum likelihood method.

We will use ordinal trait QTL mapping as an example to introduce the generalized linear model. Extension to other traits will be given later. Suppose that a disease phenotype of individual is measured by an ordinal variable denoted by where *p* + 1 is the total number of disease classes and *n* is the sample size. Let , be a vector to indicate the disease status of individual *j*. The *k*th element of *Y*_{j} is defined as

Note that the phenotype of the ordinal trait has been formulated as a multivariate Bernoulli variable (or multinomial variable with sample size one). Different link functions can be used to describe the relationship of the observed ordinal phenotype and the genetic effects of QTL. The most commonly used link functions are the logit and probit link functions (McCullagh and Nelder 1989). Here, we described the probit link function and leave the logit link function in Appendix D for interested readers. Under the probit link function, the expectation of *Y*_{jk} is

where is the intercept, *β* is a *q* × 1 vector for some systematic effects (not related to the effects of QTL), and *γ* is an *r* × 1 vector for the effects of a quantitative trait locus. The symbol Φ(·) is the standardized cumulative normal function. The design matrix *X*_{j} is assumed to be known, but *Z*_{j} may not be fully observed because it is determined by the genotype of individual *j* for the locus of interest. We will defer the definition of *Z*_{j} to the next section. Because the link function is probit, this type of analysis is called probit analysis. Let *μ*_{j} = {*μ*_{jk}} be a vector. The expectation for vector *Y*_{j} is and the variance–covariance matrix of *Y*_{j} is

The method to be developed requires the inverse of matrix *V*_{j}. However, *V*_{j} is not of full rank (see Appendix C for proof). We can use a generalized inverse of *V*_{j}, such as in place of *V*_{j}^{−1} (see Appendix C for the generalized inverse). The weight matrix is

The parameter vector is denoted by *θ* = {*α*, *β*, *γ*} with a dimensionality of Once and are defined, the reweighted least square method of Wedderburn (1974) can be used to estimate the parameters.

When the design matrix *Z*_{j} is fully observed, the maximum likelihood solution of parameters can be solved in a straightforward manner using the reweighted least squares approach (Wedderburn 1974). For the paper to be self contained, the iteration equation and the information matrix in the situation of no missing value is briefly described. The Newton–Raphson iteration is

where

is the iteratively reweighted least squares formula for parameter updating (increment of the parameter from iteration *t* to *t* + 1). Matrix *D*_{j} is the partial derivative matrix of *μ*_{j} with respect to *θ*,

with a dimensionality of Once the iteration process converges, the information matrix is automatically given, as it is the coefficient matrix in the last step of the iteration (Wedderburn 1974),

From this information matrix, the variance–covariance matrix of estimated parameters is calculated because the inverse of the information matrix approximately equals the variance–covariance matrix of the estimated parameters.

In QTL mapping for ordinal traits, the generalized linear model in its original form can be applied if one is interested in individual marker analysis because, at markers, the genotypes are observed and thus matrix *Z*_{j} is known. In interval mapping, however, effects of QTL that are located between markers should also be estimated. In this case, the genotypes of QTL are not observed and must be inferred using flanking marker genotypes. This is a typical missing value problem. The missing value *Z*_{j} can be inferred from linked markers. We now use an F_{2} population as an example to show how to handle the missing value of *Z*_{j}. Let

be the conditional probability of QTL genotype given marker information, where the marker information can be either drawn from two flanking markers (interval mapping, Lander and Botstein 1989) or multiple markers (multipoint analysis, Jiang and Zeng 1997). For an F_{2} population, matrix *H*_{g} is defined as the *g*th row of matrix *H*, where

Corresponding to the definition of *H*, the QTL effect vector *γ* is defined as where *a* and *d* represents the additive and dominance effects, respectively. When *Z*_{j} is missing, the generalized linear model becomes a generalized linear mixture model. Under the mixture model approach, we need to define the genotype-specific expectation, the genotype-specific variance matrix and the genotype-specific derivatives for each individual. Let

be the genotype-specific expectation of *Y*_{jk} when *j* takes the *g*th genotype for *g* = 1, 2, 3. The corresponding genotype specific weight matrix is

Let *D*_{j}(*g*) be the genotype-specific partial derivatives of the expectation with respect to the parameters,

The closed form of matrix *D*_{j}(*g*) is given in Appendix A. The increment of parameters in the iteration is

where *p*_{j}^{*}(*g*) is the posterior probability of QTL genotype after the phenotype information is incorporated and is given by

where

is the multinomial probability. Derivation of the EM algorithm (Eq. 14) is given in Appendix B.

Unfortunately, the information matrix under the EM algorithm is not identical to the coefficient matrix of the reweighted least squares equation; rather, it has to be adjusted for the information loss due to the uncertainty of QTL genotypes. The Louis’ (1982) adjustment of the information matrix is

The first term in the above expression (Eq. 17) is

which is the expected value of the negative Hessian matrix. The second term of Eq. (17) is

which is the variance matrix of the score vector, where

and

The EM implemented mixture model approach described above is computationally intensive due to (1) the mixture model itself and (2) the extra step in calculating the information matrix of parameters. Here, we introduce an approximate method that replaces the mixture model by a heterogeneous residual variance model. Let

be the conditional expectation of *Z*_{j} given marker information and

be the corresponding conditional variance–covariance matrix of *Z*_{j}. If *Z*_{j} were observed, we would have

When *Z*_{j} is missing, we can replace *Z*_{j} by *U*_{j} and adjust for the over dispersion caused by the substitution,

where

is the heterogeneous dispersion (see Xu 1998).

We are now in a position to explain the over dispersion. It makes more sense to introduce the liability model before we explain the overdispersion. Let

be the liability for individual *j*, where is the residual error of the liability. The liability is a latent variable that controls the observed ordinal phenotype through a series of thresholds with *T*_{j} = *k* if The residual error variance is not estimable and thus we set *σ*^{2} = 1. When *Z*_{j} is replaced by *U*_{j}, the variance of the liability becomes

Because *σ*_{j}^{2} ≥ 1, the model is called overdispersion. Since *σ*_{j}^{2} varies from one individual to another, the model is also called the heterogeneous residual variance model. To adjust for the heterogeneous overdispersion, we replace by This adjustment serves as a way to standardize the liability so that the adjusted liability has a mean and a unity variance.

Unlike the mixture model, the expectation *μ*_{j} is no longer a function of *g*. Similarly, the weight matrix is

This modification leads to a change in matrix *D*_{j}, the partial derivatives of *μ*_{j} with respect to the parameters, which is given in Appendix A. The same iteration equation given in Eq. (6) is used here for the heterogeneous residual variance model. The attractive property of this approximation is that as the information matrix is given in Eq. (8) no adjustment is required because EM algorithm is not used here for the approximation.

Ordinal traits are the most commonly observed discrete traits in QTL mapping experiments. Other discrete traits also commonly seen in QTL mapping experiments are binary traits, binomial traits and Poisson traits. This section is dedicated to these commonly observed discrete traits. The mixture model algorithm and the heterogeneous variance approximation apply to all traits as long as the traits can be analyzed under the generalized linear model. To apply the algorithms to any specific trait, we only need to find: (1) the distribution of the trait (probability density of the data point), (2) the expectation of the data point, (3) the weight (inverse of the variance) of the data point and (4) the partial derivative of the expectation with respect to the parameters. We now introduce these discrete traits and leave the details of the formulas in Appendix A for interested readers.

Binary traits can be treated as a special case of ordinal traits with *p* = 1. Without any modification, the method developed for ordinal traits can be applied to binary traits with defined as a 2 × 1 vector. Each of the two components is defined as a binary variable and the two components are perfectly correlated. Here, we simplify the problem by defining *Y*_{j} as a univariate binary trait. This univariate treatment not only saves computing time but also simplifies the notation. We now use the univariate definition to define the binary phenotype,

The expectation and variance of the phenotype given the parameter value are

and

respectively. The probability distribution is

Details for the mixture model and the heterogeneous variance model are given in Appendix A.

Let *n*_{j} be the number of trials observed from individual *j* and *m*_{j} be the number of events happened to individual *j*. The binomial phenotype for individual *j* is defined as (expressed as a fraction so that 0 ≤ *Y*_{j} ≤ 1). Under the probit model, the expectation and the variance of the phenotype are

and

respectively. The weight is

The probability distribution is

Details regarding the partial derivatives of the expectation with respect to the parameters both in the mixture model and in the heterogeneous variance model are given in Appendix A.

Let be the phenotype of a Poisson trait. The expectation and the variance of the phenotype are equivalent, , where

The weight is

The probability density is

Details regarding the partial derivatives of the expectation with respect to the parameters both in the mixture model and in the heterogeneous variance model are given in Appendix A.

Two different hypothesis tests are provided, the likelihood ratio test and the Wald test (1949). The likelihood ratio test requires evaluation of different likelihood functions (full model and reduced model). The Wald test is much simpler than the likelihood ratio test statistic because it requires only and the estimated parameters to form the test statistics. Let be the subset of matrix , the Wald test statistics for the null hypothesis, is

Recall that represents both the additive and dominance effects. Under the null hypothesis, the Wald test follows approximately the Chi-square distribution with two degrees of freedom. Therefore, the Wald test is comparable to the likelihood ratio test statistics (McCullagh and Nelder 1989). When testing a single parameter (either *a* or *d*), the Wald test statistics is equivalent to the Chi-square test with one degree of freedom. Therefore, some people used the Wald test and the Chi-square test interchangeably (e.g., Han and Xu 2008).

This example demonstrates the application of the generalized linear model to QTL mapping for binary traits in wheat. The experiment was conducted by Dou et al. (2009) who made the data available to us for this analysis. A female sterile line XND126 and an elite cultivar Gaocheng 8901 with normal fertility were crossed for genetic analysis of female sterility measured as a binary trait. The parents and their F_{1} and F_{2} progeny were planted at Huaian experimental station in China for the 2006–2007 growing season under the normal autumn sowing condition. The mapping population was an F_{2} family consisting of 234 individual plants. The binary trait was the presence of seed setting of the female plants. About five-sixth of the F_{2} progeny had seeded splikelets (phenotype 1) and the remaining one-sixth plants did not have seeded splikelet (phenotype 0). This is a typical binary trait regarding the presence of seeds. Among the plants that had seeded spikelets, the number of seeded spikelets varied and can be treated as a binomial trait for further QTL analysis (see Sect. “Binomial traits” for binomial trait QTL mapping). A total of 28 SSR markers were used in this experiment. These markers covered 5 chromosomes of the wheat genome with an average genome marker density of 15.5 cM per marker interval. The 5 chromosomes are only part of the wheat genome.

These chromosomes were scanned for QTL of the binary trait using both the mixture model and the heterogeneous variance model. The LOD profiles of the two methods are shown in Fig. 1. Two QTL on chromosome 2 and one QTL on chromosome 5 have been detected with high LOD score. The chromosome-wise empirical threshold values are lower than LOD 3. With the chromosome-wise threshold values, we detected one more QTL on chromosome 1. The estimated QTL parameters are listed in Table 1. The two models (mixture model and heterogeneous variance model) produced very similar results.

The LOD profiles for the female sterility (binary) trait of wheat. **a** The *upper panel* shows the LOD profile for the heterogeneous variance model; **b** the *lower panel* shows the LOD profile for the mixture model. The chromosomes are separated by the *vertical* **...**

The same experiment conducted by Dou et al. (2009) also recorded the number of seeded spikelets and the total number spikelets for each plant. The ratio of the two records is a binomial trait. The same mapping population and the same linkage map were used also for the binomial trait QTL mapping. Again, both the mixture model and the heterogeneous variance model were used for the binomial trait analysis. Unfortunately, the mixture model failed to generate meaningful result. Therefore, the result of the mixture model analysis was not reported. In chromosome regions, where there were no QTL, the mixture model generated result similar to that of the heterogeneous variance model. For regions with large QTL, the mixture model approach failed to converge. The possible reason for the failure will be presented in Sect. “Discussion”. We now focus on the result of the heterogeneous variance model. The LOD profile is shown in Fig. 2. First, the pattern of the LOD profile is similar to that of the binary trait analysis. There is strong evidence that there is one QTL on chromosome 1 and two QTL on chromosome 2. The LOD score here for the binomial trait is not in the same scale as that of the binary trait. The highest LOD is about 1,000, almost a hundred times more than the binary trait. This inflation of the LOD reflects the increased power of the binomial data analysis than the binary data analysis. Regions of other chromosomes also show LOD score higher than the empirical chromosome-wise threshold values. These include chromosomes four and five. The estimated QTL parameters are listed in Table 2.

The LOD score profile of the female sterility (binomial) trait in wheat under the heterogeneous variance model. The mixture model failed to converge to the correct solution and thus are not presented. The chromosomes are separated by the *vertical gray* **...**

This example demonstrates the application of the generalized linear model QTL mapping for traits with a Poisson distribution. The data were provided by Dr. Gangqiang Cao at Zengzhou University, China. The result has not been published in any form. The mapping population was a double haploid family of the rice initiated from the cross between IR64 and Azucena. The trait analyzed was the tiller number with an assumed Poisson distribution. The sample size was *n* = 110 and the number of markers was 175. These markers covered 12 chromosomes (2,031 cM in total length) of the rice genome with an average marker interval of 11.6 cM. This dataset was different from that used by Cui et al. (2006). Both experiments were initialized from the same line cross with the same linkage map, but the experiments were conducted in different times and different locations by different investigators. Interval mappings under both the mixture model and the heterogeneous variance model were applied to the data. The LOD score profiles obtained from the two different methods are depicted in Fig. 3. The LOD score from the heterogeneous variance model is slightly higher than that of the mixture model, but the difference is almost negligible. If LOD = 3 was used as the criterion for significance, two QTL would have been detected on chromosomes 1 and 4. We used the quick method of Piepho (2001) to calculate the empirical threshold for each chromosome and used these thresholds to declare significance of QTL. The LOD thresholds are substantially less than LOD 3. Using the chromosome-specific thresholds, we detected one more QTL at the end of chromosome 12. The estimated QTL parameters are listed in Table 3. The supporting interval for each estimated QTL position was determined by the one-LOD drop approach (Ooijen 1992). The two methods differ slightly in the estimated QTL positions and the supporting intervals. The supporting intervals of QTL positions for the heterogeneous variance model were consistently shorter than those of the mixture model. Overall, the difference between the two methods is very small and can be safely ignored for this data analysis.

LOD profiles for the tiller number of rice (Poisson trait) QTL mapping. **a** The *upper panel* shows the LOD profile for the heterogeneous variance model; **b** the *lower panel* shows LOD profile for the mixture model. The chromosomes are separated by the *vertical* **...**

This simulation experiment was to demonstrate the difference between the mixture model and the heterogeneous variance model for binomial trait QTL mapping. For the female sterility trait of wheat, the mixture model approach failed to converge for the binomial data analysis but succeeded for the binary data. Binomial data were supposed to be more informative than the binary data, but more information turned out to do more harm than good to the mixture model, a typical problem of inconsistency has occurred. The problem happened in the calculation of the posterior probability of QTL genotype. The binomial density is equivalent to the product of multiple independent Bernoulli trials. This density is extremely sensitive to the change of genotype from one form to another, especially when the number of trials is large. The super sensitiveness of binomial density to the QTL genotype change led to degeneracy of the posterior probability of QTL genotype. We conducted a small-scale simulation experiment to demonstrate the problem of the mixture model. We simulated a QTL at position 55 cM of a chromosome with 100 cM in length. Five markers were placed evenly on the chromosome with 20 cM per marker interval. The sample size was 500 for an F_{2} population. The binomial trait phenotype was simulated for each plant with a constant number of trials for all plants. We simulated the following number of trials in four experiments: 20, 40, 80 and 160. The LOD score profiles obtained from the mixture model and the heterogeneous variance model are shown in Fig. 4 for all the four experiments. We can see that when the number of trials were 20 and 40, the LOD profiles of the two methods are very similar. As the number of trials increased to 80 and 160, the differences between the two models are dramatic, with LOD profile of the mixture model drastically deviated from that of the heterogeneous variance model when the putative QTL position off the marker position. The strange leaf-like pattern of the mixture model LOD profile reflects the instability and failure of the mixture method.

This simulation study was to evaluate the differences between the EM implemented mixture model and the heterogeneous residual variance model of interval mapping for Poisson traits. The simulation experiments were much more comprehensive than the previous one. The criteria of evaluation include the statistical powers, the test statistic profiles, the estimation errors of QTL parameters, the biases of the estimation and the computational times. The factors considered are the marker density (A), size of the QTL (B), mean of the trait (C), the sample size (D) and the QTL position (E).

A single chromosome with 100 cM in length was simulated for an F_{2} population derived from the cross of two inbred lines. The chromosome was covered evenly by the following number of markers: 101, 51, 21, 11 and 6. These correspond to 1, 2, 5, 10 and 20 cM per marker interval. The size of the QTL (additive effect *a*) was investigated at the following five levels: 0.142, 0.324, 0.471, 0.707 and 0.926. These five levels of the additive effects correspond to the following levels of the heritability (*h*^{2}): 0.01, 0.05, 0.10, 0.20 and 0.30. For simplicity, dominance effect was not simulated and also not included in the model for data analysis. The mean of the Poisson trait was determined by the non-QTL effect *β* (intercept, a scalar because no other non-QTL effects were simulated). The five levels of the intercept (*β*) were −1.0, −0.5, 0.0, 0.5 and 1.0. The sample size was investigated at the following five levels: 100, 200, 300, 500 and 600. The simulated QTL was located at the following different positions: 10, 25, 50, 75 and 90 cM. The true QTL parameter values and the experimental parameters for this comprehensive simulation experiment are summarized in Table 4.

The total number of combinations for all the five factors is 5^{5} = 3,125. If each combination was simulated 1,000 times, the work load would be very intensive. Therefore, we decided to evaluate a small subset of the treatment combinations to draw conclusions. We chose sample size *n* = 300, marker interval 5 cM, QTL size non-QTL effect *β* = 0.0 and QTL position at 50 cM as the basic experimental setup (central level for each of the five factors). Under the basic setup of the experiment, we then evaluated one factor at a time by expanding to all the five levels for the factor of interest with the remaining factors held at the basic levels. This reduced the total number of experiments from 3,125 to 5 × 4 + 1 = 21. Each of the 21 treatment combinations was replicated 1,000 times to examine the empirical statistical powers and estimation errors of parameters for comparisons of the two models. The critical value for QTL detection was determined by simulating additional 1,000 samples under the null QTL model (*a* = 0.0) for each of the treatment combinations. The empirical power for each experiment was the proportion of the samples in which the QTL was detected out of the 1,000 replicates.

Before we discuss the biases and errors of the estimated QTL parameters, let us look at Fig. 5 which shows the average LOD test statistic profiles from the 1,000 replicates for all the experiments. The solid and dotted curves represent the LOD score profiles for the heterogeneous variance model and the mixture model, respectively. The straight horizontal lines are the critical values used to declare QTL significance for power studies. These critical values were drawn from 1,000 additional simulations under the null model (*a* = 0). Figure 5 provides a rich source of information regarding the behaviors of the two models. We only highlight a few important points here. First, at marker positions, the LOD scores for the heterogeneous variance model and the mixture model are identical, as expected, but off the markers the mixture model consistently produced higher LOD scores than the heterogeneous variance model. The higher LOD scores of the mixture model did not translate into higher powers because the critical values for the mixture model were also higher than the heterogeneous variance model. Second, the leaf-like patterns of the LOD score profiles of the mixture model became more severe when the marker interval was increased. This reflected an intrinsic flaw of the mixture model. In contrast, the heterogeneous variance model produced much more smoothed LOD profiles.

Average LOD score profiles of the simulation experiments of QTL mapping for a Poisson trait. The *solid curve* represents the average LOD score profile of the heterogeneous variance model while the *dotted line* represents the average LOD score profile of **...**

We now compare the empirical statistical powers for the two models (see the top panels in Fig. 6). The solid and open symbols represent the powers for the heterogeneous variance model and the mixture model, respectively. In all cases, the heterogeneous variance model had a slightly higher power than the mixture model, although the difference was almost negligible. Some factors had large effects on the powers and others had small effects on the powers. In general, the power declined as the marker interval increased. Sample size and the size of QTL are most influencing factors on the powers. Intercept and position of the true QTL had little influence on the powers.

Estimated QTL parameters and empirical powers of the simulation experiments of QTL mapping for a Poisson trait. Labels *A*, *B*, *C*, *D* and *E* represent five treatments (factors). Within each treatment, there are five different levels represented by levels 1 **...**

Let us turn into the panels in the second row of Fig. 6 to examine the factors that affect the estimated QTL positions. The solid and open symbols indicate the average estimated QTL positions for the two models obtained from the 1,000 replications. The vertical bars represent the standard deviations of the estimated QTL position calculated from the 1,000 replicated experiments. The red horizontal lines indicate the true positions of the simulated QTL. Both models are unbiased and the standard deviations of the estimated positions are very similar for the two models. The size of the QTL and the sample size had the largest influence on the accuracy of the estimated QTL position.

Panels of the third row of Fig. 6 show the average estimated intercepts and the standard deviations obtained from the 1,000 replicated simulations. The two models had very similar estimated intercepts, but both models were biased upwardly.

The panels at the bottom of Fig. 6 show the results of the estimated QTL effect (*a*) for the two models. The heterogeneous model was unbiased but the mixture model was consistently biased upwardly. The bias for the mixture model was more serious as the marker interval increased.

Overall, the heterogeneous variance model performed consistently better than the mixture model. This observation was not expected. Our original purpose of proposing the heterogeneous variance model was to improve the computational efficiency. We did not anticipate that the mixture model was not consistent and performed poorly. The simulation experiments did show that the heterogeneous variance model took, on average, about one-third of the computing time of the mixture model. To our surprise, the heterogeneous variance model outperformed the mixture model in all cases considered in the experiments.

The most commonly used algorithm for missing values in generalized linear model is the EM algorithm (Horton and Laird 1998). We extended the EM algorithm to interval mapping, where the independent variable (genotype indicator variable) is missing for all individuals. We also developed a heterogeneous variance model to approximate the mixture model. Both the binary data and the Poisson data analyses showed that the two methods generated similar results, but the heterogeneous variance model is computationally faster than the mixture model-based EM algorithm. To our surprise, the mixture model approach is not always better than the heterogeneous variance model in terms of better estimation of QTL parameters. The binomial data analysis showed that the mixture model approach failed to converge to the correct values for some marker intervals while the heterogeneous variance model worked very well. The failure is due to several possible reasons: (1) low information content for large marker intervals, (2) large QTL effects and (3) inconsistency of the EM algorithm. For the first reason, a large marker interval means that QTL position in the middle of the interval has little information regarding the genotype of the putative QTL. The posterior probability of QTL genotype is largely determined by the data and the current parameter values. In the end, the posterior probabilities become degenerated (probability equals unity for one genotype and zero for all other genotypes) for all individuals. This phenomenon was observed only for large intervals. The second reason of the failure is due to large QTL effects. We noticed that the failure of the mixture model approach did not happen in intervals that do not have QTL, even though some of those intervals are very large. It is the combination of large interval with large QTL effects that caused the failure. The third reason of the failure is due the inconsistency (intrinsic flaw) of the mixture model. More extensive simulation experiments using the Poisson data further justified the heterogeneous variance model for interval mapping.

Most QTL mapping experiments in the future will be done with a high marker density. In that case, the mixture model and heterogeneous variance model will have negligible difference in efficiency, with the latter slightly more preferable than the former due to its light computational load. Theory and methods of interval mapping under the mixture model are well developed both for normally distributed traits and for discrete traits. Variance–covariance matrix for the estimated parameters is also available, but only for normally distributed trait interval mapping (Kao and Zeng 1997). This research is the first attempt to develop the covariance matrix of estimated QTL parameters under the generalized linear mixture model. With the availability of the variance–covariance matrix of estimated QTL effects, we have a choice to perform either the Wald test or the likelihood ratio test. The two test statistics are asymptotically the same, with the latter slightly more preferable when the sample size is small (Frank 2001).

The approximate heterogeneous variance model was originally developed by one of us for normal trait QTL mapping (Xu 1998). We demonstrated here that it worked equally well for the generalized linear model. Simulation experiments and real data analysis all demonstrated that the approximation is very close to, and sometimes better than, the mixture model. The advantage of the approximate method is the avoidance of mixture model and thus avoidance of the EM algorithm. As a result, computing the variance–covariance matrix of the estimated parameters becomes straightforward, a by-product of the iteration process. In addition, the heterogeneous variance model appears to be more robust and stable compared with the mixture model-based EM algorithm according to our binomial data analysis and the simulation study.

Both the mixture model maximum likelihood and the heterogeneous variance approximation have been coded in an existing QTL mapping program called PROC QTL (Hu and Xu 2009). This program is a user-defined SAS procedure. Users need to specify the distribution of the data. The default distribution is normal. The current version of PROC QTL can handle binary, binomial, ordinal and Poisson distributions. More distributions will be added in the future, depending on the availability of the data. The program is downloadable from our Web site (http://www.statgen.ucr.edu).

We are grateful to Dr. Gangqiang Cao at Zhengzhou University in China for his generality of sharing the tiller number data with us. We also appreciate the help from two anonymous reviewers for their constructive comments on an early version of the manuscript. This project was supported by the National Research Initiative (NRI) Plant Genome of the USDA Cooperative State Research, Education and Extension Service (CSREES) 2007-02784 to SX.

**Open Access** This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

(1) Ordinal data with observed genotypes

Let *g* = 1, 2, 3 index the three genotypes for an F_{2} population derived from the cross of two inbred lines. The expectation of *Y*_{jk} conditional on the parameters for genotype *g* is

Define as a vector for the expectation of *Y*_{j}. The *D* matrix for genotype *g* is

where

and

(2) Ordinal data under the heterogeneous variance model (approximation)

The expectation of *Y*_{jk} conditional on parameters is

Let be a vector for the expectation of *Y*_{j}. The *D* matrix is defined as

where

and

(3) Binary data with observed genotypes

Let *g* = 1, 2, 3 index the three genotypes and define the genotype-specific expectation of *Y*_{j} by

The *D* matrix for genotype *g* is

where

and

(4) Binary data under the heterogeneous variance model

Define the expectation of *Y*_{j} by

The *D* matrix is

where

and

(5) Poisson data with observed genotypes

The expectation of *Y*_{j} given genotype *g* is

The *D* matrix is

where

(6) Poisson data under the overdispersion model (approximation)

Denote the expectation of *Y*_{j} by

The *D* matrix is

where

Since the ordinal phenotype has been presented as multivariable Bernoulli variable and the parameter vector has been partitioned into three blocks, it is a little bit tedious to derive the EM algorithm for parameter estimation. Therefore, we only used the Poisson data as an example for the derivation. The derivation applies to generalized linear model for any distributions. Let *G*_{j} = 1, 2, 3 be a discrete variable for the genotype of individual *j*. Let be a vector of three indicator variables of the genotypes for individual *j* following a multinomial distribution with sample size one. The relationship between *G*_{j} and *δ*_{j} is

for *g* = 1, 2, 3. When the genotype of individual *j* is known, the complete-data log likelihood function for the parameters is

where

is the Poisson probability. The expectation of the complete-data log likelihood function is

where

is the posterior probability of *δ*(*G*_{j}, *g*) = 1 given *θ* = *θ*^{(t)}. Now *p*_{j}^{*}(*g*) is treated as a constant (not a function of the parameters because the unknown parameter involved in *p*_{j}^{*}(*g*) has been replaced by the parameter value at iteration *t*). The EM algorithm actually maximizes the expectation of the complete-data log likelihood function, not the original observed log likelihood function. The maximum likelihood estimation in the neighborhood of *θ* = *θ*^{(t)} can be obtained through Taylor expansion around *θ* = *θ*^{(t)} (the Newton–Raphson method),

Now we only need to prove that

and

The partial derivative of the complete-data log likelihood function with respect to the unknown parameters is

The second partial derivative is

where

Given that

(see Eqs. 61 and 62 in Appendix A) and

(see Eq. 39 in the main text), we have

Substituting *X*_{j} and *H*_{g} in Eqs. (74) and (76) by Eq. (79), we have

and

Therefore,

and

This concludes the derivation of the EM algorithm.

For ordinal traits, the observed data point *Y*_{j} for individual *j* is multinomial with sample size one. Therefore, the variance–covariance matrix is

where This variance matrix can be rewritten in a general form as

where *c* = −1. The inverse of this general form has an explicit expression (Giri 1996),

The fact that leads to

Therefore, the inverse matrix can be written as

Since 1/(1 + *c*) is not defined when *c* = −1, the inverse matrix *V*_{j}^{−1} does not exist.

We now prove that is a generalized inverse of matrix *V*_{j}. To prove this, we only need to show that because a generalize inverse *V*_{j}^{−} is defined as a matrix such that . Substituting *V*_{j}^{−} by , we get

Keep in mind that the above derivation requires and for simplification. This concludes the proof that is a generalized inverse of matrix *V*_{j}.

In fact, is just one of an infinite number of generalized inverses of matrix *V*_{j}. A general form of the generalized inverse can be expressed by

where *d* is a real number (a scalar not a matrix). The generalized inverse is simply obtained by setting *d* = 0. The following equation serves as a proof of this general form of the generalized inverse,

We now use binary trait QTL mapping as an example to demonstrate the logit link function of the generalized linear model. Recall that the univariate definition of the binary phenotype is

Under the logit link function, the expectation and variance of the phenotype given the parameter values are

and

respectively. The probability density is

The link function is logit because

Let *g* = 1, 2, 3 index the three genotypes and

be the genotype specific expectation. Let

and

The *D* matrix for genotype *g* is defined as

where

and

Let us define

The expectation of *Y*_{j} is

Define

The *D* matrix is

where

and

- Box GEP, Cox DR. An analysis of transformations. J R Stat Soc B. 1964;26:211–252.
- Cui Y, Yang W. Zero-inflated generalized Poisson regression mixture model for mapping quantitative trait loci underlying count trait with many zeros. J Theor Biol. 2009;256:276–285. doi: 10.1016/j.jtbi.2008.10.003. [PubMed] [Cross Ref]
- Cui Y, Kim DY, Zhu J. On the generalized poisson regression mixture model for mapping quantitative trait loci with count data. Genetics. 2006;174:2159–2172. doi: 10.1534/genetics.106.061960. [PubMed] [Cross Ref]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977;39:1–38.
- Deng W, Chen H, Li Z. A logistic regression mixture model for interval mapping of genetic trait loci affecting binary phenotypes. Genetics. 2006;172:1349–1358. doi: 10.1534/genetics.105.047241. [PubMed] [Cross Ref]
- Dou B, Hou B, Xu H, Lou X, Chi X, Yang J, Wang F, Ni Z, Sun Q. Efficient mapping of a female sterile gene in wheat (
*Triticum aestivum*L.) Genet Res. 2009;91:337–343. doi: 10.1017/S0016672309990218. [PubMed] [Cross Ref] - Frank EH. Regression modeling strategies. New York: Springer; 2001.
- Giri NC. Multivariate statistical analysis. New York: Marcel Dekker, Inc; 1996.
- Hackett CA, Weller JI. Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics. 1995;51:1252–1263. doi: 10.2307/2533257. [PubMed] [Cross Ref]
- Han L, Xu S. A Fisher scoring algorithm for the weighted regression method of QTL mapping. Heredity. 2008;101:453–464. doi: 10.1038/hdy.2008.78. [PubMed] [Cross Ref]
- Horton NJ, Laird NM. Maximum likelihood analysis of generalized linear models with missing covariates. Stat Methods Med Res. 1998;8:37–50. doi: 10.1191/096228099673120862. [PubMed] [Cross Ref]
- Hu Z, Xu S (2009) PROC QTL—A SAS procedure for mapping quantitative trait loci. Int J Plant Genomics 2009:3. doi:10.1155/2009/141234 [PMC free article] [PubMed]
- Ibrahim JG, Chen M-H, Lipsitz SR. Bayesian methods for generalized linear models with missing covariates. Can J Stat. 2002;30:55–78. doi: 10.2307/3315865. [Cross Ref]
- Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH. Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc. 2005;100:332–346. doi: 10.1198/016214504000001844. [Cross Ref]
- Jiang C, Zeng ZB. Mapping quantitative trait loci with dominant and missing markers in various crosses from two inbred lines. Genetica. 1997;101:47–58. doi: 10.1023/A:1018394410659. [PubMed] [Cross Ref]
- Kao CH, Zeng ZB. General formulas for obtaining the MLEs and the asymptotic variance–covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics. 1997;53:653–665. doi: 10.2307/2533965. [PubMed] [Cross Ref]
- Lander ES, Botstein D. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989;121:185–199. [PubMed]
- Lange C, Whittaker JC. Mapping quantitative trait loci using generalized estimating equations. Genetics. 2001;159:1325–1337. [PubMed]
- Louis T. Finding the observed information matrix when using the EM algorithm. J R Stat Soc B. 1982;44:226–233.
- McCullagh P, Nelder JA. Generalized linear models. New York: Chapman & Hall; 1989.
- Ooijen JW. Accuracy of mapping quantitative trait loci in autogamous species. Theor Appl Genet. 1992;84:803–811. doi: 10.1007/BF00227388. [PubMed] [Cross Ref]
- Piepho HP. A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics. 2001;157:425–432. [PubMed]
- Rao SQ, Xu S. Mapping quantitative trait loci for categorical traits in four - way crosses. Heredity. 1998;81:214–224. doi: 10.1038/sj.hdy.6883780. [PubMed] [Cross Ref]
- Robins JM, Ritov Y. On double robustness. Stat Sin. 2001;11:920–936.
- Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.
- SAS/STAT users’ guide. Cary: SAS Publishing; 1999.
- Searle SR. Linear models. New York: Wiley; 1997.
- Wald A. Note on the consistency of the maximum likelihood estimate. Ann Math Stat. 1949;20:595–601. doi: 10.1214/aoms/1177729952. [Cross Ref]
- Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–447.
- Xu S. Iteratively reweighted least squares mapping of quantitative trait loci. Behav Genetics. 1998;28:341–355. doi: 10.1023/A:1021617618150. [PubMed] [Cross Ref]
- Xu S, Atchley WR. Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics. 1996;143:1417–1424. [PubMed]
- Xu S, Yi N, Burke D, Galecki A, Miller RA. An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a four-way cross mouse family. Genet Res. 2003;82:127–138. doi: 10.1017/S0016672303006414. [PubMed] [Cross Ref]
- Xu C, Zhang Y-M, Xu S. An EM algorithm for mapping quantitative resistance loci. Heredity. 2005;94:119–128. doi: 10.1038/sj.hdy.6800583. [PubMed] [Cross Ref]
- Yi N, Xu S. Mapping quantitative trait loci for complex binary traits in outbred populations. Heredity. 1999;82:668–676. doi: 10.1046/j.1365-2540.1999.00529.x. [PubMed] [Cross Ref]
- Yi N, Xu S. A random model approach to mapping quantitative trait loci for complex binary traits in outbred populations. Genetics. 1999;153:1029–1040. [PubMed]
- Yi N, Xu S. Bayesian mapping of quantitative trait loci for complex binary traits. Genetics. 2000;155:1391–1403. [PubMed]

Articles from Springer Open Choice are provided here courtesy of **Springer**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |