PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of springeropenLink to Publisher's site
TAG. Theoretical and Applied Genetics. Theoretische Und Angewandte Genetik
 
Theor Appl Genet. 2010 June; 121(1): 47–63.
Published online 2010 February 24. doi:  10.1007/s00122-010-1290-0
PMCID: PMC2871098

Generalized linear model for interval mapping of quantitative trait loci

Abstract

We developed a generalized linear model of QTL mapping for discrete traits in line crossing experiments. Parameter estimation was achieved using two different algorithms, a mixture model-based EM (expectation–maximization) algorithm and a GEE (generalized estimating equation) algorithm under a heterogeneous residual variance model. The methods were developed using ordinal data, binary data, binomial data and Poisson data as examples. Applications of the methods to simulated as well as real data are presented. The two different algorithms were compared in the data analyses. In most situations, the two algorithms were indistinguishable, but when large QTL are located in large marker intervals, the mixture model-based EM algorithm can fail to converge to the correct solutions. Both algorithms were coded in C++ and interfaced with SAS as a user-defined SAS procedure called PROC QTL.

Introduction

Interval mapping (Lander and Botstein 1989) is the most commonly used method for mapping quantitative trait loci (QTL). The method usually applies to quantitative traits, i.e., traits that have a continuous distribution. In agricultural crops, the phenotypes of some traits are measured as discrete variables. Ordinal traits, e.g., disease resistance in plants and animals, are typical examples of such discrete traits. These traits are usually modeled by a multinomial distribution. Traits measured as counts, such as liter size in pigs and tiller number in rice, are usually modeled by the Poisson distribution. Binomial traits are also common in agricultural experiments, such as the ratio of germinated seeds to the total number of seeds planted. These traits are not normally distributed. Although some transformations, such us the log transformation or the more general Box–Cox transformation (Box and Cox 1964), can be used to improve the normality of the traits, not all traits can be transformed. For example, a binary trait has no appropriate transformation to make it normal.

The generalized linear model approach (McCullagh and Nelder 1989; Wedderburn 1974) is the most appropriate method for analyzing traits with non-normal distribution and it has been widely used in statistics for parameter estimation. Generalized linear model takes advantage of all theory and methods developed in the usual linear model methodology (Searle 1997). It has been applied to QTL mapping for some special traits, e.g., binary traits (Deng et al. 2006; Xu and Atchley 1996; Yi and Xu 1999a, b, 2000), ordinal traits (Hackett and Weller 1995; Rao and Xu 1998) and Poisson traits (Cui et al. 2006; Cui and Yang 2009). Depending on the special characteristics of the traits, distribution-specific generalized linear models have been developed for these traits. These methods are not sufficiently general to extend to all traits that can be modeled by the generalized linear model. For example, the EM algorithm developed by Xu et al. (2003, 2005) only applies to binary and ordinal traits. They treated both the marker genotypes and the latent variable as missing values. Although parameter estimation under the EM algorithm is simple, the information matrix of the estimated parameters is difficult to calculate. A more comprehensive analysis of the generalized linear model applied to QTL mapping is the seminal paper by Lange and Whittaker (2001). They adopted the generalized estimating equations (GEE) approach to analyze multiple traits with arbitrary combination of continuous and discrete trait components. The method replaces the unobserved QTL genotypes by the conditional expectations of the genotype indicator variable given flanking marker information. The uncertainties of the genotype indicator variables are ignored. In addition, detailed formulas for the partial derivatives of the expectation of the data with respect to the parameters are not given.

When there are no missing values, commercial software packages are available to estimate parameters under a wide range of distribution of the traits, e.g., SAS (SAS Institute 1999) and GENESTAT (PHOEBE Biostatistics Group 2007, http://www.genestat.org). Although these programs may handle missing values using the imputation algorithm or the EM algorithm (Dempster et al. 1977), the missing patterns handled by these commercial programs are usually different from that of QTL mapping. In interval mapping, genotypes are missing for every individual at a putative QTL position unless the QTL overlaps with a fully informative marker. Special mixture models are required in QTL mapping (Lander and Botstein 1989). Hackett and Weller (1995) and Xu and Atchley (1996) were the first group of people to use the EM algorithm to estimate QTL parameters for ordinal traits, but they did not investigate the variance–covariance matrix of the estimated QTL effects. Hackett and Weller (1995) took advantage of an existing commercial software (GENESTAT) for generalized linear model analysis by iteratively calling the subroutine for generalized linear model with non-missing genotypes and calculating the weight (posterior probability of QTL genotype). The attractive property of that method is that users do not have to write their own code for the maximization step, which is conducted by the commercial software. They only incorporated the expectation step into the existing program for parameter updating. As a result, no variance–covariance matrix of the estimated parameters was provided. Xu et al. (2003) developed an EM algorithm for binary data and used the Monte Carlo simulation approach to obtain the Louis’ (1982) information matrix. The variance–covariance matrix of the estimated parameters can be approximated by the inverse of the information matrix. This method is computationally intensive due to the use of Monte Carlo simulation for approximate integration.

In the statistics literature, generalized linear model with missing covariates is often handled with the EM algorithm (Horton and Laird 1998). However, other methods are also available, as summarized by Ibrahim et al. (2005), who reviewed four general approaches: maximum likelihood method implemented via the EM algorithm by the method of weights (Horton and Laird 1998), multiple imputation (Rubin 1987), fully Bayesian (Ibrahim et al. 2002) and weighted estimation equation (Robins and Ritov 2001). Ibrahim et al. (2005) concluded that the most accurate method is the fully Bayesian method, although the method is associated with a high cost in terms of computing time. The second best method is the EM algorithm via the method of weights. Application of this method to interval mapping has not been analyzed.

The mixture model-based maximum likelihood estimation of parameters has a slow convergence speed. The computational intensity is high within each round of the iteration. Therefore, an approximate method that improves the computational efficacy with little loss in power may be desirable. Xu (1998) developed a weighted least square approximation for QTL mapping of normally distributed traits. Recently, Han and Xu (2008) made a further improvement of the weighted least square estimation. Their idea can be applied to the generalized linear model. Therefore, we will present an approximate method to improve the computational efficiency over the mixture model maximum likelihood method.

Model and methods

Generalized linear model

We will use ordinal trait QTL mapping as an example to introduce the generalized linear model. Extension to other traits will be given later. Suppose that a disease phenotype of individual equation M1 is measured by an ordinal variable denoted by equation M2 where p + 1 is the total number of disease classes and n is the sample size. Let equation M3, be a equation M4 vector to indicate the disease status of individual j. The kth element of Yj is defined as

equation M5

Note that the phenotype of the ordinal trait has been formulated as a multivariate Bernoulli variable (or multinomial variable with sample size one). Different link functions can be used to describe the relationship of the observed ordinal phenotype and the genetic effects of QTL. The most commonly used link functions are the logit and probit link functions (McCullagh and Nelder 1989). Here, we described the probit link function and leave the logit link function in Appendix D for interested readers. Under the probit link function, the expectation of Yjk is

equation M6

where equation M7 is the intercept, β is a q × 1 vector for some systematic effects (not related to the effects of QTL), and γ is an r × 1 vector for the effects of a quantitative trait locus. The symbol Φ(·) is the standardized cumulative normal function. The design matrix Xj is assumed to be known, but Zj may not be fully observed because it is determined by the genotype of individual j for the locus of interest. We will defer the definition of Zj to the next section. Because the link function is probit, this type of analysis is called probit analysis. Let μj = {μjk} be a equation M8 vector. The expectation for vector Yj is equation M9 and the variance–covariance matrix of Yj is

equation M10

The method to be developed requires the inverse of matrix Vj. However, Vj is not of full rank (see Appendix C for proof). We can use a generalized inverse of Vj, such as equation M11 in place of Vj−1 (see Appendix C for the generalized inverse). The weight matrix is

equation M12

The parameter vector is denoted by θ = {α, β, γ} with a dimensionality of equation M13 Once equation M14 and are defined, the reweighted least square method of Wedderburn (1974) can be used to estimate the parameters.

Mixture model maximum likelihood estimation

When the design matrix Zj is fully observed, the maximum likelihood solution of parameters can be solved in a straightforward manner using the reweighted least squares approach (Wedderburn 1974). For the paper to be self contained, the iteration equation and the information matrix in the situation of no missing value is briefly described. The Newton–Raphson iteration is

equation M15

where

equation M16

is the iteratively reweighted least squares formula for parameter updating (increment of the parameter from iteration t to t + 1). Matrix Dj is the partial derivative matrix of μj with respect to θ,

equation M17

with a dimensionality of equation M18 Once the iteration process converges, the information matrix is automatically given, as it is the coefficient matrix in the last step of the iteration (Wedderburn 1974),

equation M19

From this information matrix, the variance–covariance matrix of estimated parameters is calculated because the inverse of the information matrix approximately equals the variance–covariance matrix of the estimated parameters.

In QTL mapping for ordinal traits, the generalized linear model in its original form can be applied if one is interested in individual marker analysis because, at markers, the genotypes are observed and thus matrix Zj is known. In interval mapping, however, effects of QTL that are located between markers should also be estimated. In this case, the genotypes of QTL are not observed and must be inferred using flanking marker genotypes. This is a typical missing value problem. The missing value Zj can be inferred from linked markers. We now use an F2 population as an example to show how to handle the missing value of Zj. Let

equation M20

be the conditional probability of QTL genotype given marker information, where the marker information can be either drawn from two flanking markers (interval mapping, Lander and Botstein 1989) or multiple markers (multipoint analysis, Jiang and Zeng 1997). For an F2 population, matrix Hg is defined as the gth row of matrix H, where

equation M21

Corresponding to the definition of H, the QTL effect vector γ is defined as equation M22 where a and d represents the additive and dominance effects, respectively. When Zj is missing, the generalized linear model becomes a generalized linear mixture model. Under the mixture model approach, we need to define the genotype-specific expectation, the genotype-specific variance matrix and the genotype-specific derivatives for each individual. Let

equation M23

be the genotype-specific expectation of Yjk when j takes the gth genotype for g = 1, 2, 3. The corresponding genotype specific weight matrix is

equation M24

Let Dj(g) be the genotype-specific partial derivatives of the expectation with respect to the parameters,

equation M25

The closed form of matrix Dj(g) is given in Appendix A. The increment of parameters in the iteration is

equation M26

where pj*(g) is the posterior probability of QTL genotype after the phenotype information is incorporated and is given by

equation M27

where

equation M28

is the multinomial probability. Derivation of the EM algorithm (Eq. 14) is given in Appendix B.

Unfortunately, the information matrix under the EM algorithm is not identical to the coefficient matrix of the reweighted least squares equation; rather, it has to be adjusted for the information loss due to the uncertainty of QTL genotypes. The Louis’ (1982) adjustment of the information matrix is

equation M29

The first term in the above expression (Eq. 17) is

equation M30

which is the expected value of the negative Hessian matrix. The second term of Eq. (17) is

equation M31

which is the variance matrix of the score vector, where

equation M32

and

equation M33

Approximation under the heterogeneous variance model

The EM implemented mixture model approach described above is computationally intensive due to (1) the mixture model itself and (2) the extra step in calculating the information matrix of parameters. Here, we introduce an approximate method that replaces the mixture model by a heterogeneous residual variance model. Let

equation M34

be the conditional expectation of Zj given marker information and

equation M35

be the corresponding conditional variance–covariance matrix of Zj. If Zj were observed, we would have

equation M36

When Zj is missing, we can replace Zj by Uj and adjust for the over dispersion caused by the substitution,

equation M37

where

equation M38

is the heterogeneous dispersion (see Xu 1998).

We are now in a position to explain the over dispersion. It makes more sense to introduce the liability model before we explain the overdispersion. Let

equation M39

be the liability for individual j, where equation M40 is the residual error of the liability. The liability is a latent variable that controls the observed ordinal phenotype through a series of thresholds equation M41 with Tj = k if equation M42 The residual error variance is not estimable and thus we set σ2 = 1. When Zj is replaced by Uj, the variance of the liability becomes

equation M43

Because σj2 ≥ 1, the model is called overdispersion. Since σj2 varies from one individual to another, the model is also called the heterogeneous residual variance model. To adjust for the heterogeneous overdispersion, we replace equation M44 by equation M45 This adjustment serves as a way to standardize the liability so that the adjusted liability has a mean equation M46 and a unity variance.

Unlike the mixture model, the expectation μj is no longer a function of g. Similarly, the weight matrix is

equation M47

This modification leads to a change in matrix Dj, the partial derivatives of μj with respect to the parameters, which is given in Appendix A. The same iteration equation given in Eq. (6) is used here for the heterogeneous residual variance model. The attractive property of this approximation is that as the information matrix is given in Eq. (8) no adjustment is required because EM algorithm is not used here for the approximation.

Extension to other traits

Ordinal traits are the most commonly observed discrete traits in QTL mapping experiments. Other discrete traits also commonly seen in QTL mapping experiments are binary traits, binomial traits and Poisson traits. This section is dedicated to these commonly observed discrete traits. The mixture model algorithm and the heterogeneous variance approximation apply to all traits as long as the traits can be analyzed under the generalized linear model. To apply the algorithms to any specific trait, we only need to find: (1) the distribution of the trait (probability density of the data point), (2) the expectation of the data point, (3) the weight (inverse of the variance) of the data point and (4) the partial derivative of the expectation with respect to the parameters. We now introduce these discrete traits and leave the details of the formulas in Appendix A for interested readers.

Binary traits

Binary traits can be treated as a special case of ordinal traits with p = 1. Without any modification, the method developed for ordinal traits can be applied to binary traits with equation M48 defined as a 2 × 1 vector. Each of the two components is defined as a binary variable and the two components are perfectly correlated. Here, we simplify the problem by defining Yj as a univariate binary trait. This univariate treatment not only saves computing time but also simplifies the notation. We now use the univariate definition to define the binary phenotype,

equation M49

The expectation and variance of the phenotype given the parameter value are

equation M50

and

equation M51

respectively. The probability distribution is

equation M52

Details for the mixture model and the heterogeneous variance model are given in Appendix A.

Binomial traits

Let nj be the number of trials observed from individual j and mj be the number of events happened to individual j. The binomial phenotype for individual j is defined as equation M53 (expressed as a fraction so that 0 ≤ Yj ≤ 1). Under the probit model, the expectation and the variance of the phenotype are

equation M54

and

equation M55

respectively. The weight is

equation M56

The probability distribution is

equation M57

Details regarding the partial derivatives of the expectation with respect to the parameters both in the mixture model and in the heterogeneous variance model are given in Appendix A.

Poisson traits

Let equation M58 be the phenotype of a Poisson trait. The expectation and the variance of the phenotype are equivalent, equation M59, where

equation M60

The weight is

equation M61

The probability density is

equation M62

Details regarding the partial derivatives of the expectation with respect to the parameters both in the mixture model and in the heterogeneous variance model are given in Appendix A.

Hypothesis tests

Two different hypothesis tests are provided, the likelihood ratio test and the Wald test (1949). The likelihood ratio test requires evaluation of different likelihood functions (full model and reduced model). The Wald test is much simpler than the likelihood ratio test statistic because it requires only equation M63 and the estimated parameters to form the test statistics. Let equation M64 be the subset of matrix equation M65, the Wald test statistics for the null hypothesis, equation M66 is

equation M67

Recall that equation M68 represents both the additive and dominance effects. Under the null hypothesis, the Wald test follows approximately the Chi-square distribution with two degrees of freedom. Therefore, the Wald test is comparable to the likelihood ratio test statistics (McCullagh and Nelder 1989). When testing a single parameter (either a or d), the Wald test statistics is equivalent to the Chi-square test with one degree of freedom. Therefore, some people used the Wald test and the Chi-square test interchangeably (e.g., Han and Xu 2008).

Applications

Binary traits

This example demonstrates the application of the generalized linear model to QTL mapping for binary traits in wheat. The experiment was conducted by Dou et al. (2009) who made the data available to us for this analysis. A female sterile line XND126 and an elite cultivar Gaocheng 8901 with normal fertility were crossed for genetic analysis of female sterility measured as a binary trait. The parents and their F1 and F2 progeny were planted at Huaian experimental station in China for the 2006–2007 growing season under the normal autumn sowing condition. The mapping population was an F2 family consisting of 234 individual plants. The binary trait was the presence of seed setting of the female plants. About five-sixth of the F2 progeny had seeded splikelets (phenotype 1) and the remaining one-sixth plants did not have seeded splikelet (phenotype 0). This is a typical binary trait regarding the presence of seeds. Among the plants that had seeded spikelets, the number of seeded spikelets varied and can be treated as a binomial trait for further QTL analysis (see Sect. “Binomial traits” for binomial trait QTL mapping). A total of 28 SSR markers were used in this experiment. These markers covered 5 chromosomes of the wheat genome with an average genome marker density of 15.5 cM per marker interval. The 5 chromosomes are only part of the wheat genome.

These chromosomes were scanned for QTL of the binary trait using both the mixture model and the heterogeneous variance model. The LOD profiles of the two methods are shown in Fig. 1. Two QTL on chromosome 2 and one QTL on chromosome 5 have been detected with high LOD score. The chromosome-wise empirical threshold values are lower than LOD 3. With the chromosome-wise threshold values, we detected one more QTL on chromosome 1. The estimated QTL parameters are listed in Table 1. The two models (mixture model and heterogeneous variance model) produced very similar results.

Fig. 1
The LOD profiles for the female sterility (binary) trait of wheat. a The upper panel shows the LOD profile for the heterogeneous variance model; b the lower panel shows the LOD profile for the mixture model. The chromosomes are separated by the vertical ...
Table 1
The estimated QTL parameters of the wheat female sterility (binary) trait

Binomial traits

The same experiment conducted by Dou et al. (2009) also recorded the number of seeded spikelets and the total number spikelets for each plant. The ratio of the two records is a binomial trait. The same mapping population and the same linkage map were used also for the binomial trait QTL mapping. Again, both the mixture model and the heterogeneous variance model were used for the binomial trait analysis. Unfortunately, the mixture model failed to generate meaningful result. Therefore, the result of the mixture model analysis was not reported. In chromosome regions, where there were no QTL, the mixture model generated result similar to that of the heterogeneous variance model. For regions with large QTL, the mixture model approach failed to converge. The possible reason for the failure will be presented in Sect. “Discussion”. We now focus on the result of the heterogeneous variance model. The LOD profile is shown in Fig. 2. First, the pattern of the LOD profile is similar to that of the binary trait analysis. There is strong evidence that there is one QTL on chromosome 1 and two QTL on chromosome 2. The LOD score here for the binomial trait is not in the same scale as that of the binary trait. The highest LOD is about 1,000, almost a hundred times more than the binary trait. This inflation of the LOD reflects the increased power of the binomial data analysis than the binary data analysis. Regions of other chromosomes also show LOD score higher than the empirical chromosome-wise threshold values. These include chromosomes four and five. The estimated QTL parameters are listed in Table 2.

Fig. 2
The LOD score profile of the female sterility (binomial) trait in wheat under the heterogeneous variance model. The mixture model failed to converge to the correct solution and thus are not presented. The chromosomes are separated by the vertical gray ...
Table 2
Estimated QTL parameters for the wheat female sterility (binomial) trait from the heterogeneous variance model analysis

Poisson traits

This example demonstrates the application of the generalized linear model QTL mapping for traits with a Poisson distribution. The data were provided by Dr. Gangqiang Cao at Zengzhou University, China. The result has not been published in any form. The mapping population was a double haploid family of the rice initiated from the cross between IR64 and Azucena. The trait analyzed was the tiller number with an assumed Poisson distribution. The sample size was n = 110 and the number of markers was 175. These markers covered 12 chromosomes (2,031 cM in total length) of the rice genome with an average marker interval of 11.6 cM. This dataset was different from that used by Cui et al. (2006). Both experiments were initialized from the same line cross with the same linkage map, but the experiments were conducted in different times and different locations by different investigators. Interval mappings under both the mixture model and the heterogeneous variance model were applied to the data. The LOD score profiles obtained from the two different methods are depicted in Fig. 3. The LOD score from the heterogeneous variance model is slightly higher than that of the mixture model, but the difference is almost negligible. If LOD = 3 was used as the criterion for significance, two QTL would have been detected on chromosomes 1 and 4. We used the quick method of Piepho (2001) to calculate the empirical threshold for each chromosome and used these thresholds to declare significance of QTL. The LOD thresholds are substantially less than LOD 3. Using the chromosome-specific thresholds, we detected one more QTL at the end of chromosome 12. The estimated QTL parameters are listed in Table 3. The supporting interval for each estimated QTL position was determined by the one-LOD drop approach (Ooijen 1992). The two methods differ slightly in the estimated QTL positions and the supporting intervals. The supporting intervals of QTL positions for the heterogeneous variance model were consistently shorter than those of the mixture model. Overall, the difference between the two methods is very small and can be safely ignored for this data analysis.

Fig. 3
LOD profiles for the tiller number of rice (Poisson trait) QTL mapping. a The upper panel shows the LOD profile for the heterogeneous variance model; b the lower panel shows LOD profile for the mixture model. The chromosomes are separated by the vertical ...
Table 3
Estimated QTL parameters using the mixture model and the heterogeneous variance model for the tiller number (Poisson) trait QTL mapping experiment

Simulation studies

Binomial traits

This simulation experiment was to demonstrate the difference between the mixture model and the heterogeneous variance model for binomial trait QTL mapping. For the female sterility trait of wheat, the mixture model approach failed to converge for the binomial data analysis but succeeded for the binary data. Binomial data were supposed to be more informative than the binary data, but more information turned out to do more harm than good to the mixture model, a typical problem of inconsistency has occurred. The problem happened in the calculation of the posterior probability of QTL genotype. The binomial density is equivalent to the product of multiple independent Bernoulli trials. This density is extremely sensitive to the change of genotype from one form to another, especially when the number of trials is large. The super sensitiveness of binomial density to the QTL genotype change led to degeneracy of the posterior probability of QTL genotype. We conducted a small-scale simulation experiment to demonstrate the problem of the mixture model. We simulated a QTL at position 55 cM of a chromosome with 100 cM in length. Five markers were placed evenly on the chromosome with 20 cM per marker interval. The sample size was 500 for an F2 population. The binomial trait phenotype was simulated for each plant with a constant number of trials for all plants. We simulated the following number of trials in four experiments: 20, 40, 80 and 160. The LOD score profiles obtained from the mixture model and the heterogeneous variance model are shown in Fig. 4 for all the four experiments. We can see that when the number of trials were 20 and 40, the LOD profiles of the two methods are very similar. As the number of trials increased to 80 and 160, the differences between the two models are dramatic, with LOD profile of the mixture model drastically deviated from that of the heterogeneous variance model when the putative QTL position off the marker position. The strange leaf-like pattern of the mixture model LOD profile reflects the instability and failure of the mixture method.

Fig. 4
Comparison of the LOD profiles for the mixture model (dotted line) and the heterogeneous variance model (solid line) for a binomial trait of a simulation experiment. The sample size was 500 and the chromosome size was 100 cM long covered by 6 ...

Poisson traits

This simulation study was to evaluate the differences between the EM implemented mixture model and the heterogeneous residual variance model of interval mapping for Poisson traits. The simulation experiments were much more comprehensive than the previous one. The criteria of evaluation include the statistical powers, the test statistic profiles, the estimation errors of QTL parameters, the biases of the estimation and the computational times. The factors considered are the marker density (A), size of the QTL (B), mean of the trait (C), the sample size (D) and the QTL position (E).

A single chromosome with 100 cM in length was simulated for an F2 population derived from the cross of two inbred lines. The chromosome was covered evenly by the following number of markers: 101, 51, 21, 11 and 6. These correspond to 1, 2, 5, 10 and 20 cM per marker interval. The size of the QTL (additive effect a) was investigated at the following five levels: 0.142, 0.324, 0.471, 0.707 and 0.926. These five levels of the additive effects correspond to the following levels of the heritability (h2): 0.01, 0.05, 0.10, 0.20 and 0.30. For simplicity, dominance effect was not simulated and also not included in the model for data analysis. The mean of the Poisson trait was determined by the non-QTL effect β (intercept, a scalar because no other non-QTL effects were simulated). The five levels of the intercept (β) were −1.0, −0.5, 0.0, 0.5 and 1.0. The sample size was investigated at the following five levels: 100, 200, 300, 500 and 600. The simulated QTL was located at the following different positions: 10, 25, 50, 75 and 90 cM. The true QTL parameter values and the experimental parameters for this comprehensive simulation experiment are summarized in Table 4.

Table 4
Annotation of the simulation experiments

The total number of combinations for all the five factors is 55 = 3,125. If each combination was simulated 1,000 times, the work load would be very intensive. Therefore, we decided to evaluate a small subset of the treatment combinations to draw conclusions. We chose sample size n = 300, marker interval 5 cM, QTL size equation M69 non-QTL effect β = 0.0 and QTL position at 50 cM as the basic experimental setup (central level for each of the five factors). Under the basic setup of the experiment, we then evaluated one factor at a time by expanding to all the five levels for the factor of interest with the remaining factors held at the basic levels. This reduced the total number of experiments from 3,125 to 5 × 4 + 1 = 21. Each of the 21 treatment combinations was replicated 1,000 times to examine the empirical statistical powers and estimation errors of parameters for comparisons of the two models. The critical value for QTL detection was determined by simulating additional 1,000 samples under the null QTL model (a = 0.0) for each of the treatment combinations. The empirical power for each experiment was the proportion of the samples in which the QTL was detected out of the 1,000 replicates.

Before we discuss the biases and errors of the estimated QTL parameters, let us look at Fig. 5 which shows the average LOD test statistic profiles from the 1,000 replicates for all the experiments. The solid and dotted curves represent the LOD score profiles for the heterogeneous variance model and the mixture model, respectively. The straight horizontal lines are the critical values used to declare QTL significance for power studies. These critical values were drawn from 1,000 additional simulations under the null model (a = 0). Figure 5 provides a rich source of information regarding the behaviors of the two models. We only highlight a few important points here. First, at marker positions, the LOD scores for the heterogeneous variance model and the mixture model are identical, as expected, but off the markers the mixture model consistently produced higher LOD scores than the heterogeneous variance model. The higher LOD scores of the mixture model did not translate into higher powers because the critical values for the mixture model were also higher than the heterogeneous variance model. Second, the leaf-like patterns of the LOD score profiles of the mixture model became more severe when the marker interval was increased. This reflected an intrinsic flaw of the mixture model. In contrast, the heterogeneous variance model produced much more smoothed LOD profiles.

Fig. 5
Average LOD score profiles of the simulation experiments of QTL mapping for a Poisson trait. The solid curve represents the average LOD score profile of the heterogeneous variance model while the dotted line represents the average LOD score profile of ...

We now compare the empirical statistical powers for the two models (see the top panels in Fig. 6). The solid and open symbols represent the powers for the heterogeneous variance model and the mixture model, respectively. In all cases, the heterogeneous variance model had a slightly higher power than the mixture model, although the difference was almost negligible. Some factors had large effects on the powers and others had small effects on the powers. In general, the power declined as the marker interval increased. Sample size and the size of QTL are most influencing factors on the powers. Intercept and position of the true QTL had little influence on the powers.

Fig. 6
Estimated QTL parameters and empirical powers of the simulation experiments of QTL mapping for a Poisson trait. Labels A, B, C, D and E represent five treatments (factors). Within each treatment, there are five different levels represented by levels 1 ...

Let us turn into the panels in the second row of Fig. 6 to examine the factors that affect the estimated QTL positions. The solid and open symbols indicate the average estimated QTL positions for the two models obtained from the 1,000 replications. The vertical bars represent the standard deviations of the estimated QTL position calculated from the 1,000 replicated experiments. The red horizontal lines indicate the true positions of the simulated QTL. Both models are unbiased and the standard deviations of the estimated positions are very similar for the two models. The size of the QTL and the sample size had the largest influence on the accuracy of the estimated QTL position.

Panels of the third row of Fig. 6 show the average estimated intercepts and the standard deviations obtained from the 1,000 replicated simulations. The two models had very similar estimated intercepts, but both models were biased upwardly.

The panels at the bottom of Fig. 6 show the results of the estimated QTL effect (a) for the two models. The heterogeneous model was unbiased but the mixture model was consistently biased upwardly. The bias for the mixture model was more serious as the marker interval increased.

Overall, the heterogeneous variance model performed consistently better than the mixture model. This observation was not expected. Our original purpose of proposing the heterogeneous variance model was to improve the computational efficiency. We did not anticipate that the mixture model was not consistent and performed poorly. The simulation experiments did show that the heterogeneous variance model took, on average, about one-third of the computing time of the mixture model. To our surprise, the heterogeneous variance model outperformed the mixture model in all cases considered in the experiments.

Discussion

The most commonly used algorithm for missing values in generalized linear model is the EM algorithm (Horton and Laird 1998). We extended the EM algorithm to interval mapping, where the independent variable (genotype indicator variable) is missing for all individuals. We also developed a heterogeneous variance model to approximate the mixture model. Both the binary data and the Poisson data analyses showed that the two methods generated similar results, but the heterogeneous variance model is computationally faster than the mixture model-based EM algorithm. To our surprise, the mixture model approach is not always better than the heterogeneous variance model in terms of better estimation of QTL parameters. The binomial data analysis showed that the mixture model approach failed to converge to the correct values for some marker intervals while the heterogeneous variance model worked very well. The failure is due to several possible reasons: (1) low information content for large marker intervals, (2) large QTL effects and (3) inconsistency of the EM algorithm. For the first reason, a large marker interval means that QTL position in the middle of the interval has little information regarding the genotype of the putative QTL. The posterior probability of QTL genotype is largely determined by the data and the current parameter values. In the end, the posterior probabilities become degenerated (probability equals unity for one genotype and zero for all other genotypes) for all individuals. This phenomenon was observed only for large intervals. The second reason of the failure is due to large QTL effects. We noticed that the failure of the mixture model approach did not happen in intervals that do not have QTL, even though some of those intervals are very large. It is the combination of large interval with large QTL effects that caused the failure. The third reason of the failure is due the inconsistency (intrinsic flaw) of the mixture model. More extensive simulation experiments using the Poisson data further justified the heterogeneous variance model for interval mapping.

Most QTL mapping experiments in the future will be done with a high marker density. In that case, the mixture model and heterogeneous variance model will have negligible difference in efficiency, with the latter slightly more preferable than the former due to its light computational load. Theory and methods of interval mapping under the mixture model are well developed both for normally distributed traits and for discrete traits. Variance–covariance matrix for the estimated parameters is also available, but only for normally distributed trait interval mapping (Kao and Zeng 1997). This research is the first attempt to develop the covariance matrix of estimated QTL parameters under the generalized linear mixture model. With the availability of the variance–covariance matrix of estimated QTL effects, we have a choice to perform either the Wald test or the likelihood ratio test. The two test statistics are asymptotically the same, with the latter slightly more preferable when the sample size is small (Frank 2001).

The approximate heterogeneous variance model was originally developed by one of us for normal trait QTL mapping (Xu 1998). We demonstrated here that it worked equally well for the generalized linear model. Simulation experiments and real data analysis all demonstrated that the approximation is very close to, and sometimes better than, the mixture model. The advantage of the approximate method is the avoidance of mixture model and thus avoidance of the EM algorithm. As a result, computing the variance–covariance matrix of the estimated parameters becomes straightforward, a by-product of the iteration process. In addition, the heterogeneous variance model appears to be more robust and stable compared with the mixture model-based EM algorithm according to our binomial data analysis and the simulation study.

Both the mixture model maximum likelihood and the heterogeneous variance approximation have been coded in an existing QTL mapping program called PROC QTL (Hu and Xu 2009). This program is a user-defined SAS procedure. Users need to specify the distribution of the data. The default distribution is normal. The current version of PROC QTL can handle binary, binomial, ordinal and Poisson distributions. More distributions will be added in the future, depending on the availability of the data. The program is downloadable from our Web site (http://www.statgen.ucr.edu).

Acknowledgments

We are grateful to Dr. Gangqiang Cao at Zhengzhou University in China for his generality of sharing the tiller number data with us. We also appreciate the help from two anonymous reviewers for their constructive comments on an early version of the manuscript. This project was supported by the National Research Initiative (NRI) Plant Genome of the USDA Cooperative State Research, Education and Extension Service (CSREES) 2007-02784 to SX.

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Appendix A: partial derivatives

(1) Ordinal data with observed genotypes

Let g = 1, 2, 3 index the three genotypes for an F2 population derived from the cross of two inbred lines. The expectation of Yjk conditional on the parameters for genotype g is

equation M70

Define equation M71 as a equation M72 vector for the expectation of Yj. The D matrix for genotype g is

equation M73

where

equation M74
equation M75

and

equation M76

(2) Ordinal data under the heterogeneous variance model (approximation)

The expectation of Yjk conditional on parameters is

equation M77

Let equation M78 be a equation M79 vector for the expectation of Yj. The D matrix is defined as

equation M80

where

equation M81
equation M82

and

equation M83

(3) Binary data with observed genotypes

Let g = 1, 2, 3 index the three genotypes and define the genotype-specific expectation of Yj by

equation M84

The D matrix for genotype g is

equation M85

where

equation M86

and

equation M87

(4) Binary data under the heterogeneous variance model

Define the expectation of Yj by

equation M88

The D matrix is

equation M89

where

equation M90

and

equation M91

(5) Poisson data with observed genotypes

The expectation of Yj given genotype g is

equation M92

The D matrix is

equation M93

where

equation M94

(6) Poisson data under the overdispersion model (approximation)

Denote the expectation of Yj by

equation M95

The D matrix is

equation M96

where

equation M97

Appendix B: derivation of the EM algorithm

Since the ordinal phenotype has been presented as multivariable Bernoulli variable and the parameter vector has been partitioned into three blocks, it is a little bit tedious to derive the EM algorithm for parameter estimation. Therefore, we only used the Poisson data as an example for the derivation. The derivation applies to generalized linear model for any distributions. Let Gj = 1, 2, 3 be a discrete variable for the genotype of individual j. Let equation M98 be a vector of three indicator variables of the genotypes for individual j following a multinomial distribution with sample size one. The relationship between Gj and δj is

equation M99

for g = 1, 2, 3. When the genotype of individual j is known, the complete-data log likelihood function for the parameters is

equation M100

where

equation M101

is the Poisson probability. The expectation of the complete-data log likelihood function is

equation M102

where

equation M103

is the posterior probability of δ(Gj, g) = 1 given θ = θ(t). Now pj*(g) is treated as a constant (not a function of the parameters because the unknown parameter involved in pj*(g) has been replaced by the parameter value at iteration t). The EM algorithm actually maximizes the expectation of the complete-data log likelihood function, not the original observed log likelihood function. The maximum likelihood estimation in the neighborhood of θ = θ(t) can be obtained through Taylor expansion around θ = θ(t) (the Newton–Raphson method),

equation M104

Now we only need to prove that

equation M105

and

equation M106

The partial derivative of the complete-data log likelihood function with respect to the unknown parameters is

equation M107

The second partial derivative is

equation M108

where

equation M109

Given that

equation M110

(see Eqs. 61 and 62 in Appendix A) and

equation M111

(see Eq. 39 in the main text), we have

equation M112

Substituting Xj and Hg in Eqs. (74) and (76) by Eq. (79), we have

equation M113

and

equation M114

Therefore,

equation M115

and

equation M116

This concludes the derivation of the EM algorithm.

Appendix C: generalized inverse of variance matrix

For ordinal traits, the observed data point Yj for individual j is multinomial with sample size one. Therefore, the variance–covariance matrix is

equation M117

where equation M118 This variance matrix can be rewritten in a general form as

equation M119

where c = −1. The inverse of this general form has an explicit expression (Giri 1996),

equation M120

The fact that equation M121 leads to

equation M122

Therefore, the inverse matrix can be written as

equation M123

Since 1/(1 + c) is not defined when c = −1, the inverse matrix Vj−1 does not exist.

We now prove that equation M124 is a generalized inverse of matrix Vj. To prove this, we only need to show that equation M125 because a generalize inverse Vj is defined as a matrix such that equation M126. Substituting Vj by equation M127, we get

equation M128

Keep in mind that the above derivation requires equation M129 and equation M130 for simplification. This concludes the proof that equation M131 is a generalized inverse of matrix Vj.

In fact, equation M132 is just one of an infinite number of generalized inverses of matrix Vj. A general form of the generalized inverse can be expressed by

equation M133

where d is a real number (a scalar not a matrix). The generalized inverse equation M134 is simply obtained by setting d = 0. The following equation serves as a proof of this general form of the generalized inverse,

equation M135

Appendix D: logistic regression

We now use binary trait QTL mapping as an example to demonstrate the logit link function of the generalized linear model. Recall that the univariate definition of the binary phenotype is

equation M136

Under the logit link function, the expectation and variance of the phenotype given the parameter values are

equation M137

and

equation M138

respectively. The probability density is

equation M139

The link function is logit because

equation M140

Genotype observed

Let g = 1, 2, 3 index the three genotypes and

equation M141

be the genotype specific expectation. Let

equation M142

and

equation M143

The D matrix for genotype g is defined as

equation M144

where

equation M145

and

equation M146

Heterogeneous variance model

Let us define

equation M147

The expectation of Yj is

equation M148

Define

equation M149

The D matrix is

equation M150

where

equation M151

and

equation M152

References

  • Box GEP, Cox DR. An analysis of transformations. J R Stat Soc B. 1964;26:211–252.
  • Cui Y, Yang W. Zero-inflated generalized Poisson regression mixture model for mapping quantitative trait loci underlying count trait with many zeros. J Theor Biol. 2009;256:276–285. doi: 10.1016/j.jtbi.2008.10.003. [PubMed] [Cross Ref]
  • Cui Y, Kim DY, Zhu J. On the generalized poisson regression mixture model for mapping quantitative trait loci with count data. Genetics. 2006;174:2159–2172. doi: 10.1534/genetics.106.061960. [PubMed] [Cross Ref]
  • Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977;39:1–38.
  • Deng W, Chen H, Li Z. A logistic regression mixture model for interval mapping of genetic trait loci affecting binary phenotypes. Genetics. 2006;172:1349–1358. doi: 10.1534/genetics.105.047241. [PubMed] [Cross Ref]
  • Dou B, Hou B, Xu H, Lou X, Chi X, Yang J, Wang F, Ni Z, Sun Q. Efficient mapping of a female sterile gene in wheat (Triticum aestivum L.) Genet Res. 2009;91:337–343. doi: 10.1017/S0016672309990218. [PubMed] [Cross Ref]
  • Frank EH. Regression modeling strategies. New York: Springer; 2001.
  • Giri NC. Multivariate statistical analysis. New York: Marcel Dekker, Inc; 1996.
  • Hackett CA, Weller JI. Genetic mapping of quantitative trait loci for traits with ordinal distributions. Biometrics. 1995;51:1252–1263. doi: 10.2307/2533257. [PubMed] [Cross Ref]
  • Han L, Xu S. A Fisher scoring algorithm for the weighted regression method of QTL mapping. Heredity. 2008;101:453–464. doi: 10.1038/hdy.2008.78. [PubMed] [Cross Ref]
  • Horton NJ, Laird NM. Maximum likelihood analysis of generalized linear models with missing covariates. Stat Methods Med Res. 1998;8:37–50. doi: 10.1191/096228099673120862. [PubMed] [Cross Ref]
  • Hu Z, Xu S (2009) PROC QTL—A SAS procedure for mapping quantitative trait loci. Int J Plant Genomics 2009:3. doi:10.1155/2009/141234 [PMC free article] [PubMed]
  • Ibrahim JG, Chen M-H, Lipsitz SR. Bayesian methods for generalized linear models with missing covariates. Can J Stat. 2002;30:55–78. doi: 10.2307/3315865. [Cross Ref]
  • Ibrahim JG, Chen M-H, Lipsitz SR, Herring AH. Missing-data methods for generalized linear models: a comparative review. J Am Stat Assoc. 2005;100:332–346. doi: 10.1198/016214504000001844. [Cross Ref]
  • Jiang C, Zeng ZB. Mapping quantitative trait loci with dominant and missing markers in various crosses from two inbred lines. Genetica. 1997;101:47–58. doi: 10.1023/A:1018394410659. [PubMed] [Cross Ref]
  • Kao CH, Zeng ZB. General formulas for obtaining the MLEs and the asymptotic variance–covariance matrix in mapping quantitative trait loci when using the EM algorithm. Biometrics. 1997;53:653–665. doi: 10.2307/2533965. [PubMed] [Cross Ref]
  • Lander ES, Botstein D. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989;121:185–199. [PubMed]
  • Lange C, Whittaker JC. Mapping quantitative trait loci using generalized estimating equations. Genetics. 2001;159:1325–1337. [PubMed]
  • Louis T. Finding the observed information matrix when using the EM algorithm. J R Stat Soc B. 1982;44:226–233.
  • McCullagh P, Nelder JA. Generalized linear models. New York: Chapman & Hall; 1989.
  • Ooijen JW. Accuracy of mapping quantitative trait loci in autogamous species. Theor Appl Genet. 1992;84:803–811. doi: 10.1007/BF00227388. [PubMed] [Cross Ref]
  • Piepho HP. A quick method for computing approximate thresholds for quantitative trait loci detection. Genetics. 2001;157:425–432. [PubMed]
  • Rao SQ, Xu S. Mapping quantitative trait loci for categorical traits in four - way crosses. Heredity. 1998;81:214–224. doi: 10.1038/sj.hdy.6883780. [PubMed] [Cross Ref]
  • Robins JM, Ritov Y. On double robustness. Stat Sin. 2001;11:920–936.
  • Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.
  • SAS/STAT users’ guide. Cary: SAS Publishing; 1999.
  • Searle SR. Linear models. New York: Wiley; 1997.
  • Wald A. Note on the consistency of the maximum likelihood estimate. Ann Math Stat. 1949;20:595–601. doi: 10.1214/aoms/1177729952. [Cross Ref]
  • Wedderburn RWM. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika. 1974;61:439–447.
  • Xu S. Iteratively reweighted least squares mapping of quantitative trait loci. Behav Genetics. 1998;28:341–355. doi: 10.1023/A:1021617618150. [PubMed] [Cross Ref]
  • Xu S, Atchley WR. Mapping quantitative trait loci for complex binary diseases using line crosses. Genetics. 1996;143:1417–1424. [PubMed]
  • Xu S, Yi N, Burke D, Galecki A, Miller RA. An EM algorithm for mapping binary disease loci: application to fibrosarcoma in a four-way cross mouse family. Genet Res. 2003;82:127–138. doi: 10.1017/S0016672303006414. [PubMed] [Cross Ref]
  • Xu C, Zhang Y-M, Xu S. An EM algorithm for mapping quantitative resistance loci. Heredity. 2005;94:119–128. doi: 10.1038/sj.hdy.6800583. [PubMed] [Cross Ref]
  • Yi N, Xu S. Mapping quantitative trait loci for complex binary traits in outbred populations. Heredity. 1999;82:668–676. doi: 10.1046/j.1365-2540.1999.00529.x. [PubMed] [Cross Ref]
  • Yi N, Xu S. A random model approach to mapping quantitative trait loci for complex binary traits in outbred populations. Genetics. 1999;153:1029–1040. [PubMed]
  • Yi N, Xu S. Bayesian mapping of quantitative trait loci for complex binary traits. Genetics. 2000;155:1391–1403. [PubMed]

Articles from Springer Open Choice are provided here courtesy of Springer