Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2875332

Formats

Article sections

- Summary
- 1. Introduction
- 2. Single QTL Mapping
- 3. Multiple QTL Mapping
- 4. Simulations and Real Data Analysis
- 5. Discussion
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 May 25.

Published in final edited form as:

Published online 2009 May 12. doi: 10.1111/j.1541-0420.2009.01268.x

PMCID: PMC2875332

NIHMSID: NIHMS117741

Email: ude.cnu.soib@uozf

The publisher's final edited version of this article is available at Biometrics

In linkage analysis, it is often necessary to include covariates such as age or weight to increase power or avoid spurious false positive findings. However, if a covariate term in the model is specified incorrectly (e.g., a quadratic term misspecified as a linear term), then the inclusion of the covariate may adversely affect power and accuracy of the identification of Quantitative Trait Loci (QTL). Furthermore, some covariates may interact with each other in a complicated fashion. We implement semiparametric models for single and multiple QTL mapping. Both mapping methods include an unspecified function of any covariate found or suspected to have a more complex than linear but unknown relationship with the response variable. They also allow for interactions among different covariates. This analysis is performed in a Bayesian inference framework using Markov chain Monte Carlo. The advantages of our methods are demonstrated via extensive simulations and real data analysis.

Unlike monogenic traits where success in associating genotype to phenotype is assured, complex traits pose significantly greater challenges. Parametric genetic mapping using experimental populations, such as backcrosses, F2 intercross or Recombinant Inbred Lines (RIL) have been well developed during the past 15 years (see Doerge et al., 1997 for an introduction to QTL mapping in inbred line crosses). Many excellent open source software packages, such as QTLCart (Basten et al., 1999), MapManager (Manly and Olson, 1999), MapMaker (Lincoln et al., 1993), and R/qtl (Broman et al., 2003) are freely available on-line. These QTL mapping packages often model the effects of non-genetic covariates linearly, for example,

$$y=\mu +\beta x+\zeta t+e,$$

(1)

where *y* is the measured quantitative trait; *t* is the non-genetic covariate; *x* is the genetic factor, and *e* is the random error term. However, in practice, the QTL position is unknown, resulting in missing *x* (missing for all individuals).

Although most available QTL mapping methods only map one or a few QTL at a time and are not efficient for complex trait mapping, recently multiple QTL have been mapped simultaneously by treating QTL mapping as a large-scale variable selection problem: for example, for a backcross population and with *q* potential QTL positions (selected grid of positions across the genome), where *q* is in the hundreds or thousands and typically (much) larger than the sample size, there are 2* ^{q}* possible main effect models. Variable selection methods are needed that are capable of selecting variables that are not necessarily all individually important but rather together important. By treating multiple quantitative trait locus (QTL) mapping as a model/variable selection problem (Broman and Speed, 2002), forward and step-wise selection procedures have been proposed to search for multiple QTL. Although simple, these methods have their limitations, such as uncertainty about the number of QTL, the sequential model building that makes it unclear how to assess the significance of the associated tests, etc. Bayesian QTL mapping methods (Satagopan, 1996; Sillanpää and Arjas, 1998; Stephens and Fisch, 1998; Yi and Xu, 2000, 2001; Hoeschele, 2007) have been developed, in particular, for the detection of multiple QTL by treating the number of QTL as a random variable and by specifically modeling it using reversible jump Markov chain Monte Carlo (MCMC) (Green, 1995). Due to the variable dimensionality of the parameter spaces associated with different models (different numbers of QTL), care must be taken in determining the acceptance probability for such changes in dimension, which in practice may not be handled correctly (Ven, 2004). To avoid this problem, another leading approach to variable selection in QTL analysis implemented by MCMC is based on the composite model space framework (Godsill, 2001, 2003) and has been introduced to genetic mapping by Yi (2004). Bayesian variable selection methods such as reversible jump MCMC (Green, 1995) and stochastic search variable selection (SSVS) (George and McCulloch, 1993) are special cases of this framework. A modification that treats (variance) hyperparameters as unknown was recently found to produce a better mixing MCMC sampler for multiple QTL mapping (Yi et al., 2007). Recently, Yi and Xu (2008) have developed a Bayesian LASSO (Tibshirani, 1996) for QTL mapping.

In some studies, however, the relationship between *y* and *t* may not be linear. In their study of the metabolic syndrome, McQueen et al (2003) have found a nonlinear effect of alcohol consumption on the quantitative traits they investigated. Incorrect modeling of the covariate effect may adversely affect power and accuracy of QTL identification. Semiparametric regression modeling, where *ζt* in (1) is replaced by an unspecified function *η*(*t*), has attracted considerable attention in the statistical literature. When *x* is observed, model (1) reduces to the semiparametric regression model, which is well investigated in the spline literature as well as in the kernel regression literature. Examples for spline regression include Wahba (1984), Heckman (1986), Chen (1988), Speckman (1988), Cuzick (1992), Hastie and Loader (1993) and Mays (1995) while examples for kernel regression include Härdle (1990), Wand and Jones (1995), and Fan (1992). Spline regression requires a penalty weight to balance between goodness-of-fit and complexity. To account for the non-linear effect of the alcohol consumption, McQueen et al. (2003) categorized the alcohol consumption into five non-overlapping groups in their linear regression analysis, which is essentially a special form of spline regression, so-called local polynomial regression. Kernel regression, on the other hand, needs a bandwidth to determine the degree of localness and smoothness of *η*. The choice of a bandwidth, and not the choice of a kernel function, is critical for the performance of the nonparametric fit (Härdle, 1990). Bayesian approaches to semiparametric regression have also been developed. Bayesian nonparametric methods achieve flexibility by putting priors on distribution spaces, corresponding to infinitely dimensional parameterizations. The leading Bayesian methods include Dirichlet process (Müller, et al. 1996), splines (Smith and Kohn, 1996; Denison et al., 1998; DiMatteo et al., 2001), wavelets (Abramovich et al., 1998) and Gaussian process (Neal, 1997, 1996). Gaussian process priors date back to at least O’Hagan (1978) and have a large support in the space of all smooth functions through an appropriate choice for the covariance kernel. Gaussian process is particularly flexible for curve estimation because of their flexible sample path shapes. Wahba (1978) has shown that for an appropriate choice of the covariance kernel of the Gaussian process, the Bayesian estimator is a smoothing spline. However, Gaussian process better suited for modeling with multiple (even many) covariates than the smoothing spline approach.

Applying spline regression and kernel regression techniques to semiparametric interval QTL mapping is challenging, especially when mapping multiple QTL, due to the missing QTL genotypes. The Bayesian approach, however, is very flexible in handling missing data. In this paper, we propose novel Bayesian methods for interval QTL mapping of a single QTL and of multiple QTL which incorporate an unspecified function of a single covariate or multiple covariates using a Gaussian process prior. The rest of the paper is organized as follows. We first introduce the underlying semiparametric QTL model for single QTL mapping and the general development of the Markov Chain Monte Carlo (MCMC) sampler in Section 2. We then develop semiparametric multiple QTL mapping in Section 3. Simulation results are presented in Sections 4.1 amd 4.2, and the analysis of a real data set is presented in Section 4.3. We end the paper with comments and conclusions in Section 5.

Quantitative traits are usually controlled by both genetic and environmental factors, such as diet and exposure to chemical toxicities. When studying natural populations, like humans, it is difficult to separate environmental and genetic effects. With experimental organisms, uniform genetic backgrounds and controlled breeding schemes can avoid environmental variability which may obscure genetic effects. It is considerably easier to map quantitative traits with experimental populations than with natural populations. For this reason, crosses between completely inbred lines are often used for detecting QTL. QTL line-cross analysis has been widely applied in the plant sciences. It has also been used successfully in a number of animal species, such as mice and rats (Stoehr et al., 2000; Lan et al., 2001). Because of the homology between humans and rodents, animal models are extremely useful in helping us to understand human diseases.

The backcross (BC) and F2 intercross are two of the most popular mapping populations in QTL studies. Suppose two inbred parents (P1 and P2) differ in some quantitative traits. At each locus, we label the allele of parent P1 as *a* and the allele of P2 as *A*. An F1 generation is completely heterozygous with genotype *Aa* at all loci, receiving one allele from each parent. Thus, there is no segregation in F1 individuals. A BC population is generated when F1 is crossed back with one of its parents, for example, P2. At each locus, every BC individual has equal probability of 1/2 to be *Aa* or *AA*, respectively. Thus there is segregation in BC since BC individuals are no longer genetically identical at each locus. Similarly, crossing F1 individuals generates an F2 population in which each individual has probability 1/4, 1/2 and 1/4 of being *aa*, *Aa* and *AA*, respectively. The combination of the two alleles in an individual is a genotype, and the genotypes can be determined at a number of loci, called marker loci (or genetic markers), throughout the genome. These loci are often not the functional loci (QTL) that we wish to identify and whose genotypes can generally not be measured but rather inferred (with some uncertainty) from the genotypes at nearby markers.

The QTL data available include the trait values or phenotypes (the dependent variable) *y _{i}* (

The single QTL model assumes that there is only one QTL affecting the trait according to the linear regression model in Equation (1). If the QTL genotypes are observed, QTL mapping is a simple linear regression problem. However, in practice, the QTL position is rarely known and the genotypes are typically unobserved, resulting in all missing *x _{i}*s. The idea behind interval mapping (Lander and Botstein, 1989) is that at any putative QTL position located in an interval between two marker loci (typically on an evenly spaced, genome-wide grid), we can compute the probabilities of the unobserved QTL genotypes for each individual given its genotypes at the pair of closest flanking marker loci (see chapter 15 of Lynch and Walsh, 1998). The distribution of the quantitative trait given the marker genotypes thus follows a finite mixture model. Under this model, the null (alternative) hypothesis for a putative QTL position is that its genotypes do not (do) affect the phenotype of interest. The null hypothesis is evaluated via the likelihood ratio, and a plot of likelihood ratios versus putative QTL positions is called a log-likelihood (ratio) profile. In any region where the profile exceeds a (genome-wide) significance threshold, a QTL is declared at the position with the highest log-likelihood ratio.

When a parametric model is specified incorrectly, estimation bias can result, and the incorrect or inefficient modeling of the covariate effect may adversely affect the power and precision of the QTL identification. To alleviate this problem, we propose a semiparametric model of the form

$${y}_{i}=\beta {x}_{i}+\eta ({t}_{i})+{e}_{i},\phantom{\rule{0.16667em}{0ex}}i=1,\cdots ,n,$$

(2)

where *x _{i}* is the indicator of the QTL genotype (e.g., with values −1 and 1 depending on whether the QTL genotype is

As all our inference is performed conditional on the marker genotypes, i.e. on ** M** = {

A prior probability density *p*(*η*|*t*) is induced by using a Gaussian process. A Gaussian process is a stochastic process such that each finite dimensional distribution is multivariate normal. Thus any Gaussian process is specified by its mean function and covariance kernel. A wide array of functions can be derived as sample paths of a Gaussian process.

Let *t*_{1}, *t*_{2}, ···, *t*_{k} be the set of distinct covariate values and *d _{j}* the number of occurrences of

To specify a Gaussian process prior with
$\mathit{\mu}={\{\mu ({t}_{j})\}}_{j=1}^{k}$ and **Σ**, one may consider some parametric forms for the functional parameters while putting priors on the hyper-parameters. To build the prior around a parametric family, we consider

$$\mu (t;\mathit{\alpha})={\alpha}_{1}{f}_{1}(t)+\cdots +{\alpha}_{l}{f}_{l}(t),$$

(3)

where *l* is a fixed integer, and {*f*_{1}, ···, *f _{l}*} are known (for example polynomial) functions in

In practice, reasonable choices of conjugate prior distributions for *τ* and ** α** are an inverse Gamma distribution on

$$\begin{array}{l}\tau \sim \mathit{inverse}\phantom{\rule{0.16667em}{0ex}}\mathit{Gamma}(a,b),\\ \mathit{\alpha}\sim N({\mathit{\alpha}}_{0},\mathrm{\Gamma}),\\ \eta \mid \mathit{\alpha},\tau \sim \mathit{Gaussian}\phantom{\rule{0.16667em}{0ex}}\mathit{process}(\mathit{\mu},{\mathit{\sum}}_{0}/\tau )\end{array}$$

(4)

where **Σ**_{0} is the *k* × *k* matrix with element (*j*1, *j*2) equal to *σ*_{0}(*t _{j}*

The prior on the QTL position, *λ*, is non-informative with a uniform distribution over the entire genome. Given *λ*, for each individual *i*, the probability of its QTL genotype is a function of the genotypes of the two markers flanking *λ* and of the locations of the two flanking markers, and it is denoted by *p*(*x _{i}*|

Note that given the nonparametric component *η*, (2) becomes a linear model with *y _{i}* replaced by

First we initialize parameters and hyperparameters
${\sigma}_{e}^{2}$, *τ*, *β*, *λ*, *a* and *b*. Then we perform the following updating steps many times:

Sample *η*: Conditional on *η*, the *y _{i}*’s are independent normal random variables with variance
${\sigma}_{e}^{2}$ and mean

Sample ** α** and

$$\begin{array}{l}\mathit{\alpha}\mid \tau ,\eta ,\mathbf{y}=\mathit{\alpha}\mid \tau ,\eta \sim N({\mathit{\alpha}}_{0}^{\mathrm{\u2605}},{\mathrm{\Gamma}}^{\mathrm{\u2605}}),\\ \tau \mid \mathit{\alpha},\eta ,\mathbf{y}=\tau \mid \mathit{\alpha},\eta \sim \mathit{inverse}\phantom{\rule{0.16667em}{0ex}}\mathit{Gamma}({a}^{\ast},{b}^{\ast}),\end{array}$$

where ${\mathrm{\Gamma}}^{\ast}={(\tau {\mathit{F}}^{T}{\mathit{\sum}}_{0}^{-1}\mathit{F}+{\mathrm{\Gamma}}^{-1})}^{-1},{\mathit{\alpha}}_{0}^{\ast}=\tau {\mathrm{\Gamma}}^{\ast}{\mathit{F}}^{T}{\mathit{\sum}}_{0}^{-1}(\eta -\mathit{F}{\mathit{\alpha}}_{0})+{\mathit{\alpha}}_{0},{a}^{\ast}=a+k/2$ and ${b}^{\ast}=b+{(\eta -\mathit{F}\mathit{\alpha})}^{T}{\mathit{\sum}}_{0}^{-1}(\eta -\mathit{F}\mathit{\alpha})/2$.

Sample *β* from its conditional posterior distribution, which is normal
$N\left({\scriptstyle \frac{{\sum}_{i=1}^{n}{x}_{i}\{{y}_{i}-\eta ({t}_{i};\mathit{\alpha})\}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}},{\scriptstyle \frac{{\sigma}_{e}^{2}}{{\sum}_{i=1}^{n}{x}_{i}^{2}}}\right)$.

Sample
${\sigma}_{e}^{2}$ from the inverse Gamma distribution with parameters *n*/2 and
${\sum}_{i=1}^{n}{\{{y}_{i}-\beta {x}_{i}-\eta ({t}_{i};\mathit{\alpha})\}}^{2}/2$.

Sample QTL position *λ* and QTL genotypes ** x** jointly in a Metropolis-Hastings step detailed below. A new QTL position

$$q({x}_{i}^{\ast}\mid {\lambda}^{\ast})\stackrel{\mathit{def}}{=}\frac{p({y}_{i}\mid {x}_{i}^{\ast},\beta ,\eta ({t}_{i}))\xb7p({x}_{i}^{\ast}\mid {m}_{iL}({\lambda}^{\ast}),{m}_{iR}({\lambda}^{\ast}),{d}_{L}({\lambda}^{\ast}),{d}_{R}({\lambda}^{\ast}),{\lambda}^{\ast})}{{\sum}_{h\in \{-1,+1\}}p({y}_{i}\mid h,\beta ,\eta ({t}_{i}))\xb7p(h\mid {m}_{iL}({\lambda}^{\ast}),{m}_{iR}({\lambda}^{\ast}),{d}_{L}({\lambda}^{\ast}),{d}_{R}({\lambda}^{\ast}),{\lambda}^{\ast}),}$$

where *p*(*y _{i}*|

$$\gamma =\frac{p({\lambda}^{\ast},{\mathit{x}}^{\ast}\mid \mathbf{y},\beta ,\eta )}{p(\lambda ,\mathit{x}\mid \mathbf{y},\beta ,\eta )}\frac{q(\lambda ){\prod}_{i=1}^{n}q({x}_{i}\mid \lambda )}{q({\lambda}^{\ast}){\prod}_{i=1}^{n}q({x}_{i}^{\ast}\mid {\lambda}^{\ast})}$$

The ratio of the joint posteriors of QTL position and genotypes evaluated at the proposed and current values is

$$\frac{p({\lambda}^{\ast},{\mathit{x}}^{\ast}\mid \mathbf{y},\beta ,\eta )}{p(\lambda ,\mathit{x}\mid y,\beta ,\eta )}=\prod _{i=1}^{n}\frac{p({y}_{i}\mid {x}_{i}^{\ast},\beta ,\eta ({t}_{i}))\xb7p({x}_{i}^{\ast}\mid {m}_{iL}({\lambda}^{\ast}),{m}_{iR}({\lambda}^{\ast}),{d}_{L}({\lambda}^{\ast}),{d}_{R}({\lambda}^{\ast}),{\lambda}^{\ast})}{p({y}_{i}\mid {x}_{i},\beta ,\eta ({t}_{i}))\xb7p({x}_{i}\mid {m}_{iL}(\lambda ),{m}_{iR}(\lambda ),{d}_{L}(\lambda ),{d}_{R}(\lambda ),\lambda )}.$$

Consequently, *γ* is simplified as

$$\gamma =\prod _{i=1}^{n}\frac{{\sum}_{h\in \{-1,+1\}}p({y}_{i}\mid h,\beta ,\eta ({t}_{i}))\xb7p(h\mid {m}_{iL}({\lambda}^{\ast}),{m}_{iR}({\lambda}^{\ast}),{d}_{L}({\lambda}^{\ast}),{d}_{R}({\lambda}^{\ast}),{\lambda}^{\ast})}{{\sum}_{h\in \{-1,+1\}}p({y}_{i}\mid h,\beta ,\eta ({t}_{i}))\xb7p(h\mid {m}_{iL}(\lambda ),{m}_{iR}(\lambda ),{d}_{L}(\lambda ),{d}_{R}(\lambda ),\lambda )}\frac{q(\lambda )}{q({\lambda}^{\ast})}.$$

One iteration or cycle of our MCMC sampler consists of steps 1 to 5. When the chain converges to its stationary distribution, the sampled values of all parameters are from their joint posterior distribution. Likewise, the samples of any single parameter represent the marginal posterior distribution of this parameter.

We now extend the single QTL model to a multiple QTL model, or

$${y}_{i}=\sum _{j=1}^{q}{\beta}_{j}{x}_{ij}+\eta ({t}_{i})+{e}_{i},\phantom{\rule{0.16667em}{0ex}}i=1,\cdots ,n,$$

(5)

where *x _{ij}* is the indicator of the genotype of the

For multiple Bayesian QTL interval mapping, we follow Wang et al. (2005) and assume (at most) one QTL within each marker interval. That is, in this paper, we assume *q* in model (5) equals the number of marker intervals. When marker intervals are short, it is reasonable to make this assumption. For intervals that are not short enough to meet this assumption, we can further divide them into sub-intervals with pseudo-markers. Similar to the single QTL mapping, here the QTL positions,
$\mathit{\lambda}={\{{\lambda}_{j}\}}_{j=1}^{q}$, and therefore, the QTL genotypes
$\mathit{X}={\{{\mathit{x}}_{j}\}}_{j=1}^{q}$ (with
${\mathit{x}}_{j}={\{{x}_{ij}\}}_{i=1}^{n}$) are unknown. Therefore, the unobserved variables of model (5) are function *η*, error variance
${\sigma}_{e}^{2}$, QTL positions ** λ**, QTL genotypes

The priors of *η* and
${\sigma}_{e}^{2}$ are unchanged from the single QTL mapping.

The prior specifications of ** λ** and

For QTL selection, we employ the SSVS method (George and McCulloch 1993). The SSVS approach imposes a normal mixture prior on the regression parameters ** β** and uses latent variables to identify subset choices. Specifically, the latent variables

$${\beta}_{j}\mid {\gamma}_{j}\sim (1-{\gamma}_{j})N(0,{\sigma}^{2})+{\gamma}_{j}N(0,{c}_{j}^{2}{\sigma}^{2})$$

with *P*(*γ _{j}* = 1) = 1−

The MCMC steps for sampling *η*, ** α**,

Sample *β _{j}* and

$${\beta}_{j}\mid \mathbf{y},{\theta}_{-{\beta}_{j}}\sim N\left({\widehat{\beta}}_{j},\frac{{\sigma}_{e}^{2}}{{\mathit{x}}_{j}^{T}{\mathit{x}}_{j}+{\sigma}_{e}^{2}/{\sigma}_{j}^{2}}\right),j=1,\cdots ,q,$$

where
${\widehat{\beta}}_{j}={\mathit{x}}_{j}^{T}\mathit{w}/({\mathit{x}}_{j}^{T}{\mathit{x}}_{j}+{\sigma}_{e}^{2}/{\sigma}_{j}^{2}),\mathit{w}={\left\{{w}_{i}\right\}}_{i=1}^{n}$, with *w _{i}* =

$$p({\gamma}_{j}=1\mid {\theta}_{-{\gamma}_{j}})=\frac{{c}_{j}}{{c}_{j}+{\scriptstyle \frac{{p}_{j}}{1-{p}_{j}}}exp\left\{{\scriptstyle \frac{{\beta}_{j}^{2}}{2{\sigma}^{2}}}\left(1-{\scriptstyle \frac{1}{{c}_{j}^{2}}}\right)\right\}}.$$

Sample QTL positions: *λ _{j}* is sampled via Metropolis-Hastings approach since there is no closed form for the conditional posterior probability density of a QTL position. We first sample a new position
${\lambda}_{j}^{\mathrm{\u2605}}$ uniformly from the neighborhood of

$$\alpha =\prod _{i=1}^{n}\frac{p({x}_{ij}\mid {m}_{iL}{({\lambda}_{j})}^{\ast},{m}_{iR}({\lambda}_{j}^{\ast}),{d}_{L}({\lambda}_{j}^{\ast}),{d}_{R}({\lambda}_{j}^{\ast}),{\lambda}_{j}^{\ast})}{p({x}_{ij}\mid {m}_{iL}({\lambda}_{j}),{m}_{iR}({\lambda}_{j}),{d}_{L}({\lambda}_{j}),{d}_{R}({\lambda}_{j}),{\lambda}_{j})}\frac{q({\lambda}_{j})}{q({\lambda}_{j}^{\ast})}.$$

If neither *λ _{j}* nor
${\lambda}_{j}^{\mathrm{\u2605}}$ is within

$$\begin{array}{l}q({\lambda}_{j}^{\mathrm{\u2605}})=1/[\mathit{min}\{{\lambda}_{j}^{\ast}+\delta ,{d}_{R}({\lambda}_{j}^{\ast})\}-\mathit{max}\{{\lambda}_{j}^{\ast}-\delta ,{d}_{L}({\lambda}_{j}^{\ast})\}],\phantom{\rule{0.16667em}{0ex}}\phantom{\rule{0.16667em}{0ex}}\phantom{\rule{0.16667em}{0ex}}\text{or}/\text{and}\\ q({\lambda}_{j})=1/[\mathit{min}\{{\lambda}_{j}+\delta ,{d}_{R}({\lambda}_{j})\}-\mathit{max}\{{\lambda}_{j}-\delta ,{d}_{L}({\lambda}_{j})\}].\end{array}$$

Sample QTL genotypes: The QTL genotype *x _{ij}* (

$$\frac{p({y}_{i}\mid x,\beta ,{\mathit{x}}_{i(-j)},\eta ({t}_{i}))\xb7p(x\mid {m}_{iL}({\lambda}_{j}),{m}_{iR}({\lambda}_{j}),{d}_{L}({\lambda}_{j}),{d}_{R}({\lambda}_{j}),{\lambda}_{j})}{{\sum}_{h\in \{-1,+1\}}p({y}_{i}\mid h,\beta ,{\mathit{x}}_{i(-j)},\eta ({t}_{i}))\xb7p(x\mid {m}_{iL}({\lambda}_{j}),{m}_{iR}({\lambda}_{j}),{d}_{L}({\lambda}_{j}),{d}_{R}({\lambda}_{j}),{\lambda}_{j})},$$

where

$${\mathit{x}}_{i(-j)}=({x}_{i1},\cdots ,{x}_{i(j-1)},{x}_{i(j+1)},\cdots ,{x}_{iq})$$

and

$$p({y}_{i}\mid x,\beta ,{\mathit{x}}_{i(-j)},\eta ({t}_{i}))\propto exp\left[-\frac{1}{2{\sigma}_{e}^{2}}{\{{y}_{i}-\eta ({t}_{i})-{\beta}_{j}x-\sum _{{j}^{\prime}\ne j}^{q}{\beta}_{{j}^{\prime}}{x}_{i{j}^{\prime}}\}}^{2}\right].$$

One iteration or cycle of our MCMC sampler consists of steps 1 to 7 and produces a sample from the joint posterior after completing the burn-in period.

We perform simulations to evaluate the small sample performance of the proposed semiparametric Bayesian interval QTL mapping method. Backcross populations of size 100, 300, 500 and 1000 individuals were simulated, with a single large chromosome of genetic length 10 Morgan. This genome was covered by 101 evenly spaced markers (100 marker intervals, each 10 centi-Morgan (cM) long). A single QTL was located at position *λ* = 155 cM with an effect of *β* = 1. The residual variance was
${\sigma}_{e}^{2}=1$. The covariate values were sampled from the uniform distribution [0, 10), and the unknown function was *η*(*t*) = *sin*(*t*), *η*(*t*) = 3*sin*(*t*/3), or *η*(*t*) = 5*sin*(*t*/5).

For the analysis, we chose the following mean and covariance function for the Gaussian process prior of *η*:

$$\mu (t;\mathit{\alpha})={\alpha}_{1}+{\alpha}_{2}t,\phantom{\rule{0.16667em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{\sigma}_{0}(t,{t}^{\prime})=exp\{-{(t-{t}^{\prime})}^{2}\}.$$

with the priors on *α _{i}*s being improper uniform. That is,

For the analysis of the simulated and the real (see below) data, the MCMC sampler was run for a total of 25,000 cycles. The first 5000 cycles were discarded as burn-in, and the remainder of the chain was thinned by keeping one out of every ten samples, resulting in a total of 2000 samples for post-MCMC analysis. Several convergence checks were performed, including the Convergence Diagnosis and Output Analysis (CODA) by Best, Cowles and Vines (1995) and two convergence diagnostic tests by Geweke (1992) and Gelman and Rubin (1992). All analyses indicated convergence of our MCMC samplers.

The posterior mean estimates of QTL position and effect are summarized in Table 1 for the backcross of size 500 and *η*(*t*) = *sin*(*t*), along with the true parameter values, showing that estimates and true values were in very good agreement. Figure 1 compares the true and the estimated unknown functions of the covariate, where the estimate is evaluated at grid points equally spaced between [0,10] with the increment of 0.05, and the posterior mean is given at each point, for sample sizes *n* of 100, 300, 500 and 1000 and for *η*(*t*) = *sin*(*t*). The figure supports the result in Neal (1996) about the consistency of the semiparametric estimator in the context of neural networks. As we increased the sample size, the estimated function became closer to the true function and is expected to converge to the true function as the sample size approaches infinity.

Convergence of estimator to the true function (*η*(*t*) = sin(*t*)) with increasing sample size (*n* = 100, 300, 500 and 1000). True functions are in bold. The dashed line is the posterior mean of the unknown function. The dotted lines represent the 95% **...**

Posterior mean parameter estimates for the data simulated Data with sample size n = 500 and true covariate function η(t) = sin(t)

Table 2 provides comparisons between the linear regression model and our semiparametric model for the three different *η* functions and for sample size *n*=300. The estimates from the semiparametric analysis agree well with the true values for all three functions, while the results from the linear parametric model are quite different. The estimated residual variances from linear model are much larger than the true values, while the QTL effect is underestimated and its estimated position is rather inaccurate. In contrast, our semiparametric model provides the flexibility to fit the data well.

Comparison between parameter estimates obtained with the linear parametric analysis and the semiparametric analysis (values in parentheses are for the linear parametric model)

For further comparison, we simulated two additional cow weight datasets based on a simple linear growth model and a well-known generalized logistic function for modeling the growth curve (Richards, 1959). The linear growth model is *η*(*t*) = 187.459 + 2.682*t*, while the generalized logistic function is

$$\eta (t)=-243+\frac{968}{{\{1+0.15{e}^{-0.01955(t-20.1)}\}}^{1/0.15}}.$$

We set QTL effect *β* = 3 and
${\sigma}_{e}^{2}=9$ so that the phenotypic variations due to the QTL, the covariate and the random error are roughly balanced. All other simulation parameters (including the marker data and QTL position) were the same as before (sample size *n* = 500).

The simulated responses are plotted against covariate time in Figure 2. Table 3 presents the posterior summaries of the analysis of the two data sets. For the data set simulated with the linear function, the estimates from the semiparametric and linear regression models are very similar. But for the data set simulated with the generalized logistic function, the semiparametric method produces again more accurate results.

(a) Simulated cow weight against weeks based on the generalized logistic growth curve. (b) Simulated cow weight against weeks based on the linear growth curve.

We simulated a backcross population with 10 chromosomes, each containing 12 evenly spaced markers (11 intervals of length 20 centiMorgan). Four QTLs were located at the center of chromosomes 1, 3, 5, and 7 with effects 0.5, −0.5, 0.5, and −0.5 respectively. A total of 500 individuals were sampled. The covariate function was of the form *η*(**t**) = **5**(**1** −**2t**_{2}) sin(**2t**_{1}), therefore two (interacting) variables *t*_{1} and *t*_{2} were simulated, where *t*_{1} was continuous and sampled from the uniform distribution [0, 10), while *t*_{2} was binary 0/1 and sampled with equal probability 0.5.

For the prior on *η*, we have mean *μ*(**t**; ** α**) =

In addition to the simulations, we tested our method on a real mouse study of obesity, a major risk factor for type II diabetes. To genetically dissect a polygenic mouse model of obesity-driven type II diabetes, Reifsnyder et al. (2000) outcrossed the obese, diabetes-prone, NZO (New Zealand Obese)/HlLt strain to the relatively lean NON (Nonobese Nondiabetic)/Lt strain, and then reciprocally backcrossing obese F1 mice to the lean NON/Lt parental strain. They measured the body weights of 203 backcross males. The total number of markers is 85 with average distance between adjacent markers of 20.5 cM. In addition, inguinal, gonadal, retroperitoneal and mesenteric fat pad weights have also been measured (the data set can be downloaded at http://cgd.jax.org/nav/qtlarchive1.htm). Following Stylianou et al. (2006), we first calculated the total fat pad weight as the mesenteric fat pad weight plus twice the sum of the inguinal, gonadal, and retroperitoneal fat pad weights. The lean body weight (LBwt) is defined as the difference between the total body weight and the total weight of the fat pads. In their QTL analysis of the mesenteric fat pad weight (MFPwt), Stylianou et al. (2006) adjusted for the effect of LBwt. We applied our semiparametric single and multiple Bayesian QTL mapping methods to the MFPwt using LBwt as a covariate. Both analyses strongly suggest a single QTL on chromosome 4 (Figure 4, panels (b) and (d)). The estimated function of LBwt on mesenteric fat pad weight is presented in Figure 4(a) and is nearly linear. For comparison, we performed parametric multiple Bayesian QTL mapping, which includes a linear effect of LBwt. This analysis also identified a single QTL on chromosome 4 (Figure 4(c)). Hence, as in our simulations, the semiparametric and parametric mapping methods yield very similar results when the relationship between covariate and trait is (nearly) linear, although the posterior probability for QTL presence was reduced for the semiparametric method, relative to the parametric analysis (Figure 4(c) and (d)).

We have proposed efficient and robust Bayesian semiparametric, single and multiple interval QTL mapping, which allows for an unknown function of non-genetic covariates. The covariates may have a non-linear relationship with the response and/or interact with each other in a complex way. A Gaussian process is used as the prior for the unknown function. This prior does not require any assumptions like monotonicity or additivity. A hierarchical scheme to construct a prior around a parametric family has been described. We specified the Gaussian process prior via the simple, hierarchical model in Equation (4), and we used approximately uninformative default prior for the hyperparameters: Flat priors for the location parameters (*α _{i}*s), and an inverse Gamma(0.1,0.1) prior for the precision parameter

One of our reasons for choosing the Gaussian process is its ability to deal with multiple covariates. With the Gaussian process, we can specify a simple function as in Equation (3), and use the extra layer of randomness represented by Equation (4) to fine tune the curve to fit the data. Alternatively fitting a spline function to *η*(*t*) will not require the extra randomness of Equation (4). However, the drawback of the spline approach is its inflexibility in handling high-dimensional covariates. This is important for a substantial extension of our method. Essentially all parametric multiple QTL mapping method assume additive QTL action, or incorporate at most two-locus interactions, but no higher-order interactions. How to accommodate a large number of potential QTL with possibly higher-order interactions and interactions with environmental covariates in multiple QTL mapping remains a challenging problem. Our current work focuses on extending our semiparametric Gaussian process model to unknown functions including not only non-genetic covariates but also multiple putative QTL, with variable selection.

The authors wish to thank the Editor, Associate editor and two Referees for their helpful comments and suggestions, which have led to a great improvement of this article. This research was partially supported by NIH grants GM074175 (F.Z and H. H.) and CA79949 (H. Z.).

- Abrahamsen P. Technical Report. Vol. 917. Norwegian Computing Center; Oslo: 1997. A review of Gaussian random fields and correlation functions.
- Abramovich F, Sapatinas T, Silverman BW. Wavelet thresholding via a Bayesian approach. Journal of the Royal Statistical Society, Series B. 1998;60:725–749.
- Basten CJ, Weir BS, Zeng ZB. QTL Cartographer: a reference manual and tutorial for QTL mapping. Department of Statistics, North Carolina State University; 1999.
- Best NC, Cowles MK, Vines SK. CODA manual version 0.30. MRC Biostatistics Unit; Cambridge, UK: 1995.
- Broman KW, Speed TP. A model selection approach for the identification of quantitative trait loci in experimental crosses (with discussion) Journal of the Royal Statistical Society, Series B. 2002;64:641–656. 731–775.
- Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTL mapping in experimental crosses. Bioinformatics. 2003;19:889–890. [PubMed]
- Chen H. Convergence rates for parametric components in a partly linear model. Ann Statist. 1988;16:136–146.
- Cuzick J. Semiparametric additive regression. Journal of the Royal Statistical Society, Series B. 1992;54:831–843.
- Denison DGT, Mallick BK, Smith AFM. Automatic Bayesian curve fitting. Journal of the Royal Statistical Society Series B. 1998;60:333–350.
- DiMatteo I, Genovese CR, Kass R. Bayesian curve fitting with free-knot splines. Biometrika. 2001;88:1055–1071.
- Doerge RW, Zeng ZB, Weir BS. Statistical issues in the search for genes affecting quantitative traits in experimental populations. Statistical Science. 1997;12:195–219.
- Fan J. Design-adaptive nonparametric regression. Journal of the American Statistical Association. 1992;87:998–1004.
- Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7:457–72.
- George EI, McCulloch RE. Variable selection via gibbs sampling. Journal of the American Statistical Association. 1993;88:881–889.
- Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Clarendon Press; Oxford, UK: 1992.
- Godsill SJ. On the relationship between Markov chain Monte Carlo methods for model uncertainty. Journal of Computational and Graphical Statistics. 2001;10:230–248.
- Godsill SJ. Highly Structured Stochastic Systems. Oxford University Press; 2003. Proposal densities, and product space methods.
- Green PJ. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732.
- Härdle W. Applied nonparametric regression. Cambridge University Press; 1990.
- Hastie TJ, Loader C. Local regression: Automatic kernel carpentry (with discussion) Statistical Science. 1993;8:120–143.
- Heckman N. Spline smoothing in a partly linear model. Journal of the Royal Statistical Society, Series B. 1986;48:244–248.
- Hoeschele I. Mapping quantitative trait loci in outbred pedigrees. In: Balding DJ, Bishop M, Cannings C, editors. Handbook of Statistical Genetics. Wiley; New York: 2007. pp. 477–525.
- Lan H, Kendziorski CM, Haag JD, Shepel LA, Newton MA, Gould MN. Genetic loci controlling breast cancer susceptibility in the Wistar-Kyoto rat. Genetics. 2001;157:331–339. [PubMed]
- Lander E, Botstein D. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989;121:185–199. [PubMed]
- Lincoln SE, Daly MJ, Lander ES. A tutorial and reference manual for MAPMAKER/QTL. Whitehead Institute; 1993.
- Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sunderland, Mass; Sinauer: 1998.
- MacKay DJ. Introduction to Gaussian processes. In: CM Bishop., editor. Neural networks and machine learning. Berlin: Springer; 1998.
- Manly KF, Olson JM. Overview of QTL mapping software and introduction to Map Manager QT. Mammalian Genome. 1999;10:327–334. [PubMed]
- Mays JE. Unpublished Ph.D. dissertation. Virginia Polytechnic Institute and State University; Blacksburg, VA: 1995. Model robust regression: combining parametric, nonparametric, and semiparametric methods; p. 200p.
- McQueen MB, Bertram L, Rimm EB, Blacker D, Santangelo SL. A QTL genome scan of the metabolic syndrome and its component traits. BMC Genetics. 2003;4:S96. [PMC free article] [PubMed]
- Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79.
- Neal RM. Bayesion learning for neural networks. New York: Springer-Verlag; 1996.
- Neal RM. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical report No. 9702. Dept. of Statistics, University of Toronto; 1997.
- O’Hagan A. On curve fitting and optimal design for regression. Journal of the Royal Statistical, Society B. 1978;40:1–42.
- Reifsnyder PC, Churchill GA, Leiter EH. Maternal environment and genotype interact to establish diabesity in mice. Genome Research. 2000;10:1568–1578. [PubMed]
- Richards PJ. A flexible growth function for empirical use. Journal of Experimental Botany. 1959;10:290–300.
- Satagopan JM, Yandell BS, Newton MA, Osborn TC. A Bayesian approach to detect quantitative trait loci using Markov chain Monte Carlo. Genetics. 1996;144:805–816. [PubMed]
- Sillanpää MJ, Arjas E. Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics. 1998;148:1373–1388. [PubMed]
- Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. Journal of Econometrics. 1996;75:317–343.
- Speckman P. Kernel smoothing in partial linear models. Journal of the Royal Statistical, Society B. 1988;50:413–436.
- Stephens DA, Fisch RD. Bayesian analysis of quantitative trait locus data using reversible jump Markov chain Monte Carlo. Biometrics. 1998;54:1334–1347.
- Stoehr JP, Nadler ST, Schueler KL, Rabaglia ME, Yandell BS, Metz SA, Attie AD. Genetic obesity unmasks nonlinear interactions between murine type 2 diabetes susceptibility loci. Diabetes. 2000;49:1946–1954. [PubMed]
- Stylianou IM, Korstanje R, Li R, Sheehan S, Paigen B, Churchill GA. Quantitative trait locus analysis for obesity reveals multiple networks of interacting loci. Mammalian Genome. 2006;17:22–36. [PubMed]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical, Society B. 1996;58:267–288.
- Ven RV. Reversible-Jump Markov chain Monte Carlo for quantitative trait loci mapping. Genetics. 2004;167:1033–1035. [PubMed]
- Wahba G. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical, Society B. 1978;40:364–372.
- Wahba G. Cross validated spline methods for the estimation of multivariate functions from data on functionals. In: David HA, David HT, editors. Statistics: An appraisal, proceedings 50th anniversary conference Iowa state statistical laboratory. Iowa State Univ. Press; Ames, Ia.: 1984. pp. 205–235.
- Wand MP, Jones MC. Kernel smoothing. London: Chapman and Hall; 1995.
- Wang H, Zhang YM, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S. Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics. 2005;170:465–480. [PubMed]
- Yi N. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics. 2004;167:967–975. [PubMed]
- Yi N, Xu S. Bayesian mapping of quantitative trait loci for complex binary traits. Genetics. 2000;155:1391–1403. [PubMed]
- Yi N, Xu S. Bayesian mapping of quantitative trait loci under complicated mating designs. Genetics. 2001;157:1759–1771. [PubMed]
- Yi N, Xu S. Bayesian LASSO for Quantitative Trait Loci Mapping. Genetics. 2008;179:1045–1055. [PubMed]
- Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS. An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics. 2007;176:1865–1877. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |