PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Biometrics. Author manuscript; available in PMC 2010 May 25.
Published in final edited form as:
PMCID: PMC2875332
NIHMSID: NIHMS117741

Gaussian Process Based Bayesian Semiparametric Quantitative Trait Loci Interval Mapping

Summary

In linkage analysis, it is often necessary to include covariates such as age or weight to increase power or avoid spurious false positive findings. However, if a covariate term in the model is specified incorrectly (e.g., a quadratic term misspecified as a linear term), then the inclusion of the covariate may adversely affect power and accuracy of the identification of Quantitative Trait Loci (QTL). Furthermore, some covariates may interact with each other in a complicated fashion. We implement semiparametric models for single and multiple QTL mapping. Both mapping methods include an unspecified function of any covariate found or suspected to have a more complex than linear but unknown relationship with the response variable. They also allow for interactions among different covariates. This analysis is performed in a Bayesian inference framework using Markov chain Monte Carlo. The advantages of our methods are demonstrated via extensive simulations and real data analysis.

Keywords: Higher order interaction, MCMC, Multiple QTL, Non-linear, Variable selection

1. Introduction

Unlike monogenic traits where success in associating genotype to phenotype is assured, complex traits pose significantly greater challenges. Parametric genetic mapping using experimental populations, such as backcrosses, F2 intercross or Recombinant Inbred Lines (RIL) have been well developed during the past 15 years (see Doerge et al., 1997 for an introduction to QTL mapping in inbred line crosses). Many excellent open source software packages, such as QTLCart (Basten et al., 1999), MapManager (Manly and Olson, 1999), MapMaker (Lincoln et al., 1993), and R/qtl (Broman et al., 2003) are freely available on-line. These QTL mapping packages often model the effects of non-genetic covariates linearly, for example,

y=μ+βx+ζt+e,
(1)

where y is the measured quantitative trait; t is the non-genetic covariate; x is the genetic factor, and e is the random error term. However, in practice, the QTL position is unknown, resulting in missing x (missing for all individuals).

Although most available QTL mapping methods only map one or a few QTL at a time and are not efficient for complex trait mapping, recently multiple QTL have been mapped simultaneously by treating QTL mapping as a large-scale variable selection problem: for example, for a backcross population and with q potential QTL positions (selected grid of positions across the genome), where q is in the hundreds or thousands and typically (much) larger than the sample size, there are 2q possible main effect models. Variable selection methods are needed that are capable of selecting variables that are not necessarily all individually important but rather together important. By treating multiple quantitative trait locus (QTL) mapping as a model/variable selection problem (Broman and Speed, 2002), forward and step-wise selection procedures have been proposed to search for multiple QTL. Although simple, these methods have their limitations, such as uncertainty about the number of QTL, the sequential model building that makes it unclear how to assess the significance of the associated tests, etc. Bayesian QTL mapping methods (Satagopan, 1996; Sillanpää and Arjas, 1998; Stephens and Fisch, 1998; Yi and Xu, 2000, 2001; Hoeschele, 2007) have been developed, in particular, for the detection of multiple QTL by treating the number of QTL as a random variable and by specifically modeling it using reversible jump Markov chain Monte Carlo (MCMC) (Green, 1995). Due to the variable dimensionality of the parameter spaces associated with different models (different numbers of QTL), care must be taken in determining the acceptance probability for such changes in dimension, which in practice may not be handled correctly (Ven, 2004). To avoid this problem, another leading approach to variable selection in QTL analysis implemented by MCMC is based on the composite model space framework (Godsill, 2001, 2003) and has been introduced to genetic mapping by Yi (2004). Bayesian variable selection methods such as reversible jump MCMC (Green, 1995) and stochastic search variable selection (SSVS) (George and McCulloch, 1993) are special cases of this framework. A modification that treats (variance) hyperparameters as unknown was recently found to produce a better mixing MCMC sampler for multiple QTL mapping (Yi et al., 2007). Recently, Yi and Xu (2008) have developed a Bayesian LASSO (Tibshirani, 1996) for QTL mapping.

In some studies, however, the relationship between y and t may not be linear. In their study of the metabolic syndrome, McQueen et al (2003) have found a nonlinear effect of alcohol consumption on the quantitative traits they investigated. Incorrect modeling of the covariate effect may adversely affect power and accuracy of QTL identification. Semiparametric regression modeling, where ζt in (1) is replaced by an unspecified function η(t), has attracted considerable attention in the statistical literature. When x is observed, model (1) reduces to the semiparametric regression model, which is well investigated in the spline literature as well as in the kernel regression literature. Examples for spline regression include Wahba (1984), Heckman (1986), Chen (1988), Speckman (1988), Cuzick (1992), Hastie and Loader (1993) and Mays (1995) while examples for kernel regression include Härdle (1990), Wand and Jones (1995), and Fan (1992). Spline regression requires a penalty weight to balance between goodness-of-fit and complexity. To account for the non-linear effect of the alcohol consumption, McQueen et al. (2003) categorized the alcohol consumption into five non-overlapping groups in their linear regression analysis, which is essentially a special form of spline regression, so-called local polynomial regression. Kernel regression, on the other hand, needs a bandwidth to determine the degree of localness and smoothness of η. The choice of a bandwidth, and not the choice of a kernel function, is critical for the performance of the nonparametric fit (Härdle, 1990). Bayesian approaches to semiparametric regression have also been developed. Bayesian nonparametric methods achieve flexibility by putting priors on distribution spaces, corresponding to infinitely dimensional parameterizations. The leading Bayesian methods include Dirichlet process (Müller, et al. 1996), splines (Smith and Kohn, 1996; Denison et al., 1998; DiMatteo et al., 2001), wavelets (Abramovich et al., 1998) and Gaussian process (Neal, 1997, 1996). Gaussian process priors date back to at least O’Hagan (1978) and have a large support in the space of all smooth functions through an appropriate choice for the covariance kernel. Gaussian process is particularly flexible for curve estimation because of their flexible sample path shapes. Wahba (1978) has shown that for an appropriate choice of the covariance kernel of the Gaussian process, the Bayesian estimator is a smoothing spline. However, Gaussian process better suited for modeling with multiple (even many) covariates than the smoothing spline approach.

Applying spline regression and kernel regression techniques to semiparametric interval QTL mapping is challenging, especially when mapping multiple QTL, due to the missing QTL genotypes. The Bayesian approach, however, is very flexible in handling missing data. In this paper, we propose novel Bayesian methods for interval QTL mapping of a single QTL and of multiple QTL which incorporate an unspecified function of a single covariate or multiple covariates using a Gaussian process prior. The rest of the paper is organized as follows. We first introduce the underlying semiparametric QTL model for single QTL mapping and the general development of the Markov Chain Monte Carlo (MCMC) sampler in Section 2. We then develop semiparametric multiple QTL mapping in Section 3. Simulation results are presented in Sections 4.1 amd 4.2, and the analysis of a real data set is presented in Section 4.3. We end the paper with comments and conclusions in Section 5.

2. Single QTL Mapping

Quantitative traits are usually controlled by both genetic and environmental factors, such as diet and exposure to chemical toxicities. When studying natural populations, like humans, it is difficult to separate environmental and genetic effects. With experimental organisms, uniform genetic backgrounds and controlled breeding schemes can avoid environmental variability which may obscure genetic effects. It is considerably easier to map quantitative traits with experimental populations than with natural populations. For this reason, crosses between completely inbred lines are often used for detecting QTL. QTL line-cross analysis has been widely applied in the plant sciences. It has also been used successfully in a number of animal species, such as mice and rats (Stoehr et al., 2000; Lan et al., 2001). Because of the homology between humans and rodents, animal models are extremely useful in helping us to understand human diseases.

The backcross (BC) and F2 intercross are two of the most popular mapping populations in QTL studies. Suppose two inbred parents (P1 and P2) differ in some quantitative traits. At each locus, we label the allele of parent P1 as a and the allele of P2 as A. An F1 generation is completely heterozygous with genotype Aa at all loci, receiving one allele from each parent. Thus, there is no segregation in F1 individuals. A BC population is generated when F1 is crossed back with one of its parents, for example, P2. At each locus, every BC individual has equal probability of 1/2 to be Aa or AA, respectively. Thus there is segregation in BC since BC individuals are no longer genetically identical at each locus. Similarly, crossing F1 individuals generates an F2 population in which each individual has probability 1/4, 1/2 and 1/4 of being aa, Aa and AA, respectively. The combination of the two alleles in an individual is a genotype, and the genotypes can be determined at a number of loci, called marker loci (or genetic markers), throughout the genome. These loci are often not the functional loci (QTL) that we wish to identify and whose genotypes can generally not be measured but rather inferred (with some uncertainty) from the genotypes at nearby markers.

The QTL data available include the trait values or phenotypes (the dependent variable) yi (i = 1, ···, n), the discrete marker genotypes mij (i = 1, ···, n, j = 1, ···, m), and an additional covariate ti (i = 1, ···, n), where n is sample size (the number of individuals) and m is the number of genetic markers. The genotypes at a putative QTL may be denoted by {bb, Bb, bb} to distinguish the QTL genotypes from the marker genotypes {aa, Aa, AA}.

The single QTL model assumes that there is only one QTL affecting the trait according to the linear regression model in Equation (1). If the QTL genotypes are observed, QTL mapping is a simple linear regression problem. However, in practice, the QTL position is rarely known and the genotypes are typically unobserved, resulting in all missing xis. The idea behind interval mapping (Lander and Botstein, 1989) is that at any putative QTL position located in an interval between two marker loci (typically on an evenly spaced, genome-wide grid), we can compute the probabilities of the unobserved QTL genotypes for each individual given its genotypes at the pair of closest flanking marker loci (see chapter 15 of Lynch and Walsh, 1998). The distribution of the quantitative trait given the marker genotypes thus follows a finite mixture model. Under this model, the null (alternative) hypothesis for a putative QTL position is that its genotypes do not (do) affect the phenotype of interest. The null hypothesis is evaluated via the likelihood ratio, and a plot of likelihood ratios versus putative QTL positions is called a log-likelihood (ratio) profile. In any region where the profile exceeds a (genome-wide) significance threshold, a QTL is declared at the position with the highest log-likelihood ratio.

When a parametric model is specified incorrectly, estimation bias can result, and the incorrect or inefficient modeling of the covariate effect may adversely affect the power and precision of the QTL identification. To alleviate this problem, we propose a semiparametric model of the form

yi=βxi+η(ti)+ei,i=1,,n,
(2)

where xi is the indicator of the QTL genotype (e.g., with values −1 and 1 depending on whether the QTL genotype is Aa or AA in a backcross population); β is the effect of the QTL; η(t) is an arbitrary function with no particular parametric form specified, and ei is the error term with independent distribution N(0,σe2). Note that here ti can be a scalar corresponding to a single covariate or a vector representing multiple covariates.

As all our inference is performed conditional on the marker genotypes, i.e. on M = {mij }, we suppress this conditioning notation for the remainder of the paper. In Bayesian analysis, a prior distribution of the unobserved variables is combined with the likelihood of the observed data to obtain a posterior distribution of the unknown variables. Since the QTL position, λ, and therefore, the QTL genotypes are unknown, the unobserved variables of model (2) are η, λ, x={xi}i=1n, β and σe2. Below we describe in detail some of the prior and conditional posterior distributions.

2.1 Prior Specifications

2.1.1 Gaussian process prior for covariate function

A prior probability density p(η|t) is induced by using a Gaussian process. A Gaussian process is a stochastic process such that each finite dimensional distribution is multivariate normal. Thus any Gaussian process is specified by its mean function and covariance kernel. A wide array of functions can be derived as sample paths of a Gaussian process.

Let t1, t2, ···, tk be the set of distinct covariate values and dj the number of occurrences of tj, j = 1, ···, k. A Gaussian process on domain T is a random, real valued function η (t) such that all possible finite dimensional distributions (η(t1), ···, η(tk))T are multivariate normal, with E(η(tj)) = μ(tj) and cov(η(tj1), η(tj2)) = σ(tj1, tj2) for j1, j2 = 1, ···, k. The fixed real valued function μ(t) is known as the mean function, and the function σ(t, t) is known as the covariance kernel, which must satisfy the condition that the resulting k × k matrix Σ with (i, j)th element equal to σ (ti, tj) is positive definite. Smoothness of the covariance kernel essentially controls the smoothness of the sample paths of η. For an appropriate choice of the covariate kernel, a Gaussian process has a large support in the space of all smooth functions (Abrahamsen, 1997; Mackay, 1998).

To specify a Gaussian process prior with μ={μ(tj)}j=1k and Σ, one may consider some parametric forms for the functional parameters while putting priors on the hyper-parameters. To build the prior around a parametric family, we consider

μ(t;α)=α1f1(t)++αlfl(t),
(3)

where l is a fixed integer, and {f1, ···, fl} are known (for example polynomial) functions in t with the scaled hyper-parameters α={αj}j=1l unknown. Such a class of parametric families covers a wide variety of functions with appropriate choice of fj (t). For the covariance kernel, consider the simplest parametric form σ(t, t) = σ0(t, t)/τ for unknown hyper-parameters τ > 0, and known kernel σ0, such as, σ0(t, t) = exp{−(tt)2}. There are many other types of kernel functions that can be applied (see Mackay, 1998 for details). Note that the posterior mean of η almost interpolates the data as τ → 0, while the posterior distribution is concentrated near the prior mean function as τ → ∞. Clearly τ controls the smoothness of η.

In practice, reasonable choices of conjugate prior distributions for τ and α are an inverse Gamma distribution on τ and an independent l-variate normal distribution on α. Specifically, we consider the following hierarchical model

τinverseGamma(a,b),αN(α0,Γ),ηα,τGaussianprocess(μ,0/τ)
(4)

where Σ0 is the k × k matrix with element (j1, j2) equal to σ0(tj1, tj2).

2.1.2 Priors for the remaining parameters

The prior on the QTL position, λ, is non-informative with a uniform distribution over the entire genome. Given λ, for each individual i, the probability of its QTL genotype is a function of the genotypes of the two markers flanking λ and of the locations of the two flanking markers, and it is denoted by p(xi|miL(λ), miR(λ), dL(λ), dR(λ), λ), where miL(λ) and miR(λ) are the genotypes of the markers to the left and right of locus 3, respectively; dL(λ) and dR(λ) are the locations of the flanking markers (see chapter 15 of Lynch and Walsh, 1998 for detailed calculations). QTL effect and error variance are assumed to be independent a priori with noninformative priors p(β) [proportional, variant] 1 and p(σe2)1/σe2. The choice of these priors is due to their computational simplicity and our lack of knowledge on these parameters. If prior data is available, more informative priors can be employed.

2.2 MCMC algorithm for posterior computation

Note that given the nonparametric component η, (2) becomes a linear model with yi replaced by yiη(ti). Given the parameters of the genetic component (β, x), model (2) reduces to a traditional nonparametric model if yi is replaced by yiβxi. Below we present the MCMC algorithm for the posterior computation, which is largely based on the Gibbs sampling approach.

First we initialize parameters and hyperparameters σe2, τ, β, λ, a and b. Then we perform the following updating steps many times:

Step 1

Sample η: Conditional on η, the yi’s are independent normal random variables with variance σe2 and mean βxi + η(ti). Let Uj be the average of all yiβxi for those individuals whose corresponding covariate value is tj, j = 1 ···, k and U={Uj}j=1k. Further, let D be a diagonal matrix with diagonal element i equal to di/σe2. Then the conditional distribution of η is k-variate normal η|y, α, β, τ ~ Nk(μ*, Σ*), where Σ*= (D + Σ−1)−1 and mean vector μ* = Σ*D(Uμ) + μ

Step 2

Sample α and τ: Let F be the k × l matrix with the (j, r)th element equal to fr(tj). Since y affects the distributions of α and τ only through η, we have the following conditional posterior distributions:

ατ,η,y=ατ,ηN(α0,Γ),τα,η,y=τα,ηinverseGamma(a,b),

where Γ=(τFT01F+Γ1)1,α0=τΓFT01(ηFα0)+α0,a=a+k/2 and b=b+(ηFα)T01(ηFα)/2.

Step 3

Sample β from its conditional posterior distribution, which is normal N(i=1nxi{yiη(ti;α)}i=1nxi2,σe2i=1nxi2).

Step 4

Sample σe2 from the inverse Gamma distribution with parameters n/2 and i=1n{yiβxiη(ti;α)}2/2.

Step 5

Sample QTL position λ and QTL genotypes x jointly in a Metropolis-Hastings step detailed below. A new QTL position λ* and new QTL genotypes xi, i = 1, …, n are proposed jointly by first sampling λ* from a uniform proposal distribution centered on the current λ, U[λδ, λ + δ), where δ is prespecified to yield a desirable acceptance rate (say 20–40%), and q(λ*|λ) is the density of this distribution (it needs a slight modification when λ is located near the end of a chromosome). Given the new position λ*, the QTL genotypes xi are sampled directly from their fully conditional posterior distributions

q(xiλ)=defp(yixi,β,η(ti))·p(ximiL(λ),miR(λ),dL(λ),dR(λ),λ)h{1,+1}p(yih,β,η(ti))·p(hmiL(λ),miR(λ),dL(λ),dR(λ),λ),

where p(yi|xi, β, η(ti)) follows from equation (2), i.e., p(yixi,β,η(ti))exp{(yiβxiη(ti))2/(2σe2)}. The sampled new position and QTL genotypes are accepted jointly with a probability equal to min(1,γ), where

γ=p(λ,xy,β,η)p(λ,xy,β,η)q(λ)i=1nq(xiλ)q(λ)i=1nq(xiλ)

The ratio of the joint posteriors of QTL position and genotypes evaluated at the proposed and current values is

p(λ,xy,β,η)p(λ,xy,β,η)=i=1np(yixi,β,η(ti))·p(ximiL(λ),miR(λ),dL(λ),dR(λ),λ)p(yixi,β,η(ti))·p(ximiL(λ),miR(λ),dL(λ),dR(λ),λ).

Consequently, γ is simplified as

γ=i=1nh{1,+1}p(yih,β,η(ti))·p(hmiL(λ),miR(λ),dL(λ),dR(λ),λ)h{1,+1}p(yih,β,η(ti))·p(hmiL(λ),miR(λ),dL(λ),dR(λ),λ)q(λ)q(λ).

One iteration or cycle of our MCMC sampler consists of steps 1 to 5. When the chain converges to its stationary distribution, the sampled values of all parameters are from their joint posterior distribution. Likewise, the samples of any single parameter represent the marginal posterior distribution of this parameter.

3. Multiple QTL Mapping

We now extend the single QTL model to a multiple QTL model, or

yi=j=1qβjxij+η(ti)+ei,i=1,,n,
(5)

where xij is the indicator of the genotype of the ith individual at the jth putative QTL; q is the total number of QTL; βj is the effect of the jth QTL; η (t) and ei are defined as for model (2) in the previous section.

For multiple Bayesian QTL interval mapping, we follow Wang et al. (2005) and assume (at most) one QTL within each marker interval. That is, in this paper, we assume q in model (5) equals the number of marker intervals. When marker intervals are short, it is reasonable to make this assumption. For intervals that are not short enough to meet this assumption, we can further divide them into sub-intervals with pseudo-markers. Similar to the single QTL mapping, here the QTL positions, λ={λj}j=1q, and therefore, the QTL genotypes X={xj}j=1q (with xj={xij}i=1n) are unknown. Therefore, the unobserved variables of model (5) are function η, error variance σe2, QTL positions λ, QTL genotypes X, and QTL effects, β={βj}j=1q.

3.1 Prior Specifications

The priors of η and σe2 are unchanged from the single QTL mapping.

The prior specifications of λ and X are: For each j [set membership] {1, ···, q}, jth QTL position λj has a uniform prior distribution over the entire jth interval, and the λjs are independent of each other. Conditional on λj, QTL genotype xij of the ith individual at the jth QTL has conditional probability p(xij|miL(λj), miR(λj), dL(λj), dR(λj), λj).

For QTL selection, we employ the SSVS method (George and McCulloch 1993). The SSVS approach imposes a normal mixture prior on the regression parameters β and uses latent variables to identify subset choices. Specifically, the latent variables γj (γj = 0 or 1), are introduced and the normal mixture prior are represented by

βjγj(1γj)N(0,σ2)+γjN(0,cj2σ2)

with P(γj = 1) = 1−P (γj = 0) = pj. The promising subsets of predictors can be identified by applying Bayesian multiple comparison rules that use the posterior probabilities p(γj = 1|y) (e.g., Müller et al., 1996). Setting σ2(> 0) small ensures that if γj = 0, βj could be “safely” claimed to be 0. While setting cj large (cj > 1 always) makes sure that if γj = 1, a nonzero estimate of βj will be included in the model. Such mixture model results in a normal posterior distribution of βj, allowing the use of the Gibbs sampler.

3.2 Posterior computation

The MCMC steps for sampling η, α, τ and σe2 are identical to those in Section 3 if we replace yiβxi (i = 1, ···, n) by yi − Σj βjxij in all corresponding posterior updates. Therefore, below we concentrate on the updates for the remaining parameters. Let θ be the vector containing all the unknown parameters and let θz be vector of θ after removing parameter z.

Sample βj and γj (j = 1, ··· q): Sample βj from its conditional posterior distribution, which is normal

βjy,θβjN(β^j,σe2xjTxj+σe2/σj2),j=1,,q,

where β^j=xjTw/(xjTxj+σe2/σj2),w={wi}i=1n, with wi = yiη(ti) − Σj′≠jβjxij, and σj = σ or cjσ depending on whether γj = 0 or γj = 1. The conditional distribution of γj does not depend on y and is of the form

p(γj=1θγj)=cjcj+pj1pjexp{βj22σ2(11cj2)}.

Sample QTL positions: λj is sampled via Metropolis-Hastings approach since there is no closed form for the conditional posterior probability density of a QTL position. We first sample a new position λj uniformly from the neighborhood of λj, [max{λjδ, dL(λj)}, min{λj + δ, dR(λj)}] with δ being a tuning parameter (set to 2 cM in subsequent simulations). Then, λj will be accepted or rejected according to the probability min(1, α), where

α=i=1np(xijmiL(λj),miR(λj),dL(λj),dR(λj),λj)p(xijmiL(λj),miR(λj),dL(λj),dR(λj),λj)q(λj)q(λj).

If neither λj nor λj is within δ distance away from the ends of the interval, q(λj)=q(λj)=1/(2δ). However, if λj or/and λj are within δ distance from one end of the interval, then

q(λj)=1/[min{λj+δ,dR(λj)}max{λjδ,dL(λj)}],or/andq(λj)=1/[min{λj+δ,dR(λj)}max{λjδ,dL(λj)}].

Sample QTL genotypes: The QTL genotype xij (i = 1, 2, ···, n; j = 1, 2, ···, q) is updated one individual and one locus at a time, based on flanking marker information. Specifically, xij is sampled from the conditional probability distribution pij (x), for x = ±1, which equals

p(yix,β,xi(j),η(ti))·p(xmiL(λj),miR(λj),dL(λj),dR(λj),λj)h{1,+1}p(yih,β,xi(j),η(ti))·p(xmiL(λj),miR(λj),dL(λj),dR(λj),λj),

where

xi(j)=(xi1,,xi(j1),xi(j+1),,xiq)

and

p(yix,β,xi(j),η(ti))exp[12σe2{yiη(ti)βjxjjqβjxij}2].

One iteration or cycle of our MCMC sampler consists of steps 1 to 7 and produces a sample from the joint posterior after completing the burn-in period.

4. Simulations and Real Data Analysis

4.1 Simulation I

We perform simulations to evaluate the small sample performance of the proposed semiparametric Bayesian interval QTL mapping method. Backcross populations of size 100, 300, 500 and 1000 individuals were simulated, with a single large chromosome of genetic length 10 Morgan. This genome was covered by 101 evenly spaced markers (100 marker intervals, each 10 centi-Morgan (cM) long). A single QTL was located at position λ = 155 cM with an effect of β = 1. The residual variance was σe2=1. The covariate values were sampled from the uniform distribution [0, 10), and the unknown function was η(t) = sin(t), η(t) = 3sin(t/3), or η(t) = 5sin(t/5).

For the analysis, we chose the following mean and covariance function for the Gaussian process prior of η:

μ(t;α)=α1+α2t,andσ0(t,t)=exp{(tt)2}.

with the priors on αis being improper uniform. That is, p(αi) [proportional, variant] 1. For initializations, we set σe(0) to some value equal to or less than the overall variance of the trait (response variable), we set τ(0) = 1, β(0) = 0, and λ(0) was chosen by randomly selecting a position in the genome. Conditional on λ(0), the initial genotype of each individual was generated conditional on the genotypes at the two markers flanking the QTL position. Lastly, we set a = 0.1 and b = 0.1 to specify a proper but vague prior for τ. We investigated and ensured that our posterior distribution is proper given the improper priors employed here, in particular for α and σe2.

For the analysis of the simulated and the real (see below) data, the MCMC sampler was run for a total of 25,000 cycles. The first 5000 cycles were discarded as burn-in, and the remainder of the chain was thinned by keeping one out of every ten samples, resulting in a total of 2000 samples for post-MCMC analysis. Several convergence checks were performed, including the Convergence Diagnosis and Output Analysis (CODA) by Best, Cowles and Vines (1995) and two convergence diagnostic tests by Geweke (1992) and Gelman and Rubin (1992). All analyses indicated convergence of our MCMC samplers.

The posterior mean estimates of QTL position and effect are summarized in Table 1 for the backcross of size 500 and η(t) = sin(t), along with the true parameter values, showing that estimates and true values were in very good agreement. Figure 1 compares the true and the estimated unknown functions of the covariate, where the estimate is evaluated at grid points equally spaced between [0,10] with the increment of 0.05, and the posterior mean is given at each point, for sample sizes n of 100, 300, 500 and 1000 and for η(t) = sin(t). The figure supports the result in Neal (1996) about the consistency of the semiparametric estimator in the context of neural networks. As we increased the sample size, the estimated function became closer to the true function and is expected to converge to the true function as the sample size approaches infinity.

Figure 1
Convergence of estimator to the true function (η(t) = sin(t)) with increasing sample size (n = 100, 300, 500 and 1000). True functions are in bold. The dashed line is the posterior mean of the unknown function. The dotted lines represent the 95% ...
Table 1
Posterior mean parameter estimates for the data simulated Data with sample size n = 500 and true covariate function η(t) = sin(t)

Table 2 provides comparisons between the linear regression model and our semiparametric model for the three different η functions and for sample size n=300. The estimates from the semiparametric analysis agree well with the true values for all three functions, while the results from the linear parametric model are quite different. The estimated residual variances from linear model are much larger than the true values, while the QTL effect is underestimated and its estimated position is rather inaccurate. In contrast, our semiparametric model provides the flexibility to fit the data well.

Table 2
Comparison between parameter estimates obtained with the linear parametric analysis and the semiparametric analysis (values in parentheses are for the linear parametric model)

For further comparison, we simulated two additional cow weight datasets based on a simple linear growth model and a well-known generalized logistic function for modeling the growth curve (Richards, 1959). The linear growth model is η(t) = 187.459 + 2.682t, while the generalized logistic function is

η(t)=243+968{1+0.15e0.01955(t20.1)}1/0.15.

We set QTL effect β = 3 and σe2=9 so that the phenotypic variations due to the QTL, the covariate and the random error are roughly balanced. All other simulation parameters (including the marker data and QTL position) were the same as before (sample size n = 500).

The simulated responses are plotted against covariate time in Figure 2. Table 3 presents the posterior summaries of the analysis of the two data sets. For the data set simulated with the linear function, the estimates from the semiparametric and linear regression models are very similar. But for the data set simulated with the generalized logistic function, the semiparametric method produces again more accurate results.

Figure 2
(a) Simulated cow weight against weeks based on the generalized logistic growth curve. (b) Simulated cow weight against weeks based on the linear growth curve.
Table 3
Comparison between the linear parametric model and the semiparametric model (values in parentheses are from the linear parametric model)

4.2 Simulation II

We simulated a backcross population with 10 chromosomes, each containing 12 evenly spaced markers (11 intervals of length 20 centiMorgan). Four QTLs were located at the center of chromosomes 1, 3, 5, and 7 with effects 0.5, −0.5, 0.5, and −0.5 respectively. A total of 500 individuals were sampled. The covariate function was of the form η(t) = 5(12t2) sin(2t1), therefore two (interacting) variables t1 and t2 were simulated, where t1 was continuous and sampled from the uniform distribution An external file that holds a picture, illustration, etc.
Object name is nihms117741ig1.jpg[0, 10), while t2 was binary 0/1 and sampled with equal probability 0.5.

For the prior on η, we have mean μ(t; α) = α1 + α2t1 + α3t2 + α4t1t2, and covariance kernel σ0(t;t)=exp[{(t1t1)2+(t2t2)2}]. As before, the prior distribution for the αs is improper uniform. The prior for τ is again an inverse gamma distribution and is independent of α’s. The lower panel of Figure 3 shows the posterior mean for the probability of each interval containing a QTL. For comparison, we also performed linear regression analysis by setting the covariance kernel of η to 0. The linear model results are given in the upper panel of Figure 3. Clearly the semiparametric model performed better than the parametric model.

Figure 3
Posterior mean for the probability of each interval containing genes. Dashed lines represent true gene positions.

4.3 Real data analysis

In addition to the simulations, we tested our method on a real mouse study of obesity, a major risk factor for type II diabetes. To genetically dissect a polygenic mouse model of obesity-driven type II diabetes, Reifsnyder et al. (2000) outcrossed the obese, diabetes-prone, NZO (New Zealand Obese)/HlLt strain to the relatively lean NON (Nonobese Nondiabetic)/Lt strain, and then reciprocally backcrossing obese F1 mice to the lean NON/Lt parental strain. They measured the body weights of 203 backcross males. The total number of markers is 85 with average distance between adjacent markers of 20.5 cM. In addition, inguinal, gonadal, retroperitoneal and mesenteric fat pad weights have also been measured (the data set can be downloaded at http://cgd.jax.org/nav/qtlarchive1.htm). Following Stylianou et al. (2006), we first calculated the total fat pad weight as the mesenteric fat pad weight plus twice the sum of the inguinal, gonadal, and retroperitoneal fat pad weights. The lean body weight (LBwt) is defined as the difference between the total body weight and the total weight of the fat pads. In their QTL analysis of the mesenteric fat pad weight (MFPwt), Stylianou et al. (2006) adjusted for the effect of LBwt. We applied our semiparametric single and multiple Bayesian QTL mapping methods to the MFPwt using LBwt as a covariate. Both analyses strongly suggest a single QTL on chromosome 4 (Figure 4, panels (b) and (d)). The estimated function of LBwt on mesenteric fat pad weight is presented in Figure 4(a) and is nearly linear. For comparison, we performed parametric multiple Bayesian QTL mapping, which includes a linear effect of LBwt. This analysis also identified a single QTL on chromosome 4 (Figure 4(c)). Hence, as in our simulations, the semiparametric and parametric mapping methods yield very similar results when the relationship between covariate and trait is (nearly) linear, although the posterior probability for QTL presence was reduced for the semiparametric method, relative to the parametric analysis (Figure 4(c) and (d)).

Figure 4
Bayesian analysis of the backcross mesenteric fat pad data. (a) Scatter plot of LBwt vs MFPwt. The solid line represents the estimated curve from the semiparametric single QTL mapping method. (b) Marginal posterior distribution of the QTL position based ...

5. Discussion

We have proposed efficient and robust Bayesian semiparametric, single and multiple interval QTL mapping, which allows for an unknown function of non-genetic covariates. The covariates may have a non-linear relationship with the response and/or interact with each other in a complex way. A Gaussian process is used as the prior for the unknown function. This prior does not require any assumptions like monotonicity or additivity. A hierarchical scheme to construct a prior around a parametric family has been described. We specified the Gaussian process prior via the simple, hierarchical model in Equation (4), and we used approximately uninformative default prior for the hyperparameters: Flat priors for the location parameters (αis), and an inverse Gamma(0.1,0.1) prior for the precision parameter τ, which controls the smoothness of the sample paths of the unknown function. The method performed well on simulated and real datasets. The software and simulated data are available at the website: http://www.bios.unc.edu/~fzou/semiparam/index.html.

One of our reasons for choosing the Gaussian process is its ability to deal with multiple covariates. With the Gaussian process, we can specify a simple function as in Equation (3), and use the extra layer of randomness represented by Equation (4) to fine tune the curve to fit the data. Alternatively fitting a spline function to η(t) will not require the extra randomness of Equation (4). However, the drawback of the spline approach is its inflexibility in handling high-dimensional covariates. This is important for a substantial extension of our method. Essentially all parametric multiple QTL mapping method assume additive QTL action, or incorporate at most two-locus interactions, but no higher-order interactions. How to accommodate a large number of potential QTL with possibly higher-order interactions and interactions with environmental covariates in multiple QTL mapping remains a challenging problem. Our current work focuses on extending our semiparametric Gaussian process model to unknown functions including not only non-genetic covariates but also multiple putative QTL, with variable selection.

Acknowledgments

The authors wish to thank the Editor, Associate editor and two Referees for their helpful comments and suggestions, which have led to a great improvement of this article. This research was partially supported by NIH grants GM074175 (F.Z and H. H.) and CA79949 (H. Z.).

References

  • Abrahamsen P. Technical Report. Vol. 917. Norwegian Computing Center; Oslo: 1997. A review of Gaussian random fields and correlation functions.
  • Abramovich F, Sapatinas T, Silverman BW. Wavelet thresholding via a Bayesian approach. Journal of the Royal Statistical Society, Series B. 1998;60:725–749.
  • Basten CJ, Weir BS, Zeng ZB. QTL Cartographer: a reference manual and tutorial for QTL mapping. Department of Statistics, North Carolina State University; 1999.
  • Best NC, Cowles MK, Vines SK. CODA manual version 0.30. MRC Biostatistics Unit; Cambridge, UK: 1995.
  • Broman KW, Speed TP. A model selection approach for the identification of quantitative trait loci in experimental crosses (with discussion) Journal of the Royal Statistical Society, Series B. 2002;64:641–656. 731–775.
  • Broman KW, Wu H, Sen S, Churchill GA. R/qtl: QTL mapping in experimental crosses. Bioinformatics. 2003;19:889–890. [PubMed]
  • Chen H. Convergence rates for parametric components in a partly linear model. Ann Statist. 1988;16:136–146.
  • Cuzick J. Semiparametric additive regression. Journal of the Royal Statistical Society, Series B. 1992;54:831–843.
  • Denison DGT, Mallick BK, Smith AFM. Automatic Bayesian curve fitting. Journal of the Royal Statistical Society Series B. 1998;60:333–350.
  • DiMatteo I, Genovese CR, Kass R. Bayesian curve fitting with free-knot splines. Biometrika. 2001;88:1055–1071.
  • Doerge RW, Zeng ZB, Weir BS. Statistical issues in the search for genes affecting quantitative traits in experimental populations. Statistical Science. 1997;12:195–219.
  • Fan J. Design-adaptive nonparametric regression. Journal of the American Statistical Association. 1992;87:998–1004.
  • Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7:457–72.
  • George EI, McCulloch RE. Variable selection via gibbs sampling. Journal of the American Statistical Association. 1993;88:881–889.
  • Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernardo JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Clarendon Press; Oxford, UK: 1992.
  • Godsill SJ. On the relationship between Markov chain Monte Carlo methods for model uncertainty. Journal of Computational and Graphical Statistics. 2001;10:230–248.
  • Godsill SJ. Highly Structured Stochastic Systems. Oxford University Press; 2003. Proposal densities, and product space methods.
  • Green PJ. Reversible jump Markov Chain Monte Carlo computation and Bayesian model determination. Biometrika. 1995;82:711–732.
  • Härdle W. Applied nonparametric regression. Cambridge University Press; 1990.
  • Hastie TJ, Loader C. Local regression: Automatic kernel carpentry (with discussion) Statistical Science. 1993;8:120–143.
  • Heckman N. Spline smoothing in a partly linear model. Journal of the Royal Statistical Society, Series B. 1986;48:244–248.
  • Hoeschele I. Mapping quantitative trait loci in outbred pedigrees. In: Balding DJ, Bishop M, Cannings C, editors. Handbook of Statistical Genetics. Wiley; New York: 2007. pp. 477–525.
  • Lan H, Kendziorski CM, Haag JD, Shepel LA, Newton MA, Gould MN. Genetic loci controlling breast cancer susceptibility in the Wistar-Kyoto rat. Genetics. 2001;157:331–339. [PubMed]
  • Lander E, Botstein D. Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics. 1989;121:185–199. [PubMed]
  • Lincoln SE, Daly MJ, Lander ES. A tutorial and reference manual for MAPMAKER/QTL. Whitehead Institute; 1993.
  • Lynch M, Walsh B. Genetics and analysis of quantitative traits. Sunderland, Mass; Sinauer: 1998.
  • MacKay DJ. Introduction to Gaussian processes. In: CM Bishop., editor. Neural networks and machine learning. Berlin: Springer; 1998.
  • Manly KF, Olson JM. Overview of QTL mapping software and introduction to Map Manager QT. Mammalian Genome. 1999;10:327–334. [PubMed]
  • Mays JE. Unpublished Ph.D. dissertation. Virginia Polytechnic Institute and State University; Blacksburg, VA: 1995. Model robust regression: combining parametric, nonparametric, and semiparametric methods; p. 200p.
  • McQueen MB, Bertram L, Rimm EB, Blacker D, Santangelo SL. A QTL genome scan of the metabolic syndrome and its component traits. BMC Genetics. 2003;4:S96. [PMC free article] [PubMed]
  • Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79.
  • Neal RM. Bayesion learning for neural networks. New York: Springer-Verlag; 1996.
  • Neal RM. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical report No. 9702. Dept. of Statistics, University of Toronto; 1997.
  • O’Hagan A. On curve fitting and optimal design for regression. Journal of the Royal Statistical, Society B. 1978;40:1–42.
  • Reifsnyder PC, Churchill GA, Leiter EH. Maternal environment and genotype interact to establish diabesity in mice. Genome Research. 2000;10:1568–1578. [PubMed]
  • Richards PJ. A flexible growth function for empirical use. Journal of Experimental Botany. 1959;10:290–300.
  • Satagopan JM, Yandell BS, Newton MA, Osborn TC. A Bayesian approach to detect quantitative trait loci using Markov chain Monte Carlo. Genetics. 1996;144:805–816. [PubMed]
  • Sillanpää MJ, Arjas E. Bayesian mapping of multiple quantitative trait loci from incomplete inbred line cross data. Genetics. 1998;148:1373–1388. [PubMed]
  • Smith M, Kohn R. Nonparametric regression using Bayesian variable selection. Journal of Econometrics. 1996;75:317–343.
  • Speckman P. Kernel smoothing in partial linear models. Journal of the Royal Statistical, Society B. 1988;50:413–436.
  • Stephens DA, Fisch RD. Bayesian analysis of quantitative trait locus data using reversible jump Markov chain Monte Carlo. Biometrics. 1998;54:1334–1347.
  • Stoehr JP, Nadler ST, Schueler KL, Rabaglia ME, Yandell BS, Metz SA, Attie AD. Genetic obesity unmasks nonlinear interactions between murine type 2 diabetes susceptibility loci. Diabetes. 2000;49:1946–1954. [PubMed]
  • Stylianou IM, Korstanje R, Li R, Sheehan S, Paigen B, Churchill GA. Quantitative trait locus analysis for obesity reveals multiple networks of interacting loci. Mammalian Genome. 2006;17:22–36. [PubMed]
  • Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical, Society B. 1996;58:267–288.
  • Ven RV. Reversible-Jump Markov chain Monte Carlo for quantitative trait loci mapping. Genetics. 2004;167:1033–1035. [PubMed]
  • Wahba G. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical, Society B. 1978;40:364–372.
  • Wahba G. Cross validated spline methods for the estimation of multivariate functions from data on functionals. In: David HA, David HT, editors. Statistics: An appraisal, proceedings 50th anniversary conference Iowa state statistical laboratory. Iowa State Univ. Press; Ames, Ia.: 1984. pp. 205–235.
  • Wand MP, Jones MC. Kernel smoothing. London: Chapman and Hall; 1995.
  • Wang H, Zhang YM, Li X, Masinde GL, Mohan S, Baylink DJ, Xu S. Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics. 2005;170:465–480. [PubMed]
  • Yi N. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics. 2004;167:967–975. [PubMed]
  • Yi N, Xu S. Bayesian mapping of quantitative trait loci for complex binary traits. Genetics. 2000;155:1391–1403. [PubMed]
  • Yi N, Xu S. Bayesian mapping of quantitative trait loci under complicated mating designs. Genetics. 2001;157:1759–1771. [PubMed]
  • Yi N, Xu S. Bayesian LASSO for Quantitative Trait Loci Mapping. Genetics. 2008;179:1045–1055. [PubMed]
  • Yi N, Shriner D, Banerjee S, Mehta T, Pomp D, Yandell BS. An efficient Bayesian model selection approach for interacting quantitative trait loci models with many effects. Genetics. 2007;176:1865–1877. [PubMed]