Suppose that for each individual
i (
i = 1, 2, …,
n) we observe
yi, a quantitative trait;
zi, a
p-dimensional vector of individual-specific covariates, such as age and sex; and
xi = (
xi1,
xi2, …,
xiJ) , genotypes at
J SNPs. Here, we assume an additive genetic model; thus
xij = 0, 1, or 2, representing the number of minor alleles present at SNP
j of individual
i. A regression model on the quantitative trait is given by:
for
i = 1, 2, …,
n, where
γ is a vector of regression coefficients, including the intercept and slopes for individual-specific covariates, the
βj are the SNP-specific regression coefficients, and
εi is the error term. We specify the following prior distributions for the model parameters:

,
βj ~
G,
G ~ DP(
α,
G0), and
σ ~
U(
a,
b). Here,
v2 and
b >
a ≥ 0 are prespecified hyperparameters,

is a (
p + 1)-dimensional normal distribution with mean vector
μ and variance-covariance matrix Σ,
0 is a vector of zeros,
I is an identity matrix,
G denotes a random distribution,
U(
a,
b) denotes a uniform distribution between
a and
b, and DP(
α,
G0) is the Dirichlet process.
Dirichlet process
The Dirichlet process [
6] is a probability model on a space of probability distributions. It has two parameters: the base probability distribution
G0 and the precision parameter
α (>0). If
G ~ DP(
α,
G0), then
G0 is the prior expectation of
G and
α controls the variance of
G. Here, we take

, which is a normal density truncated below at 0, and use
U(
c,
d),
d >
c > 0, as the prior distribution for the precision parameter
α.
Sethuraman [
7] provided a stick-breaking construction of the Dirichlet process, which states that if we have:
is a random probability distribution generated from DP(α, G0), where δϕk denotes a point mass at ϕk. It is clear that G is discrete with probability 1. Because of the discreteness, the βj can take on the same value. That is why the Dirichlet process can be used for clustering analysis.
Ishwaran and James [
8] studied a truncated version of the Dirichlet process by choosing a truncation level
N and setting
VN = 1 in the stick-breaking construction. They used the truncated Dirichlet process to approximate Dirichlet process prior distributions and developed a block Gibbs sampling method for Dirichlet process models.
Clustering
Each iteration of the Gibbs sampler gives a clustering structure of SNP-specific regression coefficients such that coefficients taking the same value are clustered together. The number of clusters and the cluster membership of the coefficients vary across iterations, giving a random sample of clustering structures. Pairwise probabilities of two coefficients being equal are calculated from the posterior samples [
9]. A distance measure is derived as 1 minus these pairwise probabilities and is then used in complete linkage hierarchical clustering to obtain a final clustering structure of the SNPs. We study a range of the number of clusters, from as small as 2–5 clusters to as large as 100 clusters. Optimal cluster numbers are also obtained by striking a balance between sensitivity and specificity. In all cases, SNPs in the largest cluster have relatively low or close to zero estimates of regression coefficients and are considered not associated with the trait. SNPs falling outside the largest cluster have relatively high estimates of regression coefficients and are considered potential risk variants. The proportion of false discovery (FDP), defined as the ratio of the number of false discoveries to the total number of discoveries, is examined.
Application of the method
We illustrate our methods using the data from Genetic Analysis Workshop 17. The analyses were performed with the knowledge of the underlying simulation model [
10]. We studied the first 10 replicates of the quantitative trait Q1. Each replicate contains 697 unrelated individuals from 7 populations. To control for population stratification, we conducted principal components analysis on nonsynonymous common SNPs (
n = 1,379) and included the resulting first two components as covariates, in addition to Age and Smoke. We built our model with 244 nonsynonymous SNPs selected from the vascular endothelial growth factor (VEGF) pathway [
11]. These SNPs include all 39 functional SNPs for Q1, of which 23 are private variants (found in one individual, MAF = 0.000717) and 2 are common SNPs. The model was fitted using WinBUGS [
12] with
v2 = 1,000,
a = 0,
b = 100,
c = 0.5,
d = 20,
µ0 = 0.5, and

. The truncation level for the Dirichlet process was fixed at 50. For each replicate, 10,000 Markov chain Monte Carlo posterior samples were generated after a burn-in period of 2,000 iterations.
We evaluated our results using two thresholds. When the number of clusters was small (2–5), we defined true positives as true associations identified in at least 2 of the 10 replicates. This threshold was selected to balance the reduced power resulting from small cluster numbers. Indeed, requiring at least two replications for each identified association yielded a reasonably low FDP. When we used the optimal cluster numbers, we defined true positives as true associations detected in no less than eight replicates. We carried out sensitivity analyses on the prior specification for the SNP-specific regression coefficients with
µ0 ranging from 0.1 to 0.5 and

ranging from 0.5 to 2. Similar results were obtained.