3.1. Estimation When There Is Prior Knowledge of the Disease Mode of Inheritance

Ideally, if we know the causal loci, their mode of inheritance and the interaction model, we can incorporate this into the model and hence estimate the underlying true optimal ROC curve. Let

*y*_{i} (

*y*_{i} *S*) be the binary response measurement and

*x*_{i} =(

*x*_{i}_{1},

*x*_{i}_{2}, …,

*x*_{ip}) (

*x*_{ij} (

*g*_{j}_{1},

*g*_{j}_{2}, …,

*g*_{jmj})) be the measurement of

*p* disease loci for the

*i-*th individual. The marginal distribution of genotypes given disease status (

*S* = 0, 1) can be calculated as

where

*I*_{A} (

*i*) is the indicator function defined as

. Given the mode of inheritance, we can combine the genotypes that have the same relative risks. For instance, for a single SNP marker

*j* that has three genotypes,

*A A*,

*A Ā* and

*Ā Ā*, if

*A* is dominant we cluster the genotypes

*A A* and

*A Ā* into one group and calculate the corresponding frequency as

. Given the disease prevalence

*ρ*, we calculate the population genotype frequencies:

*P* (

*g*_{jkj}) =

*ρP* (

*g*_{jkj}|

*S* =1)+(1 −

*ρ*)

*P*(

*g*_{jkj}|

*S* =0). If the variants at the

*p* loci cause the disease independently (i.e., no interaction), then, based on a multiplicative model, the probabilities of the multi-locus genotype

*G*_{l} =(

*g*_{1k1},

*g*_{2k2}, …,

*g*_{pkp}) given disease status can be expressed as

where

*ρ* and

*L* respectively denote the disease prevalence and the total number of multi-locus genotypes possible for the

*p* disease loci, and

if all

*p* loci are in linkage equilibrium. We then rank the multi-locus genotypes according to their LRs, defined by

*LR*_{l} =

*P*(

*G*_{l}|

*S* =1)/

*P*(

*G*_{l}|

*S* =0), and plot the ROC curve. This represents the empirical optimal ROC curve, consisting of the set of TPR and FPR pairs:

where

*G*_{(}_{ζ}_{)} is the

*ζ*-

*th* multi-locus genotype when ranked by its LR value. As we have shown previously, it has the highest AUC (

Lu and Elston, 2008) given by:

where

*TPR*_{(0)} =

*FPR*_{(0)} = 0.

This approach provides a simple way to approximate the underlying optimal ROC curve by incorporating the correct disease model (i.e., mode of inheritance) and by making the multiplicative model assumption. Following the same strategy, as described in more detail in

Web Appendix A, we can extend this approach to incorporate interacting loci. This method is ideal for diseases that have been well studied. However, in most cases the disease model is not well understood. The disease susceptibility loci that we have discovered may not even be causal loci. Therefore, we now propose an approach that does not require prior knowledge about the disease model.

3.2. Estimation When There Is No Prior Knowledge of the Disease Model

In this approach, we start with all the data, treating each multi-locus genotype as a separate cluster, and then implement a backward clustering algorithm to group the multi-locus genotypes and so reduce the model complexity. Based on the prediction AUC calculated from *K*-fold cross validation, we choose the most parsimonious model with the appropriate number of multi-locus genotype clusters.

Suppose we have

*L* possible multi-locus genotypes (

) created from

*p* loci. The

*p* loci are disease susceptibility loci detected from previous association studies. We first estimate from the data the distribution of all the

*p* -locus genotypes, separately in the case and control samples,

The

*p* -locus genotypes are then ranked by their likelihood ratios,

, to estimate the optimal ROC curve. Since the ROC curve depends on only the ranks of the test results, we use these ordered

*p* -locus genotypes,

, to represent our full model when constructing the optimal ROC curve. Once we have formed this optimal ROC curve, we use it to estimate the area under the optimal ROC curve and denoted this estimate

.

Since this full model

*G*^{(0)} has the largest number of multi-locus genotype clusters (i.e., each multi-locus genotype represents a separate cluster) and includes all

*p* loci, it most likely over-fits the data, and

is hence biased upward. To reduce the model’s complexity, we gradually combine the multi-locus genotypes together and search for the best model that has a more accurate AUC estimate. At each step of the backward clustering process, we consider all possible multi-locus genotype clusterings that can be formed when two one-locus genotypes are pooled together, and choose the clustering that leads to a minimum decrease in the AUC when applied to the data. Thus, suppose we have

*p*′ disease susceptibility loci at backward clustering step

*t* (i.e

*. p−p*′ loci have been eliminated prior to step

*t*). For a genotype pair

at locus

*j*, we combine the two corresponding multi-locus genotypes that contain

or

and rank all the multi-locus genotypes (some now being combined together) according to their LRs. We then form the optimal ROC curve, which is denoted

, and calculate the corresponding AUC,

. Step

*t* consists of repeating this for all possible pairs of one-locus genotypes

at all

*p*′ loci, and the best candidate model at step

*t*, denoted

*G*^{(}^{t}^{)}, is chosen based on having the highest AUC.

We repeat the clustering process steps and so progressively combine the multi-locus genotypes until all multi-locus genotypes finally cluster into one group. As a result, we obtain a series of candidate models, *G*^{(0)}, *G*^{(1)}, …, *G*^{(}^{T}^{)}, where *G*^{(}^{T}^{)} is the model with only one cluster of multi-locus genotypes, which has an AUC value of 0.5. Note that for *p* diallelic SNPs, the maximum number of backward clustering steps, *T*, is 2*p* and for a real data application it would be fewer if not all multi-locus genotypes are represented in the data. Clearly, nether *G*^{(0)} nor *G*^{(}^{T}^{)} is the appropriate model for building the test. *G*^{(0)}, the most complex model with the largest number of multi-locus genotype clusters and including all *p* loci, has a high but over-fitted AUC, while *G*^{(}^{T}^{)}, the simplest model with only one cluster of multi-locus genotypes and excluding all *p* loci, gives a useless AUC. Letting *nc*^{(0)}, *nc*^{(1)}, …, *nc*^{(}^{T}^{)}, the numbers of multi-locus genotype clusters for candidate models *G*^{(0)}, *G*^{(1)}, …, *G*^{(}^{T}^{)}, describe the model complexity, a parsimonious model with a modest model complexity *nc*^{(}^{m}^{)} should exist between *G*^{(0)} and *G*^{(}^{T}^{)} that has a relatively high and accurate AUC. To find the most appropriate number of multi-locus genotype clusters, *nc*^{(}^{m}^{)}, we use *K*-fold cross-validation.

In

*K*-fold cross validation, we randomly partition the data into

*K* subsets.

*K*−1 subsets are used for the purpose of training and the remaining subset is used for the purpose of validation. We first apply the above clustering algorithm to the training dataset to find the candidate models,

,

*k* =1, 2, ···,

*K*, and then apply these candidate models (i.e., the order of the multi-locus genotypes from the training dataset) to the validation dataset to construct the ROC curve and calculate the AUC. We repeat the cross-validation process

*K* times, with each of the

*K* subsets used exactly once as the validation dataset. The

*K* results are then averaged to provide an estimate of the prediction AUC, which is denoted

,

*t* =0,1, …,

*T. nc*^{(}^{m}^{)} is then chosen to be the value of

*nc*^{(}^{t)} that maximizes the

, and the corresponding model,

*G*^{(}^{m}^{)}, is chosen as the most parsimonious model.

A practical issue in the cross-validation process is that by chance some of the multi-locus genotypes in the validation dataset may not be found in the training dataset, and hence we cannot order these multi-locus genotypes. Instead of treating them as missing, we put them into clusters whose order can be inferred from the multi-locus genotypes present in the training dataset. In other words, we follow the same strategy of gradually clustering the multi-locus genotypes in the training dataset until we arrive at clusters that are present in the validation dataset, in the sense that each cluster in the training dataset corresponds to a cluster in the validation dataset, though the latter may not include all the multi-locus genotypes present in the training dataset cluster. We then calculate the LR statistics for the clusters that are in the training dataset, and use those to infer the multi-locus genotypes’ order.

This backward clustering algorithm applies naturally to both the disease model and marker selection. Ideally, by clustering the multi-locus genotypes on a particular locus, the method automatically approaches the marker’s mode of inheritance or eliminates a “noise” marker. For instance, if at a particular step the multi-locus genotypes with either

*A A* or

*A Ā* genotype at SNP

*j* are grouped together, this would suggest that SNP

*j* follows a model in which

*A* is dominant and

*Ā* is recessive. Further clustering of the

*A A* and

*A Ā* multi-locus genotypes with their corresponding

*Ā Ā* multi-locus genotype implies that SNP

*j* is not associated with the disease, and therefore we should remove it from consideration. In a similar manner, by continuously clustering multi-locus genotypes of more than one locus, the method is able to select an interaction model. The algorithm is also flexible enough to incorporate biological information. For example, if at SNP

*j* clustering two multi-locus genotypes with

*A A* and

*Ā Ā*, but not the corresponding one with

*A Ā*, is not considered biologically plausible, we might exclude this possible clustering from the algorithm, and hence help improve the algorithm’s performance. A simple example of the clustering algorithm is given in

Web Appendix B.