The dataset used in this work was obtained from the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial of the National Cancer Institute (see Moslehi et al.
for details). Genotyping for six NAT2 SNPs (C282T, T341C, C481T, G590A, A803G and G857A) was performed using the TaqMan® (Applied Biosystems Inc., Carlsbad, CA, USA) kit. The acetylator phenotypes were assigned in our previous study based on the haplotypes determined from SNP genotyping data for each subject (Moslehi et al.
). The dataset consists of 1377 subjects (see Supplementary Table 1
for details and ethnic makeup). Prediction of the acetylator phenotype from combinations of SNPs, as defined here, is a three-class classification problem that can be addressed using a supervised pattern recognition method. We used Support Vector Machine (SVM) as a method of choice (Vapnik, 1998
). We constructed a three-class SVM predictor using the one-against-one approach which was shown to perform better than other approaches in multi-class SVMs (Hsu and Lin, 2002
). We used SVM implemented in the LIBSVM package (Chang and Lin, (2003
) with the linear kernel. Each NAT2
SNP was encoded using a set of three mutually orthogonal binary vectors: homozygote for the most frequent allele (1,0,0), heterozygote for the most frequent allele (0,1,0) and homozygote for the least frequent allele (0,0,1). For a given subject corresponding vectors describing each of the six observed SNPs were concatenated together, resulting in a final binary feature vector of dimension 18. Thus, the SNP combination of each subject was described by 18 binary variables. We used a 7-fold cross-validation to test the SVM predictor of the acetylator phenotype. In this approach, the dataset is randomly partitioned into seven groups, each containing 1/7 of the dataset. At each cross-validation run, one group is removed and the predictor is trained on the remaining observations and tested on the removed group. The process is repeated seven times, so that each group is used for testing once. In order to assess different aspects of classification quality, we used the following performance measures: overall accuracy (ACC), sensitivity (SN) for class i
) and specificity (SP) for class i
) (Baldi et al.
is a 3 × 3 confusion (contingency) matrix, in which an element z
[i,j] represents the number of times objects from class i
are predicted to be in class j
is the total number of objects (N
=1377 in this work).