We developed a novel statistical approach, DF-SNPs, that was used for an association study between SNP type data and a case/control study of esophageal cancer. Using DF-SNPs, we identified a list of SNPs, SNP types and SNP patterns that might be associated with esophageal squamous cell carcinoma. This approach could be useful for identification of potential biomarkers based on SNP data.
We have successfully developed and used the DF method for various applications, including structure-activity relationship studies, microarray data analyses and proteomics data analyses [15
]. Unlike previous applications of DF, the independent variables in the SNPs data set are categorical rather than continuous. Moreover, each categorical variable (SNP variable) has only three categories (three genotypes), which is a difficult problem for most classification methods. The DF-SNPs is a variant of the DF method that is specifically designed to analyze the SNP-disease association. In DF-SNPs, as in previous DF applications, multiple individual trees are combined to produce a better model. As shown in Figure , the DF model accuracy varies directly with the number of independent trees within the forest. The 10-tree forest that was developed has high concordance, specificity and sensitivity for the fitted data. Such a model could be used to assess the cancer potential for unknown samples solely based on the SNP profiles.
There are two important considerations for the use of DF-SNPs when compared with alternative classifier methods. First, combining identical or similar trees will not improve the quality of the forest derived from these trees and the benefit in combination can only be realized when individual trees are different or heterogeneous. Thus, each tree in DF-SNPs uses a distinct SNP type for splitting the root node, ensuring that each tree is different and encodes a different aspect of the disease-SNP association. Secondly, the individual trees of similar quality (i.e., having similar misclassification rate) when combined may cancel some of the random noise inherent in SNP type and case-control data.
The Masscode mass spectrometry-based genotyping method resulted in 3–5% missing genotypes. How to appropriately impute the missing value is important for subsequent analysis of the data generated from this technology. Accordingly, a two-step imputing method was embedded in DF-SNPs. First, we removed the individuals for whom most genotype data were missing, as well as removed SNP variables that were not detected in many individuals. Then we imputed the missing SNP genotypes for each remaining individual using a 10 nearest neighbor method. This approach proved to be efficient for preprocessing the SNPs data set.
In DF-SNPs, the potential cancer-related SNPs, SNP types and SNP patterns were identified on the basis of frequencies of occurrence in decision tree splitting for all trees during 10-fold cross-validation. A randomization test was also done with cross-validation to provide a random distribution of frequencies for comparison with the fitted model. Comparison of the fitted and random frequencies provided the estimates of the statistical significance of SNPs, SNP types and SNP patterns in distinguishing cases versus controls
To investigate the relevant SNPs to the esophageal squamous cell carcinoma, we employed a weighted approach to calculate the frequency of each SNP. Given the fact that the SNPs used for splitting the root node are applied to the entire data set while those used in the next split at the second level are applied to a much smaller portion of the data set (normally around the half of the data set), and that subsequent splits are applied to even smaller numbers, the relevance of the SNPs to cancer should decrease proportionally to the height of the tree level where they were selected. We compared several weighted factors by taking into account of the tree level to calculate the frequency of SNPs, including 1, 1.25, 1.5 and 2. Since other weighted factors potentially eliminated the SNPs used in the root node (results not shown), the weighted factor of 2 was selected, indicating that the relevance (or importance) of a SNP is reduced by half as moves to each subsequent lower level.
The odds ratios and corresponding confidence intervals were used to identify 14 SNP types that distinguish cases from controls at 95% confidence (Table ). Of these, five had confidence intervals that were either >1 or <1 and thus are likely to be more significant. Of the five, two had confidence intervals <1, indicating their possible association with reduced cancer potential. Three with confidence interval >1 are indicated to be associated with increased cancer risk. We further found that two GADD45B E1122 genotypes (numbers 1 and 4 in Table ) are suggested to modify cancer risk differently, with the homozygous common genotype possibly increasing esophageal cancer risk and the heterozygous genotype possibly decreasing cancer risk. These data suggest a potentially important role for polymorphisms of GADD45B E1122 as a biomarker of esophageal cancer risk.
Prospectively, given appropriate and sufficient data, DF-SNPs provides a methodology that could identify the possible SNP-SNP associations, that is, SNP patterns involved in genetic-based variation in cancer risk. Table illustrates how such predictions would appear for the case of patterns of two SNPs. Of the 15 2-SNP patterns in Table , it is interested that the data suggests that 12 are associated with decreased risk and two are associated with increased risk. Also notable is that odds ratios are substantially larger for the patterns of two SNPs than for individual SNPs (compare Table with Table ), possibly indicating that patterns of SNPs are more predictive of cancer risk than individual SNPs. Not surprisingly, analysis showed that odds rations vary in direct proportion to the length of SNP patterns (results not shown).