Molecular epidemiology is entering a new era owing to advancements in genotyping technology and annotation of variation in the human genome. Few established statistical and bioinformatics tools exist for studying complex interactions underlying common diseases such as cancer and cardiovascular diseases. There is no a priori way to determine which method would be best to identify complex interactions in experimental datasets with variable minor allele frequencies and unknown heritability. Therefore, we developed a method to explore complex interactions. PIA v. 2.0 incorporates several different scoring functions for cross-validation. PIA v. 2.0 also deals with imbalances between case and control ratios, a common design feature of case-control studies. Moreover, PIA v. 2.0 allows stratification to determine if particular combinations of SNPs are associated with a phenotype (i.e. case vs. control) only in a particular subgroup of individuals. Finally, PIA v. 2.0 allows examinations of associations of user-defined pathways (based on SNP or gene) with a particular phenotype.
In this study, we applied PIA v. 2.0 to simulated data sets with 2 interacting SNPs in the context of non-interacting SNPs for a total of 10 (imbalanced) or 20 (balanced) SNPs and examined the power of PIA v. 2.0 to detect the interacting SNPs. The genetic model for these data sets assumed a modest genetic association with phenotype (MAF of 0.2 and heritability of 0.01). PIA v. 2.0 detected the interacting SNPs as the highest-ranking model in 77% of the data sets, and using some models, detected the SNPs in over 90% of the data sets. These results indicate that PIA is a powerful tool for detecting interactions.
A variety of approaches for studying complex interactions exist [5
], but clearly additional methods are needed [11
]. Multiple approaches should be implemented when examining complex data to reduce the likelihood of false positive associations [11
]. PIA v. 2.0 was designed incorporating several scoring functions in order that SNP-SNP interactions may be validated over several functions, and ranked according to a total score. Using the simulated data, the total scoring function in PIA v. 2.0 performed better than all of the scoring functions other than the Gini Index. In addition, the interacting SNPs were observed in the 1st
model using PIA v. 2.0 more frequently than when using MDR [14
]. As shown with the example colon cancer data, CASP8_03
was the highest scoring 2-SNP combination overall, while IL1B_01
was the highest-ranking 2-SNP combination using only the Gini Index scoring function, indicating that each function may have strengths in different contexts or datasets. Of note, IL1B_01
are in linkage disequilibrium [5
The interacting SNPs in the simulated data sets were most frequently detected using the Gini Index scoring function, compared with the total score. The reason for this is unclear, but may be due to the fact that the Gini Index uses the entire data set in scoring, instead of dividing the population into testing and training data. It is possible, with the balanced data the scoring is biased based on the counting of the same function twice. In the balanced data sets, the number of cases and controls were the same. In this situation, the % Correct and Sensitivity scoring functions reduce to the same formula. However, even with the imbalanced data, where fractional occupations are used and the % Correct function is excluded for the scoring, the Gini Index performed better than the total score.
In addition to examining the highest scoring SNP models, to increase sensitivity, we suggest that several of the top scoring models should be investigated [5
]. The difference in score between the 1st
ranking and 2nd
ranking SNP models may be modest. Only studying the highest scoring model may result in missing relevant associations [5
]. For example, when reviewing the results of the 2-SNP interactions, the evidence for the IL1B_01
scoring 2-SNP model) and the GSTT1_02
highest scoring model, data not shown) interactions are similar, both were high scoring models using multiple scoring functions. In this report, using the simulated genetic data, the interacting SNPs were frequently observed in the 2nd
ranking SNP models (~10% of data sets).
In addition to exploring the 1–4 SNP combinations most strongly associated with outcome, PIA v. 2.0 allows the user to examine the most commonly occurring SNP pairs among the top scoring 3-SNP models. The interacting SNPs were frequently detected as pairs when using PIA to examine the 3-SNP models using the simulated genetic data. Therefore, using PIA v. 2.0, there are multiple approaches to exploring 2-SNP interactions. It is unclear at present, which approach is the more sensitive approach. However, it should be noted that the data sets used in this paper, while appropriate for testing the ability of PIA v. 2.0 to detect interacting SNPs, might not accurately represent the situation in complex diseases. In complex diseases, multiple genes interact in complex pathways, as opposed to only 2 interacting SNPs in the context of non-interacting SNPs – i.e. in real data there are likely many competing interactions and alternative pathways. In this context, if an investigator is interested in 2-SNP interactions, it may be more appropriate to study the commonly occurring pairs in the 3-SNP models. Until the better approach is determined, we suggest that both looking at the 2-SNP models and the pairs observed in the 3-SNP models is the optimal approach.
Dealing with imbalances in case:control ratios is a challenge using multi-locus approaches for examining gene*gene interactions [14
]. Using fractional occupations, PIA v. 2.0 detected the interacting SNPs in 70–80% of the imbalanced data sets (1:2 and 1:4 ratios), if only considering the 1st
ranking 2-SNP models. In contrast, if cell counts were used for imbalanced data, PIA v. 2.0 performed poorly. As a result, using the option of fractional occupations, PIA v. 2.0 may be applied to imbalanced case-control data to explore interactions.
Overall, PIA v. 2.0 is a robust method for detecting SNP-SNP interactions, as exemplified by the ability of PIA to detect the two interacting SNPs in the context of non-interacting SNPs using the simulated genetic data. PIA v. 2.0 detected the interacting SNPs more frequently than MDR using both the balanced and imbalanced data sets [14
]. Moreover, PIA was more efficient at detecting interactions than exploring interactions (one at a time) using traditional methods. Different methods may be stronger in different contexts or datasets. It should be noted that PIA v. 2.0 was designed to be highly sensitive at detecting modest interactions. Therefore, even in absence of interacting SNPs, PIA v. 2.0 will identify top scoring models. To reduce false positive associations, users should carefully examine output produced by PIA v. 2.0 for consistent associations by investigating the detection of interactions using multiple scoring functions and the top pairs in 3-SNP or 4-SNP models and replicate observed associations using alternative methods (including traditional approaches). Furthermore, the results generated using simulated data should be interpreted with caution. PIA v. 2.0, and other methods for examining complex interactions, are likely less powerful in the context of competing interactions or alternative pathways. Further research is needed to evaluate PIA v. 2.0 in this context.
Several studies examined association of genetic variation with disease using large-scale multi-locus approaches. Previously, PIA was implemented in a study of colon cancer to examine complex interactions using 94 SNPs in 67 genes [5
]. CART decision trees were used in a study of 16 SNPs in breast cancer [20
], and 44 SNPs in bladder cancer [21
]. Multifactor Dimensionality Reduction (MDR) was used to investigate 51 SNPs in 36 genes in multiple sclerosis [22
], 36 gene variants in a nested case-control study within the EPIC cohort to study of bladder cancer, lung cancer and myeloid leukemia [23
], and seven DNA repair SNPs in bladder cancer [24
]. Another study explored the association of 16 genetic variants in 11 genes with Crohn's Disease using regularized least squares [19
]. All of these studies observed complex genetic interactions associated with disease.
PIA v. 2.0 incorporates some aspects of the more common approaches implemented in other studies of complex genetic interaction, CART decision trees and MDR. The Gini Index, one of the PIA v. 2.0 scoring functions, is used for splitting branches in CART decision trees [25
]. PIA v. 2.0 uses a form of dimensionality reduction, like in MDR, in the assignment of the genotype-phenotype table in Figure . The % Correct scoring function is similar to the scoring function used in MDR [10
]. A reason for using a percent score (as in PIA v. 2.0), compared to the number correctly classified (as in MDR) for scoring of genotype combinations, is that it accounts for different sized populations when scoring cells; i.e. if a phenotype has a smaller population size but 50% of samples are still misclassified, the score is the same as if a phenotype had a larger populations size with 50% misclassification.
With the advent of genome wide association studies (GWAS), it is possible to genotype over 500,000 SNPs on a single individual. In complex diseases, there are likely many genes that interact in pathways that are related to disease susceptibility. As a result, in GWAS, there is an interest in exploring complex gene*gene interactions. Investigating complex gene*gene interactions is a challenge due to the computational time required with such a large amount of genotyping data. We observed, using PIA v. 2.0, that a single run of cross-validation was powerful at detecting the 2-SNP interactions similar to 10 rounds of 10-fold cross-validation. Further, the Gini Index and the Absolute Probability Difference functions, which both only implement a single run of the data, were robust at detecting the 2-SNP interactions. PIA also allows for the incorporation of user-defined pathways in the analysis of SNP interactions, which may be used to explore the association of global pathways, or gene ontologies with disease outcome. Therefore, while PIA v. 2.0 currently can only be used for up to 1400 SNPs, using a single run of the data or scoring functions 5 and 6, are a possible approach to be implemented to reduce computational time and may eventually be applied to analysis of GWAS.
In this paper, we describe a new method for exploring genetic interactions, but some of the limitations should be considered. In classifying genotype combinations associated with disease, PIA, as other dimensionality reduction methods, effectively dichotomizes exposure as "low" or "high" risk. Such a simplification of genotype combinations results in a loss of information, because in reality each SNP combination may be associated with levels of risk. In addition, PIA v. 2.0 is not equipped for continuous variables, such as age or years of smoking exposure. These types of variables may only be analyzed using PIA if split into a maximum of five categories. While PIA v. 2.0 is more powerful that traditional methods, when studying higher order interactions, associations become less stable due to the reduced number of individuals in each cell. Therefore, PIA does not eliminate the need to conduct studies of large sample sizes and to confirm findings with more traditional statistical methodologies.