We have presented a simple method for estimating the sample size for case-control studies required to detect a group of genetic variants using multiplicative models. We have also used the same approach for additive risk models; however, we could not show the asymptotic normality of the joint distribution of exposure for cases (Appendix A2).

In the multiplicative model, when the genetic variants are found to be jointly significant, subsequent multiple tests could be conducted to detect which R_{i}s are significantly different from 1. For example, if the null hypothesis is rejected for a group of five genetic variants, and R_{1}, R_{2 }and R_{5 }are significantly different from 1, we can conclude that the joint effect of G_{1}, G_{2 }and G_{5 }is significantly different between cases and controls.

Consider k hypothesis tests. Under the null hypothesis using the Bonferroni inequality, the probability that at least one of the k tests is significant at level α

_{0 }is less than or equal to α

_{0}k. In order to maintain an overall level of significance α, we would use the significance level α

_{0 }= α/k for each of the k separate tests of significance. Several less conservative adjustments for multiple tests of significance have been proposed, such as the procedure of Holm [

12] and Hochberg [

13]. All of these procedures conduct the multiple tests by ordering the test statistics from largest to smallest and then using less restrictive significance levels to the second, third, and so on, test conducted. When any one test is not significant, the procedure stops and all further tests are also declared non-significant. Benjamin [

14] suggested that the False Discovery Rate (FDR) may be the appropriate error rate to control in many applied multiple testing problems. The FDR is the expected proportion of erroneous rejections among all rejections. A simple procedure was given there as an FDR controlling procedure for independent test statistics and was shown to be much more powerful than comparable procedures that control the traditional family-wise-error-rate (the probability of erroneously rejecting even one of the true null hypotheses).

One could have conducted a simultaneous test of the k-parameter joint null hypothesis using multiple tests discussed above as an alternative approach to our test. However, all these tests are conservative compared to the multivariate test presented here. On the other hand, multiple comparison tests could be applied in instances in which the k-statistic vector is not normally distributed, making these tests suitable for the additive model given in the Appendix A2.

Garcia-Closas [

15] evaluated the influence of common genetic variation in the NER pathway on bladder cancer risk by analyzing 22 single nucleotide polymorphisms (SNP) in seven NER genes (XPC, RAD23B, ERCC1, ERCC2, ERCC4, ERCC5, and ERCC6). They estimated odds ratios for each individual polymorphism using logistic regression. They then performed a global test for the association between genetic variations in NER pathway as a whole based on the maximum of trend statistics of all the individual polymorphisms. The P-value for the global test was computed by the permutation method described in Westfall [

16]. They found significant associations with SNPs in four of the seven NER genes. They used 1150 cases and an almost equal number of controls. The p-value for the global test for pathway effects was 0.04. Their minor allele frequencies ranged from 0.01 to 0.33 and the odds ratios ranged from 0.8 to 1.4 with an average odds ratio of 1.2. If the odds ratios and SNP frequencies were known (assuming an average odds ratio of 1.2 and a dominant model), the sample size required to achieve 80% power at the 5% level of significance in detecting the overall effect of 22 SNPs using our method is 212 cases. In situations in which we find that none of the genetic variants were significant, the method described in this paper could have reduced the cost of the experiment by first screening the group of genetic variants for overall significance.

The results obtained here can be easily extended to a group of k genetic variants and l environmental factors, when the exposure to the i^{th }environmental factor can be specified as E_{i }= 1 (present) or E_{i }= 0 (absent) and the E_{i}s are independent among themselves and are independent of the genetic variants.

Our approach is limited by its inability to look at higher order interactions and the assumption of independence between all loci. Covariance terms in the variance-covariance matrix could increase the sample size to detect the group of genetic variants. It is possible that we may not detect individual effects, but there may be joint effects due to interactions. Our method cannot detect these interactions. Our sample size is constrained by our assumption of normal approximation to binomial distribution. Another limitation is the assumption of multiplicative effects of genetic variants. True biologic interactions could be more complex with epistasis and/or other genetic phenomena; furthermore, joint genetic effects and gene-environment interactions on risk may be neither additive nor multiplicative. Unfortunately, for statistical modeling, epidemiologic analyses have had to deal with multiplicative or additive models. The rare disease assumption in case-control studies has been discussed in many papers [

17,

18]. Generally, since most diseases are infrequent, ORs are good estimators of relative risks under this "rare disease assumption". For a disease with a frequency of 10%, which is high, the difference between OR and RR is still only 10%. The only requirement in our genetic model is the ability to express exposure due to genotype as 1 (presence of genotype) or 0 (absence of genotype). Therefore, either dominant or recessive models can be used in our analysis.

A non-parametric approach to this problem is the method of Multidimensionality Reduction (MDR), introduced by Ritchie [

5] as a method of reducing the dimensionality of multilocus information to improve the identification of polymorphism combinations associated with disease risk. This data reduction approach seeks to identify combinations of multilocus genotypes and discrete environmental factors that are associated either with high risk of disease or low risk of disease, and defines a single variable that can be divided into high-risk and low-risk combinations. When it was applied to a sporadic breast cancer case-control data set, in the absence of statistically significant independent main effects, MDR identified a statistically significant higher-order interaction among four polymorphisms from three different estrogen-metabolism genes. Limitations of MDR include its applicability only to case-control studies that are balanced, and the difficulty in interpreting MDR models. Three different strategies for improving the power of MDR to detect epistasis in imbalanced datasets have been evaluated in a recent paper[

19].

Another recent approach that holds great promise is logic regression, introduced by Ruczinski [

20] as a tool to detect interactions between binary predictors that are associated with a response variable. Logic regression is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of binary covariates. According to the authors, logic regression is the only methodology that searches for Boolean combinations of predictors in the entire space of such combinations, while being completely embedded in a regression framework, where the quality of the model is determined by the respective objective functions of the regression class.

Suppose there are k genetic variants in a group of genetic variants and only r of them are associated with the disease. The prevalence of each of (k-r) genetic variants that are not associated with the disease (relative risk of each genetic variant is equal to 1) is identical for cases and controls. Therefore, from equation (1), the sample size required to detect the k genetic variants is identical to the sample size required to detect the r genetic variants associated with the disease. Since our sample size is a function of the squares of the difference between prevalence of genetic variants in cases and controls, our method is valid even when we have a combination of positively and negatively associated genetic variants.

One advantage of our method is the simultaneous test of difference of mean exposure instead of multiple testing. Thus, for a range of reasonable numbers of genetic variants, the sample size requirement declines with the increasing number of genetic variants. It is possible that the sample size required to detect a group of genetic variants could increase when adding a genetic variant to the group. However, the sample size required to detect the group with this genetic variant is still less than the sample size required to detect the genetic variant alone or to detect a subset of the genetic variants containing this genetic variant. When testing for an effect size in a group of genetic variants, one can use the global test described in this paper as a screening tool, because the sample size required to detect an effect size in the group is comparatively small. Note that we are comparing the ability to detect at least one of many genetic variants (global test) with the power to detect just one, which are different null hypotheses. If the global test is non-significant, testing for individual genetic variants that require a large sample size is not necessary.

More methodological work is needed in this area to detect joint effects of multiple genetic variants. Our method could be viewed as a screening tool for assessing groups of genetic variants involved in pathogenesis and etiology of common complex human diseases.