We start with two papers that are similar in several aspects. Both
Agne et al. [2011] and
Xu and George [2011] analyze the simulated affection status, and both methods are based on the number of rare variants in predefined genomic regions. Although both methods can accommodate different definitions of rare variants,
Xu and George [2011] define a rare variant as one having a minor allele frequency (MAF) < 0.01 and
Agne et al. [2011] count only private variants, which occur once in the set of 697 unrelated individuals. One big difference between the two work groups is that
Agne et al. [2011] use nonoverlapping bins of a fixed number of SNPs (30 and 100 SNPs) as the predefined genomic regions, and
Xu and George [2011] use genes as the predefined genomic regions. Note that the number of SNPs per gene ranges from 1 to 231 with more than one-third of the genes having only one SNP. The two methods also differ in how the test statistics are constructed.
Xu and George [2011] count the number of rare variants for each individual and then use the common two-sample
T-test statistic with pooled variance to compare the average number of rare variants per individual between the case subjects and control subjects. They use the statistic’s normal approximation to obtain the
p-values. For each fixed bin,
Agne et al. [2011] compare the number of private variants in case subjects to the number of SNPs in both case subjects and control subjects. The statistical significance of this discrepancy is then evaluated by permuting the individual affection status a large number of times.
These two work groups present their results in different ways.
Xu and George [2011] estimate type I error and power of their method in a straightforward way using associated genes and nonassociated genes. Their results are promising; the type I error was well controlled, and the average power over 200 replicates was 0.16 at the 0.01 significance level and 0.10 at the 0.001 significance level. They also reported the top 10 genes that were most frequently identified in the 200 replicates. Two Q1-associated genes,
FLT1 and
PIK3C2B, were at the top of the list, appearing 136 times and 87 times, respectively, at the 0.001 significance level; these were true positives. However, the next eight genes, appearing between 33 and 79 times out of 200 replicates at the 0.001 significance level, were all false positives.
Agne et al. [2011] set the significance level for each permutation test at 0.05. To combine these simulation-by-simulation reports of significant regions, they use the concept of return frequency, a count of the number of times a region is significant out of 200 simulations. If a region had no causal variants, the chance of observing a certain return frequency was evaluated using a Poisson approximation. For both fixed-bin sizes of 30 and 100 SNPs, all the top 10 return regions had return frequencies greater than 0.25 (50 out of 200), which corresponded to a chance of 10
−19 under the null hypothesis. Note that such a high statistical significance was obtained by aggregating the statistics for each of the 200 replicates. With a bin size of 100 SNPs, 3 of the top 10 regions contained 7 causal private SNPs, and with a bin size of 30 SNPs, 2 of the top 10 regions contained 3 causal private SNPs; both analyses exceeded the expected number of causal private variants discovered at random: 3.56 (100-SNP bin) and 1.10 (30-SNP bin). The false positives were rather high because the rest of the top regions contained no private causal variants.
Xing et al. [2011] apply a weighted group-wise association test to analyze the association of quantitative trait Q1 at the biological pathway level. They first create a genetic summary score by summing the number of minor alleles over all SNPs residing in a predefined biological pathway, either with equal weight or with a weighting scheme that extends the idea of
Madsen and Browning [2009] to quantitative traits. The weight for each SNP is given by
Iij is the number of minor alleles of individual
i at the SNP,
δ(
Yi) = 1 if the phenotype
Yi is within one standard deviation of the mean and
δ(
Yi) = 0 otherwise. This weighting scheme assigns a higher weight to SNPs that are rare among individuals with nonextreme phenotypes.
Xing et al. [2011] then use a simple linear regression model to test the association between the phenotype and the genetic summary score. When the weighting scheme is used, permutation tests are used to obtain the
p-values because the weight assignment uses phenotypes. For the unweighted summary score, standard methods for linear regression are used.
Xing et al. [2011] use this method to evaluate 809 pathways defined by PharmGKB and 3,009 pathways defined by GO (Genetic Ontology). Bonferroni correction was used to adjust for multiple testing. Among 12 significant PharmGKB pathways, most were somewhat related to the vascular endothelial growth factor (VEGF) pathway, which includes most of the nine Q1-associated genes. Although the relationship between the significant GO processes and the VEGF pathway was less clear, the GO pathways contained as many Q1-associated genes as the significant PharmGKB pathways. One interesting observation from the results is that the unweighted score appears to work as well as, if not better than, the weighted summary score. This can be explained by the fact that in the simulation the effect size of the causal variants was not related to the MAF, and the weighting scheme improves the power only if rare variants tend to have a larger effect.
Unlike the three contributions just discussed, the approach used by
Pradhan et al. [2011] considers the effects of multiple genes simultaneously in a multiple linear regression setting, and unlike four other papers discussed here, the genotypes are not summarized before their associations with the phenotypes are assessed. Pradhan and colleagues’ approach is based on a Bayesian framework of model selection as proposed by
Yoon [2006]. They first consider a model space in which each model under consideration contains a fixed number of factors (usually small and set by the users). The factors are then ranked according to their marginal posterior probabilities.
Pradhan and colleagues analyze the data at two different levels, first by treating individual SNPs as factors and then by treating genes as factors. When a gene is treated as a factor, all variants in the gene enter the model as a group. Because each gene contains a different number of SNPs, the model with larger genes tends to fit the data better so that the large genes are favored if flat prior probabilities are assigned to each model. In an attempt to correct such bias, the prior probability of a model is set to be proportional to e−k/2, where k is the total number of model parameters, following the same reasoning of the Bayesian information criterion (BIC). By using standard conjugate prior probabilities for the regression model, Pradhan and colleagues could compute the posterior probability of a model efficiently with a closed-form expression. Because there was a large number of candidate models, they used a straightforward Metropolis-Hasting algorithm to estimate the marginal posterior probabilities.
Pradhan et al. [2011] applied their method to the quantitative trait Q1. The results were reported in two ways. First, they looked at the top 10 genes (or top 20 SNPs) ranked by the marginal posterior probabilities. Second, they looked at the genes (or SNPs) whose marginal posterior probability passed a certain threshold at 0.5 or 0.1. Using gene-level analysis with the model size fixed at three factors and a threshold of marginal posterior probability of 0.5, Pradhan and colleagues’ methods detected
FLT1, a Q1-associated gene, in each of the 10 replicates;
KDR, another Q1-associated gene, in 4 out of 10 replicates; and some non-Q1-related genes in 5 out of 10 replicates. Relaxing the threshold to 0.1 did not detect Q1-associated genes much more often but greatly increased the number of false positives. Besides
FLT1 and
KDR, three other Q1-associated genes were among the top 10 genes at least once out of 10 replicates. Their SNP-level analysis consistently identified several Q1-associated SNPs in
FLT1 and
KDR. In general, the identified variants had higher MAFs than nonidentified causal variants.
The work by
Pungpapong et al. [2011] is unique in the sense that it uses a two-stage approach and that at each stage the phenotype information is used. In the first stage, Pungpapong and colleagues use penalized orthogonal-components regression (POCRE) [
Zhang et al., 2009] to build the model for each genetic region using a training data set, and then they calculate the predicted values in the test data set. They use these values as gene-level markers. POCRE itself has a function of variable selection so that the “markers” of a gene might not be constructed from all SNPs in the gene. In the second stage, all gene-level markers, together with some covariates, are entered into an empirical Bayesian variable selection process developed by
Johnstone and Silverman [2004,
2005]. Unlike the full Bayesian methods used by
Pradhan et al. [2011], in which the users specified all prior parameters, empirical Bayesian methods determine some prior parameters from the data.
Pungpapong et al. [2011] apply their approach to Q1 and use one replicate as the training set to obtain the gene-level markers. They then report the number of times a gene shows a nonzero effect from the empirical Bayes variable selection process. Only 4 genes appeared more than 5 times out of 200 replicates. Among them,
FLT1 appeared in all the replicates,
KDR in 26%,
ARNT in 6%, and
RIPK3 in 3%. All but
RIPK3 are Q1-associated genes. In addition, 98 noncausal genes were identified out of 200 replicates. Pungpapong and colleagues’ results are, in general, comparable to those of
Pradhan et al. [2011] in terms of power and false positives.