Four pathway analysis methods were compared by using each to test association of GO level 4 pathways with lung cancer risk in two lung cancer GWAS data sets. Methods compared included four gene set enrichment approaches, EASE, GenGen, mSUMSTAT and a p-value combination approach, SLAT. After adjustment for multiple comparisons using an FDR of less than or equal to 0.05 as the criterion for a significant association, EASE and mSUMSTAT identified more pathways associated with lung cancer risk across the two datasets (10 and 8 respectively) than did GenGen (no pathways), or SLAT (5 pathways). EASE and mSUMSTAT also identified pathways that were significantly associated with risk in both data sets: transmission of nerve impulse and Ras guanyl nucleotide exchange factor by EASE; and the acetylcholine receptor activity pathway by mSUMSTAT. There was limited agreement among the different methods in the identification of top ranked pathways. Comparing genes among top pathways chosen by each method showed only a modest degree of overlap.
In comparing pathway analysis methods, we examined whether the number of SNPs per gene in pathways influenced the selection of top pathways. The results indicated EASE, identified top pathways with a significantly greater median number of SNPs per gene than the other methods. This result is not unexpected. For all gene set enrichment methods we used the common approach of assigning the most significant SNP to represent each gene. Genes with more SNPs, generally large genes, are more likely to be assigned a SNP with a high association statistic, which can lead to over estimation of significance of pathways with large genes (gene size bias)
[8],
[9]. We acknowledge that large genes might be more likely to harbour multiple variants which are truly associated with outcome, but our comments focus on statistical properties of the methods, specifically the potential for false positives resulting from gene size bias. EASE, which uses a relatively simple approach based on the Fishers Exact test, is susceptible to this bias. Normalization routines and phenotype permutations incorporated into GenGen and mSUMSTAT protect against this bias
[6],
[22]. SLAT is also protected against this bias as it uses all SNPs in a pathway for analysis and incorporates a phenotype shuffling routine
[12]. The more robust design of GenGen, mSUMSTAT and SLAT provides an additional benefit, as these methods account for correlation among SNPs within pathways.
A critical aspect of this comparison was the use of replication of top pathways across CETO and GRMD to help evaluate the relative performance of these methods. However, based on an FDR of ≤0.05, few replicated associations were found. Lack of study power may in part account for the small number of replicated associations. In particular GRMD (cases

=

1639, controls

=

1618) may have had insufficient sample size to detect associations found in CETO (cases

=

2258, controls

=

3027). Heterogeneity between data sets might also have contributed to small number of replicated associations, as the German sample was restricted to subjects under age 50, and the MD Anderson GWAS included only ever smokers. Therefore, GRMD subjects were younger and had a higher proportion of ever smokers compared to CETO subjects.
Among the three methods (GenGen, mSUMSTAT and SLAT) that are robust against gene size bias only mSUMSTAT identified a replicated association. This was for the acetylcholine receptor activity pathway. The association of this pathway with risk is not unexpected as several SNPs at or near the
CHRNA3-
CHRNA5-
CHRNB4 gene cluster are associated with both lung cancer risk
[1],
[2],
[5] and nicotine addiction
[5],
[23],
[24]. It is of interest that the GenGen method also identified acetylcholine receptor activity as the top ranked pathway in CETO and one of the most highly ranked pathways in GRMD, although the result was not significant in either data set after correcting for multiple comparisons using the FDR. We note that the associations found for this pathway was driven by the
CHRNA3-
CHRNA5-
CHRNB4 gene cluster, as demonstrated by the dramatic reduction of strength of association (according to the FDR) found for both the mSUMSTAT and GenGen methods when data were reanalyzed with these three genes removed from the pathway. This may complicate the interpretation of the observed association as ideally, significant pathways should not be identified from a signal that might ultimately represent a single gene or variant
[20],
[21] We point out, however, that there are two independent risk associated loci in this region
[25] and it is currently not clear which genes in the region are causally related to disease risk. It is preferable then that pathways such as these are identified to be associated with outcome by the analysis method, and the researcher can then follow-up with additional exploratory analyses. Further investigation of this pathway did suggest that allowing the same SNP to represent both
CHRNA5 and
CHRNA3 in the analysis overestimated significance in the GRMD data set for mSUMSTAT and the CETO data set for Gengen.
Results from analyses that excluded
CHRNA5 are likely the most appropriate for this pathway.
For the purpose of further comparing pathway associations across data sets we used a less restrictive criterion for a replicated pathway association (a significant FDR in one data set and a nominally significant association (P<

=

0.05) in the second). This permitted additional associations to be identified, although with less confidence than those identified using the original criterion. The mSUMSTAT method found four potential risk associated pathways with a significant FDR in CETO and nominally significant P-values in GRMD: heme metabolic process, porphyrin metabolic process, pigment biosynthesis and 4 iron, 4 sulfur cluster binding. The heme metabolic and porphyrin metabolic pathways show a high degree of overlap. All four of these pathways include
IREB2 which is in the same region of strong LD that includes the
CHRNA3-
CHRNA5-
CHRNB4 cluster. SLAT identified one pathway, regulation of cell migration, using this same criterion.
Overall, our results (along with insights from other comparisons discussed below) suggest mSUMSTAT should be considered when choosing a method for pathway analysis. Lack of strong replication of pathway associations makes it difficult to evaluate GenGen and SLAT against one another. However, the GenGen approach appears to have some advantages. GenGen results provided some support for an association of the acetylcholine receptor pathway with risk, and like mSUMSTAT this method allows for the incorporation of covariates, whereas the SLAT program does not have this capability. Finally, GenGen is commonly used and has provided other plausible associations in pathway analyses of GWAS data sets
[10]. On the other hand, the utility of SLAT is difficult to assess given our results and further evaluation of this method is needed. The rest of the discussion focuses on mSUMSTAT and GenGen.
Our mSUMSTAT method contrasts to that of Tintle et al.
[11] through calculation of a normalized test statistic, and use of phenotype permutations instead of randomly selected gene sets to determine the null distribution. These changes were introduced to address gene size bias and maintain the correlation structure among SNPs in a pathway.
Some simulation results suggest that approaches that use the sum or average of the χ
2 as a pathway test statistic will be more powerful than those that use the weighted Kolmogorov-Smirnov-like running sum statistic incorporated into GenGen and related GSEA approaches. Tintle et al. found that the original SUMSTAT test statistic was more powerful than a GSEA approach in a comparison where random gene sets were used to construct the null distribution for both methods
[11]. Efron and Tibshirani found generally lower p-values using mean test statistics when compared to GSEA in simulated gene expression analyses
[18].Their analysis used a t-test instead of a χ
2 statistic, allowing for gene expression comparisons of two groups. Permutation and normalization approaches were the same as used here, except normalization for GSEA also incorporated means and standard deviations calculated from permutations with random gene sets. Our results are consistent with these studies in that mSUMSTAT identified several significant associations in CETO and GRMD (with one of these replicated in both data sets), while GenGen did not, suggesting that mSUMSTAT may have greater power to detect associations.
Since the strongest association found by GenGen and mSUMSTAT was for the acetylcholine receptor pathway we graphed odds ratios and confidence limits to further explore the pathway association. Despite weak association signals found for these regions when the CHRNA3-CHRNA5-CHRNB4 cluster was removed from analyses, the graphical presentation of results suggests that SNPs outside of this gene cluster may contribute to the association, as suggested by replicated associations across the two data sets. This association appeared more convincing when comparing the most significant SNPs representing each gene across the two data sets (gene based comparison) as opposed to comparing the most significant SNPs at each gene in CETO to the same SNPs in GRMD (variant based comparison). Better evidence for replication could result from a gene based approach versus a SNP based approach if multiple SNPs capture the causal variant(s) more completely than single SNPs for some pathway genes. This can be advantageous to pathway analysis approaches which can rely on gene based association signals to better replicate pathway associations.
In summary, this study compared several different pathway analysis approaches in two lung cancer GWAS data sets comprising four studies. Difficulties in replicating associations across studies hindered our comparison and we cannot clearly establish one pathway analysis method as superior to the others. However, the mSUMSTAT approach did demonstrate several strengths such as a highly plausible association with the acetylcholine receptor pathway and several additional suggestive associations, while accounting for correlation among SNPs and gene size bias. Since different pathway analysis methods can produce different results using the same data set (as was seen here), it is best to use more than one method when examining pathway associations with disease risk
[26]. We suggest that the mSUMSTAT method could be used in combination with other methods, such as the better known GenGen approach, in pathway analysis investigations.