|Home | About | Journals | Submit | Contact Us | Français|
Recent genome-wide association studies have resulted in a dramatic increase in our knowledge of the genetic loci involved in type 2 diabetes. In a complementary approach to these single-marker studies, we attempted to identify biological pathways associated with type 2 diabetes. This approach could allow us to identify additional risk loci.
We used individual level genotype data generated from the Wellcome Trust Case Control Consortium (WTCCC) type 2 diabetes study, consisting of 393,143 autosomal SNPs, genotyped across 1,924 case subjects and 2,938 control subjects. We sought additional evidence from summary level data available from the Diabetes Genetics Initiative (DGI) and the Finland-United States Investigation of NIDDM Genetics (FUSION) studies. Statistical analysis of pathways was performed using a modification of the Gene Set Enrichment Algorithm (GSEA). A total of 439 pathways were analyzed from the Kyoto Encyclopedia of Genes and Genomes, Gene Ontology, and BioCarta databases.
After correcting for the number of pathways tested, we found no strong evidence for any pathway showing association with type 2 diabetes (top Padj = 0.31). The candidate WNT-signaling pathway ranked top (nominal P = 0.0007, excluding TCF7L2; P = 0.002), containing a number of promising single gene associations. These include CCND2 (rs11833537; P = 0.003), SMAD3 (rs7178347; P = 0.0006), and PRICKLE1 (rs1796390; P = 0.001), all expressed in the pancreas.
Common variants involved in type 2 diabetes risk are likely to occur in or near genes in multiple pathways. Pathway-based approaches to genome-wide association data may be more successful for some complex traits than others, depending on the nature of the underlying disease physiology.
Recent genome-wide association (GWA) studies have resulted in a dramatic increase in our knowledge of the genetic loci involved in type 2 diabetes. There are now 18 common variants with robust evidence of association with the disease (1–7). Individually, these variants only have a small effect on disease risk, and in combination they account for only ~3% of the heritability of type 2 diabetes (6). GWA studies have focused on only the most statistically strongly associated individual variants, with robust association being declared only if a variant meets a stringent “genome-wide” level of statistical significance. However, GWA studies of a few thousand individuals are underpowered to detect variants with odds ratios of <1.2 at P < 1 × 10−5 —the typical P values used to take SNPs forward into replication studies. This means many real type 2 diabetes predisposing variants may not be detected in GWA data and are not yet followed up. The most robust way of testing for these variants is to increase sample sizes, and such efforts are ongoing. However, ever larger numbers of case and control subjects may be difficult to obtain. We therefore sought alternative approaches to investigating GWA data.
For the initial analysis, we used individual-level genotype data generated from the Wellcome Trust Case Control Consortium (WTCCC) type 2 diabetes study. Quality-control criteria and study design are described in detail in Zeggini et al. (7). We used a total of 393,143 autosomal single nucleotide polymorphisms (SNPs), in 1,924 case subjects and 2,938 population-based control subjects.
We used summary association statistics from two type 2 diabetes GWA studies for validation. These are the Diabetes Genetics Initiative (DGI) and the Finland-United States Investigation of NIDDM Genetics (FUSION) study, which together consist of 2,625 case subjects and 2,641 control subjects (2,3,6). Imputed SNP association statistics were used for any FUSION or DGI SNPs that were missing or not directly genotyped. Only those with high imputation confidence (r2hat > 0.4) were included in the analysis.
The Gene Set Enrichment Algorithm (GSEA) modified algorithm was provided as part of the GenGen suite (http://openbioinformatics.org/gengen), complete with formatted data files using the SNP/Gene mappings from the UCSC Genome Browser annotation in the snp126, refGene, and refLink tables (genome build:hg18). The software uses human biological pathways defined from Gene Ontology (http://www.geneontology.org), BioCarta (http://www.biocarta.com), and the Kyoto Encyclopedia of Genes and Genomes (KEGG;http://www.genome.jp/kegg) databases (8,9). All human KEGG and BioCarta pathways were used in the analysis, but only level 4 annotations in Biological Process and Molecular Function from Gene Ontology. These were the latest version when downloaded from the Gengen website (http://openbioinformatics.org/gengen, accessed December 2007).
We also designed three artificial positive control pathways. The first contained 17 known type 2 diabetes loci (18 confirmed variants excluding SLC30A8 not captured on the Affymetrix chip), and the second and third contained the same genes, diluted with 17 and 34 genes with median GWA χ2 association statistics.
Analysis of pathways was performed using a recently published algorithm (10). This method implements the existing GSEA algorithm (11). Briefly, χ2disease association statistics from all quality-controlled SNPs were mapped onto genes. The mapping process assigns the SNP with the highest test statistic, within a 200-kb flanking window on either side of the gene, to represent the overall test statistic for an individual gene. SNPs that are not mapped to any defined gene are discarded from further analysis, whereas SNPs within the region of multiple genes are mapped to all. This produced a list of 303,861 SNPs in 15,114 genes. Each of these 15,114 genes is then ranked from highest to lowest according to their test statistic (represented by the highest associated SNP). From an initial set of 2,077 KEGG, Gene Ontology, and BioCarta pathways, 439 pathways met the criteria of containing between 20 and 200 GWA-captured genes. This cutoff was selected as the optimal range for gene sets based on the power of the algorithm. However, cutoffs of between 10 and 400 produced similar results, with WNT-signaling ranking top.
The significance of each pathway is then judged using a Kolmogorov-Smirnov–like running sum statistic. This test reflects the overrepresentation of pathway genes toward the top of the overall ranked gene list. An enrichment score (ES) is produced, measuring the deviation of the association statistics in a given gene set, compared with a set of randomly picked genes. As larger genes will by chance harbor SNPs with higher test statistics, a normalized enrichment score (NES) is also produced, adjusting for gene size through permutation-based label swapping. By subtracting the mean permuted ES score from the observed ES score and dividing by the SD of the permuted ES scores, different-sized genes and gene sets are directly comparable. A total of 10,000 permutations were performed, where the case/control ratio was maintained during label swapping. The permutation approach gives an empirical P value for an individual pathway based on label swapping, alongside a false discovery rate calculated using NES scores from all permuted values in all pathways examined in an experiment. Full details of formulae used in this algorithm are provided by Wang et al. (10).
Biological pathway-based analysis is a complementary approach to single-point GWA studies. The approach allows us to test the hypothesis that a set of polymorphisms that occur in genes from a particular biological pathway or process are associated with type 2 diabetes. Using GWA data to test for association of a pathway rather than individual genes may provide important novel biological insights (i.e., a pathway associated with type 2 diabetes), which have been missed by the focus on only the most associated individual variants. A pathway-based approach may also help prioritize individual variants, which are at lower levels of statistical confidence, for follow-up in large replication efforts.
Pathway-based approaches have been used extensively for analyzing gene expression data. For example, Mootha et al. (12) used the GSEA approach to identify a set of genes involved in oxidative phosphorylation, whose expression was coordinately decreased in muscle from diabetic patients compared with control subjects. Wang et al. (10) recently published a modified version of the GSEA, which is designed to analyze genome-wide SNP association statistic data rather than expression data. No GWA pathway–based analyses have yet been performed for type 2 diabetes. In this study, we attempted to identify pathways associated with type 2 diabetes by applying the modified GSEA method to a type 2 diabetes GWA dataset.
We used individual level GWA data from the WTCCC type 2 diabetes study (1,7). We used a total of 393,143 autosomal SNPs genotyped across 1,924 case subjects and 2,938 population-based control subjects. We analyzed 439 pathways defined by the Gene Ontology (8), BioCarta, and KEGG (9) databases. Details of the algorithm used are provided inresearch design and methods, but briefly, the most strongly associated SNP from a gene ±200 kb of flanking sequence either side is used to represent each gene in the genome. Overlap is possible, where the same SNP can be mapped to multiple genes. An individual gene set is then assigned a statistic based on the trend for its genes to appear toward the top of the ranked association list of all genes in the genome.
A total of 26 pathways reached a nominal P < 0.05, slightly higher than the number expected by chance (0.05 × 439 = 22) (Table 1, Fig. 1). No pathways were associated with type 2 diabetes at P < 0.05 after Bonferroni adjustment (top P = 0.31). A quantile quantile (QQ) plot of the 439 pathways is shown in Fig. 1. The WNT-signaling pathway was the most strongly associated and was of interest because it contains TCF7L2, a common variation that shows the strongest association with type 2 diabetes in all type 2 diabetes GWA studies (7). Excluding the TCF7L2 gene from analysis ranked the WNT-signaling pathway second out of the 439 pathways (nominal P = 0.002). WNT signaling has been proposed as an important candidate pathway for type 2 diabetes, not only because of the TCF7L2 association but because of its potential importance in β-cell development and function (13–15). We therefore investigated the WNT-signaling pathway association using summary-level data from an additional 2,625 case subjects and 2,641 control subjects from the DGI (2) and the FUSION (3) studies. We used the type 2 diabetes association statistics available for each of the 126 WNT-signaling gene SNPs. A QQ plot of the meta-analyzed DGI and FUSION results is shown in Fig. 2. There is modest deviation away from the null distribution, but this is within the 95% concentration intervals. Details of the WNT-signaling SNPs reaching P < 0.01 are given in Table 2 and for all 126 SNPs in Supplementary Table 1 (found in an online-only appendix at http://diabetes.diabetesjournals.org/cgi/content/full/db08-1378/DC1). Expanded results are also available for the next best 12 pathways in Supplementary Table 2.
We next assessed in which pathways the genes in the 18 confirmed type 2 diabetes loci occurred (Supplementary Table 3). Based on regions defined by recombination hotspots, these 18 loci contain 23 genes, although it is important to note that a gene outside of a recombination hotspot may be the gene on which an associated SNP acts. Seven pathways contained more than one gene from the 23 implicated in the 18 loci but none more than three. The highest-ranked of these was the metallopeptidase activity pathway, but this pathway did not reach significance (P = 0.19). We also designed three artificial positive control pathways to assess what proportion of real loci need to be present in a pathway for it to be detected. The first contained the most strongly associated SNP from each of 17 known type 2 diabetes loci (6) (excluding SLC30A8 not captured on the Affymetrix chip); the second and third contained the same SNPs, diluted with 17 and 34 SNPs with median GWA χ2 association statistics. Analysis of these 17, 34, and 51 gene pathways, together with the 439 database-defined pathways, ranked each positive control pathway at the top, with nominal P values <0.0001. The 17 and 34 gene pathways reached significance after multiple testing (false discovery rate statistics of 0.006 and 0.02, respectively), but the pathway consisting of 51 genes did not (false discovery rate 0.09).
The main implication of our results is that type 2 diabetes genes fall in multiple pathways. This is consistent with the observation that, of the 23 genes implicated from the current 18 confirmed loci, a maximum of three fall in the same pathway. There are two main limitations to our study. First, we are limited by the accuracy and completeness of publicly available pathway databases. While defined pathways were used from three separate data sources, there is clearly an incomplete knowledge of biological and disease pathways. Second, results from our artificially created positive control pathways indicate that pathways need to have a large proportion of genes associated before they reach experiment-wise significance (for example, a pathway containing variants representing 17 confirmed loci and 34 nonassociated loci does not reach P < 0.05 after correcting for the number of pathways tested).
A further limitation is that the best approach to generate an association statistic for a set of genes is unclear. Large linkage disequilibrium (LD) blocks around strongly associated genes can hugely inflate the significance of an individual pathway and can introduce the problem of nonindependence across genes and pathways. Although this problem was overcome in the GSEA algorithm by selecting a single SNP to represent a gene, this in itself is limiting. A true disease-associated gene may have multiple functional variants. By representing each gene locus with a single SNP, we are therefore missing out on the potential effect of multiple association signals for a given gene. A better gene statistic would evaluate all the variation across a locus, rather than just the SNP with the best association statistic, while accounting for any LD.
Despite our conclusion that diabetes genes are likely to reside in multiple pathways, it is still interesting to note that the second highest–ranked pathway was the WNT-signaling pathway, after the exclusion of TCF7L2. Multiple studies have suggested that tight regulation of WNT signaling is required for normal pancreatic development during embryonic growth (13–15). Several WNT-signaling genes, with SNPs associated at nominal P < 0.01, are worth noting. These include the CCND2 gene, a cyclin upregulated in islet compared with whole pancreas (rs11833537 P = 0.003), and the SMAD3 (rs7178347 P = 0.0006) and PRICKLE1 (rs1796390 P = 0.001) genes, which are both expressed in the fetal pancreas (data fromhttp://www.cbil.upenn.edu/epcondb42).
It was previously shown by Mootha et al. (12), using expression data, that the oxidative phosphorylation pathway was associated with type 2 diabetes. In our analysis, using GWA data, the oxidative phosphorylation pathway ranked 431 of the 439 pathways (P = 0.98). The most likely reason for this difference is that changes in oxidative phosphorylation gene expression are caused by the disease, rather than the gene expression changes causing type 2 diabetes.
In conclusion, using type 2 diabetes GWA data and a pathway-based approach, we found that the WNT-signaling pathway ranked highest; however, SNPs from this pathway did not show convincing association in additional studies. The failure to find a highly associated individual pathway could be explained by the genetic architecture of type 2 diabetes, the power of the analysis, or the strength and completeness of the algorithms and pathway datasets used. Pathway-based approaches may be more successful for some complex traits than others, depending on the nature of the underlying disease physiology.
Funding for the project was provided by the Wellcome Trust under award 076113. M.N.W. is a Vandervell Foundation Research Fellow.
No potential conflicts of interest relevant to this article were reported.
This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data are available fromwww.wtccc.org.uk.
The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.