4.1 Lung cancer
We illustrate the GSCA approach using the three lung cancer microarray datasets considered in Parmigiani
et al. (
2004) and Subramanian
et al. (
2005) and described in detail in Garber
et al. (
2001), Bhattacharjee
et al. (
2001) and Beer
et al. (
2002). Briefly, the three studies referred to here as the Stanford (Garber
et al.,
2001), Harvard (Bhattacharjee
et al.,
2001) and Michigan studies (Beer
et al.,
2002), were aimed at characterizing lung tumor gene expression profiles relative to that of normal lung tissue. The Stanford and Harvard studies include many subtypes of lung cancer, while the Michigan study focuses on lung adenocarcinomas, a tumor subtype included in the other two studies. The Harvard, Michigan and Stanford studies contain 17, 10, 5 normals and 139, 86, 41 tumor samples, respectively.
We considered the Entrez Gene ID, Unigene ID and Gene Symbol for gene matching. The Entrez Gene IDs were used as this ID gave the biggest inter-study gene coverage overlap for the lung cancer data. Following gene matching, the 3924 genes that appeared in all three studies were annotated into 3649 gene sets including 3471 GO categories and 178 KEGG pathways of at least size 3. We note that GSCA conducted within each study would not require any gene matching; however, gene matching was done here prior to all analyses to facilitate comparison of the GSCA results with the meta-GSCA results that follow.
The GSCA approach was applied to each of the three studies in isolation. shows the total number of DC gene sets identified within each study at varying levels of FDR. Given the dispersion index specified in (
1), the identification of a given set as DC could be due to a small number of gene pairs showing large differences in correlation between conditions, to many gene pairs showing moderate difference, or both. As a result it is useful to further investigate the identified gene sets to gain insight into specific sources of DC.
| Table 1.Number of significant DC gene sets (total 3649) |
Consider a particular gene set, the immune response gene set GO:0006955, which was identified as DC at FDR 4.2% using the Harvard study data. To focus on a subset of the 211 genes in this set, genes were rank ordered by
P-values derived from the hypergeometric test described in (
3). The 30 genes with smallest
P-values are shown in the upper panel of . A striking feature concerns the presence of relatively stronger co-expressions in the normal condition. A closer look at a subset of the network (lower panel) highlights a few specific differences. For example, the co-expression between CFP and SKAP1 increases in cancer compared with normal; the opposite holds for CFP and RBM4. CFP, complement factor properdin, is a member of the properdin family which is known to play an important role in the immune system (Ivanovska
et al.,
2008; Stover
et al.,
2008) and has been associated with numerous types of cancer (Rottino
et al.,
2006).
Similar results are observed in the Michigan and Stanford studies (see
Supplementary Fig. S3), although this gene set did not reach the same level of statistical significance. Using the Michigan data, the gene set is identified as DC at FDR 8.3%. With the Stanford data, the estimated FDR for this gene set is 49%, which is clearly quite high. However, we note that 0.49 was the smallest
q-value observed in the Stanford DC analysis, largely due to the relatively small sample size.
When the data are combined in a meta-GSCA, the immune response gene set GO:0006955 as well as a number of others (48 at 5% FDR, shown in
Supplementary Table S1) is identified as significantly DC. As in the case of GO:0006955, the sets identified in the meta-GSCA are largely those showing moderate, but not necessarily statistically significant evidence of DC within each study. This is shown in , where the study-specific GSCA
q-values are plotted for each study, and color-coded according to the meta-GSCA
q-values. As shown, most of the gene sets identified as statistically significant in the meta-GSCA are those having relatively small (although not necessarily significant) study-specific
q-values.
displays the gene-specific average DCs for 50 of the 211 genes in GO category GO:0006955, rank ordered by the hub-gene test described in
Section 2.3. The most significant hub gene identified in this set is LCK, lymphocyte-specific protein tyrosine kinase, a much studied gene that is associated with lung and other kinds of cancer (Harashima
et al.,
2001; Imai
et al.,
2001; Krystal
et al.,
1998; Naito
et al.,
2007). Slow decay of the average DCs as shown suggests that there is not a single gene, or a few genes, driving the DC call for this dataset, but rather many genes showing a similar amount of DC overall. Investigation of such plots can be useful when identifying DC gene sets for which there are a few genes giving rise to a majority of the observed DC.
A similar calculation was carried out for each of the 48 gene sets identified in the meta-GSCA. The top 10 hub genes (10 genes with smallest hub-gene test
P-values) were recorded for each set and the 12 genes showing up at least five times in the top 10 across the 48 sets are shown in . There we give the gene name and the number of sets (out of 48) for which that gene is in the top 10 hub-gene list. As genes present in many gene sets are favored for over representation, we also report the number of sets (out of 48) that the gene appears in, the number of genes that appear in that many sets and the number of genes that appear in at least that many sets. For example, MXI1 appears in the top 10 genes in 10 of 48 gene sets. It is present in 11 of the 48 gene sets; 92 other genes are present in exactly 11 of the 48 gene sets; and 314 genes are present in 11 or more of the 48 gene sets. MXI1 is a well-known tumor suppressor gene (Ariyanayagam-Baksh
et al.,
2003; Eagle
et al.,
1995; Kim
et al.,
1998; Petersen
et al.,
1998), and has recently been studied with respect to its interactions with other genes (Corn and El-Deiry,
2007; Dang
et al.,
2008; Dooley
et al.,
1995; Tsao
et al.,
2008). A number of other interesting genes made the list, including TGFB1 which has been associated with lung cancer risk (Park
et al.,
2006), and BMP7, a gene recently identified as a potential therapeutic target for breast cancer (Yan and Chen,
2007) and metastatic bone disease (Buijs
et al.,
2007).
| Table 2.Common hub genes identified in 48 gene sets |
We note that the results discussed here are largely distinct from those obtained from traditional enrichment methods (for details on the enrichment methods employed here, see
Section 2.4). GSEA applied to the Harvard data found no gene sets to be enriched for DE genes at FDR 5% (or 10% FDR; the smallest
q-value from the GSEA analysis was 0.18). The upper left panel of
Supplementary Figure S4 suggests that most of the gene sets show little DE between tumor and normal. That is not the case for DC (upper right panel of Fig. S4). Tests for enrichment among DC gene pairs gave analogous results with only eight gene sets identified at 5% FDR, compared with 312 gene sets identified by GSCA (). Two of the eight are represented in the 312; all eight are represented in the 1663 sets identified by GSCA at FDR 10%. A similar finding was observed in the meta-analysis. Rhodes
et al. (
2002) and Choi
et al. (
2003) identified 111 and 1534 of the 3924 genes to be DE at FDR 5%, with each of the 111 genes contained in 1534. The GO category GO:0006955 highlighted in contained five genes identified as DE using the method of Rhodes
et al. (
2002); the method of Choi
et al. (
2003) identified 86 DE genes. Given these totals, GO:0006955 was not found to be enriched for DE genes using the hypergeometric approach described in Falcon and Gentleman (
2007) at FDR 5%. Indeed, neither the DE list derived from Rhodes
et al. (
2002) nor Choi
et al. (
2003) showed enrichment for
any of the 3649 gene sets at FDR 5%; and, as shown in
Supplementary Figure S5, the discrepancy between the meta-GSCA and enrichment tests is not due to the FDR threshold.
4.2 Diabetes
We performed a second meta-analysis with diabetes data obtained by searching the public repository NCBI GEO (Gene Expression Omnibus) and DGAP (Diabetes Genome Anatomy Project). As of February 29, 2008, NCBI GEO returned 79 GEO Series (GSE) for the search term ‘diabetes’. After removing series only peripherally related with diabetes, series without Entrez Gene ID annotation, series with fewer than three biological replicates and series without raw data files uploaded, 16 datasets from human, mouse and rat remained eligible for analysis. DGAP provided an additional six datasets for which the diabetic and normal conditions were clearly described. A brief summary of the 22 datasets is given in
Supplementary Table S2. After gene matching by NCBI HomoloGene build 61, the 22 datasets represent 2349 common genes and 2253 common gene sets defined by GO and KEGG of size at least 3. We further reduced the collection to include eight experiments with sample size larger than 5, based on the shape of the distribution of correlation coefficients obtained from simulations (
Supplementary Fig. S6). The final eight sets are marked in
Supplementary Table S2.
Meta-GSCA on the eight datasets identified 47 gene sets significantly preserved across studies at 5% FDR (sets are shown in
Supplementary Table S3). The approach again identifies what are likely biologically meaningful gene sets. For example, the KEGG pathway for mitogen-activated protein kinase (MAPK) signaling was identified. This pathway is known to play a key role in both types I and II diabetes (Evans
et al.,
2003; Wellen and Hotamisligil,
2005). NR4A1, nuclear receptor subfamily 4, group A, member 1 in particular, included in this and many other significant sets, has been reported to be a regulator of hepatic glucose metabolism (Pei
et al.,
2006). PDK4, pyruvate dehydrogenase kinase, isozyme 4a, well known to be associated with diabetes (Cadoudal
et al.,
2008; Kim
et al.,
2006), is another gene that also appears in many significant gene sets.
A particularly interesting gene set among the 47 is GO:0007169. shows a subset of six important genes selected from GO:0007169 as done for the lung cancer results shown in the lower panel of . The gene set contains ERBB4, a gene known to be involved in pancreatic islet cell development (Huotari
et al.,
2002; Kritzik
et al.,
2000; Miettinen
et al.,
2000), which our group has recently shown to be predictive of type 2 diabetes (Keller
et al.,
2008). In the normal condition, ERBB4 is non-negatively correlated with CRYAB and PRKCA in seven of the eight studies; the correlations are largely negative in the diabetic condition.