|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWASs) have primarily focused on marginal effects for individual markers and have incorporated external functional information only after identifying robust statistical associations. We applied a new approach combining the genetics of gene expression and functional classification of genes to the GWAS of basal cell carcinoma (BCC) to identify potential biological pathways associated with BCC. We first identified 322,324 expression-associated single-nucleotide polymorphisms (eSNPs) from two existing GWASs of global gene expression in lymphoblastoid cell lines (n=995), and evaluated the association of these functionally annotated SNPs with BCC among 2,045 BCC cases and 6,013 controls in Caucasians. We then grouped them into 99 KEGG pathways for pathway analysis and identified two pathways associated with BCC with p-value < 0.05 and false discovery rate (FDR) < 0.5: the autoimmune thyroid disease pathway (mainly HLA class I and II antigens, p < 0.001, FDR = 0.24) and JAK-STAT signaling pathway (p = 0.02, FDR = 0.49). Seventy nine (25.7%) out of 307 eSNPs in the JAK-STAT pathway were associated with BCC risk (p < 0.05) in an independent replication set of 278 BCC cases and 1,262 controls. In addition, the association of JAK-STAT signaling pathway was marginally validated by using 16,691 eSNPs identified from 110 normal skin samples (p = 0.08). Based on the evidence of biological functions of the JAK-STAT pathway on oncogenesis, it is plausible that this pathway is involved in BCC pathogenesis.
Basal cell carcinoma (BCC) is the most common cancer in Caucasian populations, accounting for approximately 80 percent of all non-melanoma skin cancers, and its incidence is increasing (de Vries et al. 2004). Although the mortality attributable to BCC is not high, the disease is responsible for considerable morbidity, imposing a growing burden on healthcare services. Besides the known risk factors for BCC, including exposure to ultraviolet radiation and host factors such as fair complexion, red or blond hair, and light eye color (Gallagher et al. 1995; Lear et al. 1997; Rubin et al. 2005), germline polymorphic variants have been implicated as playing a role in disease development (Lear et al. 2005). More recently, genome-wide association studies (GWAS) have identified several new loci as associated with BCC (Stacey et al. 2008; Stacey et al. 2009). However, the GWAS studies have focused only on the most significant SNPs (those that fulfilled a stringent statistical “genome-wide” significance criterion), while paying little attention to the rest. Thus, new methods are emerging to utilize the remaining genetic information in the GWAS data.
Given that gene-gene interactions may contribute to complex diseases, combining the modest association signals in the GWAS data with information on biological pathways and networks can help to detect the joint effects of multiple genes (Pedroso 2010). Some have argued that a large proportion of disease susceptibility genes will be found to be functionally related and/or interact with one another in biological pathways, and that only a limited number of biological pathways contribute to the etiology of complex traits (Carlborg and Haley 2004). Thus, pathway-based approaches have been applied to the GWAS of several complex diseases in determining whether the cumulative contribution of genes with a common biological denominator is greater than expected by chance (Baranzini et al. 2009; Elbers et al. 2009; Menashe et al. ; Wang et al. 2007; Wang et al. 2009). In these studies, the representative SNPs for the genes were chosen from all the SNPs physically located in the gene region; however, SNPs in a gene region might not represent functional variants of the gene, and a true disease-associated gene may have multiple independent functional variants. Furthermore, a gene may be regulated in trans by genetic variants that are physically distant from the structural gene (Schadt et al. 2008). It has been demonstrated that SNPs associated with gene expression in liver and adipose tissues are enriched for association with type 2 diabetes (Zhong et al. 2010a). By applying this method to the GWAS of type 2 diabetes, several well-known pathways as well as some novel ones have been identified in the etiology of this disease (Zhong et al. 2010b). Therefore, a new approach integrating the genetics of gene expression and GWAS pathway analysis is appealing for its ability to improve the power of detecting the association (Zhong et al. 2010b).
In this study, we adapted this approach to integrate pathway information and two existing databases of gene expression quantitative trait loci (eQTL) for the GWAS on BCC. We used the expression-associated SNPs (eSNPs) discovered in two GWAS of global gene expression in Epstein-Barr virus transformed lymphoblastoid cell lines (LCL) (Dixon et al., 2007 and unpublished data, for details see method), and sought to uncover the important biological pathways associated with BCC risk that might help provide insight into the etiology of BCC.
Global gene expression data were analyzed by two techniques in two independent samples. The first sample (referred to as MRCA) contains 405 children of British descent (Dixon et al. 2007), organized into 206 sibships, including 297 sib pairs and 11 half-sib pairs. The families were identified through a proband with childhood asthma, and siblings were included regardless of disease status. Global gene expression in LCLs was measured using Affymetrix HG-U133 Plus 2.0 chips. All 405 children and their parents were genotyped using the Illumina Sentrix Human-1 Genotyping BeadChip (ILMN100K, including 105,713 autosomal SNPs), and 378 children were also genotyped using Illumina Sentrix HumanHap300 BeadChip (ILMN300K, including 307,981 autosomal SNPs) according to the manufacturer's instructions (Dixon et al. 2007; Moffatt et al. 2007).
The second sample (referred to as MRCE, data unpublished) of 950 individuals from 320 families of British descent was genotyped using the Illumina Sentrix HumanHap300 Genotyping Beadchip. The genotyped sample consisted of 347 subjects with asthma and 487 subjects with atopic dermatitis (259 subjects with both diseases). Expression arrays using Illumina Human 6 BeadChips were available on 550 children. Expression values were estimated using BeadStudio (Illumina, San Diego), and bead summary data were used for downstream analysis. From a total of 47,293 probes, we excluded 30,806 probes deemed “absent” (detection score <0.95) in more than 80% arrays to eliminate noise. We retained 16,487 probes representing 15,576 genes for analysis. The data were then normalized using quantile normalization (Bolstad et al. 2003).
We performed genotype imputation in the MRCA and MRCE samples independently. The ILMN300K genotypes were used to impute the polymorphic SNPs in the Phase II HapMap. Imputation was carried out using a hidden Markov model programmed in MACH (Li et al. 2009).
We estimated non-genetic contributions in gene expression measures using principal component analysis (Leek and Storey 2007; Stegle et al. 2008). Top principal components were included in the eQTL regression model as covariates. The number of principal components used was chosen to maximize the number of genome-wide significant cis eQTLs. Association analysis was applied with the FASTASSOC option implemented in MERLIN (Abecasis et al. 2002; Chen and Abecasis 2007).
Finally, we obtained 367,232 eSNPs associated with at least one gene in cis with LOD > 4 (corresponding to false discovery rate, FDR < 0.3%, accounting for all cis SNP-gene pairs within 1Mb) or one gene in trans with FDR < 5%. In total, 18,988 genes were found to be associated with at least one of these eSNPs.
The skin eQTL analysis was conducted among 110 normal skin tissues (Ellinghaus et al. ; Nair et al. 2009). We identified 16,697 cis SNPs as skin eQTLs, which were associated with the expression of at least one gene in the skin with LOD > 4.
The BCC GWAS comprised 2,045 BCC cases and 6,013 controls from five GWAS sets (Hunter et al. 2007; Qi et al. 2010; Rimm et al. 1991), which reflects the postmenopausal invasive breast cancer case-control study (controls only) nested within the Nurses’ Health Study (BC_NHS, BCC cases=248, BCC controls=816), the type 2 diabetes case-control study nested within the NHS (T2D_NHS, BCC cases=665, BCC controls=2,162), the type 2 diabetes case-control study nested within the Health Professionals Follow-up Study (T2D_HPFS, BCC cases=597, BCC controls=1,555), the coronary heart disease case-control study nested within the NHS (CHD_NHS, BCC cases=253, BCC controls=765), and the coronary heart disease case-control study nested within the HPFS (CHD_HPFS, BCC cases=282, BCC controls=715). All the cases are self-reported BCC patients, and we excluded those who carried diagnosis of other common cancers before diagnosis of BCC based on common cancers reported by National Cancer Institute and American Cancer Society. For BCC controls, we also excluded those individuals who carried diagnosis of BCC or other common cancers. Those common cancers included melanoma, SCC, breast cancer, endometrial cancer, ovarian cancer, colorectal cancer, bladder cancer, lung cancer, pancreatic cancer, kidney (renal cell) carcinoma, leukemia, Non-Hodgkin Lymphoma, thyroid cancer, and oral cancer. The overlapping samples among the five GWAS sets were removed. All the individuals in our study are of European ancestry and live in the United States. The study protocol was approved by the Institutional Review Board of Brigham and Women's Hospital and Harvard School of Public Health. Detailed descriptions of the study populations and genotyping platforms for each study have been reported previously (Hunter et al. 2007; Qi et al. 2010; Rimm et al. 1991).
We used the Illumina HumanHap550 array for the genotyping in BC_NHS (Hunter et al. 2007) and used the Affymetrix 6.0 array for the other four GWAS sets (T2D_NHS, T2D_HPFS, CHD_NHS, and CHD_HPFS) (Qi et al. ; Rimm et al. 1991). Within each of the five studies, we used the MACH program (Li et al. 2009) to impute genotypes for more than 2.5 million SNPs based on the genotyped SNPs and haplotype information in the HapMap phase II data build 36 (CEU) (Li et al. 2009; Marchini et al. 2005). eSNPs with imputation R2 > 0.4 and minor allele frequency (MAF) > 2.0% in all of the five sets were included in further analysis. Finally, 322,324 out of 367,232 eSNPs in the LCL eQTL database were used for the pathway enrichment analysis. In the replication stage, 16,691 out of the 16,697 skin eSNPs with imputation R2 > 0.4 and MAF > 2.0% in the BC_NHS set were used. Due to the very limited number of the identified skin eQTLs, we conducted the replication analysis in the BC_NHS set, which had the largest number of eQTLs passing quality control.
Associations between each SNP and BCC risk were determined using a multivariable logistic regression model; the first three principal components of genetic variation were adjusted in the model within each of the five sets. These principal components were calculated for all individuals on the basis of ca. 10,000 unlinked markers using the EIGENSTRAT software (Price et al. 2006). For each SNP, within-cohort association results were combined in an inverse variance–weighted meta-analysis by using software METAL (http://www.sph.umich.edu/csg/abecasis/Metal/index.html).
We collected pathway data from the Kyoto Encyclopedia of Genes and Genomes (KEGG). Ninety-nine pathways that contain at least 20 but at most 200 genes represented by eSNPs were tested. The cutoff of 20–200 genes was selected as a good balance for gene sets to reduce the multiple-testing issue and to avoid testing overly narrow or broad functional categories, as suggested by Wang et. al.(Wang et al. 2007)
We applied the method of Zhong et al. (Zhong et al. 2010b) to integrate the eSNP information into the pathway-based GWAS analysis. Specifically, for each gene, we evaluated the association between BCC risk and all the eSNPs which were associated with the expression of this gene, and then assigned the eSNP which was most significantly associated with BCC risk into each gene as its representative. Then we assigned these genes into the pathways. We used the Kolmogorov-Smirnov test statistic to evaluate the association between each pathway and BCC risk. For each pathway, we calculated the Enrichment Score (ES), which reflects the overrepresentation of genes within this pathway at the top of the entire ranked list of genes in the genome. A permutation procedure (1,000 permutations, permuting the case-control status and re-calculating the statistic value) was used to assess the significance of the ES. A normalized enrichment score (NES) was calculated for each pathway in the observed and permutated data to allow direct comparison of pathways of different sizes. A false discovery rate (FDR) was calculated to estimate the proportion of false positive findings by using NES (Reiner et al. 2003). The detailed algorithm was described by Zhong et al. (Zhong et al. 2010b) and Wang et al. (Wang et al. 2007). We set the significance level for the pathway analysis as p value < 0.05 and FDR < 0.5.
We included the individuals in the kidney stone case-control study (KS) nested in the NHS, the Nurses’ Health Study II (NHS2) and the HPFS as the replication set. NHS2 was established in 1989, when 116,678 female registered nurses aged 25-42 and residing in the United States at the time of enrollment responded to an initial questionnaire on their medical histories and baseline health-related exposures. Using the same criteria of case identification as used for the BCC GWAS set, we identified 278 BCC cases and 1,262 controls in the KS set. Illumina Human610 array was used for genotyping and the same imputation procedure was conducted as for the BCC GWAS set.
From two existing eQTL databases based on LCL (Dixon et al., 2007 and unpublished data, for details see method), we identified 367,232 eSNPs associated with gene expression in LCLs. Among them, 322,324 eSNPs had MAF > 2.0% and imputation R2 > 0.4 in our BCC GWAS, and were used for further analysis. A total of 11,478 genes with eSNPs significantly associated with their abundance of transcripts in LCL were assigned to the KEGG pathways. Finally, 99 KEGG pathways that contained between 20 and 200 eligible genes with qualified eSNPs associated with their expression were tested for association with BCC risk in our GWAS data.
From the pathway enrichment analysis, eight (8.08%) out of the 99 KEGG pathways reached a nominal p value < 0.05, which was 1.63 times as high as the number expected by chance (0.05 × 99 = 4.95, a conservative estimate, for that pathways may be correlated due to overlapping genes and the effective number should be smaller than 99). Two of these eight pathways reached a FDR of less than 0.5 (Table 1; complete association results for all the 99 KEGG pathways are listed in Table S1). The two pathways with significant enrichment in the BCC GWAS were the autoimmune thyroid disease pathway (p < 0.001, FDR = 0.24) and the Janus kinase-signal transducer and activator of transcription (JAK-STAT) signaling pathway (p = 0.02, FDR = 0.49). In addition, interleukin 6 signal transducer (IL6ST, smallest p-value for BCC association, PBCC = 0.01) and interleukin 6 receptor (IL-6R, smallest PBCC = 0.03) in the JAK-STAT signaling pathway were associated with BCC risk. According to the pathway definition in the KEGG database, there is some overlap between the two pathways: 32.7% of the genes in the autoimmune thyroid disease pathway are also in the JAK-STAT signaling pathway.
There were 25 genes with eSNPs in the autoimmune thyroid disease pathway, among which 21 (84.0%) genes had eSNPs with PBCC less than 0.05. Twenty-five (41.7%) out of 60 genes with eSNPs in the JAK-STAT signaling pathway had eSNPs with PBCC less than 0.05. The representative eSNPs of the genes in the two pathways that had PBCC < 0.01 are listed in Table 1, and all the representative eSNPs in the two pathways are listed in Table S2. The quantile quantile (QQ) plots of all the representative eSNPs for the two pathways are shown in Figure 1.
We attempted to validate the findings of the two pathways with BCC using the skin eQTLs generated from 110 normal skin tissues (Ding et al. 2010). A total of 16,697 skin eQTLs were identified and 16,691 of them with MAF > 2.0% and imputation R2 > 0.4 in the BCC GWAS were used in the final analysis. There were 619 (77%) out of the 803 genes in skin eQTL data overlapping with those from the LCL eQTL data. As a result, the association for JAK-STAT pathway was marginal significant (p-value = 0.08). The association between the autoimmune thyroid disease pathway and BCC risk was not significant using the skin eQTL data (p-value = 0.56).
We further conducted a replication analysis in an additional set of 278 BCC cases and 1,262 controls for the JAK-STAT pathway. We found that 79 (25.7%) out of 307 LCL eSNPs within this pathway were associated with BCC in the replication set (p < 0.05), which were 5.3-folder more than by chance (307×0.05=15). This finding validated the enrichment of association signals for the eSNPs within this pathway.
GWASs have primarily focused on marginal effects for individual markers and functional pathways were investigated only after robust statistical associations were identified. In this study, we integrated pathway analysis and functional annotation of SNPs from existing eQTL databases for a GWAS on BCC risk. This approach improved power to detect relevant functional pathways that are directly regulated by BCC-associated genetic variants. This method has recently shown its potential strength in the context of type 2 diabetes GWAS (Zhong et al. 2010b); however, application to cancer has not been reported. Using this method, we integrated the LCL eQTL information and identified two pathways that might be implicated in the etiology of BCC: the autoimmune thyroid disease pathway and the JAK-STAT signaling pathway. The latter was also found to be marginal significant using the skin eQTL information based on a small dataset.
Ideally, the skin eQTLs are more relevant to the study on BCC. However, the available eQTLs from skin tissues were based on a small sample (n=110 for skin tissue samples vs. n= 995 for LCL samples), which limits the power to identify eQTLs that are specific to the skin. After quality control, there were only 16,691 skin eQTLs identified for the association with the expressions of 803 genes compared to 322,324 LCL eQTLs identified for 11,487 genes, which would substantially reduce the statistical power of the pathway analysis and limit the number of genes within each pathway. Although patterns of gene expression may vary across different tissue types, eQTLs could be substantially shared between certain tissues. Hence, eQTLs identified from one tissue type could be a useful surrogate to study the genetic of gene expression in another tissue (Cookson et al. 2009; Ding et al. 2010). The eQTLs identified in the LCL (include the eQTLs used in our paper and other LCL data such as the HapMap samples) have helped interpret findings from GWAS where the LCL is not the directly relevant tissue, including human height, body mass index, waist-hip ratio, osteoporosis-related traits, childhood asthma and Crohn disease among many others (Czarnecki et al. ; Dixon et al. 2007; Heid et al. ; Hsu et al. ; Lango Allen et al. ; Libioulle et al. 2007). In a recent study, it was estimated that ~70% of cis-eQTLs in LCLs are shared with skin (Ding et al. 2010). In addition, the study on psoriasis reported a substantial overlap (>95%) of cis-eQTLs among three types of skin tissues (normal skin in healthy controls, lesional psoriatic skin and uninvolved skin in psoriasis patients) (Ding et al. 2010). Therefore, we used the LCL eQTLs in the discovery stage, and the skin eQTLs for validation. As a result, we marginally validated the finding of one pathway but not the other. It is possible that some important eSNPs for genes in the other pathway were missed using the current skin eQTL data due to insufficient power (n=110 skin expression data vs n=995 in LCL expression data). Nevertheless, we do not rule out the possibility that some eQTLs were specific to lymphocytes and not observed in the skin eQTL dataset.
Signal transducer and activator of transcription (STAT) has been implicated in programming gene expression in biological events such as embryonic development, programmed cell death, organogenesis, innate immunity, adaptive immunity and cell growth regulation in many organisms. The JAK-STAT pathway is the principal signaling mechanism for a wide array of cytokines and growth factors including IL-6, -21, -23 and IFN, and is well known to contribute to oncogenesis (Boudny and Kovarik 2002). In particular, IL-6 signaling via STAT3 promotes the Th-17 phenotype and IFNgamma signaling via STAT4 promoted Th-1 phenotype, both important T cell subtypes for cancer surveillance. Changes with signaling via JAK/STAT, either aberrant function resulting in impaired surveillance or enhanced activation can result in susceptibility to cancer. Persistent activation of STAT signaling has been shown in an increasing number of human cancers, including a wide variety of solid tumors such as breast and prostate cancer, as well as many leukemias and lymphomas (Boudny and Kovarik 2002; Buettner et al. 2002; Yu and Jove 2004). As for BCC, it has been reported that the microenvironment of BCC is dominated by several cytokines including IL-6 (Elamin et al. 2008), which enhances tumorigenic potential and anti-apoptotic activity in human BCC (Jee et al. 2001). The JAK-STAT pathway has been reported to transmit IL-6 signals and induce cell survival and angiogenic activity in BCC cell lines (Jee et al. 2002; Jee et al. 2004), and over-expression of IL-6 has been reported in the tissue of BCC (Gambichler et al. 2006). In addition, an IL-6 polymorphism has been associated with BCC risk in a population study (Wilkening et al. 2006). Consistent with these reports, we found that IL6ST (p = 0.01) and IL-6R (p = 0.03) were associated with BCC risk in the JAK-STAT signaling pathway, which further suggests the involvement of IL-6 in disease development via JAK-STAT signaling.
As for the autoimmune thyroid disease pathway, it includes HLA class I and class II antigens, which play an important role in tumor immune surveillance by presenting tumor antigens to T lymphocytes (Streilein 1991). The association between various HLA antigens and BCC risk has been reported by a number of publications (Bavinck et al. 2000; Bouwes Bavinck et al. 1997; Bouwes Bavinck et al. 1991; Cerimele et al. 1988; Czarnecki et al. 1991; Myskowski et al. 1985). However, the association of this pathway was not validated using the skin eQTLs and is upon further replication.
Although none of the eSNPs of the genes in the identified pathway reached genome-wide significance (0.05/2,500,000 = 2×10-8) in our BCC GWAS, we tested for the enrichment of BCC risk in predefined pathways by incorporating prior functional information available from existing eQTL databases. This approach could improve power to detect functionally relevant groups of genes. There are still some limitations in this study. First, we used the LCL eQTLs to surrogate those of the skin and it is possible that some important genes that are specifically expressed in skin tissue and play critical roles in BCC etiology may be missed. Second, we focused only on the eQTLs that affect gene expression. Hence, other potential functional SNPs (e.g. methylation SNPs, splicing SNPs and missense/nonsense SNPs) may be missed. This approach also highly depends on the accuracy and completeness of the existing eQTL database. As the sample size increases and newer technology covers more depth of gene expression, more eSNPs are likely to be identified. Third, the FDR of the identified pathway is not particularly low. Further replication using additional samples and functional investigation would be required to illustrate the true effect of this pathway.
In this study, we identified the JAK-STAT pathway with known biological functions in oncogenesis that may be involved in BCC development. The role of the JAK/STAT pathway in the host Th1/Th17 response may be associated with risk for BCC. Further studies are warranted to verify our findings and additional functional characterization of this pathway is needed to elucidate its biological mechanism in the etiology of BCC. It is noteworthy that this approach, which integrates the pathways and eSNPs, may help identify potential biological mechanisms underlying other diseases using GWAS data.
We thank Dr. David Hunter, Dr. Frank B. Hu, and Dr. Eric B. Rimm for their work on the five GWAS sets (BC_NHS, T2D_NHS, T2D_HPFS CHD_NHS and CHD_HPFS). We thank Dr. Hongmei Nan for her work in BCC GWAS dataset construction and Dr. Mousheng Xu for his technical support in programming. We thank the participants in the Nurses’ Health Study and the Health Professionals Follow-Up Study for their dedication and commitment. We also thank the following state cancer registries for their help: AL, AZ, AR, CA, CO, CT, DE, FL, GA, ID, IL, IN, IA, KY, LA, ME, MD, MA, MI, NE, NH, NJ, NY, NC, ND, OH, OK, OR, PA, RI, SC, TN, TX, VA, WA, WY.
The NHS and HPFS cohorts are supported by NIH grants CA87969, CA055075, and CA49449.
Conflict of Interest
The authors state no conflict of interest.