|Home | About | Journals | Submit | Contact Us | Français|
The majority of established prostate cancer risk-associated Single Nucleotide Polymorphisms (SNPs) identified from genome-wide association studies do not fall into protein coding regions. Therefore, the mechanisms by which these SNPs affect prostate cancer risk remain unclear. Here, we used a series of bioinformatic tools and databases to provide possible molecular insights into the actions of risk SNPs.
We performed a comprehensive assessment of the potential functional impact of 33 SNPs that were identified and confirmed as associated with PCa risk in previous studies. For these 33 SNPs and additional SNPs in Linkage Disequilibrium (LD) (r2 ≥ 0.5), we first mapped them to genomic functional annotation databases, including the Encyclopedia of DNA Elements (ENCODE), eleven genomic regulatory elements databases defined by the University of California Santa Cruz (UCSC) table browser, and Androgen Receptor (AR) binding sites defined by a ChIP-chip technique. Enrichment analysis was then carried out to assess whether the risk SNP blocks were enriched in the various annotation sets. Risk SNP blocks were significantly enriched over that expected by chance in two annotation sets, including AR binding sites (p=0.003), and FoxA1 binding sites (p=0.05). About one third of the 33 risk SNP blocks are located within AR binding regions.
The significant enrichment of risk SNPs in AR binding sites may suggest a potential molecular mechanism for these SNPs in prostate cancer initiation, and provide guidance for future functional studies.
Genome-wide association studies have identified ~33 established prostate cancer (PCa) risk-associated Single Nucleotide Polymorphisms (SNPs) (1–9). SNPs are DNA sequence variations occurring when a single nucleotide in the genome differs between individuals. Although these SNP associations have been consistently replicated in multiple studies, the functional roles of these risk SNPs remain uncharacterized, largely due to their location in DNA regions that do not encode proteins. Additionally, almost half of the SNP loci are located in regions which do not harbor any known genes. For example, multiple SNPs at 8q24 reside within a 1.3Mb gene-desert region (10), with the closest gene, c-Myc, located ~300kb downstream to rs1447295. Because the knowledge base for non-coding DNA is generally limited, few studies have been performed to evaluate the functional impact of the risk SNPs on PCa etiology (11–12). Indeed, the ability to understand the functional impact of the risk SNPs will most likely require additional emphasis on potential transcriptional regulatory mechanisms on non-coding DNA sequence.
The Encyclopedia of DNA Elements (ENCODE) project aims to provide a comprehensive catalog of biological information for functionally important elements. These elements include non-protein coding DNA sequences which regulate gene transcription, gene expression, and DNA replication. The ENCODE pilot study rigorously analyzed a small proportion (1%) of the human genome using computational and experimental methods. Results of the pilot study highlighted the complexity of transcriptional regulation and demonstrated the knowledge gap in this area (13). Based on the initial success, National Human Genome Research Institute (NHGRI) expanded the human ENCODE project to the whole genome (www.genome/gov/ENCODE). At this time, a primary focus of ENCODE is on the characterization of binding of transcriptional factors (TF) and chromatin structure, which represent two of the major factors involved in transcriptional regulation, using the ChIP-seq technique. ChIP-seq is a state-of-the-art high-throughput approach that involves chromatin immunoprecipitation and high-throughput sequencing of immunoprecipitated DNA. Compared with other methods of characterizing regulatory elements, a key advantage of ChIP-seq is a systematic and nonbiased approach, which does not depend on previous knowledge of canonical promoter regions and allows for evaluation of binding complexes of transcription factors and regulatory elements in a more natural state (14). The availability of this comprehensive catalog may facilitate an improved understanding of the functional role of risk SNPs located in non-coding regions.
In addition to the ENCODE project, several recent studies have utilized the high-throughput ChIP-chip method to identify genome-wide binding sites for transcription factors, including Androgen receptor (AR), Forkhead box A1 (FoxA1), and Estrogen receptor (ER) (15,16). Similar to ChIP-seq, ChIP-chip is another high-throughput, global approach for mapping transcription factors. ChIP-chip analysis involves isolating target DNA through chromatin immunoprecipitation, followed by analysis on DNA microarrays that tile the human genome (14). AR is a well-known transcription factor that plays an important role in prostate cancer initiation and progression. Risk SNPs located in putative AR binding sites might change the affinity of androgen-AR complex to binding sequences, thus providing a mechanism leading to modification of PCa risk.
In our study, we performed a comprehensive assessment of the potential functional impact of SNPs that were associated with PCa risk by GWAS studies, utilizing ENCODE genomic annotation databases, as well as ~20 annotation databases from the University of California Santa Cruz table browser (UCSC table browser) (http://genome.ucsc.edu/) and TF binding sites defined by previous studies (15,16). Enrichment analysis was then performed to evaluate whether the risk SNPs were over-represented in any of the annotation sets. To our knowledge, our study is among the first attempts to comprehensively characterize the potential function of risk SNPs using existing bioinformatic databases. These results assist in the interpretation of the molecular mechanisms of the risk SNPs on PCa etiology and provide guidance for future functional studies.
The causative SNP may be the risk SNP itself or the SNPs in LD with them. Therefore, we identified all SNPs in LD (r2 ≥ 0.5) with the 33 risk SNPs discovered by GWAS (SNPs that reached a genome-wide significance level with a p value equal or less than 10−7 in previous studies (1–9)) based on the CEU genotype data from the HapMap release #27 (Phase II+PhaseIII) (http://hapmap.ncbi.nlm.nih.gov/). We consider each risk SNP and SNPs that are in LD (r2 ≥ 0.5) with it as one risk SNP block.
We mapped SNPs in each risk SNP block to the ENCODE genomic annotation databases (release #2), as well as eleven annotation databases from UCSC (http://genome.ucsc.edu/) and transcription factors defined by previous studies. We defined a risk SNP block as located within a given annotated region if the risk SNP itself, or at least one of the SNPs in LD with the risk SNP, mapped to the annotated region.
We counted the number of risk SNP blocks that mapped to each annotated genomic region. Each risk SNP block was counted only once, even if more than one SNP within the same block mapped to the annotated region. A simulation analysis was used to assess the statistical significance of any potential enrichment for risk SNP blocks within annotated genomic regions, under a null hypothesis that none of these blocks were truly associated with PCa risk. We began the simulation analysis by randomly generating 1,000 sets of 33 SNPs (1,000 replicates) from the ~2.5 million SNPs in the genome with minor allele frequency (MAF)>=0.05 (Hapmap Phase II). We then identified all SNPs in LD with the randomly selected 33 SNPs, and performed the same analysis as for the true risk SNPs, including overlapping the SNP blocks with functionally annotated genomic regions and then counting the number of the SNP blocks that mapped to each annotated genomic region. Next, the mean number of risk SNP blocks that mapped to each annotated region was calculated based on the average counts of the 1,000 replicates. Finally, empirical p-values were calculated based on the number of replicates in which the number of counts was equal or larger than the observed number, divided by the total number of replicates. To reduce the concern of multiple testing, we limited the enrichment analysis to annotation sets with 5 or more mapped risk SNP blocks.
We identified a total of 972 SNPs in LD with the 33 risk SNPs. A list of these SNPs and pair-wise r2 for each risk SNP is provided in Supplementary Table 1.
We further grouped the genomic annotation databases into six categories (Table 1), majorly based on the potential functionality and techniques used to define the annotation sets: 1) Yale ENCODE (Yale Transcription Factor Binding Sites (TFBS)) characterizes the binding sites for a series of transcription factors including c-Myc, GATA-2, SIRT6, TCF7L2, STAT1, NK-kB, c-Fox, c-Jun, E2F6, Max and SIRT6; 2) Broad ENCODE (Broad histone) defines genomic regions with chromatic accessibility and histone modifications, including regions that are enriched with histone markers (H3K4m1, H3K4m2, H3K4m3, H3K27ac, and H3K9ac); 3) regulatory elements defined by UCSC table browser (http://genome.ucsc.edu/), which includes 11 genomic regulatory annotation sets; 4) a conserved region annotation set was also retrieved from UCSC phastConsElements28way and phaseConsElements17way table with conservation scores >500 (a conservation score is a measurement of the degree of conservation of a genomic region) ; 5) coding regions and splice sites that include annotation sets for the protein coding regions, and non-protein-coding RNAs (including transfer RNAs, ribosomal RNAs, small nuclear RNAs, and micro (mi) RNAs); 6) annotation sets including AR, ER and FoxA1 binding sites as defined by the ChIP-on-chip technique were obtained from previous studies (14,15).
Detailed annotation for each of the 33 risk SNP blocks are shown in Table 2. Only annotation sets with more than one mapped risk SNP block are listed in Table 2. Detailed information about the mapped SNPs that are in LD with the risk SNPs are presented in Supplementary Table 2. Briefly, 10 risk SNP blocks fall into conserved regions. No risk SNP blocks map to coding regions, non-synonymous changes, splice sites, non-protein-coding RNAs, miRNAs, miRNA target regions or methylation sites (data not shown). Based on category #6 genomic annotation databases (defined in preceding paragraph), 11, 4 and 9 risk SNP blocks were found to map to AR, ER and FoxA1 binding sites, respectively.
Risk SNP blocks were significantly enriched in genomic regions containing AR binding sites, with 11 (34.4%) risk SNP blocks mapping to these sites, whereas only 4.0 (12.5%) blocks randomly generated from the genome are located at such sites (p=0.003). Similarly, risk SNP blocks were significantly enriched in regions containing FoxA1 binding sites (7 (21.88%) vs 3.47 (10.84%); p=0.05) (Table 2). Risk SNP blocks were not significantly enriched in any of the other annotation genomic sets that were tested.
PCa risk-associated SNPs identified from GWA studies have been consistently replicated and confirmed in a large number of studies (1–9). The clinical utility of predicting an individual’s risk for PCa and identification of high risk men for PCa using these SNPs have been extensively discussed and explored (17–19). In contrast, the biological mechanisms by which the risk SNPs affect PCa initiation are poorly understood. In this study, we used bioinformatics tools to provide insight into the potential functional impact of these SNPs. Importantly, risk SNPs were found to be significantly enriched over that expected by chance in two functional annotation sets, consisting of AR binding sites (p=0.003) and FoxA1 binding sites (p=0.05).
AR, a member of the nuclear hormone receptor family, is a well-known transcription factor which plays an important role in prostate cancer initiation, although the precise mechanisms by which androgens promote prostate carcinogenesis remain ill-defined despite years of investigation. It is clear, however, that healthy men receiving preventive drugs that block the conversion of testosterone to dihydrotestosterone, the more potent androgen, experience a significant reduction (~25%) in PCa risk (20,21). Upon androgen binding, the AR-androgen dimer translocates from the cytosol into the cell nucleus. The AR-androgen dimer complex then binds to specific DNA sequences known as AR binding sites, recruiting coactivators and other factors which direct and regulate target gene expression. In our study, we demonstrated that almost one third of the 33 known risk SNP blocks are located in AR binding regions identified by the ChIP-chip method (15). A total of ~22,000 AR binding regions have been mapped across the prostate cell genome, using the Model-based Analysis of Tiling-arrays (MAT) algorithm (22) and based on a false discovery rate of 15% (p<10−4) (15). The average length of the AR binding regions is 911 base pairs (bp), with a range from 299 to 5,554 bp. The significant enrichment of known risk SNP blocks in regions that harbor AR binding regions suggests a molecular mechanism that may explain the associations between the risk SNPs and PCa risk. The statistical significance (P=0.003) remains even after a stringent Bonferroni correction (13 independent tests were performed in enrichment analysis, p=0.05/13=0.0037). Known risk SNPs that are located within the putative AR binding sites may change the binding affinity of the AR-androgen complex for the binding sequence, leading to altered expression of AR target genes that presumably play rate limiting roles in PCa formation.
Risk SNP blocks were also enriched in FoxA1 binding sites with a nominal p value of 0.05. FoxA1 acts as an AR collaborating cofactor and assists nuclear receptor binding in certain genomic regions (23,24). The coupling of FoxA1 to binding sites is required for AR binding to enhancers in multiple AR-targeted genes (15). SNPs that reside in FoxA1 binding sites may affect the binding affinity of FoxA1 protein and lead to increased risk for PCa. However, the importance of risk SNPs and FoxA1 binding need to be interpreted with caution since the enrichment in FoxA1 did not reach statistical significance after Bonferroni correction (P = 0.65 after Bonferroni correction).
One advantage of our study is the use of a comprehensive and unbiased approach to evaluate TF binding regions defined by ChIP-seq and ChIP-chip techniques, which allowed us to evaluate the regulatory elements on a genome-wide level. Regulation of gene expression is a complicated process and current knowledge in this area is very limited. The pilot study of the ENCODE project revealed that only about 25% of the regulatory elements are located near previously identified transcription starting sites (TSS). This suggests the ChIP-on-chip method is able to identify a large number of promoter regions and regulatory elements that were previously unknown and distant from the classical TSS (13). Although we did not observe any significant enrichment of risk SNP blocksin the ENCODE annotation sets, this might be due to tissue-specific patterns of TF and TF regulation. Currently, a variety of cell lines, mainly Hela and GM06990 have been used for the identification of regulatory elements in the ENCODE project. The LNCaP cell line is currently proposed for study by ENCODE consortium in Tier 3 (http://genome.ucsc.edu/ENCODE/cellTypes.html). Prostate cancer tissue-specific TF binding may provide valuable information for evaluating the molecular mechanisms of risk SNPs on PCa risk.
The fact that no risk SNP blocks are located in regions that code proteins may be due to two reasons. First, we only evaluated SNPs that are characterized by the Hapmap project, in which unknown SNPs that are located in protein coding regions are not evaluated. Secondly, the current resources that are commonly used for gene annotation, RefSeq and ENSEMBLE, likely represents an incomplete catalogue of human genes. SNPs may be located within exons that are not yet identified. In fact, about 60% of GENCODE exons (GENCODE is a sub-project of ENCODE, which aimed to provide a reference annotation of all protein-coding genes within the pilot study of the ENCODE project) are not annotated in RefSeq and ENSEMBL (25). This fact indicates that a high number of alternative slice forms with unique exons exist across the genome (25). With the completion of ENCODE in the near future, a richer and more complete annotation of human genes should provide more insight.
We also did not observe significant enrichment of risk SNP blocks in the regulatory annotation sets defined by UCSC table browsers. However, the null results for these annotation sets need to be interpreted with caution. The majority of the regulatory elements annotation sets are defined by computational algorithms, rather than by biological experiments. In addition, the regulatory elements predicted in those annotation sets are not tissue specific.
In summary, our study is among the first to comprehensively evaluate the potential functional impact of risk loci identified for PCa through GWAS studies. The fact that about one third of 33 SNP blocks fall within AR binding regions, and that the risk SNPs were statistically enriched in AR regions, may suggest a potential molecular mechanism by which risk SNPs contribute to PCa initiation. These results also provide a guidance for future functional studies. In addition, the databases used for bioinformatics annotation could also be used to annotate and prioritize variants identified through GWAS and whole-genome sequencing.
Grants: National Cancer Institute (CA129684, CA140262, CA148463 to J.X.) and Department of Defense (W81XWH-09-1-0488 to J.S.)