|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) of prostate cancer have identified single nucleotide polymorphism (SNP) markers in a region of chromosome 11q13.3 in men of European decent. A fine-mapping analysis with tag SNPs in the Cancer Genetic Markers of Susceptibility (CGEMS) study identified three independent loci, marked by rs10896438, rs12793759, and rs10896449. This study further annotates common and uncommon variation across this region.
A next generation resequence analysis of a 122.9kb region of 11q13.3 (68,642,755-68,765,690) was conducted in 78 unrelated individuals of European background, 1 CEPH trio, and 1 YRI trio.
In total, 644 polymorphic loci were identified by our sequence analysis. Of these, 166 variants – 118 SNPs and 48 insertion-deletion polymorphisms (indels) – were novel, namely not present in the 1000 Genomes or International HapMap Projects. We identified 22, 25, 6, and 4 variants strongly correlated (r2 ≥ 0.8) with rs10896438, rs10896449, rs12793759, and rs11228565, respectively. HapMap SNPs were in linkage disequilibrium (r2 ≥ 0.8) with 48%, 69%, 14%, and 60% of SNPs marking bins by rs10896438, rs10896449, rs12793759, and rs11228565, respectively.
Our next generation resequence analysis compliments publicly available datasets of European descent (HapMap, build 28 and 1000 Genome, Pilot 1, Oct 2010), underscoring the value of targeted resequence analysis prior to initiating functional studies based on public databases alone. Increasing the number of common variants enables investigators to better prioritize variants for functional studies designed to uncover the biological basis of the direct association(s) in the region.
Prostate cancer is the most common non-cutaneous cancer in men (1). Prior to the age of genome-wide association studies (GWAS), established risk factors included age, family history, and ethnic background. Previous studies have estimated that genetic risk factors overall could account for up to 42% of risk, a figure which may be higher in men of African-American background (2). GWAS of prostate cancer have identified at least 35 common genetic variants associated with prostate cancer to date (3-9). Two prostate cancer GWAS identified a pair of highly correlated common single nucleotide polymorphisms (SNP), rs10896449 and rs7931342 (r2 = 0.966, D’=1.000, HapMap 3 release 28 CEU), in the human chromosome 11q13.3 region (4,6). A subsequent fine-mapping study identified a second locus, rs12418451, 60kb centromeric to the previously identified loci, which is independently associated with prostate cancer (10). This independent association is corroborated by an existence of a recombination hotspot separating rs12418451 from the others. A second study reported rs11228565 as a refinement SNP which remained significant after adjustment for rs10896450 or rs7931342 (5). A recent fine-mapping study by the Cancer Genetic Markers of Susceptibility (CGEMS) study confirmed the second locus with rs10896438 (r2 = 0.958, D’=1.000 with rs12418451, HapMap 3 release 28 CEU) and also identified a third independent locus, rs12793759, using ~10,000 case/control pairs (11).
The markers for the prostate cancer susceptibility loci reported within the 11q13.3 region map to an intergenic region flanked by TPCN2 and MYEOV. None of the common SNPs are in high linkage disequilibrium (LD) with common genetic variations in known or putative functional places in either gene. Interestingly, two coding SNPs within TPCN2 (two-pore segment channel 2) were reported to be associated with blond versus brown hair color (12). The nearest gene flanking the three prostate susceptibility loci on the telomeric side, MYEOV, is frequently over-expressed in different cancers, such as multiple myeloma, squamous cell carcinoma, breast cancer, and oral cancer (13-14). Often co-amplified and overexpressed with MYEOV is CCND1 (cyclin D1), a cell cycle regulator gene, which is ~340kb telomeric to prostate cancer susceptibility loci. Recently, two independent risk loci for kidney and breast cancer have also been identified by GWAS between MYEOV and CCND1 but these are not in appreciable LD with prostate susceptibility loci (15-16).
While GWAS have successfully identified multiple prostate cancer susceptibility loci, these only capture a fraction of the total heritability (17-18), and additional loci are estimated to be identified by larger GWAS (19). It was recently proposed that multiple rare variants may create ‘synthetic association’ in GWAS in association with one of the alleles of a common surrogate marker (20). A comprehensive assessment of common and rare genetic variations and prioritization of variant set represent important next steps following GWAS discovery, especially if the region has a complex genomic architecture. In this study, we used next-generation sequencing technology to re-sequence a region of 11q13.3 (68,642,755-68,765,690; UCSC genome build hg18) defined by the linkage disequilibrium pattern, in 80 unrelated individuals of European background drawn from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) (21) cohort used in the initial GWAS for CGEMS.
Eighty individuals, all of European ancestry, were selected from the NCI Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO) (21). Six additional individuals were included, a Yoruba trio Y005 (NA18503, NA18504, and NA18505) and trio from a CEPH pedigree 1350 (NA10855, NA10856, and NA11824). All but NA11824 were directly genotyped in The International HapMap Project (http://hapmap.org).
A 122.9 kb region of chromosome 11q13, spanning 68,642,755-68,765,690 (UCSC genome build hg18), was selected for next generation sequence analysis based on the observed LD pattern flanking rs10896449, the most notable marker in the CGEMS prostate cancer GWAS using HapMap CEU data (release 22, phase II) (6). The boundaries of this region within 11q13 include the six previously reported markers, rs10896438, rs12418451, rs12793759, rs11228565, rs7931342, and rs10896449. The most telomeric side of the region is approximately 51 kb from the MYEOV gene; the centromeric side extends ~28 kb beyond the TPCN2 gene.
A Nimblegen capture probe pool was designed to cover the 122.9 kb targeted region. The capture probes were approximately 60 bp in size, and the probe pool was designed for amplicon overlap (100 bp in average). Primers were designed using Nimblegen Proprietary Capture probe design software followed by in silico quality assessment for uniqueness, possible sequence paralogy, and DNA repeat sequences using the BLAT of the UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgBlat). For primer secondary structure and PCR efficiency check, NetPrimer (http://www.premierbiosoft.com/netprimer/index.html) was used. Primers were ordered from Integrated DNA Technologies (Coralville, IA, USA; http://www.idtdan.com). The BED file is available upon request for this region. After performing the Nimblegen solution-based sequence capture method, sequence analysis was performed on the 454 Genome Sequencer FLX system (http://www.454.com/products-solutions/product-list.asp).
An in-house automated computational pipeline was developed to process sequence reads generated by 454 FLX Genome Sequencers. Whenever applicable, sequence reads from the same sample were pooled based on barcodes provided by Roche/454. Quality check (QC) was performed using vendor-supplied software; sequence reads that passed QC were aligned to the target genomic region (11q13: 68,642,755-68,765,690, UCSC genome build hg18) by MOSAIK aligner (http://bioinformatics.bc.edu/marthlab/Mosaik). The resulting assembly was analyzed column-by-column and both putative polymorphic sites and most likely genotypes were called based on a set of heuristic rules. All possible indels were checked individually. The minimal sequence coverage depth was set to 20 reads for each nucleotide position and the ratio (r) of forward and reverse reads was determined. To avoid directional bias, an optimal range of r was set between 10 and 90%. Homozygous genotype calls were made when the most frequent allele was present in at least 85% of the reads. Heterozygous genotype calls were made when the two most frequent alleles were represented in 30-70% of reads. No genotype calls were made if the above criteria were not met. For quality assurance (QA), NextGENe software (http://www.softgenetics.com) and Consed (http://bozeman.mbt.washington.edu/consed/consed.html) were used to resolve ambiguous calls.
Genotype completion, concordance, minor allele frequency (MAF) estimations, deviations from fitness for Hardy-Weinberg proportion (HWP), pair-wise LD, and tag SNP analysis were performed using the GLU software package (Genotype Library and Utilities; http://code.google.com/p/glu-genetics/). To check concordance, variant calls of the five individuals genotyped in HapMap were compared with HapMap data (release 28). Matched variants were identified in the 1000 Genome Project data (pilot 1 low-coverage CEU data, October 2010 release) and allele frequencies were compared using a two-group χ2 test of equal proportions (22), followed by a correction for multiple testing using an R-based software package, QVALUE (23).
Putative functional elements within the resequenced region were assessed using the UCSC genome browser (http://genome.ucsc.edu/), a publically available bioinformatics website. Specifically, human mRNA and spliced EST tracks for any known transcripts, ENCODE Integrated Regulation tracks for putative enhancer/promoter regions, and conservation tracks were assessed by scanning 500 bp-window over the entire 122.9 kb resequenced region.
Sequence coverage and depth averaged over all samples in the targeted genomic region (chr11: 68,642,755-68,765,690, hg18) are shown in Figure 1. No gaps in coverage were observed. The average read depth was approximately 50-fold (range from 2- to 470-fold, median 44). A cumulative length of approximately 4.2 kb was observed with an average coverage depth of less than 20-fold (Supplementary Table 1).
Genotypes were called for 888 possible segregating sites in 80 samples from the National Cancer Institute’s PLCO Cancer Screening Trial (21), one trio from the CEPH pedigree 1350, and one trio from a Yoruba pedigree Y005. The concordance between the sequence data and HapMap data for the five samples was 99.66%. During the data QC assessment, 244 loci were excluded due to monomorphism (n=214), no reads (n=19), or substantial violation of fitness for HWP for called genotypes (P < 0.001, n=11). Two unexpected duplicate samples from PLCO were excluded in the formal analysis. No loci were dropped due to low per locus completion rates. The final genotype dataset contained 84 individuals (78 from PLCO, 1 CEPH trio, and 1 Yoruba trio) and 644 polymorphic loci detected by 454 analysis (Supplementary Table 2). These include 118 novel SNPs, 48 novel indels and 478 variant loci previously described in NCBI’s dbSNP (build 132) database (http://www.ncbi.nlm.nih.gov/projects/SNP/) and/or reported in HapMap data (release 28) and 1000 Genome data (pilot 1 low-coverage CEU of 10-2010 release). From the analysis including only 80 unrelated individuals of European origin (78 from PLCO + 2 founders of a CEPH trio), 559 loci were polymorphic, and when compared with 1000 Genome CEU data and HapMap CEU data (Figure 2), 231 polymorphic loci (41.3%) were uniquely identified by our study. For subsequent analyses, loci with completion rates ≥ 40%, which included 469 polymorphic loci (103 novel SNPs, 20 novel indels, 346 loci previously described in NCBI’s dbSNP build 132, 1000 Genome data, or HapMap data),were considered (Table 1, Figure Figure33 and and4).4). The average genotype call rate was 74.7% (range 40%-100%, median 76.3%) (Supplementary Table 2); the average of computed MAF estimates was 13.7% (range 0.6-48.2%, median 5.4%) (Table 1). Allele frequencies were compared in 308 SNPs detected both in this study and in the 1000 Genome CEU data. No significant difference in allele frequency was observed except for one locus (rs1542335, q-value = 0.0001), which was excluded in subsequent analyses. Since our insertion/deletion calling algorithm is under refinement, indels detected in this study should be considered preliminary.
Based on our data (call rate ≥ 0.4, MAF ≥ 0.5, n=253), the linkage disequilibrium pattern across the sequenced region indicates 4 complex block structure defined by 3 inferred recombination hotspots (Figure 5). The two telomeric blocks, defined by two recombination hotspots (chr11:68,659,036-68,662,036 and chr11:68,727,036-68,729,036), include the previously reported prostate cancer susceptibility loci. Notably, rs10896438 and rs12418451 (r2 = 0.897, D’=0.999) and their surrogates are separated by a recombination hotspot from other susceptibility loci, corroborating their independent contribution to prostate cancer susceptibility (10-11).
A tagging analysis was performed with 80 unrelated samples of European origin for the 122.9kb sequenced region (r2 ≥ 0.8-1.0, MAF ≥ 0.01 and 0.05, minimum call rate > 0.40) using the TagZilla program implemented in GLU (Genotype Library and Utilities). Based on the loci with MAF ≥ 0.05 (n=253), at an r2 ≥ 0.8, 65 tags are required to tag 100%, at an r2 ≥ 0.9, 84 tags, and at an r2 = 1.0, 175 tags are required (Table 2, Supplementary Table 3). Within the region, 90 SNPs with MAF ≥ 0.05 are common in our resequencing data, HapMap III CEU (release 28), and 1000 Genome CEU (pilot 1 low-coverage data, 10-2010 release). In analysis restricted to the SNPs at an r2 ≥ 0.8, 29 bins monitoring 206 loci were covered (81.4%), while using variants reported in the 1000 Genome CEU data (n=233), 60 bins monitoring 248 loci (98.0%) were covered, whereas 5 singleton bins were exclusively covered by resequencing data (Table 2). For loci with MAF ≥ 0.01 (n=358), at an r2 ≥ 0.8, 117 tags are required to tag 100% (r2 ≥ 0.9, 136 tags, and r2 = 1.0, 259 tags are required), of which 41 tags representing 52 loci (14.5%) were not covered by HapMap or 1000 Genome data (Table 2).
To date, six prostate cancer susceptibility loci (rs10896449, rs7931342, rs12418451, rs10896438, rs12793759, and rs11228565) have been previously reported in independent GWAS and fine-mapping studies (4-6,10-11). Based on the high degree of LD across this region, it is a priority to catalogue highly correlated common variants in the region prior to conducting the bioinformatic analysis in search of putative functional elements across this non-coding region. We performed tag analysis in the 122.9kb resequenced region with all 6 prostate cancer susceptibility loci within 11q13.3 as obligate includes using our data, HapMap CEU (release 28), and the 1000 Genome CEU (pilot 1 low-coverage data, 10-2010 release), then catalogued all possible surrogates (r2 ≥ 0.8) (Table 3, Supplementary Table 3, 4). Of the six loci, high correlation between pairs existed: rs10896449 and rs7931342 (r2=0.967, D’=1.000), rs10896438 and rs12418451 (r2=0.895, D’=1.000), and rs12793759 and rs11228565 (r2=0.728, D’=1.000). Our resequencing data catalogued 24, 21, 6, and 4 highly correlated surrogates (r2 ≥ 0.8) with rs7931342/rs10896449 (bin1), rs10896438/rs12418451 (bin2), rs12793759 (bin3), and rs11228565 (bin4), respectively, which included 6 indel polymorphisms (Table 3, Supplementary Table 4). The 1000 Genome CEU data (pilot 1 low-coverage data, October 2010 release) catalogued 100% of all possible surrogates for rs7931342/rs10896449, rs12793759, and rs11228565, while cataloging 57.7% of all possible surrogates of rs10896438 (Table 3). HapMap CEU data (release 28) catalogued only 59.3% of rs7931342/rs10896449 surrogates, 34.6% of rs12418451/rs10896438 surrogates, 50% of rs11228565 surrogates, and none of rs12793759 surrogates, with no indel polymorphisms included (Table 3).
The previously reported 6 prostate cancer risk loci and all possible surrogates at r2 ≥ 0.8 were primarily assessed for existence of potential regulatory elements using the UCSC genome browser ENCODE Integrated Regulation track. On the centromeric side of the recombination hotspot (chr11:68,727,036-68,729,036), rs10896438 localizes to an alternative TPCN2 transcript (RefSeq accession: NM_139075) as well as a spliced EST AL137479, whereas rs12418451 maps to two known spliced ESTs – BC843531 and BI826779 (Supplementary Figure 1). On the telomeric side of the recombination hotspot, rs7931342 and rs10896449 are located within 5kb centromeric to the spliced EST DB036467. None of the 6 risk loci directly overlap with transcription factor binding sites reported by ENCODE Transcription Factor ChIP-seq data, but 12 surrogate markers (8 surrogate markers of rs10896438, 2 surrogates of rs7931342/rs10896449, and 2 surrogates of rs12793759) overlap transcription factor binding sites of interest meriting further follow-up (Supplementary Table 5).
In this study, we have characterized common genetic variants, namely, SNPs and indels, across a 122.9kb region (11q13: 68,642,755-68,765,690, UCSC genome build hg18) by next-generation resequencing technology and catalogued a comprehensive set of surrogates of previously reported prostate cancer susceptibility loci. Comparison of our resequence results with the current public datasets (1000 Genome CEU and HapMap CEU) revealed a substantial number of common and uncommon variants (with MAF between 1% and 10%). In total we called 664 polymorphic sites where 107 SNPs were identified by all three datasets with a median MAF of 0.295 (range 0.007-0.5), whereas resequence analysis determined 218 variants previously not included in HapMap but with a lower median MAF of 0.118. When we examined the 332 variants exclusively reported by sequence analysis, 231 variants (MAF median=0.013, average=0.066) were unique to our resequencing analysis, as compared to 101 variants (MAF median=0.046, average=0.093) observed uniquely in the 1000 Genome CEU data. This difference can be attributed to the number of chromosomes analyzed and the depth of coverage per base.
Indel polymorphisms represent an important type of genetic variant that are, thus far, not well annotated in large data sets, mainly because consensus calling methods for indels are not as robust as for single base pair substitutions. Moreover, they appear to contribute to the genetic architecture of human diseases by altering functional elements (24-25). Overall, we observed that 13.4% (n=89) of the 664 reported variants are indels, 58.5% (n=52) of which were uniquely identified by our resequencing study. Twenty indel polymorphisms (MAF median=0.134, average=0.225) were identified by our study and the 1000 Genome CEU, while 17 indel polymorphisms (MAF median=0.125, average=0.169) were unique to the 1000 Genome CEU data. In an in silico assessment, one indel polymorphism, rs11357679 (GT/T), a surrogate of rs7931342/10896449 at an r2 ≥ 0.8, maps to a transcription factor glucocorticoid receptor (GR) binding site according to the ENCODE Transcription Factor ChIP-seq data from the UCSC genome browser (26). Although this study extended the list of indel polymorphisms by reporting 72 indels, which involve 1 to 9 bases insertions or deletions, further validation is needed to confirm the current analytical algorithm for detection.
Using the three available data sets, we conducted an analysis of tagging SNPs to determine the extent of coverage for each data set. Restricting the analysis to all SNPs with MAF ≥ 5% and a threshold for binning of r2 ≥ 0.8 for variants, we note that 65 tags are required; an increased number of tags is needed for higher r2 thresholds (r2 ≥ 0.9, 84 tags, and r2 ≥ 1.0, 175 tags). When we only looked at the content of HapMap reported SNPs, 18.6% of the variants with MAF ≥ 5% within the region cannot be monitored at an r2 ≥ 0.8, whereas the 1000 Genome coverage approximates our re-sequence analysis (98%). As we lower the filter for tagging to SNPs with MAF between 1% and 5%, the resequence analysis provides approximately one third more coverage than HapMap and 14.5% more than the 1000 Genome data. We also note that as the 1000 Genome Project expands and more subjects are analyzed with deeper coverage these estimates will shift slightly.
Our study provides important insights into the next steps required to map GWAS regions, especially since the majority of reported SNP markers have MAFs well above 10%, while a small proportion have MAFs between 5 and 10% due to inadequate power to detect small effects and the limited number of low MAF SNPs with current data sets (19,27). In the case of 11q13, so far, all of the known SNP markers have MAFs that exceed 15% (4-6,10-11). Pursuing the recent hypothesis of ‘synthetic association’ will be particularly difficult in this region because the notable variants appear to map to a non-genic region (20). On the other hand, others have argued that this is probably less common than suggested (28). Nonetheless, mapping and functional studies should provide insights into the specific underpinnings of GWAS signals.
A bioinformatic analysis of the variants suggests interesting sites to pursue for functional analysis, such as the set of variants that cluster near an alternative transcript of TPCN2 (RefSeq accession: NM_139075, chr11:68,596,959–68,686,483, 89.525kb, 15 exons, UCSC genome browser) that extends 72kb telomeric of the protein-coding TPCN2 transcript (RefSeq accession: NM_139075.3). Two spliced ESTs (BC043531, chr11:68,671,272–68,695,606; BI826779, chr11:68,671,430–68,695,608), both detected in brain tissue, localize to the telomeric side of NM_139075, but in the opposite direction (negative strand). More than a half of rs10896438/rs12418451 surrogates (19 out of 28) reside in the vicinity of these transcripts; 6 of the 19 reside in transcription factor binding sites, but further work is needed to demonstrate that these are functionally active. rs3019748 maps to multiple transcription factor binding sites, including p300, notable for its binding to putative enhancers (29). The local region is also enriched for H3K4Me1 sites in the HMEC (human mammary epithelial cell) cell line. rs12275055 and rs11228580, two of the eight rs12793759 surrogates at an r2 ≥ 0.8, are located on transcription factor NFkB binding sites; rs11228580 is also located within DB036467, a spliced EST.
The LD across this region is quite interesting, particularly as it relates to the signals detected for breast and renal cancers: in recent GWAS, rs7105934 (chr11:68,948,922, ~198kb telomeric to rs10896449) was recently identified in renal cancer (p=7.8×10−14) (16), while rs614367 (chr11:69,037,945, ~287kb telomeric to rs10896449) was associated with breast cancer risk (p=3.2×10−15) (15). Though it was suggested that one of the previously reported prostate cancer risk loci, rs7931342, might be associated with breast cancer (OR, 0.95 with 95% CI 0.91-0.99, p=0.028) in a candidate gene analysis prior to the GWAS, (30), this signal was not confirmed conclusively in the GWAS. The complex LD across the region could account for the above suggestion, as there is minimal correlation between rs7931342 and rs614367 (r2=0.001 in HapMap CEU).
In this study, we have conducted a resequence analysis of a 122.9kb region that harbors three distinct loci for prostate cancer risk and identified a large annotated set of variants that should be considered for follow-up studies. We have shown that additional resequence analysis supplements the public databases and affords investigators the opportunity to discover and characterize variants that directly account for the association signals observed in large-scale GWAS studies of cancer.
Supplementary Figure 1 A snapshot of 11q13.3 resequenced region on the UCSC genome browser and location of 6 prostate cancer susceptibility loci
This study was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health (NIH). The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services nor does mention of trade names, commercial products, or organization indicate endorsement by the U.S. Government. The authors thank Drs. Christine Berg and Philip Prorok, Division of Cancer Prevention, NCI, the screening center investigators and staff of the PLCO Cancer Screening Trial, Mr. Thomas Riley and staff at Information Management Services, Inc., and Ms. Barbara O’Brien and staff at Westat, Inc. for their contributions to the PLCO Cancer Screening Trial. We thank Marie-Josephe Horner for editorial support. Finally, we acknowledge the study participants for donating their time and making this study possible.
Disclosure Statement None of the authors listed have any significant or perceived conflicts of interest relating to the publishing of this manuscript.