In this study, we have characterized common genetic variants, namely, SNPs and indels, across a 122.9kb region (11q13: 68,642,755-68,765,690, UCSC genome build hg18) by next-generation resequencing technology and catalogued a comprehensive set of surrogates of previously reported prostate cancer susceptibility loci. Comparison of our resequence results with the current public datasets (1000 Genome CEU and HapMap CEU) revealed a substantial number of common and uncommon variants (with MAF between 1% and 10%). In total we called 664 polymorphic sites where 107 SNPs were identified by all three datasets with a median MAF of 0.295 (range 0.007-0.5), whereas resequence analysis determined 218 variants previously not included in HapMap but with a lower median MAF of 0.118. When we examined the 332 variants exclusively reported by sequence analysis, 231 variants (MAF median=0.013, average=0.066) were unique to our resequencing analysis, as compared to 101 variants (MAF median=0.046, average=0.093) observed uniquely in the 1000 Genome CEU data. This difference can be attributed to the number of chromosomes analyzed and the depth of coverage per base.
Indel polymorphisms represent an important type of genetic variant that are, thus far, not well annotated in large data sets, mainly because consensus calling methods for indels are not as robust as for single base pair substitutions. Moreover, they appear to contribute to the genetic architecture of human diseases by altering functional elements (24
). Overall, we observed that 13.4% (n=89) of the 664 reported variants are indels, 58.5% (n=52) of which were uniquely identified by our resequencing study. Twenty indel polymorphisms (MAF median=0.134, average=0.225) were identified by our study and the 1000 Genome CEU, while 17 indel polymorphisms (MAF median=0.125, average=0.169) were unique to the 1000 Genome CEU data. In an in silico
assessment, one indel polymorphism, rs11357679 (GT/T), a surrogate of rs7931342/10896449 at an r2
≥ 0.8, maps to a transcription factor glucocorticoid receptor (GR) binding site according to the ENCODE Transcription Factor ChIP-seq data from the UCSC genome browser (26
). Although this study extended the list of indel polymorphisms by reporting 72 indels, which involve 1 to 9 bases insertions or deletions, further validation is needed to confirm the current analytical algorithm for detection.
Using the three available data sets, we conducted an analysis of tagging SNPs to determine the extent of coverage for each data set. Restricting the analysis to all SNPs with MAF ≥ 5% and a threshold for binning of r2 ≥ 0.8 for variants, we note that 65 tags are required; an increased number of tags is needed for higher r2 thresholds (r2 ≥ 0.9, 84 tags, and r2 ≥ 1.0, 175 tags). When we only looked at the content of HapMap reported SNPs, 18.6% of the variants with MAF ≥ 5% within the region cannot be monitored at an r2 ≥ 0.8, whereas the 1000 Genome coverage approximates our re-sequence analysis (98%). As we lower the filter for tagging to SNPs with MAF between 1% and 5%, the resequence analysis provides approximately one third more coverage than HapMap and 14.5% more than the 1000 Genome data. We also note that as the 1000 Genome Project expands and more subjects are analyzed with deeper coverage these estimates will shift slightly.
Our study provides important insights into the next steps required to map GWAS regions, especially since the majority of reported SNP markers have MAFs well above 10%, while a small proportion have MAFs between 5 and 10% due to inadequate power to detect small effects and the limited number of low MAF SNPs with current data sets (19
). In the case of 11q13, so far, all of the known SNP markers have MAFs that exceed 15% (4
). Pursuing the recent hypothesis of ‘synthetic association’ will be particularly difficult in this region because the notable variants appear to map to a non-genic region (20
). On the other hand, others have argued that this is probably less common than suggested (28
). Nonetheless, mapping and functional studies should provide insights into the specific underpinnings of GWAS signals.
A bioinformatic analysis of the variants suggests interesting sites to pursue for functional analysis, such as the set of variants that cluster near an alternative transcript of TPCN2
(RefSeq accession: NM_139075, chr11:68,596,959–68,686,483, 89.525kb, 15 exons, UCSC genome browser) that extends 72kb telomeric of the protein-coding TPCN2
transcript (RefSeq accession: NM_139075.3). Two spliced ESTs (BC043531, chr11:68,671,272–68,695,606; BI826779, chr11:68,671,430–68,695,608), both detected in brain tissue, localize to the telomeric side of NM_139075, but in the opposite direction (negative strand). More than a half of rs10896438/rs12418451 surrogates (19 out of 28) reside in the vicinity of these transcripts; 6 of the 19 reside in transcription factor binding sites, but further work is needed to demonstrate that these are functionally active. rs3019748 maps to multiple transcription factor binding sites, including p300, notable for its binding to putative enhancers (29
). The local region is also enriched for H3K4Me1 sites in the HMEC (human mammary epithelial cell) cell line. rs12275055 and rs11228580, two of the eight rs12793759 surrogates at an r2
≥ 0.8, are located on transcription factor NFkB binding sites; rs11228580 is also located within DB036467, a spliced EST.
The LD across this region is quite interesting, particularly as it relates to the signals detected for breast and renal cancers: in recent GWAS, rs7105934 (chr11:68,948,922, ~198kb telomeric to rs10896449) was recently identified in renal cancer (p
), while rs614367 (chr11:69,037,945, ~287kb telomeric to rs10896449) was associated with breast cancer risk (p
). Though it was suggested that one of the previously reported prostate cancer risk loci, rs7931342, might be associated with breast cancer (OR, 0.95 with 95% CI 0.91-0.99, p
=0.028) in a candidate gene analysis prior to the GWAS, (30
), this signal was not confirmed conclusively in the GWAS. The complex LD across the region could account for the above suggestion, as there is minimal correlation between rs7931342 and rs614367 (r2
=0.001 in HapMap CEU).