|Home | About | Journals | Submit | Contact Us | Français|
MicroRNAs regulate diverse cellular processes and play an integral role in cancer pathogenesis. Genomic variation within miRNA target sites may therefore be important sources for genetic differences in cancer risk. To investigate this possibility, we mapped HapMap SNPs to putative miRNA recognition sites within genes dysregulated in estrogen receptor stratified breast tumors and used local linkage disequilibirum (LD) patterns to identify high-ranking SNPs in the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer genome wide association study (GWAS) for further testing. Two SNPs, rs1970801 and rs11097457, scoring in the top 100 from the CGEMS study, were in strong LD with rs1434536 – a SNP that resides within a miR-125b target site in the 3'UTR of the Bone Morphogenic Receptor Type 1B (BMPR1B) gene encoding a transmembrane serine/threonine kinase. We validated the CGEMS association findings for rs1970801 in an independent cohort of admixture corrected cases identified from families with multiple case histories. Subsequent association testing of rs1434536 for these cases and CGEMS controls with imputed genotypes supported the association. Furthermore, luciferase reporter assays and overexpression of miR-125b-mimics combined with quantitative RT-PCR showed that BMPR1B transcript is a direct target of miR-125b and that miR-125b differentially regulates the C and T alleles of rs1434536. These results suggest that allele-specific regulation of BMPR1B by miR-125b explains the observed disease risk. Our approach is general and can help identify and explain the mechanisms behind disease-association for alleles that affect miRNA regulation.
MicroRNAs (miRNAs) are a recently discovered class of short non-coding RNA genes that act post-transcriptionally as negative regulators of gene expression and play fundamental roles in cell growth, apoptosis, hematopoietic lineage differentiation, and differentiation  . Functional studies indicate that changes in miRNA expression patterns might underlie human pathologies, including malignancies  . In addition, variations in miRNA target sites mediated by single nucleotide polymorphisms (SNPs) may be associated with human cancers [5, 6].
Gene expression profiling studies have identified specific signatures for breast cancer (BrCa) and are employed to guide patient treatment with both the Oncotype Dx and Mammaprint tests in use clinically [7, 8]. We previously described a meta-analysis of multiple independent BrCa RNA expression studies whereby a unified set of dysregulated genes was identified in ER+ and ER− tumors. The identification of germline variations in elements controlling RNA expression (i.e. transcription factor or miRNA recognition sites) may provide clues as to the mechanistic basis for the observed variations in gene expression patterns.
Genome wide association studies have been employed in many common diseases to identify SNPs associated with disease [9, 10]. To date, four independent studies examining BrCa patients have identified multiple SNPs associated with disease [10–13]. While some association signals appear universal in multiple studies (ie. several SNPs within FGFR2) often these studies also yield vastly differing collections of SNPs associated with disease perhaps owing to differences in study design. While many disease-associated SNPs are nongenic, and thus their contribution to disease pathogenesis is unclear, many are likely to reside in gene regulatory elements that may influence gene expression patterns observed in tumors.
We describe an integrative genomic approach leveraging gene expression patterns, miRNA targeting, BrCa GWAS data, and biological testing to identify a disease-associated SNP in the 3' untranslated region of BMPR1B gene. To identify this SNP we mapped a set of reference SNPs from the HapMap project to prospective miRNA target sites located in the 3'UTRs of a previously identified set of dysregulated ER+ and ER− genes . An analysis of local linkage disequilibrium (LD) patterns surrounding these SNPs identified one SNP (rs1434536) in strong LD with two SNPs showing a high degree of association in the CGEMS study. We replicated this association in an independent set of cases identified from families with multiple case histories and common CGEMS controls after controlling for population stratification with ancestry informative markers (AIMs). We provide strong support that allelic variation at rs1434536 influences interactions with miR-125b leading to differences in BMPR1B expression levels. The approach described is generally applicable and provides clues to the role cis-acting allelic variation plays in tumor gene expression patterns via interactions with the miRNA machinery in disease pathogenesis.
Our input data consisted of 275 candidate genes previously identified as constituting the Top 1% of genes dysregulated in ER+ and ER− (130 and 145 genes respectively) BrCa tumors  and their annotated 3'UTR sequences from the UCSC Table Browser (NCBI Build 36.1); mature human miRNA sequences from miRBase1 and SNPs from HapMap2. Using custom python scripts, we (1) identified all unique 7mer seeds (nucleotides 2–8) within the mature miRNA sequences; (2) identified all seed sites - that is, locations with perfect reverse complementarity to a 7mer seed – within the candidate genes' 3'UTRs; (3) identified all HapMap SNPs that mapped to one of the seed site locations; and (4) removed all SNPs that had no reported minor allele in any HapMap population.
Four hundred fifty-nine probands from a BrCa affected sibling pair cohort were recruited from a multi-institutional study (Eastern Cooperative Oncology Group Cancer in Sibling Study, E1Y97) under protocols approved by the respective Institutional Review Boards at each institution. The mean age of diagnosis for probands was 55 years (range 16–87) and disease status was verified by pathology reports for 96.5% of cases (443 of 459) (Supplementary Methods). We collected self-reported ethnicity data for both maternal and paternal grandparents from 78% (356) of our cases. CGEMS patients consisted of 1,142 controls and 1,145 cases of post-menopausal BrCa and were gathered from the Nurses Health Study as described previously [10, 15]. Self-reported ethnicity information was unavailable for these individuals.
DNA samples were prepared as previously described from peripheral leukocytes . SNP genotyping was performed using Sequenom MassARRAY genotyping technology and iPLEX chemistry according to manufacturer's instructions . Ancestry informative markers (AIM) were developed into 2 multiplex assays (Supplementary Table S1) as defined by the 64 In4 AIMs described by Kosoy et al.. Genotyping success ranged from 95.9 to 97.8% for the 3 association SNP in our cases. Patient samples were genotyped and samples demonstrating <80% completion rate (46/459) were subjected to a second round of genotyping. Quality control metrics for our cases included a minimum of 80% genotyping success while SNPs with completion rates <90% were discarded. After 2 rounds of genotyping, four cases and nine AIMs were discarded from further analysis, having not met quality control metrics.
For admixture analysis we utilized 45 AIMs and combined our cases (455), the CGEMS controls (1,142) and seeded the analysis with a training set of 270 HapMap reference samples (CEU, YRI and CHB+JTP) to perform STRUCTURE analysis with k=3 populations (Supplementary Figure S1). We observed general agreement between our patient's self-reported ethnicity and genetic ancestry as determined by our AIMs although rarely a patient's self-identified ancestry was at odds with the calculated CEU ancestral component. In these instances we relied on STRUCTURE results to determine genetic ancestry. For association testing, each SNP was analyzed using a logistic regression model where odds ratios are estimated for homozygous and heterozygous states of the indicated cases and CGEMS controls. For the causative SNPs rs1434536 we directly genotyped our cases and imputed genotypes from CGEMS controls using HapMap CEU reference individuals (Supplementary Methods). IMPUTE and SNPTEST were used for genotype determination and association testing of rs1434536 as described .
Cell lines were maintained in F12/DMEM respectively supplemented with 10% FBS, and 1% Pen/Strep. Luciferase reporter targets were generated for the miR-125b target region of BMPR1B by cloning PCR products from HapMap NA18505 (rs1434536-C/T) into the 3'-untranslated region of the Renilla luciferase gene in the psiCheck2.2 dual reporter vector (Promega). Clones containing T or the C alleles at rs1434536 were verified by ABI fluorescent dideoxy sequencing and transiently transfected into MCF-7 and MD-MBA-231 cell lines. Renilla luciferase (hRluc) activity was measured 48hr post-transfection. Cells were lysed with 120 μl Passive Lysis Buffer (Promega), and luciferase levels were analyzed from 10 μl lysates using the dual luciferase reporter assay (50 μl of each substrate reagent, Promega) on a Veritas Microplate Luminometer (Turner Biosystems, Sunnyvale, CA). Changes in expression of Renilla luciferase (target) were normalized relative to Firefly luciferase.
siRNAs (IDT) were transfected into MCF-7 or MDA-MB-231 cells using RNAiMax (Invitrogen) utilizing the manufacturers recommendations. Twenty-five pmol of each strand of the siRNA target were annealed by heating to 94°C for 2 min to form duplexes in buffer supplied by the manufacturer then allowed to cool to room temperature. Transfection efficiencies were monitored by transfecting in parallel a Cy3 labeled DS scrambled control siRNA duplex (IDT). Cells were harvested 24 hr after transfection and RNAs purified. cDNA was synthesized from 25ng RNA using random hexamers and M-MLV reverse transcriptase and was subsequently amplified with BMPR1B specific primers (Supplementary Methods). We calculated the SQ values and normalized BMPR1B transcript to GAPDH. RNA quantitation experiments were performed in triplicate from two independent transfection experiments.
Since allelic variations in miRNA binding sites have been shown to influence transcript levels  we examined if commonly occurring SNPs present in miRNA binding sites could be identified from the HapMap Consortium. Using a previously described set of genes dysregulated in ER+ and ER− breast tumors , we identified all HapMap SNPs residing within putative miRNA target sites in the genes' 3'UTRs (see Materials and Methods). We focused our search on the miRNA seed region, as the seed nucleates the miRNA to the complementary mRNA target region and is the main determinant for miRNA targeting . More specifically, we based our miRNA target site predictions on 7mer seed sites as we expected these would give an acceptable tradeoff between the number of false negative and false positive predictions . Our search identified 63 unique SNPs. Thirty-seven and 26 SNPs mapped to genes dysregulated in ER+ and ER− tumors respectively (Supplementary Table S2). This collection of SNPs was considered for further analysis.
To prioritize the 63 SNPs for further biological testing, we mapped each to the publically available CGEMS BrCa GWAS3 dataset looking for SNPs that had signals of association. However, only seven mapped directly to this dataset – none of which demonstrated a statistically significant association. Twenty of the 63 target SNPs were either monomorphic (14 SNPs) in CEU samples or exhibited minor allele frequencies <0.05 (6 SNPs) and were therefore not expected to be represented on the GWAS array as rare SNPs (typically <5% minor allele frequency in CEU samples) are often excluded from these arrays. Moreover, the arrays typically contain only subsets of SNPs within haplotype blocks, but these SNPs can be used as proxies for the missing SNPs within blocks. To prioritize the remaining 43 SNPs, we therefore first used local linkage disequilibrium (LD) structure from HapMap to identify proxy SNPs in the CGEMS dataset and second observed such proxies' genome-wide association rank in the CGEMS set.
One SNP, rs1434536, demonstrated high LD to rs1970801 and rs11097457 (r2=0.81) in the HapMap CEU reference samples (Fig. 1). rs1970801 and rs11097457 ranked 79th and 67th in the CGEMS GWAS association data (p=0.00017 and p=0.00014 respectively, unadjusted score test). These SNPs exhibit extensive pairwise LD (r2=0.93) in the CEU HapMap reference samples. We conclude that they likely represent the same association signal. The target site SNP rs1434536 lies 5.4kb downstream of rs1970801 and 0.85kb upstream of rs11097457 in the 3'UTR of the Bone Morphogenetic Protein Receptor 1B (BMPR1B) gene. The SNP's C→T change alters a 7mer seed site for miR-125b to a 6mer site – a change expected to reduce miR-125b's binding affinity to the site (Fig. 2). Moreover, miR-125b is differentially expressed in normal vs. breast cancer in general, and in ER+ vs. ER− tumors in particular [22–24]. The combined observations that miR-125b and BMPR1B are differentially expressed in breast cancer, that allelic variation of rs1434536 likely disrupts miR-125b's regulation of BMPR1B, and that the SNP is in LD with two BrCa-associated SNPs, suggest that rs1434536 has a pathogenic role in breast cancer.
Although the CGEMS results did not reach genome-wide significance for either rs1970801 or rs11097457 we elected to replicate the CGEMS results by screening rs1970801 in an independent cohort of genetically enriched BrCa cases. In parallel, we screened two additional SNPs for association with disease: rs1219648 and rs6831418 which ranked 2 and 52 respectively in the unadjusted CGEMS genomewide rankings (Supplementary Table S3). SNP rs6831418 resides within an intron of BMPR1B, approximately 320kb upstream of rs1970801 (r2=0.118 with rs1970801) and a regional association plot of the CGEMS data (Fig. 1A) also indicated potential disease association. SNP rs1219648, present in intron 2 of FGFR2, was previously shown to be the most strongly associated SNP with BrCa in multiple GWAS studies including the CGEMS, Wellcome Trust (rs2981582, r2=1.0 with rs1219648), and MSKCC Ashkenazi Jewish (rs1078806, r2=0.96 with rs1219648) studies [10, 11, 13]. We employed rs1219648/FGFR2 as a positive control for association in our cases, as the three previous studies indicated this SNP is a universal marker for disease. Our BrCa cases consisted of probands ascertained by virtue of a living, affected full sibling with disease while we employed admixture-corrected, shared disease free controls from the CGEMS study. The use of cases ascertained by virtue of family history served to enrich for alleles with a strong genetic etiology. In addition, the use of shared controls has recently been described for multiple common disease scenarios [9, 25, 26].
Prior to comparing allele frequencies between our cases and CGEMS controls for the 3 SNPs, we sought to eliminate two potential biases: population differences between cases and controls and technical artifacts (e.g. errors in genotype scoring). To reduce the likelihood that any observed associations could be mediated by differences in the genetic ancestry of our cases and the controls we elected to employ ancestry informative markers (AIMs) and only analyze cases and controls with a high percentage (>90%) of Caucasian ancestry as defined by HapMap CEU reference samples. Recently AIMs useful for determining intercontinental admixture have been described to facilitate structured association testing in case-control studies . We selected 59 AIMs (Supplementary Table S2) based upon the 64 most informative In4 markers as described by Kosoy, et al. for population structure analysis . These markers have a high discriminatory power to distinguish CEU, YRI, CHB+JPT, AMI (American Indian) continental populations. After STRUCTURE analysis (Supplementary Methods) we observed 94.1% (428/455) of our cases exhibited >90% CEU ancestry while CGEMS controls showed 93.3% (1,064/1,142) CEU ancestry (Supplementary Fig. S1).
We next utilized logistic regression and calculated odds ratios (OR) testing independently for both heterozygotes and homozygotes carrier states omitting both ECOG cases and CGEMS controls not exhibiting >90% CEU class membership (Table 1). rs1219648/FGFR2 showed association in the ECOG cases exhibiting a heterozygote and homozygote OR of 1.29 and 1.72 (p=0.0052) for the minor allele (G). These ORs are similar to those observed for the original CGEMS findings of 1.25 and 1.81 for heterozygotes and homozygotes. rs1970801-T also exhibited association in the ECOG cases with an OR of 1.24 and 1.84 for heterozygotes and homozygotes (p=0.00048). These OR's are comparable to those previously observed in the CGEMS study (1.21 for T/G and 1.65 for T/T). In contrast, rs6831418 did not exhibit significant association (p=0.177) in our cases. One explanation for the lack of confirmatory association with rs6831418 may stem from departure from Hardy Weinberg equilibrium (HWE) in both CGEMS cases and controls while all 3 SNPs were in HWE for our ECOG cases (Supplementary Table S3). As final verification of association we genotyped rs1434536 in our cases and, utilizing the CEU HapMap LD structure, imputed genotypes for rs1434536 in admixture-corrected CGEMS controls. We observed association in our cases with OR = 1.29 for T/C heterozygotes and OR = 1.94 for major allele homozygote T/T (Table 1). Based on the replication of association in our BrCa cases for rs1219648/FGFR2, rs1970801-T and rs1434536-T we concluded that rs1434536 was indeed associated with disease risk.
Next we tested a biological model where miR-125b differentially regulates the C/T allelic variants of rs1434536 in BMPR1B. In this model, rs1434536-T results in increased BMPR1B transcript levels which gives an increased BrCa risk as demonstrated by the association testing. Computational models of miRNA target interactions predicted that miR-125b would downregulate the C allele more strongly than the T allele, as the T allele forms a weaker 6mer seed site for miR-125b binding (Fig. 2) . The PITA thermodynamic model of miRNA binding supports this allelic difference. The algorithm models miRNA targeting as a competition between the free energy gained by miRNA binding and the energetic cost of displacing existing RNA secondary structure at the target site . PITA summarizes this model in the ΔΔG score, where smaller values indicate stronger miRNA binding. Inputting the 200 nucleotides centered on rs1434536-C/T alleles to PITA gave ΔΔG values of −0.53 and 3.09, which suggested reduced binding of miR-125b to BMPR1B for the T allele.
To test our model, we cloned partial BMPR1B 3'UTR fragments from a rs1434536 heterozygote into the luciferase 3'UTR reporter vector psiCHECK-2 to compare the luciferase activities between the two alleles at rs1434536. Vectors containing either C or T alleles were transiently transfected into ER+ and ER− cell lines and Renilla luciferase activity was measured. When transfected into MCF-7 (ER+) cells the C-allele gave a 38% reduced luciferase activity relative to the T allele consistent with our model (Fig. 3b). However, when we tested luciferase activity in MD-MBA-231 (ER−) cells; we observed no difference between the C and T alleles. Additionally, the overall luciferase activities observed were lower in MDA-MB-231 cells relative to MCF-7 cells which may reflect the higher levels of miR-125b in this cell line .
As an additional test of our model, we transiently transfected synthetic miR-125b oligos into MCF-7 and MDA-MB-231 cells, and quantitated endogenous BMPR1B transcript levels by qRT-PCR. Prior genotyping MCF-7 and MDA-MB-231 cells revealed homozygous T and C genotypes at rs1434536 respectively. The oligonucleotides, which mimicked the annotated hsa-miR-125b:hsa-miR-125b-1* duplex, only weakly downregulated BMPR1B in MCF-7 (Fig. 3c), which is consistent with our model. In contrast, transfection with a miRNA mimic (siR), not targeting the miR-125b site, resulted in an 80% reduction in BMPR1B transcript levels. When we tested these duplexes in MDA-MB-231 cells, BMPR1B levels were less than 1/50 of the levels in MCF-7 and below the assay's detection limit (data not shown). The low levels of BMPR1B levels in ER− MDA-MB-231 cells were consistent with our prior meta-analysis from ER+ and ER− tumors and with increased levels of endogenous miR-125b in 231 cells .
Both rs1434536-T and rs1970801-T were shown to be associated with increased risk in an independent cohort of admixture-corrected cases and controls. We have demonstrated that miR-125b negatively regulates BMPR1B and that C/T allelic variation (rs1434536) within the target site disrupts miR-125b's regulation. The presence of rs1434536-T leads to loss of miR-125b regulation, increased BMPR1B expression and ultimately elevated disease risk. Consistent with this, increased BMPRIB expression is associated with high tumor grade, proliferation, cytogenetic instability, and a poor prognosis in ER+ breast carcinomas . Moreover, breast cancers in general and ER+ tumors in particular seem to have reduced levels of miR-125b [22–24], which in the light of these results, at least partly explain why ER+ tumors have increased BMPR1B expression .
BMPR1B binds bone morphogenetic proteins (BMP) and are multifunctional signaling molecules that belong to the transforming growth factor beta (TGF-β) superfamily and were first identified based on their ability to form bone at extraskeletal sites . Once activated, BMP/receptor complexes phosphorylate cytosolic SMAD proteins which translocate to nucleus and regulate target genes . Our findings indicate that ER+ patients harboring elevated BMPR1B transcript levels may have poorer outcomes when carrying the risk-associated rs1434536-T allele. While not only identifying a new target for further study, these results demonstrate the importance of combining tumor expression profiles and genotype data in determining a patients' clinical prognosis.
More generally, our methodology has identified a set of allelic variants present in miRNA recognition sites within a set of dysregulated ER responsive genes. Independent of our efforts, Adams et al identified rs9341070 in a miR-206 site in the ESR 3'UTR. Allelic variation at this SNP was shown to influence ESR expression over 3-fold in HeLa cells . This SNP resides in a domain upstream of the miRNA seed targeting sequence (nucleotides 2–8) yet we identified this same SNP by virtue of it presence in a miR-122 seed region (Supplementary Table S1). However, due to the low frequency of rs9341070 in CEU samples (<0.01) this SNP is not represented in any GWAS array. This illustrates a common deficiency of GWAS datasets: the absence of low frequency/rare SNPs which may also play a role in disease risk . One would anticipate that appropriately powered future association studies of these potential miRNA interacting rare variants may support their role in risk.
We found that T/T homozygotes at rs1970801 had slightly higher odds ratios in our ECOG cases (1.84) compared with the CGEMS cases (1.65) and this could be explained by differences in case ascertainment. CGEMS cases were recruited from a prospective cohort study where only 22% (274/1145) reported first degree family history as opposed to our cases whereby all cases exhibited first-degree family history, namely an affected sibling. Second, all CGEMS cases were of post-menopausal disease while only half of the ECOG cases indicated an age of diagnosis <50. These differences indicate that the genetic contribution to risk may have been underestimated for rs1970801-T in the CGEMS study reinforcing the importance of family history in confirmatory replication studies as this may be valuable for later risk-assessment predictions. We observed a higher OR for TT homozygotes at rs1434536 when we tested for association with imputed genotypes in the CGEMS controls compared to rs1970801 TT homozygotes (1.94 versus 1.84; Table 1). These results also highlight both the merits of the tagging SNPs utilized in the GWAS studies and the utility of imputation for deriving missing genotypes.
Our replication of prior disease associations for two SNPs (Table 1) relied on using a set of AIMs to correct for differences in genetic admixture between our cases and CGEMS controls. Approximately 6–7% of CGEMS controls and CGEMS cases (data not shown) demonstrated <90% CEU ancestry as defined by HapMap reference samples. This indicates that population substructure introduced by intercontinental admixture may have contributed to potential false positives or missed associations in the original CGEMS data. To rectify this it has been proposed that AIM panels should be employed prior to GWA tests . More subtle levels of admixture within both European and Chinese populations have recently been described which will necessitate the continued use of extended AIM panels to discern finer levels of population substructure as a prelude to association testing and biological testing [33–35].
The usefulness of GWAS data for identifying BrCa susceptibility alleles is premised on the common disease-common variant hypothesis whereby SNPs (>5% frequency) may act as surrogates to identify causal variants. Replication studies of the very top tier signals in breast cancer have firmly established some associations; however, modest signals in first round GWAS screens may not be selected for rescreening . Thus we feel it is likely that future meta-analyses of multiple GWAS datasets will provide additional candidates for examination .
These findings have implicated a germline variant in BrCa susceptibility and provided a strong model for biological causality via miRNAs. Our approach relies on integrating association data, expression profiles and testable biological models to evaluate potential disease alleles in pathogenesis . As GWAS have identified only common SNPs as genetic risk factors, it is likely that many rare alleles present within motifs for miRNAs and additional trans-acting regulators (i.e. transcription factors) remain to be identified. In addition, approaches such as whole genome sequencing and the identification of common recurrent somatic mutations in breast tumors may provide a large collection of potential disease alleles for exploration [39, 40].
Grant Support This study was supported by Susan G. Komen for the Cure Basic, Clinical and Translational Research Grant BCTR0504486 (GPL). PS received support from the Norwegian Functional Genomics Program (FUGE) of the Norwegian Research Council. LFT received support from the Norwegian Research Council. JB was a participant in the Southern California Bioinformatics Institute Summer Academy (NSF-NIH EEC-0609454). JJR is funded by (HL074704). We thank the ECOG study subjects in E1Y97 for their participation.