|Home | About | Journals | Submit | Contact Us | Français|
Previously we have shown that nonsyndromic cleft lip with or without cleft palate (NSCL/P)1, is strongly associated with SNPs in Interferon Regulatory Factor 6 (IRF6)2. Here, multispecies sequence comparisons identify a common SNP (rs642961, G>A) in a novel IRF6 enhancer. The A allele is significantly overtransmitted (P=1×10−11) in families with NSCL/P, in particular with cleft lip (CL) but not cleft palate. Further, there is a dosage effect of the A allele, with the relative risk for CL 1.68 for the AG genotype and 2.40 for the AA genotype. EMSA and ChIP assays demonstrate that the risk allele disrupts the binding site of transcription factor AP-2α and expression analysis in the mouse localizes the enhancer activity to craniofacial and limb structures. Our findings place IRF6 and AP-2α in the same developmental pathway and identify a high frequency variant in a regulatory element contributing substantially to a common, complex disorder.
Mutations in IRF6 cause Van der Woude syndrome (VWS), a rare Mendelian clefting disorder with lower lip pits in ~85% of the cases3. The remaining 15% of the VWS cases have CL/P with no lip pits and are clinically indistinguishable from the common, isolated or NSCL/P with a birth prevalence of ~1/700 livebirths. Significant association between single nucleotide polymorphisms (SNPs) in and around IRF6 and NSCL/P was previously shown in multiple populations2 and independently replicated4–7. One particular SNP (rs2235371, G>A) that changes valine to isoleucine at amino acid position 274 (V274I) was found to be significantly associated with NSCL/P in Asian and Amerindian populations. The associated V allele is evolutionarily conserved and its frequency is very high in European and African populations (>97%). This SNP may be a surrogate for a true etiologic variant as it is located in an approximately 140Kb-wide linkage disequilibrium (LD) block. Sequencing of the protein coding and splice site regions of IRF6 in 160 NSCL/P cases did not detect any obvious causative variants2. Based on these observations, we postulated that an etiological variant was in strong LD with the V allele and would reside in a regulatory element of IRF6 within the LD block.
To identify potential cis-regulatory elements for IRF6 we obtained and aligned genomic sequences orthologous to a 500Kb region encompassing human IRF6 from 17 vertebrate species. Sequences were first aligned to the human reference sequence and then searched for multispecies conserved sequences (MCSs). A total of 407 MCSs, with an average size of 61bp, were identified within the 500Kb examined with their distribution in protein coding and untranslated regions shown in Supplementary Table 1 online.
We next selected 41 non-coding MCSs contained within the 140Kb haplotype block of strong LD as most likely to contain one or more SNPs contributing to our previous association with NSCL/P. These sequences were located in the introns, 5′ and 3′ flanking sequences of IRF6 (Supplementary Table 1). The 41 MCSs ranged in size from 25bp to 168bp and were screened for potential causative variants in 184 NSCL/P cases from Iowa and the Philippines by direct sequencing (~7.5Kb of sequence in total). Overall, 18 variants were detected, of which 12 were previously identified (in dbSNP), and 6 were novel. To determine whether the frequencies of the detected variants were different in cases versus controls we sequenced an equal number of unaffected individuals from matched populations. Among 18 variants, only three SNPs (all located within a 50bp segment in MCS-9.7) (Fig. 1a) showed differences between cases and controls with P<0.05 (uncorrected for multiple comparisons as this was hypothesis generating to identify likely mutation sites). The common ancestral alleles of two of these SNPs, -14474A>G and -14523G>A, were overrepresented in cases compared with the controls from the Iowa population (χ2=6.12, df=1, P=0.01) and were in complete LD (r2=1) with each other and with the V allele of V274I. The remaining SNP rs642961, located between these two SNPs, also showed significant differences in allelic frequencies (χ2=4.8, df=1, P<0.02) and genotypic distribution (χ2=6.1, df=2, P<0.04) between cases and controls from the Iowa population. In contrast to -14474A>G and -14523G>A, the associated allele of rs642961 is the derived allele A, while its ancestral G allele is strongly conserved across 12 different vertebrates (Fig. 1a). Transcription factor binding site analysis predicted that the risk allele would alter a binding site for AP-2α, a transcription factor involved in craniofacial development8. Moreover, mutations in AP-2α cause branchio-oculo-facial syndrome9 that has overlapping features with VWS such as orofacial clefting and occasional lip pits, thus making rs642961 a good candidate for an etiological variant.
We then assessed association between NS clefts and SNPs rs642961 (G>A) and rs2235371 (V274I) using family-based transmission disequilibrium tests (TDT) in 432 Norwegian, 479 Danish, 606 other European (Netherlands, UK, Italy) nuclear families from the EUROCRAN Project , 196 large multiplex Filipino families and 490 Filipino trios (Table 1). The CL subset consists of those families in which one or more of the affected family members have cleft lip alone, while all affected family members in the CLP subset have cleft lip and cleft palate. The CL/P subgroup is a combination of the CL and CLP subgroups. The families in the PALATE subgroup have at least one affected individual with cleft palate alone. Parent-to-offspring observed transmission values were compared with the expected transmission values using the family based association test (FBAT)10 for each SNP and haplotypes of the two SNPs. Table 1 presents the results for the CL/P group by population. We found statistically significant overtransmission of the A allele at rs642961 to affected individuals in all populations separately and combined: Norwegian (P=0.005), Danish (P=0.0001), EUROCRAN (P=0.003), All European populations (P=3×10−8), Filipino (P=6×10−6) and all populations combined (P=1×10−11).
Haplotype TDT analysis showed that rs642961 splits the V allele of V274I into two distinct haplotypes, V-G and V-A. Haplotype V-A showed strong evidence of overtransmission (P= 8×10−13) in the CL/P group, whereas haplotypes V-G, I-G, and I-A were significantly undertransmitted, i.e. negatively associated (P=0.005, P=4×10−9 and P=0.04 respectively) (Table 1). These haplotype results demonstrate a strong association with the haplotype containing the rs642961 A allele and suggest that there is not an independent association with the V allele of rs2235371 (V274I). Furthermore, the association with haplotype V-A was more strongly associated in the CL subset (P=5×10−11) than in the CLP subset (P=0.0004), and not associated in the PALATE subset (P=0.79). These patterns are consistent across populations.
We next utilized conditional haplotype analyses in the proband nuclear families derived from each extended kindred, by cleft subgroups (Table 2). First we contrasted the risk of the V-A haplotype to that of the V-G haplotype to test further whether rs2235371 has an association with clefting independent of the rs642961 association. In every population, plus combined Europe and total, V-A had a significantly increased risk over V-G in the CL/P and CL groups (i.e., OR>1.0), lending further support to the notion that the V allele of rs2245371 does not have an association with clefting independent of the rs642961 A allele association with CL/P and CL. In contrast, there were no significant differences in V-A vs V-G risk in the PALATE subgroup for any population. The CLP subgroup showed significant results in Europe but not in the Filipinos.
Then, we contrasted haplotype V-G versus I-G to determine whether the association with rs642961 completely accounted for the IRF6 association (Table 2). For these comparisons, the only significant findings were in the Filipino population, suggesting that rs642961 is etiologic in the European populations but there may be additional alleles leading to clefting in the Filipinos. Interestingly, this finding in the Filipinos was most significant in the CLP and CL/P groups (P=0.0009 and P=0.0001 respectively), and borderline in the CL group (P=0.02). Overall, the haplotype results suggest there may be one or more additional alleles present on the V haplotype background also contributing to clefting and with perhaps a greater effect on CLP than CL. Multiple functional SNPs have previously been reported in the IRF5 gene, encoding another member of the IRF family of transcription factors, associated with systemic lupus erythematosus11. Thus, additional risk variants in IRF6 might increase the risk of clefting independently or synergistically with rs642961.
In order to assess possible dosage effects of the A allele, we used log-linear modeling to determine the Relative Risks (RR) for rs642961 genotypes within each phenotype and population (Table 3), in the proband triads. FBAT association analyses in the proband trios (Table 3) had the same patterns of significance as in the entire extended kindred dataset (Table 2), with CL subset showing highly significant association and the PALATE subgroup showing no association. The genotypic RR results suggest a dosage effect of allele A, for example, in the TOTAL combined population CL subset, the relative risk of the AG genotype is 1.68 versus 2.40 for the AA genotype, in Europe CL 1.91 and 2.29, in Filipinos 1.36 and 2.45; while the GG genotype was either not associated or strongly negatively associated with clefting (Table 3). This dosage effect trend is also seen in most of the individual populations for the CL, CLP and CL/P phenotypic subgroups, while there is no association and no allele dosage effect in the PALATE subgroup.
To determine the population impact of the risk allele between cases and controls we genotyped rs642961 in two cohorts of unbiased case collections born within defined time periods from Denmark (107 cases and 495 controls,1997–200312) and Norway (406 cases and 750 controls, 1996–200113). The A allele was more common in all cases combined than in controls from both groups. Consistent with the transmission results in the family data, the frequency of the risk allele was significantly higher in cases with cleft lip only (CLO) versus other cleft phenotypes and the CLO odds ratio was 1.99 (95% CI 1.54–2.57, P=3×10−7) (Table 4). The fraction of NSCLO cases attributable to the A allele of rs642961 is 18% in these two populations combined. Similar to the transmission results in the family data, although the risk allele showed a non-significant trend towards association with CLP (OR=1.23, 95% CI 0.97–1.57, p=0.08), it showed no association with cleft palate only (CPO) (P=0.1).
Previously, most studies have combined CLO and CLP into common etiologic and recurrence risk groups14. Recent epidemiologic evidence suggests that CLO might be separable15. In the current study there was a clear separation of risk and transmission patterns between the CLO and CLP groups based on rs642961 genotype. Further, we recently updated our large genome-wide linkage analysis16 of NSCL/P to double the sample size (now 861 multiplex families) and found genome-wide significant linkage (LOD score = 3.34) with the IRF6 region attributable to the CLO subset. We also determined the worldwide distribution of rs642961 by genotyping the samples from the CEPH diversity panel. African populations have the lowest derived allele frequency (0.11), and Native Americans have the highest (0.27) while European and Asian frequencies differ based on geographic origins (Supplementary Table 2). These frequency differences broadly mirror the observed prevalence differences in orofacial clefting across these populations and, coupled to the high frequency of this risk allele in all populations, suggest the possibility of a selective advantage for this or a linked variant in Asian and European groups17
To determine whether rs642961 affects DNA binding by transcription factor AP-2α we performed electrophoretic mobility shift assays (EMSA) using human recombinant AP-2α protein and fluorescently labeled oligonucleotide probes containing rs642961 (Fig. 1b). The oligonucleotide probe containing the G allele robustly bound the AP-2α protein, whereas an oligonucleotide probe containing the A allele did not bind at all (Fig. 1c). Increasing the amount of unlabeled G probe efficiently competed with the binding of the labeled G probe. In contrast, increasing amounts of unlabeled A probe had no effect on the binding activity of the labeled G probe, indicating a specific interaction between the G allele and AP-2α.
Further analysis of the region surrounding rs642961 revealed three additional highly conserved AP-2α binding motifs (Supplementary Fig. 1). A chromatin immunoprecipitation (ChIP) assay, coupled with quantitative real-time PCR, showed 2.6-fold enrichment for the MCS-9.7 region in AP-2α-Ab immunoprecipitated chromatin relative to IgG-precipitated chromatin from HaCaT keratinocyte cells homozygous for rs642961-G (Fig. 1d, plot I). In cells overexpressing AP-2α the MCS-9.7 chromatin enrichment was 10-fold (Fig. 1d, plot III). No difference in relative enrichment was observed for a control region devoid of AP-2α binding sites in either cells (Fig. 1d, plots II–IV). Taken together, these results demonstrate that the MCS-9.7 region is a direct target of AP-2α in vivo.
Next, we assessed the regulatory activity of MCS-9.7 in a transgenic mouse enhancer assay. A 775bp (chr1:208055673–208056447, UCSC hg18) of human genomic sequence that contains the entire MCS-9.7 region was inserted into a previously described expression vector18 in which the LacZ reporter gene is driven by a minimal mouse heat shock promoter. Eight out of nine F0 embryos reproducibly expressed LacZ in the ectoderm covering facial prominences, predominantly at the fusion sites between the prominences, and developing limbs, and in the branchial arch (Fig. 2), consistent with known sites of endogenous craniofacial Irf6 expression at this developmental stage19 and suggesting that the MCS-9.7 region functions as an enhancer element for IRF6. Three-dimensional digital imagery and virtual sections of LacZ-stained whole-mount embryos captured using optical projection tomography20 are available in Supplementary videos online.
In order to asses the effects of rs642961 on IRF6 expression we performed luciferase reporter assays in human foreskin keratinocyte cell line. The risk haplotype consistently increased the luciferase expression greater than the non-associated haplotypes, but the difference did not reach statistical significance (Supplementary Fig. 2). Previous studies demonstrated that some regulatory polymorphisms do not always reflect their in vivo effects in cell culture-based assays21, particularly for developmental genes that show temporal and tissue-specific expression pattern. Although rs642961 shows minor effects on reporter gene expression in cell culture, it is possible that it has significantly larger or even opposing effects on IRF6 expression in vivo in its native chromosomal context and/or in the presence of trans-acting variants.
Although linkage and association studies are increasingly effective in identifying loci and putative genes involved in common complex traits, the progression to specific etiologic variant identification has proven difficult. Demonstrating that a specific variant is etiologic in what may be a large haplotype block with many strongly correlated SNPs, requires compelling statistical evidence coupled with convincing functional data for a specific variant. Previous sequencing of candidate genes in NSCL/P cases has disclosed rare mutations in individual families22,23 but no common causative variants have been described. In the present study we report a common single point mutation in a gene regulatory element that confers an 18% attributable risk for isolated cleft lip.
Written informed consent was obtained from all participants in compliance with The University of Iowa Institutional Review Board (approval nos. 199804080 and 199804081). The Danish family data consisted of: 362 NSCL/P nuclear pedigrees, with 23 affected (step) and 845 unaffected (step) siblings for a total of 1624 individuals. The Danish case control data consisted of 107 cases and 495 controls unrelated to the family data. The Norwegian family data consisted of 314 NSCL/P nuclear pedigrees, with 1 affected sibling, 1 unaffected sibling and 237 siblings with unknown affection status for a total of 1181 individuals. The Norwegian case-control data consisted of 298 cases who were the probands from the families plus 108 cases with CPO and 750 controls unrelated to the family data. The Filipino data consisted of 203 multiplex NSCL/P extended pedigrees containing 932 nuclear families for a total of 2797 individuals. For some analyses CLO and CLP subsets were analyzed from the total Filipino families where the CLO subset consists of those families in which one or more affected individual has CLO (number of CLO families=121), versus the CLP subset in which all affected family members have CL plus CP (number of CLP families=70). The EuroCran data consisted of 432 nuclear trios (263 from the Netherlands, 105 from the UK, and 64 from Italy) for a total of 1296 individuals.
Sequences of the genomic region encompassing the IRF6 gene for human and organisms listed in Supplementary Table 3 were compiled from publicly available whole-genome datasets. Orthologous sequences from rabbit, pig, bat, armadillo, and elephant were generated specifically for this study using targeted bacterial artificial chromosome (BAC)-based mapping and sequencing strategy24 (Supplementary Table 3). MultiPipMaker25 was used to align these sequences with the single coverage option that eliminates some matches caused by duplications and the search both strand option. MCSs were identified using the WebMCS program, which calculates a conservation score for each base in the reference sequence by analyzing windows of 25nt across MultiPipMaker-generated multispecies sequence alignment based in the algorithm developed by Margulies et al26.
Transcription Element Search System (TESS) that is linked to TRANSFAC, JASPAR, IMD, and CBIL-GibbsMat databases was used to scan MCSs for previously reported transcription factor binding sites.
Primers used to amplify MCSs were designed with Primer 3 (v.0.3.0) program and are available in Supplementary Table 4 online. Cycle sequencing was performed on 1µl of 1/10 diluted PCR product in 10µl reaction by using 0.25µl of ABI Big Dye Terminator sequencing reagent (v. 1.1), 0.5µl of 5μM sequencing primer, 0.5µl of DMSO, 1µl of 5X sequencing buffer and 6.75µl of ddH2O. Following a denaturation step at 96°C for 30 sec, reactions were cycle sequenced at 96°C for 10 sec, at primer Tm for 5 sec, and at 60°C for 4 min for 40 cycles. Unincorporated dyes and reaction buffers were removed by Sephadex™ G-50 columns (GE Healthcare) or magnetic beads (Agencourt) and the sequencing reaction was then injected on an ABI 3730 capillary sequencer (Applied Biosystems). Chromatograms were transferred to a Unix workstation, base called with PHRED (v.0.961028), assembled with PHRAP (v. 0.960731) scanned by POLYPHRED (v. 0.970312) and the results viewed with the CONSED program (v. 4.0)27
Genotyping was carried out by using TaqMan® SNP Genotyping Assays on the ABI Prism 7900HT machine and analyzed with SDS 2.2 software (Applied Biosystems). The genotyping success rate was extremely high (>98%) and the Mendelian error rate was less than 2% (some of which may be due to potential deletions in this region). For the family data, SNP and haplotype TDT analyses were performed with the FBAT program (v.1.7.3), using the –e option of the program to account for multiple sibs in a family or multiple nuclear families in a pedigree. With a Bonferroni correction, the alpha level for significance of the association analyses was calculated as 0.05/32=0.0016. In order to assess possible dosage effects of the associated allele, log-linear models with genotype and imprinting effects were fit to the proband-parent triads using the SAS version of the LRT program28. Likelihood ratio tests of the model parameter effects were then used to estimate the RR (with 95% CI) for over-transmission of 1 versus 2 associated alleles (A allele) at SNP rs642961. In these same proband-parent triads plus proband siblings if available (ie proband nuclear families), we compared the risk associated with haplotype V-A to that of haplotype V-G, as well as the risks associated with haplotypes I-G vs. V-G. We also estimated the OR of the effects due to a particular haplotype with respect to a specific reference haplotype. The 95% CI associated with the OR were also estimated. The haplotype RR and OR were calculated using LRT methods, as implemented in the software Unphased29. In the case-control data, OR with Fisher’s exact P-values were calculated using SAS v.9.1.3 (Cary, NC) software. The population attributable risk percentage (PAR%) was calculated based on the following formula, PAR%= [Pe (RR-1)/RR] ×100, where Pe is the prevalence of the allele in cases, and RR is the relative risk. The OR was used as a proxy for the RR. The Danish case-control data was from the singleton cases and controls that were ascertained separately from the cleft family data. The Norwegian case-control data consisted of the probands from the cleft family data and the probands from a control family data.
Infrared dye (IRDye-700) end-labeled and unlabeled oligonucleotide probes (Fig. 1b) were purchased from LI-COR Biosciences (Lincoln, NE) and IDT (Coralville, IA), respectively. We performed EMSAs by using the LI-COR EMSA Kit following the protocol described in Supplementary methods online.
Protein cross-linked chromatin from ~2×107 Ad-AP-2α-infected and uninfected HaCaT keratinocyte cells was isolated as detailed previously30. The amount of immunoprecipitated target region was then determined by SYBR Green (Applied Biosystems) quantitative real-time PCR with primers specific for the target sequence in MCS-9.7 and control region (Supplementary Methods and Table 4). Real-time PCR was carried out in triplicate and amplification of the target amplicon was monitored as a function of increased SYBR Green fluorescence. An analysis threshold was set and the cycle threshold (Ct) computed for each sample. Fold enrichment of target sequence was calculated using the following formula (Fold enrichment = 2(Ct AP-2α-Ab IPed)−(Ct IgG IPed)).
For luciferase reporter assay we generated reporter constructs by inserting 540bp genomic segment (chr1:208055787–208056326, UCSC hg18) containing the entire MCS-9.7Kb region upstream of firefly luciferase ORF driven by the SV40 promoter. DNA samples from individuals homozygous for -14474A>G, -14523G>A, and rs642961 variants were PCR amplified and cloned into the pGL3-Basic and pGL3-Promoter vectors (Promega) in both orientations. Luciferase assays were performed in HFK cells as described in Supplementary methods. The assays were performed in triplicates in seven independent experiments on separate days. The relative luciferase activity was calculated by dividing the luminescence of firefly luciferase activity by that of the cotransfected Renilla luciferase, and pairwise comparisons of luciferase expression level from different constructs were done using 2-tailed Student’s t-test.
We would like to thank Akira Kinoshita, Kathy Frees, Adela Mansilla, Jamie L'Heureux, Marla Johnson, Harris Morrison, George Wehby, Nicholas Rorick, Kurt Bedell, and Linda Powers for technical assistance and Susie McConnell, Dan Benton and Melanie DeVore for their administrative assistance. We would also like to thank Al Klingelhutz for kindly providing us with HFK cell line. This work was supported by grants from the National Institutes of Health (NIH): P50 DE16215 (JCM, MLM, BCS), P30 ES05605 (JCM), R37 DE08559 (JCM, MLM), R01-DE13513 (BCS), 1 UL1 RR024979-01 (JCM, BCM), R01-CA73612 (FED), R01-HG003988 administered under Department of Energy Contract DE-AC02-05CH11231 (LAP) as well as by the Intramural Research Program of the National Human Genome Research Institute (EDG), in part by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (AJW), and European Commission FP5: EUROCRAN Project (Contract No. QLG1-CT-2000-01019) (MR, PAM, JL, RPS). AV was supported by the American Heart Association