|Home | About | Journals | Submit | Contact Us | Français|
We genotyped 2,861 cases from the UK PBC consortium and 8,514 UK population controls across 196,524 variants within 186 known autoimmune risk loci. We identified three loci newly associated with primary biliary cirrhosis (PBC) (with P<5×10−8), increasing the number of known susceptibility loci to 25. The most associated variant at 19p12 is a low-frequency non-synonymous SNP in TYK2, further implicating JAK/STAT and cytokine signalling in disease pathogenesis. A further five loci contained non-synonymous variants in high linkage disequilibrium (LD) (r2>0.8) with the most associated variant at the locus. We found multiple independent common, low-frequency and rare variant association signals at five loci. Of the 26 independent non-HLA signals tagged on Immunochip, 15 have SNPs in B-lymphoblastoid open-chromatin regions in high LD (r2>0.8) with the most associated variant. This study demonstrates how dense fine-mapping arrays coupled with functional genomic data can be utilized to identify candidate causal variants for functional follow-up.
Primary biliary cirrhosis (PBC) is characterized by the immune-mediated destruction of intra-hepatic bile ducts, resulting in chronic cholangitis, liver fibrosis and ultimately cirrhosis1. With a UK prevalence of 35:100,000, rising to 94:100,000 women over 40 years of age, it is the most common autoimmune (AI) liver disorder1,2. Family-based studies indicate a substantial genetic component to PBC susceptibility, with a sibling relative risk of ~10.5 in the UK3. Genome-wide association studies (GWAS) have identified 22 PBC risk loci, and highlighted the role of NFkB signaling, T-cell differentiation, Toll-like receptor and tumor necrosis factor signaling in disease pathogenesis4-6. Sixteen of these loci are also associated with other immune-mediated diseases such as multiple sclerosis, celiac disease and type 1 diabetes (T1D), shedding light on the involvement of common genes and pathways across these diseases7. Despite these advances, the specific causal variant at many of these loci remains unknown.
To better define risk variants and identify additional susceptibility loci, we performed a fine-mapping and association study using a cohort of 2,861 cases from the UK PBC Consortium and 8,514 UK population controls from the 1958 British Birth Cohort and National Blood Service. All samples were genotyped on the Immunochip, an Illumina Infinium array containing 196,524 variants (718 small insertions/deletions and 195,806 SNPs) across 186 known AI risk loci. SNPs were derived from population-based sequencing projects such as the 1000 Genomes project and autoimmune disease resequencing efforts8,9. Compared to GWAS arrays, Immunochip’s increased marker density within known AI loci increases power to detect PBC associations within these selected key candidate genes, and provides a powerful means of fine-mapping known PBC loci as causal variants are more likely to be directly genotyped.
Following quality control (Online Methods), 143,020 polymorphic SNPs were available across 2,861 cases and 8,514 controls. (Supplementary Tables 1-2, Supplementary figures 1-6). A further 94,559 SNPs in the Immunochip fine-mapping regions were imputed using genotypes from the 1000 Genomes Project June 2011 release (Online Methods). The inflation factor inferred from 2,258 SNPs not associated with autoimmune disease showed only a modest inflation (λ=1.096, Online Methods), similar to that reported in our previous GWAS for PBC 6.
Of the 22 known PBC risk loci, 16 reached genome-wide significance (P<5×10−8) (Figure 1) and four showed nominal evidence of association (5×10−8<P<5×10−4) (Supplementary Table 3). Two PBC loci, 14q32 and 19q13, were not included on Immunochip as the array was designed before the publication of the most recent PBC GWAS6. At 12 of the genome-wide significant loci, the most associated SNP was different to that previously reported (Supplementary Table 3). There was little difference in the effect-size estimates between the GWAS tagging SNP and the most strongly associated Immunochip SNP (Supplementary Figure 7), although this may be due to a large proportion of overlapping samples between the two studies (Online Methods).
Stepwise conditional regression10 revealed multiple independent signals at five loci, with 16p13 harboring three, and 3q25 four such associations (Table 1). At the 16p13 locus, the third independent signal, rs80073729, is a rare SNP (MAF<0.5%) recently associated with celiac disease9. In the same study, Trynka et al.9 also identified multiple independent signals at 3q25, though rs80014155, a rare SNP that best tags the fourth independent PBC association at this locus, was not among them. These results suggest that resequencing hundreds or thousands of cases across known GWAS loci will be a powerful means of identifying additional independent risk alleles. It is likely that these two rare SNP associations would have been missed using standard GWAS arrays due to poor tagging, unless they were directly genotyped. As these are rare SNPs, further replication in large independent cohorts will be required to confirm their associations. Haplotype association analysis at loci with multiple independent signals identified similar effect-size estimates suggesting that the causal variant is among, or is highly correlated with, genotyped SNPs (Supplementary Table 4). These additional independent association signals thus yield a more complete understanding of the genetic architecture of PBC and enable more informative genotype-based recall studies to be conducted.
Variants at three loci not previously reported as associated with PBC reached genome-wide significance threshold (Table 1). The most significant association on 19p12, rs34536443 (OR=1.91, P=1.24×10−12), is a low-frequency (1%≤MAF<5%) non-synonymous SNP in the tyrosine kinase 2 gene (TYK2), previously associated with multiple sclerosis11. The locus has also been associated with T1D12, psoriasis13 and Crohn’s disease14, although rs34536443 was not genotyped as part of these studies. For T1D and psoriasis, the strongest associations were to common SNPs that reside on the same haplotype (rs2304256 (r2=0.06, D’=0.9) and rs280519 (r2=0.03, D’=1)). The most associated SNP in Crohn’s disease and the second psoriasis signal (rs12720356) is independent of rs34536443 (r2 = 0, D’ = 0.003). The 12q24 locus has been associated with celiac disease9,15, rheumatoid arthritis16 and T1D17, though it was a non-synonymous SNP in SH2B3, rs3184504 (OR=1.19, P=1.11×10−8), rather than the most significant SNP in this study, rs11065979 (OR=1.2, P=2.87×10−9), that was most strongly associated; The two SNPs are in high LD (r2 = 0.81) and further studies are required to identify the causal variant underlying the PBC association signal at this locus. The most associated SNP in the 17q21 region, rs17564829 (OR=1.25, P=1=2.15×10−9), is located in MAPT, a gene that has been associated with cognitive symptoms in Parkinson’s disease. While cognitive symptoms are a major part of the symptom complex associated with PBC, it remains to be seen if a) the true causal variant at the locus has its functional effect through MAPT, and b) if this functional effect then results in cognitive changes in individuals with PBC.
Both TYK2 and SH2B3 are involved in the production of cytokines, adding to the evidence that cytokine imbalances play a role in PBC and other autoimmune diseases18,19. TYK2 is a member of the Janus kinase family, which transduce cytokine signals by phosphorylating STAT transcription factors. Couturier et al.20 showed that heterozygotes for rs34536443 have significantly reduced TYK2 activity, which promotes the secretion of Th2 cytokines20. For SH2B3, carriers of the A risk allele of rs3184504 show a moderate increase in production of cytokines and stronger activation of the NOD2 recognition pathway compared to carriers of the G allele21, suggesting a possible role in helping prevent bacterial infection.
Candidate genes studies have implicated several HLA-DR alleles in PBC susceptibility, particularly the DRB1*08 allele22-25. To further characterize HLA risk variants, the classical HLA alleles (A, B, C, DQA1, DQB1 and DRB1) were imputed from genotyped SNPs in the MHC26,27 (Online Methods). Fourteen HLA-alleles reached genome-wide significance and conditional analysis clustered these associations into four independent signals (Supplementary Table 5, Supplementary Figure 8). The most significant association was the HLA-DQA1*0401 allele (OR=3.06, P =5.9×10−45), which forms a haplotype with two other HLA class II alleles (DQB1*0402 and DRB1*0801) and is an established PBC risk locus22-25. The second and third most significant clusters, DQB1*0602 (OR=0.64, P=2.32×10−15) and DQB1*0301 (OR=0.70, P =6.48×10−14) both have protective effects, confirming previous studies showing suggestive associations between these loci and PBC susceptibility 22,23. The fourth most associated cluster, DRB1*0404 (OR=1.57, P=1.22×10−9) has not been previously associated with PBC. The variance in liability explained by the 26 independent SNPs and four HLA-types are 4.9% and 1.4% respectively, which together account for 16.2% of the total PBC heritability of liability of 0.39 (Online Methods).
To identify candidate causal variants we searched for non-synonymous variants in high LD (r2>0.8) with the most associated variants at each PBC risk locus. We found 39 such variants (of which 13 were directly genotyped) within seven risk loci (Table 1 and Supplementary table 6), including two variants at two of the loci newly associated with PBC in this study, TYK2 and SH2B3. Functional follow-up studies are needed before these non-synonymous variants can be confirmed as the causal variants at these loci. As variation in gene expression is also likely to influence PBC risk, we evaluated the extent to which the most associated SNP at each locus tags expression quantitative trait loci (eQTLs) or regions of open chromatin. Regions of open chromatin are associated with gene regulatory elements including promoters, enhancers, silencers, insulators and locus control regions. Known eQTLs were collated from the University of Chicago eQTL (see URLs) Browser and Gaffney et al. (2012)28. Open chromatin regions in a range of cell lines were identified as part of the Encyclopedia of DNA Elements (ENCODE) project29,30 using DNase I hypersensitive sites sequencing (DNase-seq) and formaldehyde-assisted isolation of regulatory elements sequencing (FAIRE-seq).
Of the 26 independent non-HLA genome-wide significant SNPs identified in this study, 15 have an r 2>0.8 with SNPs in DNase-seq or FAIRE-seq peaks in a B-lymphoblastoid cell line, and seven are also significant eQTLs in the same cell line (Table 1 and Supplementary Table 7-8). To test if the enrichment of open chromatin within the B-lymphoblastoid cell line was significantly greater than that for all other cell lines we began by grouping SNPs into independent loci. We sequentially identified the most associated SNP not already assigned to a locus and assigned this SNP, and others in weak LD (r2>0.1) with it, to a new locus. We then calculated an enrichment score, E (Online Methods), using only candidate causal variants (r2>0.8 to the most associated SNP in each locus) across all currently assigned loci. Considering only loci where the most associated SNP achieved genome-wide significance (N=21, excluding the HLA locus, SNPs outside Immunochip fine-mapping regions and SNPs with MAF<5%), Gm12878 had the highest enrichment score compared with the other cell lines, though the difference in enrichment just failed to reach significance (P = 0.068 Online Methods) (Figure 2). Failure to correctly account for LD between associated SNPs can bias the calculated degree of enrichment (Supplementary Figure 9). Our enrichment analysis protocol can be applied to other functional annotations and other disease phenotypes and will be well powered for traits with many genome-wide significant associations.
In summary, we have used dense genotyping across autoimmune disease associated loci to better define the genetic architecture of known PBC risk loci. We have identified additional independent genome-wide significant associations at five loci, and have identified potentially causal protein-coding and regulatory variants within many disease associated loci. We also identified three new PBC risk loci, bringing the total number of associated loci to 25, and confirmed HLA-allele associations by imputing HLA-types. Furthermore, we have combined our SNP data with large-scale functional genomics annotations to identify the cell types in which the PBC associated variants are likely to be acting.
This study was approved by the Research and Development Departments of all National Health Service (NHS) Trusts participating in this study and by the Oxford Research Ethics Committee C (Oxford REC C reference 07/H0606/96).
All subjects were of self-declared British or Irish ancestry. Cases were collected by the UK PBC Consortium, which consists of 142 NHS trusts including all UK liver transplant centers. All individuals were over 18 years of age with probable or certain PBC. Three criteria were applied to diagnose the condition: a) a positive test for the presence of anitmitochondrial antibodies (titer 1:40 or higher), b) liver biopsy histology consistent with PBC, and c) liver biochemistry consistent with PBC (i.e. a higher level of bilirubin, aspartate transaminase, alanine transaminase, alkaline phosphatase or gamma-glutamyl transferase compared to the upper reference level). Diagnosis was documented as probable when two criteria were satisfied and certain if all three criteria were satisfied. A total of 2,981 cases were supplied by the UK PBC Consortium. 8,970 control samples were ascertained from the 1958 British Birth Cohort and the National Blood. This study contains 1,838 cases and 2,356 controls included in our recent PBC GWAS6.
DNA was extracted from blood or saliva. Blood samples from PBC patients were extracted by the East Anglian Medical Genetics Service, while saliva samples were collected using an Oragene kit and DNA extracted at Source BioScience Healthcare. DNA samples were plated, normalized and shipped to the Wellcome Trust Sanger Institute for sample quality control.
Samples were genotyped on an Illumina iSelect HD custom genotyping array (Immunochip). All 2,981 cases and 4537 controls were genotyped at the Wellcome Trust Sanger Institute. A further 4433 control samples were genotyped at the Center for Public Health Genomics at the University of Virginia. Genotyping of control samples was coordinated by the Immunochip consortium for use in several Immunochip projects. The NCBI build 36 (hg18) map was used (Illumina manifest file Immuno_BeadChip_11419691_B.bpm). Normalized probe intensities were extracted for all samples passing standard laboratory QC thresholds and genotypes were called using optiCall31. Genotypes with an individual posterior probability lower than 0.7 were defined as unknown. optiCall was chosen because we found it to be more accurate in calling common and low-frequency variants on Immunochip compared to other established algorithms such as Illuminus32 and GenoSNP33.
Sample quality control (QC) was performed for each sample set separately. All monomorphic SNPs were removed prior to QC. Samples with a call rate lower than 98% and heterozygosity more than three standard deviations from the mean were excluded. A set of LD-pruned SNPs with MAF>20% were used to estimate identity by descent (IBD) and ancestry. For each pair of individuals with an estimated IBD>18.75%, the sample with the lower call rate was removed. Principal component analysis was used to exclude samples of non-European ancestry34 (Supplementary Figures 1-3). Following sample QC 2,861 cases and 8,514 controls remained (Supplementary Table 1). SNPs with a minor allele frequency less than 0.1%, Hardy-Weinberg equilibrium P < 10−6, call rate lower than 98%, or significantly different (P<10−5) call rate in cases vs. controls (or between the two control sets) were excluded. After marker QC 143,020 polymorphic SNPs were available for analysis (Supplementary Table 2).
The Immunochip contains 2,258 SNPs that lie in regions associated with bipolar disease. These were used as null markers to estimate the overall inflation of the distribution of association test statistics35.
Using the 90,977 SNPs from the cleaned Immunochip set that were in fine-mapped regions, additional genotypes were imputed using the 1000 Genomes Phase I (interim) June 2011 release reference panel and IMPUTE236. Imputation was performed separately in three batches of 3792, 3792 and 3791 individuals, with the case:control ratio constant across batches. SNPs with a posterior probability less than 0.9 and those with differential missingness (P<10−5) between the three batches were removed, as were those SNPs that failed the same exclusion thresholds used for the original Immunochip QC. After imputation, a total of 237,619 SNPs were available for analysis.
Case-control association tests were implemented using a standard one-degree of freedom Cochran-Armitage test for trend in PLINK v1.0737. Secondary associations were identified using step-wise logistic regression analysis conditioning on the allelic dosage of the primary signal in each significant locus. The process was repeated, conditioning on all independent genome-wide significant SNPs, until all genome-wide significant signals were accounted for10. Haplotype association was performed in PLINK using logistic regression. Cluster plots for all SNPs P<5×10−6 were manually checked using Evoker38, and poorly called SNPs were removed from further study (Supplementary Figure 11).
Imputation of six classic HLA alleles (class I: HLA-A, HLA-B and HLA-C, class II: HLA-DQA1, HLA-DQB1 and HLA-DRB1) was performed using the prediction algorithm proposed by Leslie et al. and implemented in the program HLA*IMP26,27. Case-control association was performed on HLA allele posterior probabilities generated from HLA*IMP using logistic regression to account for genotype uncertainty following imputation. Pairwise conditional logistic regression was used to identify independent association signals among the 21 HLA-alleles that reached P < 0.0001.
The heritability explained by the 26 independent genome-wide significant SNPs and four HLA-alleles was estimated using a liability threshold model39,40 assuming a disease prevalence of 40/100,000, log-additive risk and a sibling relative risk ratio of 10.53.
eQTLs within genome-wide significant loci were collated from the University of Chicago eQTL Browser (see URLs) and a study by Gaffney et al., (2012)28. The eQTL Browser contains significant eQTLs that were identified in recent studies across multiple cell lines and populations, while Gaffney et al., reanalysed gene expression data from 210 lymphoblastoid cell lines using a total of 13.6M SNPs from the 1000 Genomes project. For more details, see Gaffney, et al., (2012)28 and references listed in the Chicago eQTL Browser (see URLs).
The ENCODE project annotated regions of open chromatin using two techniques, the direct sequencing of DNaseI cleavage sites (DNase-seq: sixteen different cell lines) and formaldehyde-assisted isolation of regulatory elements sequencing (FAIRE-seq: ten different cell lines)29,30. Both methods isolate nucleosome-depleted regions of DNA and map reads from next-generation sequencing to determine their location. The overlap of peaks between the two assays ranges from 30-40% depending on the cell type, and regions identified uniquely by DNase-seq or FAIRE-seq often represent relevant biological processes29. Positions of discrete DNase-seq and FAIRE-seq peaks were estimated from the base overlap signal (BOS) at each base-pair29. We quantified the evidence for the open chromatin peaks using a Poisson distribution where lambda equals the mean BOS across all Immunochip SNPs. Supplementary Figure 12 shows the relative position of open chromatin peaks and associated SNPs within significantly associated loci.
For both DNase-seq and FAIRE-seq data, we estimated the amount of enrichment for open chromatin peaks among significant loci across the ENCODE cell lines. SNPs were first grouped into independent loci; we sequentially identified the most associated SNP not already assigned to a locus and assigned this SNP, and others in weak LD with it (r2>0.1), to a new locus. After the addition of each new locus, we calculated E,
where, for a given cell line, OCloci and Nloci are the number of candidate causal SNPs (r2>0.8 to the most associated SNP) that lie within open chromatin peaks across the selected loci and the total number of SNPs within the loci, respectively. OCichip and Nichip are the equivalent measures across all SNPs within Immunochip fine mapping regions We only included the fine-mapping regions to increase the likelihood that the causal variant was assayed, and excluded SNPs in the HLA and those with MAF < 0.05 to avoid possible biases due to LD structure. The OCichip values for each of the cell lines are given in Supplementary Table 9. To compare E between cell lines, the number of candidate causal SNPs in open chromatin (OCloci:allcells) and the total number SNPs in open chromatin (OCichip:allcells) were first calculated for the union of open chromatin peaks across all cell lines other than that being evaluated. We then tested the alternative hypothesis that, for a given cell line, the proportion OCloci/OCichip > OCloci:allcells/OCichip:allcells using a chi-square test for the difference in proportions.
To ensure that our test was well calibrated under the null hypothesis we undertook 1000 permutations, repeating the association and enrichment analyses for each permutation. Comparing the observed level of enrichment at our top 21 loci to the equivalent from the permutations we obtained a similar, non-significant empirical P-value of 0.073 indicating that our proposed enrichment analysis is well calibrated under the null. A 95% confidence interval for E was estimated using the permutations.
The PBC sample collection was funded by the Isaac Newton Trust, the PBC Foundation, The Addenborooke’s Charitable Trust and the Wellcome Trust (085925/Z/08/Z). The PBC Genetics Study is a portfolio study of the National Institute for Health Research Comprehensive Clinical Research Network (NIHR CRN, portfolio reference 5630). The project is also supported by the Wellcome Trust (WT090355/A/09/Z, WT090355/B/09/Z, 098051). Genotyping of samples at the University of Virginia utilized resources provided by the Type 1 Diabetes Genetics Consortium, a collaborative clinical study sponsored by the National Institute of Diabetes and Digestive and Kidney Diseases, the National Institute of Allergy and Infectious Diseases, the National Human Genome Research Institute, the National Institute of Child Health and Human Development, and the Juvenile Diabetes Research Foundation International and is supported by U01-DK-062418. We would like to thank the UK Medical Research Council and Wellcome Trust for funding the collection of DNA for the British 1958 Birth Cohort (MRC grant G0000934, WT grant 068545/Z/02). We acknowledge use of DNA from The UK Blood Services collection of Common Controls (UKBS collection), funded by the Wellcome Trust grant 076113/C/04/Z, by the Wellcome Trust/Juvenile Diabetes Research Foundation grant 061858, and by the National Institute of Health Research of England. The collection was established as part of the Wellcome Trust Case-Control Consortium. G.F.M. is a Clinical Research Training Fellow of the Medical Research Council (G0800460). G.F.M. is also supported by a Raymond and Beverly Sackler Studentship.
We are grateful to the PBC Foundation for helping us to establish the PBC Genetics Study, for endorsing it, and for encouraging members of the Foundation to contribute samples. We thank all of the research nurses who assisted with participant recruitment in collaborating centers. We thank the staff in the NIHR CRN and Clinical Research Collaboration (CRC) Cymru for providing invaluable support. We are grateful to K. Chittock and his colleagues at Source Bioscience for performing DNA extraction. We thank O. Burren for designing the participant database and for providing information technology support. We are grateful to Alexander Dilthey for providing support regarding HLA*IMP. We thank J. Stone for coordinating the Immunochip design and production at Illumina. We also thank the members of each disease consortium who initiated and sustained the cross-disease Immunochip project and shared control genotypes. Finally, we thank the individuals who contributed the DNA samples used in this study.
URLs University of Chicago eQTL Browser, http://eqtl.uchicago.edu/cgi-bin/gbrowse/eqtl/
Author contributions Study concept and design: D.J.G., G.F.M., L.J., H.J.C., M.A.H., J.M.N., P.T.D., D.E.J., G.J.A., A.J.B., A.B., M.H.D., J.C.B., The WTCCC3 management committee (see Supplementary Note), R.N.S., C.A.A.
Case ascertainment and phenotyping: S.J.D., D.B.D. and The UK PBC Consortium (see Supplementary Note)
Genotyping: The WTCCC3 DNA, Genotyping and Informatics group (see Supplementary Note).
Statistical analysis: J.Z.L., M.A.A., D.J.G., L.J., The WTCCC3 data analysis group (see Supplementary Note), C.A.A.
Manuscript preparation: J.Z.L., M.A.A., D.J.G., C.A.A. All authors reviewed the final manuscript.