|Home | About | Journals | Submit | Contact Us | Français|
To identify susceptibility alleles associated with rheumatoid arthritis, we genotyped 397 individuals with rheumatoid arthritis for 116,204 SNPs and carried out an association analysis in comparison to publicly available genotype data for 1,211 related individuals from the Framingham Heart Study1. After evaluating and adjusting for technical and population biases, we identified a SNP at 6q23 (rs10499194, ~150 kb from TNFAIP3 and OLIG3) that was reproducibly associated with rheumatoid arthritis both in the genome-wide association (GWA) scan and in 5,541 additional case-control samples (P = 10−3, GWA scan; P < 10−6, replication; P = 10−9, combined). In a concurrent study, the Wellcome Trust Case Control Consortium (WTCCC) has reported strong association of rheumatoid arthritis susceptibility to a different SNP located 3.8 kb from rs10499194 (rs6920220; P = 5 × 10−6 in WTCCC)2. We show that these two SNP associations are statistically independent, are each reproducible in the comparison of our data and WTCCC data, and define risk and protective haplotypes for rheumatoid arthritis at 6q23.
Rheumatoid arthritis is the most common inflammatory arthritis, affecting up to 1% of the adult population3. Two loci (HLA-DRB14 and PTPN225) have previously been associated with rheumatoid arthritis susceptibility in individuals with circulating antibodies to cyclic citrullinated peptides (CCP). Most of the inheritance of rheumatoid arthritis remains unexplained.
To identify additional common variants associated with risk of CCP antibody–associated (CCP+) rheumatoid arthritis, we conducted a GWA study using the Affymetrix 100K GeneChip microarray in a longitudinal case series of individuals with CCP+ rheumatoid arthritis (the Brigham Rheumatoid Arthritis Sequential Study (BRASS) cohort). As we lacked epidemiologically matched controls, we compared case data to publicly available genotype data collected using the same platform from 1,211 related Framingham Heart Study (FHS) participants1, drawn from the same geographical region as the individuals in our study (near Boston, Massachusetts, USA).
Before comparing allele frequencies between cases and controls, we considered biases that may be introduced by the use of shared controls. Such biases, whether due to nonrandom distribution of technical artifacts6 or to population differences between case and control data7,8, would result in a non-null distribution of test statistics with excess false-positive associations. In an initial analysis of unrelated case-control samples, we assessed the median distribution of test statistics with the genomic-control parameter λGC9 (where 1.0 indicates no inflation) and examined the tail of the distribution of association statistics in a comparison of observed and expected P values (Q-Q plot; Fig. 1).
Using published data quality control parameters from early studies on this genotyping platform (genotype call rates > 90%, minor allele frequency (MAF) >5%)1, we observed λGC = 1.19 and an excess of associations in the extreme tail of the −log10(P) distribution (Fig. 1a). To disentangle the contribution of genotyping bias from that due to population stratification, we examined the χ2 distribution for a subset of 40,562 SNPs with nearly complete genotype data (call rate >99%). This stringent filtering of SNPs reduced λGC to 1.12, and fewer SNPs had extreme P values (Fig. 1b and Supplementary Table 1 online), indicating that SNPs with low call rates were disproportionately inflating the association statistics. The presence of residual inflation in the χ2 distribution, however, suggested that bias in missing genotype data was not the only source of inflation in this study.
We next used two statistical methods to adjust for inflation due to population stratification: structured association by genetically matching cases and controls using identity-by-state similarity as implemented in PLINK10 and a principal components approach (EIGENSTRAT)11. After these adjustments, λGC was nearly completely normalized, falling from 1.12 to 1.04 (PLINK Cochran-Mantel-Haenszel; Fig. 1c) and 1.03 (EIGENSTRAT; Supplementary Table 1), with both methods giving very similar results (Supplementary Fig. 1 online). Thus, using a set of SNPs with complete genotype data and controlling for stratification in either of two ways, we found that an essentially null distribution of association statistics could be obtained despite the use of shared controls and a first-generation genotyping platform with substantial missing data.
Although this approach accounted for observed biases, it did so at the cost of reduced genome coverage due to stringent SNP filtering: from 30% of common HapMap CEU SNPs captured (at r2 > 0.8) by the 87,962 SNPs with call rates >90% to just 18% captured with the subset of 40,562 with call rates >99%. In a two-parameter linear model with call rate and minor-allele frequency as variables, we found that λGC was considerably associated with call rate and with an interaction between call rate and MAF (Supplementary Fig. 2 online). Thus, instead of a standard correction of uniformly dividing all test statistics by λGC, we used linear regression to correct the test statistics of 79,853 SNPs with >95% call rates as a function of call rate and MAF–call rate interaction (Supplementary Fig. 3 online). This dynamic genomic-control correction resulted in a null −log10(P) distribution (Fig. 1d) and maintained genome coverage at 29% of HapMap CEU SNPs.
Finally, as the available control genotypes were drawn from related individuals from multigenerational pedigrees, we evaluated whether power was improved by including genotypes from multiple related individuals (adjusting for the inflation in the χ2 distribution) or by using only the unrelated individuals from each pedigree (Supplementary Methods and Supplementary Fig. 4 online). Specifically, we evaluated significance for the two known true-positive associations (HLA-DRB1 and PTPN22) in each design. Inclusion of related individuals predictably inflated the χ2 distribution, with λGC increasing from 1.04 to 1.34 (Supplementary Table 2 online) because of overestimation of the number of control chromosomes (as some are not independent). However, even after correction for this inflation, we observed a net increase in ability to detect the effect of HLA-DRB1 and PTPN22 (Supplementary Table 2). Intuitively, this is not surprising, as inclusion of additional family members increases the number of independent chromosomes with which to estimate control-allele frequencies.
On the basis of these evaluations, we carried out association analysis of 397 CCP+ rheumatoid arthritis cases and 1,211 related FHS controls over 79,853 SNPs, using PLINK CMH to correct for stratification, two-parameter linear modeling to correct for genotype artifact, and residual λGC to correct for relatedness. This analysis resulted in an overall null distribution of results, with only slight enrichment in the tail, where an excess of spurious results may have occurred (Fig. 1e). Such enrichment could be due to true-positive results, or it could be due to bias that we failed to account for in our study. We report all SNPs with P < 0.001 from this final analysis in Supplementary Table 3 online to facilitate future attempts to replicate our findings.
From this analysis, we attempted to replicate 90 of the most significant common non–major histocompatibility complex (non-MHC) SNPs in 875 CCP+ incident rheumatoid arthritis cases and 832 controls drawn from a population-based study in Sweden (Epidemiological Investigation of Rheumatoid Arthritis (EIRA))12 and in 535 CCP+ family-based rheumatoid arthritis cases and 1,013 controls (North American Rheumatoid Arthritis Consortium (NARAC) family samples)13. In an interim analysis of genotypes for a subset of these SNPs, we identified a single SNP (rs10499194) that was associated with rheumatoid arthritis susceptibility in combined analysis of EIRA and NARAC data (Table 1). We advanced this SNP to genotyping in a third group of rheumatoid arthritis samples (NARAC sporadic samples, n = 873 CCP+ cases, n = 1,413 controls) to confirm the finding. We also genotyped additional SNPs from the region to fine map the locus in all available samples. In Supplementary Table 3, we list the complete association statistics for all SNPs genotyped in our replication samples.
As shown in Table 1, the single SNP we identified from this interim analysis (rs10499194) was strongly associated with risk of rheumatoid arthritis in our study: P = 4 × 10−7 in the 2,283 unrelated CCP+ rheumatoid arthritis cases and 3,258 unrelated control samples used for replication; P ≤ 10−9 including the original scan of the BRASS cohort and related FHS controls. The minor allele was associated with protection against rheumatoid arthritis, with a frequency ~0.24 in cases and ~0.30 in controls (odds ratio = 0.75 across all samples tested). The SNP resides in a 63-kb region of linkage disequilibrium that falls outside of any coding sequence—the nearest genes, TNFAIP3 and OLIG3, are ~185 kb away (Fig. 2).
After initial submission of our manuscript, genome-wide association data became available from the Wellcome Trust Case Control Consortium (WTCCC) on ~2,000 rheumatoid arthritis cases (CCP status unknown) and ~3,000 controls2. Because the full association results for this study were available online, we sought to examine the association of our replicated finding (rs10499194) in this independent study. The WTCCC data showed association to rs13207033, a perfect proxy (r2 = 1.0) of our replicated SNP (rs10499194) with P = 0.01. Notably, a second SNP less than 4 kb away (rs6920220; r2 = 0.05 to rs10499194) had much stronger association in WTCCC data, with P = 5 × 10−6. For the WTCCC SNP rs13207033, the minor allele is increased in frequency in controls compared to cases, as is the minor allele of rs10499494 in our study (Fig. 3).
Before learning of the WTCCC results, in an attempt to fine map our association, we had genotyped in our replication samples an additional 17 SNPs chosen on the basis of imperfect linkage disequilibrium (LD) to rs10499194 (r2 = 0.20–0.95). In light of the WTCCC results, we carried out stepwise regression analysis to determine whether the two signals were independent or simply due to linkage disequilibrium with each other or another SNP in the region. Specifically, we used these 17 SNPs to predict SNPs in CEU HapMap individuals that were not directly genotyped in our study but that could be well predicted using single SNPs or multi-marker haplotypes14. In this analysis, the SNP we originally observed (rs10499194) provided a strong signal of association (Fig. 2) but alone did not explain the entire association signal: the SNP with the stronger association in WTCCC (rs6920220, imputed with r2 = 1 using a two-marker predictor) remained significant after analysis conditional on rs10499194 (P = 0.0005 for rs6920220; MAF = 0.241 for cases and 0.196 for controls). Analysis of rs6920220 alone was also highly significant (P = 1 × 10−7) in our replication samples. Similarly to the WTCCC study, the rs6920220 minor allele was increased in rheumatoid arthritis cases compared with controls.
We next carried out haplotype analysis on the basis of these two SNPs and found that a two-allele model of risk provided the strongest predictor of risk, which was highly significant (P = 2.8 × 10−12). Addition of other SNPs to the haplotype analysis did not increase the significance of the model, and the two SNPs together did not predict any known HapMap SNP. These two SNPs reside on distinct phylogenetic branches of the haplotype tree constructed with genotype data from our study and define three categories of risk: a ‘protective’ haplotype tagged by rs10499194; a ‘risk’ haplotype tagged by rs6920220; and the remaining haplotypes, which have risks equal to one another (Fig. 3). Although these data strongly suggest the existence of two independent susceptibility alleles, exhaustive resequencing is required to rule out the possibility that these two SNPs form a haplotype in LD with a single, as-yet-unidentified causal allele. If multiple independent association signals are confirmed, the finding of multiple common risk alleles at 6q23 would be similar to other recent examples of multiple alleles such as the associations of IRF5 and risk of systemic lupus erythematosis15, IL23R and risk of Crohn's disease16, 8q24 and risk of prostate cancer17–19 and CFH and risk of age-related macular degeneration20.
These two SNPs (rs10499194 and rs6920220) are located within 3.8 kb of each other but are >150 kb from the nearest genes, which are those encoding tumor necrosis factor, alpha-induced protein 3 (TNFAIP3, ~185 kb telomeric), and oligodendrocyte transcription factor 3 (OLIG3, ~185 kb centromeric; Fig. 2). TNFAIP3, also known as A20, is a potent inhibitor of NF-κB signaling and is required for termination of tumor necrosis factor (TNF)-induced signals21. TNF-α levels are increased in individuals with rheumatoid arthritis, and inhibition of TNF-α is a potent treatment of severe rheumatoid arthritis22. Furthermore, mice lacking Tnfaip3 show chronic inflammation23, consistent with loss of function of this gene playing a role in autoimmunity. Far less is known about OLIG3. Mutant Olig3 mice have abnormalities in neuronal development but no reported abnormalities of the immune or musculoskeletal systems24. Finally, two other immune-related genes lie within 1 Mb of the associated region (IL22RA and IFNGR1). Additional genetic and functional studies will be required to determine which of these genes, or others not yet recognized, explain the two SNP associations observed consistently and significantly across our study and the WTCCC results.
Samples from patients with rheumatoid arthritis (n = 435) were collected at Brigham and Women's Hospital in Boston, Massachusetts (USA), as part of the BRASS Registry25. A total of 1,343 Framingham Heart Study samples from 303 multiplex families were available for analysis. Because the population prevalence of rheumatoid arthritis is <1% in the adult population, and because only limited data on the rheumatoid arthritis status of FHS samples were available, all FHS samples were considered as possible controls. Informed consent was obtained by the institutions overseeing the BRASS and FHS studies.
Genotyping of the rheumatoid arthritis samples was carried out at the Broad Institute using the Affymetrix GeneChip 100K Mapping Array containing 116,204 SNPs. FHS samples were genotyped at Boston University1 and obtained through a formal application process. Genotypes were called using the dynamic-modeling algorithm. (BRLMM data were available for the rheumatoid arthritis samples, but we did not use them because we only had access to FHS genotypes called using the dynamic-modeling algorithm.) Both datasets were filtered individually and then merged; individuals with >10% missing genotypes and SNPs with >10% missing data or Hardy-Weinberg equilibrium (HWE) P values <0.0001 were excluded. After applying these filters, 405 rheumatoid arthritis cases and 1,305 FHS controls remained. We removed FHS individuals with two genotyped parents (n = 66), as these samples contribute no independent genetic information. The average call rate of the 87,962 SNPs across these samples was 98.3%. The rheumatoid arthritis–associated SNP (rs10499194) had a call rate of 98.03% in the rheumatoid arthritis cases and 99.24% in FHS controls, with a HWE P value >0.05. Additional details are available in Supplementary Methods. The Massachusetts Institute of Technology Institutional Review Board approved the study.
We compared SNP allele frequency in unrelated rheumatoid arthritis samples to either unrelated (n = 393) or related (n = 1,211) FHS controls. In analysis without correction for population stratification, significance was determined using standard Pearson's χ2 test for contingency tables. To correct for population stratification, we first removed genetic outliers (see Supplementary Methods) and then applied two distinct methods: Cochran-Mantel-Haenszel (CMH) meta-analysis implemented in PLINK10 and a principal-components method implemented in EIGENSTRAT11. We used PLINK CMH for our primary analysis and EIGENSTRAT for a secondary analysis (Supplementary Methods).
We first normalized the distribution of association statistics by taking the square root and arbitrarily changing sign for SNPs whose odds ratios were >1. This resulted in an essentially normal distribution of values, to which we fit a linear model with two parameters: missing data proportion and minor allele frequency, including their interaction. Corrected test statistics were recovered by inverting the normalization process for residuals of the model.
Our overall strategy was to replicate our top SNPs in two sample collections: population-based case-control samples from Sweden (EIRA12) and familial case-control samples from North America (NARAC family collection13). We analyzed one CCP+ case from each NARAC family, for a total of 1,548 samples (n = 535, CCP+ rheumatoid arthritis cases; n = 1,013, unrelated controls). The NARAC controls were selected from 20,000 individuals who are part of the New York Cancer Project (NYCP)26. Approximately two controls were matched to each affected sibling proband case on the basis of sex, age (birth decade) and ethnicity (grandparental country or region of origin). A third set of samples (NARAC ‘sporadic collection’) was used to test rs10499194 and carry out fine mapping across the 6q23 locus (Supplementary Methods). Informed consent was obtained by the institutions overseeing the EIRA and NARAC studies.
Genotyping was carried out at the Broad Institute using the Sequenom iPLEX platform. We removed samples with call rates <95% and SNPs with call rates <97% and/or HWE P < 0.01. A final set of 2,283 unrelated CCP+ rheumatoid arthritis cases and 3,258 unrelated control samples were available for analysis. We received permission from FHS to genotype a single SNP, rs10499194, in the same set of FHS samples. The Affymetrix-Sequenom concordance for rs10499194 was 100% for the BRASS and unrelated FHS samples and 99.8% for the related FHS samples. Additional genotype data of 704 European ancestry informative markers (AIMs) had been previously carried out using the Illumina GoldenGate custom assay27 and were available in all NARAC samples.
Our primary analysis in EIRA was based on 2 × 2 contingency tables of allele frequencies and a χ2 test. For NARAC, our primary analysis was EIGENSTRAT11 applied to a set of 704 European substructure AIMs27 and correcting along the first principal component. As a secondary analysis in NARAC, we used the 704 AIMs to generate identity-by-state case-control clusters (for CMH analysis in PLINK; see Supplementary Methods).
We combined replication genotype data for all 2,283 unrelated CCP+ rheumatoid arthritis cases and 3,258 unrelated controls. We imputed three SNPs with an r2 = 1 using two-marker SNP predictors generated by the 17 SNPs genotyped in these samples14: rs6920220 (predicted by rs1167224 and rs812845), rs566097 (predicted by rs9321624 and rs9376293) and rs507779 (predicted by rs6921233 and rs4896295). The statistical software package WHAP28 was used to conduct logistic regression analysis conditional on each SNP and to conduct an omnibus (or global) test of haplotypes. Additional details are available in Supplementary Methods
Note: Supplementary information is available on the Nature Genetics website.
The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University or NHLBI. We appreciate the comments provided by B. Voight and J. Hirschhorn during the preparation of the manuscript. We appreciate the release of genome-wide association results by the WTCCC, which was of great value to our analysis. The BRASS Registry is supported by a grant from Millennium Pharmaceuticals and Biogen-Idec. R.M.P. is supported by a K08 grant from the US National Institutes of Health (AI55314-3). The NARAC is supported by US National Institutes of Health grants RO1-AR44422 and NO1-AR-2-2263 (P.K.G.). This work was also supported in part by the Intramural Research Program of the National Institute of Arthritis and Musculoskeletal and Skin Diseases of the National Institutes of Health. The EIRA study is supported by grants from the Swedish Medical Research Council, the Swedish Council for Working Life and Social Research, King Gustaf V's 80-Year Foundation, the Swedish Rheumatic Foundation, the Stockholm County Council, the insurance company Arbetsmarknadens Försäkringsaktiebolag and the County of Sörmland Research and Development Center. D.A. is a Burroughs Wellcome Fund Clinical Scholar in Translational Research and a Distinguished Clinical Scholar of the Doris Duke Charitable Foundation.
Author Contributions Clinical samples were collected and prepared by R.M.P., E.W.K., N.M., D.M.L., E.F.R., A.T.L., L.P., L.A., J.C., M.E.W., L.K., P.K.G. and N.A.S. Genotyping was contributed by R.M.P., L.D., N.P.B., B.B., M.D., M.P., R.B., W.W., C.H., D.A.H., S.B.G., M.F.S., E.I., R.R. and A.N.P. Statistical analysis was carried out and interpreted by R.M.P., C.C., L.D., A.L.P., P.I.W.D., J.M., I.P., R.R.G., R.G., S.P., M.J.D. and D.A. The manuscript was written by R.M.P., C.C., M.J.D. and D.A.
Competing Interests Statement: The authors declare competing financial interests: details accompany the full-text HTML version of the paper at http://www.nature.com/naturegenetics/.
Published online at http://www.nature.com/naturegenetics
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions