Rheumatoid arthritis is the most common inflammatory arthritis, affecting up to 1% of the adult population3
. Two loci (HLA-DRB14
) have previously been associated with rheumatoid arthritis susceptibility in individuals with circulating antibodies to cyclic citrullinated peptides (CCP). Most of the inheritance of rheumatoid arthritis remains unexplained.
To identify additional common variants associated with risk of CCP antibody–associated (CCP+
) rheumatoid arthritis, we conducted a GWA study using the Affymetrix 100K GeneChip microarray in a longitudinal case series of individuals with CCP+
rheumatoid arthritis (the Brigham Rheumatoid Arthritis Sequential Study (BRASS) cohort). As we lacked epidemiologically matched controls, we compared case data to publicly available genotype data collected using the same platform from 1,211 related Framingham Heart Study (FHS) participants1
, drawn from the same geographical region as the individuals in our study (near Boston, Massachusetts, USA).
Before comparing allele frequencies between cases and controls, we considered biases that may be introduced by the use of shared controls. Such biases, whether due to nonrandom distribution of technical artifacts6
or to population differences between case and control data7,8
, would result in a non-null distribution of test statistics with excess false-positive associations. In an initial analysis of unrelated case-control samples, we assessed the median distribution of test statistics with the genomic-control parameter λGC9
(where 1.0 indicates no inflation) and examined the tail of the distribution of association statistics in a comparison of observed and expected P
values (Q-Q plot; ).
Figure 1 Q-Q plots of GWA analyses in unrelated individuals: influence of missing genotype data and population stratification. We conducted GWA analysis of BRASS rheumatoid arthritis cases compared to unrelated FHS controls. Light blue diamonds indicate SNPs within (more ...)
Using published data quality control parameters from early studies on this genotyping platform (genotype call rates > 90%, minor allele frequency (MAF) >5%)1
, we observed λGC
= 1.19 and an excess of associations in the extreme tail of the −log10
) distribution (). To disentangle the contribution of genotyping bias from that due to population stratification, we examined the χ2
distribution for a subset of 40,562 SNPs with nearly complete genotype data (call rate >99%). This stringent filtering of SNPs reduced λGC
to 1.12, and fewer SNPs had extreme P
values ( and Supplementary Table 1
online), indicating that SNPs with low call rates were disproportionately inflating the association statistics. The presence of residual inflation in the χ2
distribution, however, suggested that bias in missing genotype data was not the only source of inflation in this study.
We next used two statistical methods to adjust for inflation due to population stratification: structured association by genetically matching cases and controls using identity-by-state similarity as implemented in PLINK10
and a principal components approach (EIGENSTRAT)11
. After these adjustments, λGC
was nearly completely normalized, falling from 1.12 to 1.04 (PLINK Cochran-Mantel-Haenszel; ) and 1.03 (EIGENSTRAT; Supplementary Table 1
), with both methods giving very similar results (Supplementary Fig. 1
online). Thus, using a set of SNPs with complete genotype data and controlling for stratification in either of two ways, we found that an essentially null distribution of association statistics could be obtained despite the use of shared controls and a first-generation genotyping platform with substantial missing data.
Although this approach accounted for observed biases, it did so at the cost of reduced genome coverage due to stringent SNP filtering: from 30% of common HapMap CEU SNPs captured (at r2
> 0.8) by the 87,962 SNPs with call rates >90% to just 18% captured with the subset of 40,562 with call rates >99%. In a two-parameter linear model with call rate and minor-allele frequency as variables, we found that λGC
was considerably associated with call rate and with an interaction between call rate and MAF (Supplementary Fig. 2
online). Thus, instead of a standard correction of uniformly dividing all test statistics by λGC
, we used linear regression to correct the test statistics of 79,853 SNPs with >95% call rates as a function of call rate and MAF–call rate interaction (Supplementary Fig. 3
online). This dynamic genomic-control correction resulted in a null −log10
) distribution () and maintained genome coverage at 29% of HapMap CEU SNPs.
Finally, as the available control genotypes were drawn from related individuals from multigenerational pedigrees, we evaluated whether power was improved by including genotypes from multiple related individuals (adjusting for the inflation in the χ2
distribution) or by using only the unrelated individuals from each pedigree (Supplementary Methods and Supplementary Fig. 4
online). Specifically, we evaluated significance for the two known true-positive associations (HLA-DRB1
) in each design. Inclusion of related individuals predictably inflated the χ2
distribution, with λGC
increasing from 1.04 to 1.34 (Supplementary Table 2
online) because of overestimation of the number of control chromosomes (as some are not independent). However, even after correction for this inflation, we observed a net increase in ability to detect the effect of HLA-DRB1
(Supplementary Table 2
). Intuitively, this is not surprising, as inclusion of additional family members increases the number of independent chromosomes with which to estimate control-allele frequencies.
On the basis of these evaluations, we carried out association analysis of 397 CCP+
rheumatoid arthritis cases and 1,211 related FHS controls over 79,853 SNPs, using PLINK CMH to correct for stratification, two-parameter linear modeling to correct for genotype artifact, and residual λGC
to correct for relatedness. This analysis resulted in an overall null distribution of results, with only slight enrichment in the tail, where an excess of spurious results may have occurred (). Such enrichment could be due to true-positive results, or it could be due to bias that we failed to account for in our study. We report all SNPs with P
< 0.001 from this final analysis in Supplementary Table 3
online to facilitate future attempts to replicate our findings.
From this analysis, we attempted to replicate 90 of the most significant common non–major histocompatibility complex (non-MHC) SNPs in 875 CCP+
incident rheumatoid arthritis cases and 832 controls drawn from a population-based study in Sweden (Epidemiological Investigation of Rheumatoid Arthritis (EIRA))12
and in 535 CCP+
family-based rheumatoid arthritis cases and 1,013 controls (North American Rheumatoid Arthritis Consortium (NARAC) family samples)13
. In an interim analysis of genotypes for a subset of these SNPs, we identified a single SNP (rs10499194) that was associated with rheumatoid arthritis susceptibility in combined analysis of EIRA and NARAC data (). We advanced this SNP to genotyping in a third group of rheumatoid arthritis samples (NARAC sporadic samples, n
= 873 CCP+
= 1,413 controls) to confirm the finding. We also genotyped additional SNPs from the region to fine map the locus in all available samples. In Supplementary Table 3
, we list the complete association statistics for all SNPs genotyped in our replication samples.
Summary of results for rs10499194 across 2,680 CCP+ rheumatoid arthritis cases and 4,469 controls
As shown in , the single SNP we identified from this interim analysis (rs10499194) was strongly associated with risk of rheumatoid arthritis in our study: P = 4 × 10−7 in the 2,283 unrelated CCP+ rheumatoid arthritis cases and 3,258 unrelated control samples used for replication; P ≤ 10−9 including the original scan of the BRASS cohort and related FHS controls. The minor allele was associated with protection against rheumatoid arthritis, with a frequency ~0.24 in cases and ~0.30 in controls (odds ratio = 0.75 across all samples tested). The SNP resides in a 63-kb region of linkage disequilibrium that falls outside of any coding sequence—the nearest genes, TNFAIP3 and OLIG3, are ~185 kb away ().
Figure 2 Case-control association results and linkage disequilibrium (LD) structure at 6q23. Results for SNPs genotyped across 1 Mb as part of the original GWA scan in 397 CCP+ rheumatoid arthritis cases and 1,211 related controls (gray diamonds), as well 17 SNPs (more ...)
After initial submission of our manuscript, genome-wide association data became available from the Wellcome Trust Case Control Consortium (WTCCC) on ~2,000 rheumatoid arthritis cases (CCP status unknown) and ~3,000 controls2
. Because the full association results for this study were available online, we sought to examine the association of our replicated finding (rs10499194) in this independent study. The WTCCC data showed association to rs13207033, a perfect proxy (r2
= 1.0) of our replicated SNP (rs10499194) with P
= 0.01. Notably, a second SNP less than 4 kb away (rs6920220; r2
= 0.05 to rs10499194) had much stronger association in WTCCC data, with P
. For the WTCCC SNP rs13207033, the minor allele is increased in frequency in controls compared to cases, as is the minor allele of rs10499494 in our study ().
Figure 3 Haplotype analysis in our replication samples and in the WTCCC study of ~2,000 individuals with rheumatoid arthritis and ~3,000 controls. Haplotype analysis with 17 genotyped SNPs and 3 imputed SNPs across a 63-kb region of strong LD in (more ...)
Before learning of the WTCCC results, in an attempt to fine map our association, we had genotyped in our replication samples an additional 17 SNPs chosen on the basis of imperfect linkage disequilibrium (LD) to rs10499194 (r2
= 0.20–0.95). In light of the WTCCC results, we carried out stepwise regression analysis to determine whether the two signals were independent or simply due to linkage disequilibrium with each other or another SNP in the region. Specifically, we used these 17 SNPs to predict SNPs in CEU HapMap individuals that were not directly genotyped in our study but that could be well predicted using single SNPs or multi-marker haplotypes14
. In this analysis, the SNP we originally observed (rs10499194) provided a strong signal of association () but alone did not explain the entire association signal: the SNP with the stronger association in WTCCC (rs6920220, imputed with r2
= 1 using a two-marker predictor) remained significant after analysis conditional on rs10499194 (P
= 0.0005 for rs6920220; MAF = 0.241 for cases and 0.196 for controls). Analysis of rs6920220 alone was also highly significant (P
= 1 × 10−7
) in our replication samples. Similarly to the WTCCC study, the rs6920220 minor allele was increased in rheumatoid arthritis cases compared with controls.
We next carried out haplotype analysis on the basis of these two SNPs and found that a two-allele model of risk provided the strongest predictor of risk, which was highly significant (P
= 2.8 × 10−12
). Addition of other SNPs to the haplotype analysis did not increase the significance of the model, and the two SNPs together did not predict any known HapMap SNP. These two SNPs reside on distinct phylogenetic branches of the haplotype tree constructed with genotype data from our study and define three categories of risk: a ‘protective’ haplotype tagged by rs10499194; a ‘risk’ haplotype tagged by rs6920220; and the remaining haplotypes, which have risks equal to one another (). Although these data strongly suggest the existence of two independent susceptibility alleles, exhaustive resequencing is required to rule out the possibility that these two SNPs form a haplotype in LD with a single, as-yet-unidentified causal allele. If multiple independent association signals are confirmed, the finding of multiple common risk alleles at 6q23 would be similar to other recent examples of multiple alleles such as the associations of IRF5
and risk of systemic lupus erythematosis15
and risk of Crohn's disease16
, 8q24 and risk of prostate cancer17–19
and risk of age-related macular degeneration20
These two SNPs (rs10499194 and rs6920220) are located within 3.8 kb of each other but are >150 kb from the nearest genes, which are those encoding tumor necrosis factor, alpha-induced protein 3 (TNFAIP3
, ~185 kb telomeric), and oligodendrocyte transcription factor 3 (OLIG3
, ~185 kb centromeric; ). TNFAIP3
, also known as A20
, is a potent inhibitor of NF-κB signaling and is required for termination of tumor necrosis factor (TNF)-induced signals21
. TNF-α levels are increased in individuals with rheumatoid arthritis, and inhibition of TNF-α is a potent treatment of severe rheumatoid arthritis22
. Furthermore, mice lacking Tnfaip3
show chronic inflammation23
, consistent with loss of function of this gene playing a role in autoimmunity. Far less is known about OLIG3
. Mutant Olig3
mice have abnormalities in neuronal development but no reported abnormalities of the immune or musculoskeletal systems24
. Finally, two other immune-related genes lie within 1 Mb of the associated region (IL22RA
). Additional genetic and functional studies will be required to determine which of these genes, or others not yet recognized, explain the two SNP associations observed consistently and significantly across our study and the WTCCC results.