Rheumatoid arthritis is a chronic autoimmune disease characterized by an inflammatory polyarthritis12
. Genetic studies have now identified multiple risk alleles for autoantibody positive RA within the MHC
region, a PTPN22
missense allele, and risk alleles in 14 other loci (see )1,3-11
. Most RA risk loci contain multiple genes, and currently the causal genes are unknown. However, most contain at least one plausible biological candidate gene involved in immune regulation, and these genes suggest an important set of processes involved in RA pathogenesis. For example, risk alleles highlight genes involved in T-cell activation by antigen presenting cells (class II MHC
, and CTLA4
), the NF-κB
signaling pathway (CD40
, and TNFAIP3
, and the recent report of REL13
), citrullination (PADI4
), natural killer cells (CD244
), and chemotaxis (CCL21
Validated RA loci used in functional analyses
Based on these observations, we hypothesized that as yet undiscovered autoantibody positive RA risk loci might also contain genes with functions similar to those of genes in known risk loci. Therefore, known RA risk loci can be used to prioritize SNPs for replication from GWAS, especially SNPs with modest statistical support, in independent samples ().
Using Gene Relationships Across Implicated Loci (GRAIL) to prioritize candidate RA SNPs
To objectively quantify the degree of functional similarity between genes within candidate loci (identified from GWAS) and genes within validated RA risk loci, we used a published functional genomics method, GRAIL
(Gene Relationships Across Implicated Loci)2
. GRAIL quantifies functional similarity between genes by applying established statistical text mining methods14
to text from a reference database of 250,000 published scientific abstracts about human and model organism genes. For each locus, GRAIL identifies the gene with the greatest number of observed relationships. GRAIL estimates the statistical significance of the number of observed relationships with a null model where relationships between genes near SNPs occur by random chance. This significance score, ptext
, represents the output GRAIL
score. GRAIL is already able to effectively identify functional inter-connectivity between genes within the previously known RA loci (); it might also be able to establish connections between these 16 loci and as yet undiscovered RA risk loci.
GRAIL identifies inter-connectivity among genes in RA loci
Since GRAIL might demonstrate variable performance across different phenotypes, we wanted to carefully quantify its predictive ability in RA before using it to prioritize SNPs for replication. To estimate GRAIL’s ability to distinguish true RA loci from spurious associations, we examined 12 RA risk loci discovered since 2006 (see , Supplementary Table 1
online). The current GRAIL implementation is based on PubMed abstracts published prior to December 2006. As these 12 risk loci were subsequently discovered, they constitute a representative set to evaluate GRAIL’s performance. In a leave-one-out analysis, we used GRAIL to score each of these loci for functional relationships to the other 15 validated RA risk loci. A total of 10 of the 12 loci obtain GRAIL scores of ptext
<0.01. This analysis suggests that at that ptext
threshold, GRAIL has an ~83% true positive rate (or sensitivity). To assess the false positive rate of this same ptext
threshold, we modeled spurious loci by sampling 10,000 random SNPs from the Affymetrix 500K array; we scored these SNPs against all 16 RA loci. Of the random SNPs, 5.4% scored ptext
<0.01; this corresponds to a specificity of ~95%. Assessment of true and false positive rates at different cutoffs revealed an area under the curve (or C statistic) of 0.97 (see Supplementary Figure 1
). We note that if a large number of candidate SNPs are screened in a study, this might still result in a large number of false positives.
Next, we attempted to identify novel RA risk loci from a set of SNPs with modest evidence of association from our recent GWAS meta-analysis of 3,392 cases and 12,462 controls1
. In our original study, we genotyped SNPs with p
in the meta-analysis, and found that 6 out of 31 replicated in our independent samples. However, many RA risk alleles have modest effects (e.g., OR <1.2) and will be missed at that significance threshold. We therefore expected that some SNPs at p
<0.001 may be risk alleles. After excluding SNPs that were known validated RA risk loci, we identified a total of 370 SNPs from 179 independent regions that obtained p
<0.001 (see Methods
and Supplementary Note
online). The total number of SNPs observed at this threshold is consistent with the approximate number of SNPs expected by chance, suggesting that the majority represent spurious associations and should not be reproducible in an independent case-control study.
For each of the 179 candidate loci we selected the single SNP with the strongest evidence from the GWAS meta-analysis, and then scored it against the 16 validated RA risk loci with GRAIL. If all 179 SNPs were spurious, then ~10 should score ptext
<0.01 based on the estimated 5.4% false positive rate. However 22 of the 179 (12.3%) scored ptext
<0.01 (see , Supplementary Table 2
online). This represented a significant enrichment compared to random sets of 179 SNPs (p
=3.3 × 10−4
by simulation). We therefore expected that of this select subset of 22 SNPs, as many as half might represent true RA risk alleles.
Figure 3 A. GRAIL identifies 22 SNPs among the 179 candidate SNPs with p<0.001 in a GWAS meta-analysis. We plot a histogram of the 179 SNPs as a function of their GWAS meta-analysis p-value. Gray bars represent the 157 SNPs that were not selected, while (more ...)
In order to identify which of these 22 SNPs represent true RA risk loci, we genotyped them in an independent validation study of 7,957 cases and 11,958 controls from 11 collections from Europe and North America (see Supplementary Table 3
). All cases met 1987 American College of Rheumatology classification criteria15
or were diagnosed by a board-certified rheumatologist, and were seropositive for disease-specific autoantibodies (anti-cyclic citrullinated peptide [CCP] antibody or rheumatoid factor [RF]). All individuals were self-described white and of European ancestry. We assessed association with a Cochran-Mantel-Haenszel (CMH) stratified association statistic16
. For each SNP we calculated a z
-score, where a z
>0 indicates the same allele confers risk in both the replication and the meta-analysis samples. To interpret statistical significance, we used a Bonferroni-corrected one-tailed p
-value of 0.0023 (=0.05/22, z>2.83). Additionally, we calculated the overall association p
-value across all samples (GWAS meta-analysis plus replication).
Strikingly, of the 22 SNPs examined, 19 (86%) obtained z>0 (see ). If these SNPs represented spurious associations then only about half should have z>0; the probability of such a positive skew by chance alone is pskew=0.0005 − suggesting the likelihood of a large number of true RA risk loci within this set of 22 SNPs.
Of the 22 SNPs selected by GRAIL, 13 obtained nominal levels of association at p<0.05 (corresponding to z>1.65); whereas no more than 2 might be expected by chance alone. More compellingly, 7 SNPs achieved a Bonferroni-corrected level of significance in replication (p<0.0023, z>2.83).
When we aggregated both GWAS meta-analysis and replication genotype data (see , Supplementary Table 4
online), we observed the strongest associations at rs11586238 on 1p13.1
near the CD2
overall), at rs1980422 on 2q33.2
overall), and at rs548234 on 6q21
overall). Based on conservative estimates of genome-wide significance these represent confirmed RA risk alleles.
SNPs tested for RA susceptibility
Four additional loci replicated; however, aggregate analysis of GWAS meta-analysis and replication genotype data did not exceed a conservative estimate of significance. We observed evidence of association at rs394581 on 6q25.3 near TAGAP (1.5×10−4 replication, 3.8×10−7 overall), rs10919563 on 1q31.3 within a PTPRC intron (2.6×10−4 replication, 6.7×10−7 overall), rs540386 on 11p12 within a TRAF6 intron (8.3×10−4 replication, 3.9×10−6 overall), and rs12746613 on 1q23.3 near FCGR2A (2.2×10−3 replication, 1.5×10−5 overall). These SNP associations likely represent true RA loci, but additional genotyping will be necessary for definitive confirmation.
Interestingly, many of the SNPs picked by GRAIL that validated in independent genotyping were not those with strongest evidence of association in the initial GWAS meta-analysis (see ). That is, prioritization based purely on meta-analysis p-values might have missed many of these associations. For example, rs12746613 (FCGR2A) was ranked 163rd of 179 and rs540386 (RAG1/TRAF6) was ranked 110th. Of the 5 SNPs that we genotyped with the most significant GRAIL ptext scores, 3 replicated and 1 demonstrated nominal evidence of association; only rs2614394 (IRAK4) demonstrated no evidence of association.
Many of these seven alleles further implicate genomic regions already associated with autoimmune diseases (see ). At this point none of these RA risk alleles correspond perfectly to any previously established autoimmune allele; but in some cases fine mapping of the region in multiple diseases could clarify the relationship between them. The rs12746613 FCGR2A
SNP is 13 kb away from a missense SNP in FCGR2A
that has been associated with systemic lupus erythematosus17,18
; they are in the same LD block (r2
=0.19, D’=1.0). The rs394581 SNP is located in the 5′ untranslated region of TAGAP
and is 17 kb away from a SNP associated with celiac disease and with type I diabetes19,20
; they are in partial LD (r2
=0.32, D’=.73). The rs10919563 PTPRC
SNP is 35 kb away from a rare (~1% allele frequency) non-synonymous SNP that alters splicing of PTPRC21
; there have been inconsistent reports that it is associated with multiple sclerosis22-24
. We also note that the rs7234029, a PTPN2
intronic SNP, is 41 kb away from a SNP associated with both type I diabetes and celiac disease20
; they are in the same LD block (r2
=0.14, D’=1.0). The rs548234 SNP is located 10 kb downstream from the PRDM1
transcript and is 133 kb away from a SNP previously associated with Crohn’s disease25
. The rs11586238 SNP is 50 kb upstream of the CD2
start site, but is also near multiple other key immunological genes including CD58
, and IGSF2
. It is 159 kb away from a multiple sclerosis associated SNP within a CD58
Tested SNPs near other alleles associated with autoimmune diseases
The rs1980422 SNP is located about 10 kb away from the 3′ UTR of CD28
and is 129 kb away from a known RA and type I diabetes risk allele in the CTLA4
. There is minimal LD between these two SNPs (r2
=0.04, D’=0.40); conditional analysis confirmed that the two SNPs independently confer RA risk (see Supplementary Table 5
These SNP associations continue to clarify critical biological processes involved in RA pathogenesis, including T-cell activation, NF-κB
signaling, and B-cell activation and differentiation. The CD2
protein is a co-stimulatory molecule on the surface of natural killer cells and T-cells; CD2
signaling is mediated by binding PTPRC
. Association to CD28
, contributes additional evidence of the role of T-cell activation in disease pathogenesis. TRAF6
is involved in downstream NF-κB
activation; it binds CD40
directly and is a key component of B-cell activation29
. Our study has also implicated novel processes represented by PRDM1
), a transcription factor that regulates terminal differentiation of B-cells into immunoglobulin secreting plasma cells30
. Functional studies and re-sequencing will be required to confirm that these genes are indeed the truly causal genes.
We examined all 7 replicated RA SNPs along with known RA risk alleles for epistatic interactions (see Supplementary Note
online). Despite the functional relationships between these genes, we found no evidence of significant interactions.
Population stratification could result in spurious associations. However, we were careful for each collection either to use (1) epidemiologically matched samples or (2) ancestry informative markers to match cases and controls. We further note that our seven replicated SNP associations demonstrate consistent effects across all 14 collections without evidence of heterogeneity (p>0.05 by Breslow-Day test of Heterogeneity, see ).
In this study we demonstrate the utility of functional information to prioritize SNPs for replication. We did not pre-define pathways, but rather we used GRAIL to look for genes that had relationships to other validated RA genes. We note that GRAIL is limited in its ability to identify disease genes in entirely novel pathways (i.e., pathways not suggested by the 16 previously known RA risk loci). Arguably, those loci could point to truly novel pathogenic mechanisms. Additionally, successful application of GRAIL is contingent on the scientific literature’s comprehensive description of relevant gene relationships. The general application of GRAIL to other diseases will depend critically on the completeness of the validated loci list and the documentation about relevant processes in the literature. Despite these limitations, our study has identified at least three novel RA risk loci, with strong evidence for additional risk loci.