|Home | About | Journals | Submit | Contact Us | Français|
If trait-associated variants alter regulatory regions, then they should fall within chromatin marks in relevant cell types. However, it is unclear which of the many marks are most useful in defining cell types associated with disease and fine mapping variants. We hypothesized that informative marks are phenotypically cell type specific; that is, SNPs associated with the same trait likely overlap marks in the same cell type. We examined 15 chromatin marks and found that those highlighting active gene regulation were phenotypically cell type specific. Trimethylation of histone H3 at lysine 4 (H3K4me3) was the most phenotypically cell type specific (P < 1 × 10−6), driven by colocalization of variants and marks rather than gene proximity (P < 0.001). H3K4me3 peaks overlapped with 37 SNPs for plasma low-density lipoprotein concentration in the liver (P < 7 × 10−5), 31 SNPs for rheumatoid arthritis within CD4+ regulatory T cells (P = 1 × 10−4), 67 SNPs for type 2 diabetes in pancreatic islet cells (P = 0.003) and the liver (P = 0.003), and 14 SNPs for neuropsychiatric disease in neuronal tissues (P = 0.007). We show how cell type–specific H3K4me3 peaks can inform the fine mapping of associated SNPs to identify causal variation.
Recent work showing that common phenotypically associated SNPs are enriched for expression quantitative trait loci (eQTLs)1–6 suggests that they might act by altering gene regulatory regions. One example is a common non-coding variant associated with plasma low-density lipoprotein (LDL) concentration. This variant modifies a CEBPB transcription factor–binding site in an enhancer and, in doing so, alters the expression of SORT1, a gene that affects plasma LDL concentration7. Another similar example is an intergenic risk allele for systemic lupus erythematosus (SLE) that decreases TNFAIP3 transcription by modifying the nuclear factor (NF)-κb–binding site within a promoter8. Whereas many eQTLs and regulatory variants act universally, the ones most relevant to disease might have tissue specific activity6. The cell type specificity of regulatory elements is one of the major limitations in pursuing functional studies to investigate the regulatory potential of common alleles9–13.
One approach to identify regulatory elements influenced by common variants involves assaying epigenetic chromatin marks14–16. For example, H3K4me3 and monomethylation at H3K4 (H3K4me1) highlight active promoters and enhancers. But, a practical challenge of this approach is that dozens of chromatin marks might potentially be assayed17, and it is prohibitive to conduct studies on all of them in large numbers of different tissues or in samples collected from many individuals. However, because chromatin marks colocalize18, the status of a small subset of the most informative marks might be characterized, allowing for more focused assays in tissue libraries and populations to link variants to regulatory mechanisms. Additionally, it is challenging for a given phenotype to know which cell type(s) are most useful to assay chromatin marks in order to fine map risk alleles. If the critical cell types were known, then it might be possible to identify the biologically important cell type–specific eQTLs.
Here, we hypothesize that a proportion of alleles for a given phenotype influence gene regulation by altering regulatory elements that control expression within the cell types most relevant to the phenotype. If this is the case, then variants associated with the same phenotype should overlap marks preferentially occurring within the same cell type. Therefore, to identify the most informative chromatin marks, we quantify the degree to which their activity in specific cell types near phenotypically associated variants tracks with phenotype. We then show how those chromatin marks that are most phenotypically cell type specific can identify causal cell types, asserting that cell type–specific marks might be used to fine map and identify the plausible causal variant at a particular locus.
We first sought to define a score that corresponds to the possibility that a phenotypically associated SNP or a variant in tight linkage disequilibrium (LD) with it can alter cell type–specific gene regulation, as highlighted by a specific chromatin mark. We define chromatin marks as precise positions in the genome where there is a significant excess of reads from chromatin immunoprecipitation and sequencing (ChIP-seq) data over control sequencing data. We assume that variants close to or directly under tall chromatin mark peaks in specific cell types might be involved in cell type–specific gene regulation; on the other hand, variants that are far from chromatin mark peaks are much less likely to have a direct role in gene regulation. First, for each phenotypically associated SNP, we identified each SNP or insertion and/or deletion (indel) in tight LD (r2 > 0.8 in 1000 Genomes Project data19; Fig. 1a). Next, for each cell type, we assigned each variant in LD a score proportional to the height of the nearest chromatin mark peak (referred to as h; Online Methods) divided by the physical distance to the summit (h/d in Fig. 1b; referred to as s; Online Methods). If the physical distance to the nearest peak is more than 2.5 kb, then the score is set to 0 to obviate any confounding distal effects. Thus, a variant in LD directly under a strong peak will receive a very high score. For each cell type, we assigned the phenotypically associated SNP the maximum score achieved by any of its variants in LD. To quantify the specificity of signals across cell types (as opposed to the absolute magnitude), we normalized the h/d scores so that the Euclidean metric across cell types was one (normalized h/d scores (sn; Online Methods). Thus, a SNP within a chromatin mark that is active in only one cell type will have a high score of 1 in that cell type and 0 in others. In contrast, a SNP close to chromatin marks that are not cell type specific will have similarly modest scores across cell types.
Then, we wanted to quantify the phenotypic cell type specificity of the overlap between SNPs and chromatin marks. To do this, we identified sets of SNPs associated with different phenotypes and then assessed the phenotypic cell type specificity of different marks (Fig. 1c). For informative marks, one or few cell types should consistently score highly across many of the SNPs for a given phenotype. For an uninformative chromatin mark, the cell types with the greatest scores vary from SNP to SNP within the same phenotype. Therefore, for informative marks, there should be minimal deviation of scores within a phenotype across multiple cell types. To quantify the phenotypic cell type specificity of a chromatin mark, we defined a metric representing the variation of signal seen within a cell type within a specific phenotype (referred to as d; Online Methods). We evaluated the statistical significance of this metric with permutations with which we randomly reassigned SNPs to phenotypes (Fig. 1d). This permutation strategy restricts analysis to only phenotypically associated SNPs and, in doing so, avoids biases that might result from known differences between phenotypically associated SNPs and non–phenotypically associated SNPs in local LD structure, gene density and epigenetic activity. We note that this approach accurately estimates type I error (Supplementary Fig. 1a).
To test the phenotypic cell type specificity of individual marks, we identified a set of SNPs associated with any one of many complex traits20. We selected only SNPs associated in European populations to facilitate LD calculations. To ensure adequate power, we selected only those traits that had at least 15 reported associations in European populations. Then, we pruned SNPs by LD so that they were all independent (r2 < 0.1 and > 100 kb away from other associated SNPs in the genome; Online Methods). This resulted in a set of 510 independent SNPs associated with 31 complex traits. After defining the genomic locations and heights of peaks for 15 chromatin marks assayed in 14 Encyclopedia of DNA Elements (ENCODE) cell types15 (Supplementary Table 1), we observed statistically significant phenotypic cell type specificity for 4 marks (P < 0.0033 = 0.05/15; Fig. 2). The most strongly associated chromatin marks were H3K4me3 and acetylation of histone H3 at lysine 9 (H3K9ac) (P < 1 × 10−6), which are known to highlight active gene promoters16,21. In fact, all four most significant modifications are known to occur at regions of the genome involved in active gene transcription; DNase I hypersensitivity sites (DHSs; P < 1 × 10−3) and dimethylation of histone H3 at lysine 79 (H3K79me2; P < 1 × 10−5) identify active promoter, enhancer or transcribed regions. Because some chromatin marks colocalize (Supplementary Fig. 2), we performed conditional analyses to assess whether chromatin marks contributed to phenotypic cell type specificity independently (Supplementary Fig. 3). We observed that the highly significant associations of H3K4me3, DHSs and H3K9ac were generally not independent. In contrast, we found that chromatin marks that did not correspond to active gene regulation were not phenotypically cell type specific. In particular, H3K9me1, H3K9me3, CTCF-binding sites and trimethylation at histone H3 lysine 27 (H3K27me3), highlighting transcriptionally repressed heterochromatic insulator and polycomb-repressed regions, respectively, showed no evidence of being phenotypically cell type specific (P > 0.40).
To assess the reproducibility of these results, we conducted a similar analysis of data from the US National Institutes of Health (NIH) Epigenomics Project, consisting of assays for 6 different chromatin marks in 38 different cell types22 (Supplementary Table 2). We again observed that the most informative mark was H3K4me3 (P < 1 × 10−6), along with H3K4me1 (Fig. 2). H3K9ac was more nominally significant (P = 0.03), perhaps owing to the fewer cell types assayed in this experiment. The concordance of the results from these two data sets was reassuring when considering that the data from the ENCODE Project were obtained on cell lines, whereas most of the NIH Epigenomics Project data were obtained using primary cell types.
Our approach benefits from taking advantage of 1000 Genomes Project data to identify variants in LD (Fig. 1a). Repeating our analysis using only the reported lead SNPs and not examining SNPs in LD resulted in considerably less significant results (Supplementary Fig. 1b). We note that some of the variation in phenotypic cell type specificity could be related to the variable number of assayed cell types for different chromatin marks; power to detect phenotypic cell type specificity correlates with the number of assayed cell types (Supplementary Fig. 4).
Because chromatin marks tend to concentrate in and around genes, we considered the possibility that the observed overlap between H3K4me3 peaks and variants might be an artifact of proximity to gene transcript sequences with phenotypically cell type specific expression. To assess the role of the specific peak locations versus proximity to specifically expressed genes, we repeated our analyses after randomly shifting the specific location of peaks locally (± 10 kb, s.d. of 2.5 kb) within phenotypically associated loci. While these small shifts would maintain the proximity of peaks to genes, they would disrupt the specific colocalization of variants and H3K4me3 peaks. Indeed, in 1,000 such experiments, we found that shifting peak locations lowered the significance of phenotypic cell type specificity (median P = 0.03), and we did not observe any instance where the phenotypic cell type specificity was more significant than it was in the actual data (Supplementary Fig. 5). This result strongly suggests that the specific colocalization of variants in LD with phenotypically associated SNPs and H3K4me3 peaks rather than proximity to gene structures is driving the phenotypic cell type specificity signal (P < 0.001 by permutation).
To understand whether the phenotypic cell type specificity that we observed was driven by the activity of promoters or enhancers, we divided chromatin peaks into those falling within proximal promoter regions (including the transcriptional start site (TSS) ± 2 kb) and those falling outside of promoter regions and repeated our analysis. Whereas phenotypic cell type specificity was seen both within and outside of the immediate promoter regions, H3K4me3, H3K79me2 and DHSs were more significantly phenotypically cell type specific outside of promoter regions than within (Supplementary Fig. 6). We note that, although H3K4me3 marks are not generally thought of as being enriched in enhancers, there was evidence that they can be enriched in strong and disease-associated enhancers9,23,24. Alternatively, H3K4me3 enrichment outside of promoter sites might also represent unannotated sites.
We further assessed the phenotypic cell type specificity of previously published functional annotations on the basis of hidden Markov model states capturing information on nine separate chromatin marks9. We observed that hidden states 4 and 5, corresponding to active proximal enhancers and active distal enhancers, respectively, were most significantly phenotypically cell type specific (Supplementary Fig. 7). State 4 is highly enriched for H3K4me3 peaks, the mark that we observed to be the most phenotypically cell type specific.
We identified the cell types within which common variants likely influence gene regulation using published SNPs for 4 distinct phenotypes (Fig. 3 and Supplementary Table 3) and H3K4me3 data from the Epigenomics Project for a panel of 34 cell-types22. We selected these phenotypes because there is a reasonable sense of what the critical cell types might be and because a sufficient number of associated SNPs had been identified. For each phenotype, we assigned a cell type specificity score to each of its associated variants (Fig. 1a,b and Online Methods) and compared to scores from equal-sized sets of matched SNP sets sampled from 45,950 LD-pruned SNPs3. Because phenotypically associated SNPs have more epigenetic activity than other SNPs, we were careful to match sampled SNPs so that they had similar total numbers of H3K4me3 peaks across all 34 cell types as associated SNPs. Results were generally consistent in a more stringent analysis when we sampled instead from only phenotypically associated SNPs from the National Human Genome Research Institute (NHGRI) genome-wide association study (GWAS) catalog20 (Supplementary Fig. 8). In addition to these phenotypes, we present separately the results for four additional phenotypes, B-cell–specific cis eQTL associations, SLE, type 1 diabetes (T1D) and body mass index (BMI) (Supplementary Fig. 9); in all of those instances, except BMI, we were able to identify highly significant cell types.
As a positive control, we tested 37 SNPs associated with LDL concentration25 for overlap with H3K4me3 marks in different tissues. These variants should implicate regulatory activity within the liver, according to previous work7,26,27. In aggregate, we observed that the 37 SNPs implicated a total of 1,501 H3K4me3 peaks in 34 different cell types. The most significant cell type was adult liver tissue (P = 7.2 × 10−5; Fig. 3a). We observed overlap with liver-specific peaks using other phenotypically cell type–specific marks, including H3K9ac (P = 0.003) and H3K4me1 (P = 0.002). In contrast, we observed little association with liver for the H3K27me3 or H3K9me3 marks (Fig. 2 and Supplementary Table 4). Examining the relative proximity and specificity of the SNPs within 10,000 sets of matched SNP sets used to calculate statistical significance, we identified the 95th-percentile threshold at a score of 0.58 (Fig. 4a). Of the 37 SNPs associated with LDL concentration, 7 (19%) were near to a highly liver-specific chromatin mark at this threshold. These seven SNPs are generally in tight LD with a variant that is very close to cell type– specific H3K4me3 peaks (median of 132 bp away; see Supplementary Table 3 for details on the specific SNPs).
For rheumatoid arthritis and other autoimmune diseases, the critical immune cell types are often not clearly defined in the literature and can be controversial28–30. When we tested the 31 SNPs associated with rheumatoid arthritis31, we observed that they implicated 1,328 H3K4me3 peaks in 34 tissues, with the most significant association to CD4+ T cells and, in particular, CD4+ regulatory T (Treg) cells (P = 1.3 × 10−4; Fig. 3b). The phenotypically similar CD4+ memory T cells were also highly significantly associated (P = 7.0 × 10−4)32. Of the 31 SNPs associated with rheumatoid arthritis, we found that 6 (19.3%) were close to chromatin marks that were highly specific to CD4+ Treg cells, with relative specificity of 0.53 or greater (permuted 95th-percentile threshold; Fig. 4b). These 6 SNPs are generally in tight LD with a variant that is very close to cell type–specific H3K4me3 peaks (median of 37 bp away; see Supplementary Table 3 for details on the specific SNPs).
In instances where dense genotyping has been applied to localize the association signal, we speculate that cell type–specific overlap might become more apparent. Indeed, for the 31 loci associated with rheumatoid arthritis, we examined recent results from a fine-mapping study using the dense genotyping platform the Immunochip33. Indeed, when repeating the analysis with the newly defined index SNPs from each locus using dense genotyping data, we found that the significance of the enrichment for CD4+ Treg cells increased (5.1 × 10−5; Supplementary Fig. 10) and that the median specificity score for each locus increased from 0.13 to 0.16.
The 14 independent SNPs from neuropsychiatric disorders34,35 mapped within 874 H3K4me3 peaks. Despite the limited power of this analysis, we were encouraged to see that these SNP associations implicated multiple neuronal tissues, including the anterior caudate nucleus (P = 0.0076) and the mid-frontal lobe of the brain (P = 0.044) (Fig. 3c); we also observed a likely spurious association with colonic smooth muscle (P = 0.026). The role of the frontal lobe in neuropsychiatric disease in particular has long been appreciated36–38. Although none of these results reached a conservative level of significance after correcting for multiple-hypothesis testing, we are hopeful that additional SNP discoveries will help clarify this result further. Of the 14 SNPs associated with neuropsychiatric disorders, 3 (21%) had a tissue-specific chromatin mark within the anterior caudate, with a relative specificity of 0.28 or greater (permuted 95th percentile; Fig. 4c).
In certain instances, it might be plausible that multiple tissues could be implicated in a disease. When we examined 67 SNPs for type 2 diabetes (T2D)39–50, implicating a total of 2,776 H3K4me3 peaks within 34 different cell types, we observed the most significant enrichment in pancreatic islets (P = 0.0061) and the liver (P = 0.0079) (Fig. 3d). In particular, of the 67 SNPs associated with type 2 diabetes, 14 (20.1%) were either highly specific for chromatin marks within the liver (at a 0.57 permuted 95th-percentile threshold) or pancreatic islets (at a 0.65 permuted 95th-percentile threshold); these SNPs are in tight LD with a marker that has a median distance of 46 bp from the summit of a cell type–specific peak. When we tested the pancreatic islet and liver tissues together, we found that the combination of liver and pancreatic islets was even more significant than the tissues individually (P = 2.0 × 10−4; Online Methods) and was more significant than all other possible tissue pairs. We found that the SNPs driving the overlap in the two tissues were distinct (Fig. 4d). When we removed the SNPs most specific for pancreatic islet marks (score > 0.3), we observed that enrichment in liver was even more apparent (P = 0.0032); similarly, when we removed the SNPs most specific for overlap with liver marks (score > 0.3), we observed that the enrichment in pancreatic islets was also more apparent (P = 0.0026). Both islet cells and the liver have long been known to have a key role in mediating glucose synthesis, insulin secretion and diabetes51.
One of the major challenges in understanding complex trait associations is to identify the causal variants and the mechanisms through which they affect genes. Associated variants can be fine mapped to variants in tight LD within cell type–specific chromatin marks in the appropriate cell type. Here, we present examples where cell type– specific H3K4me3 peaks can potentially be used to localize associated variants to causal variants.
First, we considered the rs629301 SNP that is associated with plasma LDL concentration in the region including the SORT1 gene (Fig. 5a). A liver-specific H3K4me3 peak, not seen as prominently in other cell types, overlapped with this SNP and three other variants in tight LD with it. This H3K4me3 peak is located far from the TSS region and corresponds to a hepatocyte enhancer region7. The closest SNP to the summit of the peak (87 bp away) is the rs12740374 functional variant. This variant is known to alter a CEBPB-binding site within the enhancer region controlling SORT1.
Another example is provided by the locus for T2D represented by the rs704184 reported SNP association. rs10814915, tightly in LD with the reported GWAS SNP (r2 = 0.93), scored highly for pancreatic islets but showed no tissue specificity for the liver (Fig. 5b). This SNP located only 84 bp away from the summit of a highly pancreatic islet–specific peak. rs10814915 is predicted to be present within a sequence bound by the glucocorticoid receptor (GR)52, which is known to have a role in pancreatic islets and glucose regulation. The SNP resides within an intron of the GLIS3 gene, which is involved in the development of pancreatic islets.
Finally, we examined the locus for rheumatoid arthritis defined by a reported association with the rs13119723 SNP in the intron of a gene with unknown function, KIAA1109. This SNP is in LD with other variants spanning over 500 kb within this locus, rendering fine-mapping efforts particularly challenging. We identified a SNP, rs13140464, in tight LD with rs13119723 (r2 = 0.9) (Fig. 5c), which maps only 116 bp from the summit of the H3K4me3 peak, which is highly specific to CD4+ Treg cells with a score of 0.63. This SNP is located between the IL2 and IL21 genes, 122 kb downstream of IL2 and 34 kb upstream of IL21, and is 280 kb away from the published SNP. It is tempting to speculate that rs13140464 might act by altering a highly cell type–specific regulatory sequence controlling IL2 expression, which has a key role in CD4+ Treg maturation53.
In this study, we demonstrated that chromatin marks highlighting active regulatory regions, such as H3K4me3, H3K9ac and DHSs, overlap phenotypically associated variants; furthermore, this overlap is phenotypically cell type specific. These results strongly support the hypothesis that many complex disease and trait alleles might act by influencing gene regulation in a cell type–specific manner. In addition, we quantified the degree to which different marks are cell type specific in their overlap with phenotypically associated SNPs. These cell type–specific marks might not only be used to connect phenotypes to specific cell types, but they might also be useful in mapping phenotype-associated SNPs to potential regulatory variants. In particular, we consistently observed that H3K4me3 marks could be used to effectively identify specific cell types that are enriched among specific phenotypes. We note that this statistical approach can be applied to assess the significance of other chromatin marks or other cell type–specific gene annotations as they become available.
In the phenotypes that we examined, we found that about onefourth of associated variants could be connected to a highly cell type– specific mark within a critical cell type (Fig. 5). In instances where we do not observe a SNP in tight LD within a highly cell type–specific H3K4me3 peak, it is possible that a regulatory region that is not cell type specific might be altered. Alternatively, in some instances the reported SNP association will need to be further refined with dense genotyping, or undiscovered variants in tight LD will need to be ascertained through sequencing, before the effect of a cell type–specific peak can be identified. Finally, for many phenotypes, multiple cell types could be involved, in which case this approach might have limited efficacy. We demonstrated one example of this type of scenario in T2D, where we detected effects both in liver and pancreatic islet cell types.
We acknowledge that our approach is potentially sensitive to the diversity and number of cell types assayed. For instance, a limited application to a set of hematopoietic cell types might not be particularly informative if a set of purely neurological phenotypes is assayed. We note that our approach depends critically on technical factors—for instance, the quality of antibody reagents, experimental protocols or other technical factors that might introduce noise into specific chromatin mark assays could mitigate true signals. Our approach may perform better on the chromatin marks with higher quality assays.
Once variants and cell types are identified, they will likely be excellent candidates for cell type–specific functional investigations, including allelic imbalance assays to define cis-eQTL activity54, cell type–specific DHS quantitative trait locus (dsQTL) analyses55 and identification of active transcription factor–binding sites. These cell type–specific investigations in appropriately chosen cell types might ultimately help to lead investigators from common disease variation to causal variants and molecular mechanisms.
We obtained two publicly available data sets for chromatin mark assays on different sets of tissues. We use the term chromatin mark broadly to include histone modifications and DHSs, as well as common epigenetic features, such as CTCF-binding sites.
First, we used data from the ENCODE Project, which included sequence reads from ChIP-seq assays and controls in up to 14 different cell types from a diverse set of 15 chromatin marks: CTCF-binding sites, the variant H2A histone (H2A.Z), H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me1, H3K9me3, H4K20me1, Pol2b-binding sites and DHSs15 (Supplementary Tables 1 and 2). We separately obtained hidden chromatin state annotations for 8 of the 14 cell types defined by clustering chromatin marks9.
Second, we used data from the NIH Roadmap Epigenomics Mapping Consortium that assayed only six chromatin marks on a large number of cell types22. This data set included sequence reads from ChIP-seq assays and controls for 6 histone modifications—H3K27me3, H3K4me3, H3K36me3, H3K9ac, H3K4me1 and H3K9me3—assayed in 38 adult and fetal tissues (Supplementary Table 2).
For both of these data sets, we downloaded data comprising hg19-mapped sequence reads. In instances where there were multiple replicates of a given ChIP-seq assay for the same tissue, we aggregated sequence reads for the individual assays. We also obtained mapped reads from control data comprising sequenced genomic DNA. We ran MACS (v1.4) software to identify significant peaks (P < 1 × 10−5), specific locations within the genome with enrichment of tag sequences, setting the bandwidth parameter to 300 bp56. For each chromatin mark, we located its summit, which represents the position with the highest pileup of sequence tags.
Once we identified peaks, we used MACS to determine the fold enrichment of tags compared to controls, using the equation
where λpeak and λlocal are parameters for a Poisson distribution determined by fitting the local sequence tag distributions in the peak region from ChIP-seq data and control data, respectively. We considered f as the height of peak instead of the raw number tags, as this approach leverages control data to account for local biases in the genome (due to sequencing bias, mapping bias, chromatin structure and genome copy-number variations) and improves the robustness and specificity of the estimation.
We then corrected for global variation in multiple experiments for the same chromatin mark in different cell types, using the equation
where fi,j corresponds to fold enrichment for the peak j in the cell type i before normalization, and hi,j,norm is the fold enrichment after normalization (or the height of the peak).
To estimate the phenotypic cell type specificity of each chromatin mark, we identified a comprehensive set of independent SNPs associated with unique phenotypes. We used data from a catalog summarizing results from recent GWAS20 (downloaded January 2012). We selected only the phenotype-associated SNPs with highly statistically significant associations (P < 5 × 10−8). To ensure the applicability of the 1000 Genomes Project resource, we used only those SNPs associated in populations of European descent. To limit the analysis to phenotypes that have an adequate number of SNP associations, we selected only phenotypes with at least 15 such SNP associations. To ensure the independence of the associated SNPs, we removed SNPs with r2 > 0.1 and those that were <100 kb from a more strongly associated variant in the genome. To preserve a priori specific phenotypes for independent testing, we removed SNPs associated with rheumatoid arthritis, BMI and LDL plasma cholesterol concentration as well as height. For variants associated with multiple phenotypes, we selected a single phenotype association and discarded others; we selected the SNP associated with the phenotype with the fewest SNPs. Our final data set consisted of 510 risk variants associated with 31 diseases or traits.
To test our approach, we also separately identified in the literature 37 SNPs associated with LDL plasma concentration25, 31 SNPs associated with rheumatoid arthritis risk31, 67 SNPs associated with T2D risk39–50 and 14 risk loci for neuropsychiatric disorders34,35.
We recognized that the observed phenotype association of a given variant might be the consequence of other variants tightly linked to the associated variant (Fig. 1a). We therefore comprehensively ascertained variants from the 1000 Genomes Project to identify all variants (SNPs and indels) in LD19 (r2 > 0.8) on the basis of haplotypes reconstructed with Beagle from the subset of 379 individuals of European descent.
Next, we examined chromatin marks in the different cell types located near associated SNPs (Fig. 1b). We assumed that the closer an associated SNP (or variant in LD) was to a tall peak, the greater the chance that it might influence a regulatory element highlighted by that peak. We scored each associated SNP k within each cell type by identifying a SNP k′ (or indel) in tight LD that was closest to a chromatin mark peak j in tissue i. We then assigned a score sj,k equal to the height of peak j in the tissue i, hi,j,norm (referred to as h in the main text) divided by the distance d between the SNP k′ and the summit of the peak j. If there was no peak within 2.5 kb of each SNP in LD with SNP k, then si,k was set to zero.
For each associated SNP k and chromatin mark, we obtained a vector of scores for multiple cell types i. To compare the cell type specificity score across risk variants and phenotypes, we applied Euclidean normalization in the following equation:
This ensured that sni,k emphasized cell type specificity instead of the magnitude of the signal. For associated risk variants not near any peak, where si,k is zero for all i, we replaced values with the average of values of other associated SNPs with at least one nonzero si,k value over all cell types.
If a chromatin mark is informative for phenotypic cell type specificity, then the deviance of chromatin mark overlap for associated SNPs (sni,k) should be minimal for a given phenotype and tissue. If a chromatin mark is not informative, then the deviance of chromatin mark overlap for associated SNPs will be high for a phenotype and tissue.
Therefore, we defined a deviance-based metric of phenotypic cell type specificity for a mark, which was the aggregate sum of the squared differences between sni,k values and mean values for fixed phenotypes p and cell types i,
where meani,p(sn) is the mean of the normalized cell specificity scores in the cell type i for SNPs associated with phenotype p. If a mark is informative, then sn scores are dependent on the phenotype and cell type, and this sum of squares should be relatively small.
To evaluate the statistical significance of phenotypic cell type specificity for particular marks, we conducted up to 1 million permutations reassigning SNPs to phenotypes randomly. This ensures that the properties of associated SNPs in the analysis are maintained, only disrupting their phenotypic associations. We recalculated d after each permutation. To compute P values, we calculated the proportion of d scores from permutations (these correspond to the null hypothesis) that were greater than the observed d score.
After identifying SNPs associated with a selected phenotype, we compute a cell type specificity score ci,p for a phenotype p by summing the normalized sni,k scores for a cell type i and associated SNPs k in the following equation:
To evaluate the statistical significance of cell type specificity scores ci,p, we defined matched sets of SNPs not associated with phenotype p and used them to calculate cell type specificity scores. Statistical significance was calculated as the proportion of SNP sets with cell type specificity scores exceeding the observed scores for actual phenotypic SNPs.
To define the matched SNP sets, we required that the sampled SNPs had the same total number of chromatin mark peaks in the region in LD across all cell types as associated SNPs. This ensures that randomly selected SNPs have similar nearby regulatory activity. For the primary analysis, we drew random SNPs from 45,950 independent HapMap SNPs that were clustered to ensure minimal independence3. In a secondary analysis, we drew SNPs from phenotypically associated SNPs from the NIH GWAS catalog20.
To test possible pairs of n cell types for association, we constructed (n – 1) × n/2 artificial ChIP-seq profiles for each tissue pair. Each artificial profile consisted of all of the peaks defined in both tissues, where the peak heights were reduced to half of their original heights. We then tested for association with cell type pairs in the same way as for single cell types, except that we replaced individual cell type scores with scores for cell type pairs.
We thank M. Daly, M. Kellis, D. Diogo, X. Hu, Y. Okada, R. Plenge, S. Ripke, G. Srivastava, E. Stahl and S. Sunyaev for critical feedback and discussion. G.T. is supported by the Rubicon grant from The Netherlands Organization for Scientific Research (NWO). B.E.S. and S.R. are supported by the Harvard University Milton Fund, and Brigham and Women’s Hospital. S.R. is also supported by funds from the US NIH (K08AR055688 and U01HG0070033) and the Arthritis Foundation. X.S.L. is also supported by funds from the US NIH (R01 HG004069). We thank the ENCODE Project, supported by the NHGRI, and the NIH Roadmap Epigenomics Mapping Consortium for making data available.
Note: Supplementary information is available in the online version of the paper.
AUTHOR CONTRIBUTIONSS.R. led the study. G.T., C.S., S.R., B.H. and H.X. performed the analysis. G.T., C.S., S.R., B.E.S. and X.S.L. wrote the manuscript. All authors reviewed the final manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.