Numerous studies have demonstrated association between HLA alleles and disease susceptibility (a partial list is provided in and Supplementary Table 1
), but the interpretation of these results is confounded by the strong correlation between alleles at neighboring HLA and non-HLA genes. Major efforts have therefore been directed at cataloguing the gene and variation content of the entire MHC4-6
. In addition, previous studies in European-derived populations have examined the distribution of LD across the region and have suggested that SNPs could help dissect causal variation within the MHC2,3,7-10
. Here, we have created a resource to guide future association studies by genotyping genetic variants across the extended MHC region of 7.5 Mb at a higher density and in more DNA samples than previously reported. In 361 individuals of African (YRI), European (CEU), Chinese (CHB), and Japanese (JPT) ancestry, the inferred haplotype structure across the region shows that LD is systematically higher in CEU, CHB and JPT samples than in the YRI sample (). Alleles across the different classical HLA loci demonstrate strong correlation (Supplementary Table 2
). These high levels of LD among SNPs and DIPs and among HLA alleles suggest that SNPs outside the HLA genes are informative about HLA types (), and that a few, well chosen SNPs may capture common classical HLA variation at several loci.
Examples of HLA alleles associated with common disease and their tag SNPs
Figure 1 The relationship between recombination rates and haplotype structure spanning the 7.5 Mb extended MHC region (defined by the SLC17A2 gene at the telomeric end to the DAXX gene at the centromeric end of chromosome 6). Recombination rates (blue lines in (more ...)
Figure 2 Allelic association between SNPs across the 7.5 Mb extended MHC region and HLA types at each gene for the combined population data using the 5,754 SNPs that were typed in all populations and are polymorphic across the combined population samples (see (more ...)
We examined the association between HLA types and single SNPs across the entire region. shows the results for HLA-C (see Supplementary Fig. 1
for the other HLA genes). In the four populations studied, 34-44% of the HLA alleles present are strongly associated with one or more individual SNPs (maximum r2
> 0.8), sometimes located at a considerable distance from the HLA allele. There are noticeable differences between the four populations studied. For example, allele HLA-C*0702 has many SNPs in moderate to strong LD in YRI and CEU extending over several Mb, while in CHB and JPT, strong association is only found to SNPs within 50 kb of the gene. In contrast, some alleles, such as HLA-C*0304, are not strongly associated with any single SNP in any of the four population samples studied in which it was found. These results suggest that while tagging of certain common HLA alleles in some populations may be relatively straightforward, tag SNPs are likely to differ between populations, and tagging of some HLA alleles may prove difficult if based solely on pairwise association to single SNPs.
To assess the extent to which allelic variation at HLA loci can be captured by nearby SNPs we used the “Tagger” algorithm11
to identify allelic tests using single SNPs or haplotypes of combinations of up to three SNPs as predictors of HLA. Following this tagging approach, the majority of common HLA alleles could be captured effectively and efficiently ( and Supplementary Table 3
). We observed differences in the tagging performance between HLA genes: common (≥5%) alleles of HLA-A, -B, -C were captured, on average, with a maximum r2
= 0.97 in all four population samples, compared with a maximum r2
= 0.90 for all common HLA-DRB1
alleles. Of the less common (<5%) HLA alleles, 75% in YRI and CEU, and 100% in CHB and JPT are captured with a maximum r2
> 0.8, but one should exercise caution in interpreting these results given the small sample size and inaccuracies in allele frequency estimates.
The majority (~70%) of the HLA alleles are captured with high(er) r2
by specified haplotypes of multiple SNPs. Generally, a tag/test to capture a HLA allele observed in one reference panel captured that allele with lower r2
in the other population samples (Supplementary Table 4
). This is broadly consistent with observed tag SNP transferability patterns in population samples across the major continents12
. Additional empirical data in other samples is required to better understand the extent of transferability of tags selected across the MHC.
To this end, we performed an empirical validation of the tags for four different HLA alleles in two independent samples from ongoing disease studies. Specifically, we had access to 330 Dutch samples from a celiac disease study13
for which we had HLA typing data for DQA1
and 332 trio samples from a UK systemic lupus erythematosus (SLE) study14
for which we had HLA typing data for DRB1
. The haplotype formed by the DQA1*0501 and DQB1*0201 alleles (also known as haplotype DQ2.5) is a known risk factor for celiac disease, with the highest risk for individuals homozygous for the DQ2.5 haplotype or that have one copy of this haplotype and one haplotype formed by DQA1*0201 and DQB1*0202 (haplotype DQ2.2)15
. In SLE, significant association has been observed for both DRB1*1501 and DRB1*030116
. We directly evaluated the predictive power of the SNPs/haplotypes for these alleles () in these samples, and found that the sensitivity and specificity of these tags was significant and useful, not least, for example, in pre-screening large samples in the selection of certain individuals for further study ().
Empirical validation of SNP-based tags of associated HLA alleles in large patient collections
In general, two features make HLA allele tagging more difficult than tagging of SNPs. First, HLA alleles are themselves multilocus haplotypes, identified by unique combinations of sequence motifs generated by mutation, recombination and gene conversion17
. Second, the unique evolutionary history of the MHC means that patterns of association are not just influenced by recombination, gene conversion, demography and genetic drift, but also through natural selection. In particular, HLA class I and class II alleles are often maintained in the population by balancing selection18-20
(e.g. heterozygote advantage, frequency-dependent selection). Certain forms of balancing selection, such as host-pathogen frequency-dependent selection21
, will favor novel combinations of alleles across multiple HLA genes, hence actively selecting for recombinants20
. However, as favored HLA combinations increase in frequency, so will the haplotype background on which they occurred. The direct consequence of such a dynamic is that a given HLA allele might be found on one, two, or several different haplotype backgrounds depending on where in the cycle of fluctuating selection it currently lies. In addition, balancing selection has resulted in the existence of hundreds of HLA alleles and haplotypes in populations, the vast majority of which are not common (less than 2% frequency), and yet collectively account for a significant proportion of the genetic variation. Given the limitations of this MHC variation resource (in terms of density and sample size), it remains to be seen how well the less common variants, haplotypes or recombinants can be captured via a tagging approach.
To illustrate the link between tagging efficiency and evolutionary dynamics we mapped the distribution of common alleles to the evolutionary tree relating haplotypes around the HLA-C gene (). Certain common alleles, such as C*0702, are associated with a single clade in the tree that is defined by multiple SNPs. Such an allele can therefore be tagged with high efficiency, and its tags will likely be transferable between populations. The evolutionary implication is that this allele has a recent origin, or that a recent recombinant haplotype carrying this allele has been favored by natural selection coupled to the loss from all populations of the allele on any other haplotype background by random drift and bottleneck effects. In contrast, allele C*0701 occurs on two quite distinct clades of the tree that differ considerably in frequency between populations, an observation supported by analysis of long-range haplotype structure (). Both of these clades are of recent origin (as indicated by their extensive haplotype backgrounds) such that they can be tagged, though only through combinations of multiple SNPs. Other alleles are yet further dispersed across the evolutionary tree and consequently harder to tag; for example, C*0303 requires two tags in CEU and CHB and three in JPT (it is absent from YRI), and no single tag SNP was selected in more than one population. Identification of differences in evolutionary history, for HLA alleles associated with disease, is informative for association mapping experiments, as specific HLA alleles distributed in different clades will carry different sets of linked HLA and non-HLA alleles. Identification of the causal allele(s) will depend on the ability to distinguish the clade associated with disease, followed by direct re-sequencing of the corresponding chromosomes to identify candidate variants.
Figure 3 The evolutionary history of HLA-C. (a) Estimated evolutionary tree showing relationships among haplotypes at the HLA-C locus (defined as position 31,341,277 in build 34 or between SNPs rs2853950 and rs2001181) with mutations (blue circles) that unambiguously (more ...)
Identification of alleles in the MHC that have likely undergone positive selective pressure can also provide candidates to test for association to immune mediated diseases. Preliminary searches of the CEU population indeed suggested that such alleles were also associated with autoimmunity3
. One approach for identifying recent positive selection is to identify alleles that are prevalent, but that are associated with long-range haplotypes, unbroken by recombination over time (suggesting they are of young age) 22
. We used this ‘long-range haplotype’ approach on the current data set with the matched genome-wide data available from HapMap, resulting in the identification of several alleles within the MHC region that show evidence for recent selective sweeps ( and Supplementary Table 5
). One striking example is a haplotype of 25% frequency in YRI in the region containing the BAK1 and HLA-DPA1 genes (Supplementary Fig. 2
). Further study of this and other haplotypes that have putatively undergone selection may point to key functional changes in the MHC that have influenced human disease past and present.
Figure 4 The genetic distance over which the long-range haplotype associated with each allele for each SNP on chromosome 6 extends (before decaying to an EHH22 of 0.8) in each of the four populations. (See Methods for details.) The blue dot represents the average (more ...)
We set out to create a dense haplotype map across the extended MHC in four population samples. This resource will facilitate the selection of informative tag SNPs to capture HLA and non-HLA variation, enabling a cost-effective means for conducting association studies in large patient samples, and thus provide a complementary approach to classical HLA typing. We anticipate that future integration with the efforts from the International HapMap Project, International Histocompatibility Working Group, and the MHC Haplotype Project combined with targeted functional studies will help identify the causal alleles that predispose to immune-mediated diseases and those that have been under selection23,24