We have sequenced 4.25 Mb of a common MHC haplotype that is associated with certain autoimmune diseases, including type 1 diabetes. The full variation content has been defined in relation to two complete MHC haplotypes that we have previously sequenced [
18]. The cell lines used in this study were invaluable in obtaining homozygous DNA from the classical MHC and allowed the determination of a complete inventory of polymorphism and sequence across the whole MHC in the context of HLA-defined haplotypes, including gene, pseudogene, promoter, intergenic and complex repeat sequences.
Our approach of cloning and shotgun sequencing, instead of direct sequencing of PCR product [
57], evaded the potential pitfalls inherently associated with PCR-amplification within the MHC region, such as mispriming in under-characterised highly polymorphic areas. In addition, many sequences are difficult to PCR for more general reasons, for example, the GC-rich 5' UTRs of genes. The validity of our experimental strategy is reflected by the completeness of the polymorphism map, from SNPs to large DIPs. The identification of significantly altered coding sequences in different haplotypes stresses the value of careful and thorough annotation. The results of these efforts will ensure that the MHC research community has comprehensive genomic information for medical research. The study design will serve as a model system for future sequencing projects of other complex, polymorphic immune gene clusters of the human genome that are associated with disease, such as the leukocyte receptor complex (LRC).
HLA haplotypes in QBL and COX cell lines share identical alleles at HLA-DRB1 (*0301) and -DQB1 (*0201) genes and, therefore, some commonality in their origin, composition and function. We were able to define this shared portion of DNA identical by descent to only a small 158-kb segment telomeric of HLA-DRB3 and centromeric of HLA-DQB3. When comparing these two cell lines, this segment presents exceptionally low divergence relative to other regions within the MHC. Outside this segment, the divergence between these haplotypes is as extensive as that we found previously between two HLA-disparate haplotypes: We identified about 15,000 SNPs, of which approximately 40% were novel to the newly sequenced haplotype. Approximately 2,000 DIPs were also identified. The nucleotide heterozygosity between the two haplotypes was 3-fold higher than typical genome-wide diversity. In contrast, the extreme conservation of the 158-kb segment points to a relatively recent common ancestor fewer than 3,400 generations ago.
A number of factors contribute to the variation within the MHC and could potentially be responsible for the existence of the shared 158-kb segment, including conventional and gene-conversion–mediated recombination [
1,
58]. We propose that this segment originated by conventional recombination, possibly involving recombination hotspots 1 and 2 (), giving rise to an original region of about 450 kb. This is supported by extremely low sequence divergence (π = 8.47 × 10
−7) within the 158-kb segment and is continued by lower than expected sequence divergence within the remaining approximately 290 kb up to the recombination hotspot between
NOTCH4 and
C6orf10. At both ends, the divergence collapses at LD breaks coinciding with confirmed recombination hotspot 2 [
52] at the centromeric end and predicted hotspot 1 at the telomeric end. To our knowledge there has never been a gene conversion–mediated recombination event described involving more than 10 kb of sequence. According to the HLA allele frequencies, the MHC can be divided into only a few blocks that contain non-randomly associated alleles at different loci [
59]. The
HLA-DRB1*0301–DQB1*0201 (DR3
–DQ2) block is present in a number of populations, including Caucasians (Whites of northern and western European ancestry), ethnic Africans and Filipinos [
59–
61], and is often associated with type 1 diabetes, coeliac disease, autoimmune thyroid disease, and multiple sclerosis incidence [
25,
60,
62]. This shared DR3
–DQ2 identical-by-descent segment or “frozen block” [
63] is the most commonly observed DR/DQ haplotype in different European populations in which the ancestral MHC haplotypes A1-B8-Cw7-DR3-DQ2 (e.g., COX) and A30-B18-Cw7-DR3-DQ2 account for by far the largest proportion of its frequency. These extended haplotypes are generally believed to have arisen from their rapid expansion across Europe driven by the selection pressure for the function of a single locus or multiple functional loci of the haplotypes [
64]. However, DR3
–DQ2 has also been observed constituting other much less frequent extended haplotypes [
65–
67]. The wide distribution of the conserved block in Old World haplotypes deserves further investigation. Because this segment has not been split by recent recombination events, the small number of minor variants distributed over it presumably occurred by mutation. By scoring them in DR3
–DQ2 blocks in different populations, we will be able to track an accurate clade structure that can be used for timing of association with different flanking regions in relation to population structure and disease association.
Our model is, therefore, consistent with the idea that a DNA segment derived from an ancestral haplotype has been transferred into a number of diverse and widely distributed haplotypes by recombination [
63,
68], and that certain recombinant haplotypes have subsequently expanded in frequency across European populations (see ). The data suggest that ancestral DR–DQ blocks have been shuffled into different MHC haplotypes. The expansion of the resultant novel haplotypes could relate either to selection for resistance to disease by offering an evolutionary advantage in terms of HLA class II functions and peptide binding specificities, for example, or to neutral genetic drift, perhaps in an ancestral population with a small effective population size. Although not proven, recent data support the long-held view that sequence variation within HLA genes is driven by resistance to infection [
69,
70]. The spread of the DR3–DQ2 ancestral segment by inter-haplotype exchange may also have been driven by selection. This interesting hypothesis might be tested in further studies by, for example, haplotype-based tests for positive selection [
71]. It is not trivial, however, to explain the contrasting genealogy of ancestral haplotype segments in a chromosome. If ancestral DR/DQ haplotypes (i.e., DR3–DQ2) have exchanged the discrete segment of the MHC that appears identical by descent between COX and QBL so recently, it might be possible that similar exchange of different HLA sequences between haplotypes of this defined DR/DQ segment may be responsible for contrasting disease risk related to non-DR/DQ loci [
25] and to different ethnic backgrounds [
72]. An alternative explanation would require the action of purifying selection on this fragment keeping substitution rates low. However, this argument requires the majority of the DNA to be functional and therefore intolerant of substitutions. The differential contributions of selection and recombination in shaping the contrasting evolutionary history of ancestral haplotype segments containing classical HLA class II genes might be categorized in further studies expanding the population range and increasing SNP density.
The “modular” or block structure of the MHC is well known to the HLA community [
59,
63,
68]. Whether the maintenance of polymorphic conserved and common blocks such as the
DR–
DQ segment is due to suppression of recombination or selection has never been satisfactorily resolved [
73]. It has been argued that it is advantageous to maintain clusters of polymorphic genes whose products interact [
1]. The
DQA and
DQB loci are good examples because these polymorphic genes encode a heterodimeric molecule with constraints on pairing of the α and β protein chains [
74]. Similarly, different
DR/
DQ allelic pairs could be advantageous since they perform interrelated functions. These considerations may lie behind the characteristics of the MHC of ancient, highly diverged haplotypes that appear to be evolving independently except for sequence in the peptide-binding grooves and rare “block shuffling” as we indicate here. They also lie behind the difficulties in locating single gene contributions to disease in which multiple linked interacting genes are at work [
25,
75]. Our results point to a particular selective advantage of the 158-kb–segment allelic variation in the history of Europeans.
The data generated from the MHC haplotype project provide a major resource for the construction of informative and high-resolution genetic maps in a region that has been more refractory to certain whole-genome analysis methods than less complex regions of the genome. Characterisation of fine segmental LD structure is an essential part of disease mapping, because it provides guidance for the selection of markers [
76]. To date, over 40,000 variations from the project have been submitted to dbSNP. Over 60% of the mapped variations in this region were novel submissions from this study. These maps will provide a guide to the fine-scale patterns of LD and recombination within the MHC and will aid methods used to identify optimal sets of tag SNPs that allow association studies to be conducted more efficiently [
77]. These methods take advantage of high-resolution maps and can show increasing efficiency at higher marker density [
78]. Eventual elucidation of the specific disease-predisposing variants will require detailed association analysis of all genetic differences in tag SNP–defined intervals in a large number of affected individuals and controls, along with functional analysis of associated variants, to verify a biological function consistent with the disease phenotype. The annotation of disease-associated MHC haplotypes in the context of complete information, encompassing all described splice variants of expressed genes and UTR sequences, will provide an initial basis for the subsequent experimental verification of candidate MHC loci and structural variants in disease and in gene expression analyses. The sequences of the remaining haplotypes will not only reveal further polymorphisms for genetic dissection of the MHC in disease, but also define the genealogical relationships between haplotypes. There appears to be differential associations with some immune-mediated diseases and the two B18-DR3 and B8-DR3 haplotype groups studied here. Our data indicate that the variability is probably not determined by sequence variation within the class II gene–containing 158-kb chromosome segment.
Taking together our finding of the conserved sequence block between the DR3-DQ2 COX and QBL sequences, and our observations of a similar level of sequence conservation in the DR–DQ region for DR15–DQ6 haplotypes, a recent inter-haplotype exchange of this discrete portion of the MHC is suggested. The DR–DQ segment is one of the most variable in the genome, yet it is apparently “fixed” in some haplotypes. The precise explanation for this interesting situation needs further investigation, particularly the relative contributions of recombination suppression, selection, and population expansion.