For the DNA methylation profiling of the human MHC, the following regions of interest (ROIs) were chosen: (i) a potential regulatory region for each gene and (ii) the most CpG-dense region of each gene. It is well established that epigenetic modifications at regulatory regions, in particular promoters, correlate with the transcriptional state of the cognate gene (reviewed in Bird 2002
). Because the precise locations of promoters within the human MHC were unknown at the time this study was initiated, we surmised that analysing a region from 2 kb upstream to 500 bp downstream of the annotated start codon would, in many cases, include the promoter region. Such regions were designated as “upstream” ROIs. ROIs representing the most CpG-dense region within each gene were defined for the region from 500 bp downstream of the annotated start codon to the end of the gene and did not exceed a total length of 2.5 kb. These ROIs were named “intragenic”. For longer genes, more than one intragenic ROI was chosen. Within each ROI, we used the amplicon with the highest CpG density that could be successfully amplified. Other amplicons, if used, were chosen based on the ranking of their CpG density. Wherever possible, CpG islands associated with genes were included. (CpG islands have been defined by Bird 
as a contiguous window of DNA of at least 200 bp in which the G + C content is at least 50% and the ratio of observed over expected CpG frequency is greater than 0.6. We used a slightly stricter definition: regions of at least 400 bp in which the G + C content is at least 50% and the ratio of observed over expected CpG frequency is greater than 0.6.) All known repeat sequences were avoided during amplicon design. Methylation was analysed in seven human tissues—adipose, brain, breast, liver, lung, muscle, and prostate—with multiple samples from different individuals for all tissues (except adipose) (see Table S1
shows the locations and the coverage provided by the bisulphite PCR amplicons across the 3.8-Mb human MHC in the context of annotated genes, CpG content, CpG islands, and SNPs extracted from the SNP database (http://www.ncbi.nlm.nih.gov/SNP
). A total of 253 unique amplicons were successfully analysed (). On average, the amplicons were 438 bp in length (which is close to the optimum amplicon length for the bisulphite PCR), were relatively GC-rich (average G + C > 50%), and had a high density of CpGs (approximately 1 CpG/31 bp). Ninety genes (i.e., more than 70% of all expressed genes within the MHC) were represented by at least one amplicon. Of the analysed CpG sites, 80% displayed methylation levels that varied (i.e., by more than 20%) either between individuals and/or tissues, suggesting that the potential information content of the selected amplicons was relatively high.
Map of the Human MHC Showing Coverage and Locations of the Bisulphite PCR Amplicons for Which Methylation Data Have Been Generated
Quantification of DNA Methylation by Direct Sequencing of Bisulphite PCR Products
We analysed DNA methylation using bisulphite sequencing (Olek et al. 1996
). In the presence of sodium bisulphite, unmethylated cytosines are converted to uracil, whereas methylated cytosines are unreactive under the same conditions. After bisulphite treatment the DNA is subjected to PCR and sequencing. Methylated cytosines are detected as cytosines in the sequencing reaction, whereas all unmethylated cytosines appear as thymidines. Traditionally, bisulphite PCR analysis involves sequencing multiple sub-clones of the bisulphite PCR product. This approach is time-consuming, and there have also been reports of bias (Grunau et al. 2001
) and hetero-duplex amplification (Sandovici et al. 2003
) during sub-cloning of bisulphite PCR products. We sequenced the bisulphite PCR products directly with the same primers used in the PCR and developed software, called ESME (Lewin et al. 2004
), to determine the DNA methylation levels from the sequence trace files. Briefly, ESME performs quality control, normalises signals, corrects for incomplete bisulphite conversion, and maps positions in the trace file to CpGs in the reference sequence. The program calculates methylation levels by comparing the C to T peaks at CpG sites, with the ability to discriminate levels of methylation that differ by as little as 20%. Methylation estimation by ESME at any given CpG site is the average from all the copies generated during PCR and is, therefore, compared to sub-cloning, a more accurate representation of the methylation level. Furthermore, we reanalysed the methylation levels of 77 amplicons by MALDI-MS, which allows for discrimination of methylation levels that differ by as little as 5% (Tost et al. 2003
). shows the comparison of the two methods, demonstrating a concordance rate of 88% between ESME and MALDI-MS.
Comparison of Methylation Measurements Obtained Using MALDI-MS with Those from ESME Analysis of Directly Sequenced Bisulphite PCR Products
The HEP Database
To make the data generated in this study a publicly available resource, we have designed a Web-based, ENSEMBL-like genome browser (http://www.epigenome.org
) that allows easy access to the data from the pilot HEP study (A and B). Methylation levels calculated by ESME are displayed in a colour-coded matrix. Rows represent the averages of forward and reverse sequences for various tissues while columns represent individual CpG sites. Each matrix square therefore represents the average methylation level at a given CpG site for a given tissue. Multiple data rows are available for all tissues (except adipose). Clicking on a square in the genome browser reveals the level of methylation observed at that particular CpG site (the average of the forward and reverse sequence) and information about the tissue source. Additional annotation includes chromosome coordinates, CpG islands, SNPs, ENSEMBL and high-quality, manually curated Vertebrate Genome Annotation database transcripts, the ROIs, and amplicon and primer sequences. The browser provides a zoom function to view the genomic sequence (B), and a link to ENSEMBL facilitates access to additional information and the ENSEMBL search engines. The data from the full-scale HEP will be made available via the same browser, providing a novel public resource for the research community.
Methylation Profile Characteristics of the MHC
The methylation profile of the human MHC region appears to be strongly bimodal, with over 90% of the amplicons being either relatively hypomethylated (i.e., median methylation of amplicon 30% or less) or relatively hypermethylated (i.e., median methylation of amplicon 70% or greater) (A and ). Re-analysis of a subset of the data by MALDI-MS confirmed the bimodality of the methylation profile (). Extensive bimodality of genomic methylation profiles has been observed by several authors (reviewed in Bird 2002
). Furthermore, the experiments of Lorincz et al. (2002)
suggest that the extremes of methylation profiles may in fact be the most stable states within the genome. Lorincz et al. showed that a high density of methylation at a proviral construct is stably propagated in vivo, whereas a low density of proviral methylation is inherently unstable, with daughter cells harbouring proviral cassettes that are demethylated or de novo methylated. It must be noted that even though the amplicons displayed hypo- or hypermethylated profiles, small variations in the levels of methylation at individual CpG sites within an amplicon were also frequently observed. Although there may be technical reasons for this heterogeneity, numerous studies (using a variety of techniques) have shown that the methylation profile of a given region in vivo is rarely homogenous (Costello et al. 2000
; Kondo et al. 2000
; Grunau et al. 2001
; Cui et al. 2003
). The functional outcome of these small variations, particularly when they exist between tissues or individuals, remains to be elucidated.
Bimodal Distribution of DNA Methylation within the Human MHC
Comparison of the methylation values for upstream amplicons (median methylation of 10%) versus intragenic amplicons (median methylation of 86%) revealed that upstream amplicons were more likely to be hypomethylated (p < 0.0001). Interestingly, within the upstream category we found that CpG sites located within the 5′ UTR were less likely to be methylated (median methylation of 7%) than the CpG sites located within 2 kb of the first start codon but not within the 5′ UTR (median methylation of 14%) (p < 0.0001). Within the intragenic category, we found that CpG sites located within introns (median methylation of 84%) were less likely to be methylated than CpG sites located within exons (median methylation of 89%) (p < 0.0001). Whether these significant but small differences reflect any bias for the presence of regulatory elements close to the transcriptional start site or within introns, or some other functional consequence, is currently hard to assess.
Analysis of Heterogeneously Methylated Regions
Fourteen amplicons displayed significant heterogeneous methylation profiles (i.e., median methylation between 30% and 70%) (see A). These might represent differentially methylated regions at which parental alleles display reciprocal methylation profiles that are determined by the parent-of-origin of the allele, or regions that were heterogeneously methylated on both alleles. Our sequencing method could not discriminate between these two possibilities, and none of these regions corresponded to known imprinted sites within the human genome. We therefore sub-cloned the PCR products and sequenced individual sub-clones of ten different heterogeneously methylated amplicons and used polymorphisms to discriminate between the parental alleles. The overall methylation profiles determined by sequencing individual sub-clones were consistent with those obtained by direct sequencing of the bisulphite PCR products. None of the six amplicons for which polymorphisms were found showed allele-specific methylation (data not shown). This is consistent with the fact that so far there have been no reports of imprinted regions within the human MHC. Since these amplicons were heterogeneously methylated to a similar extent in samples from various tissues and individuals, they might represent regions where maintenance of a specific epigenetic state is not essential. It is also possible that these regions are located at the boundaries of hypermethylated regions, and, consequently, the methylation levels are “trailing off”. However, both these possibilities contradict models that suggest that the genome prefers to maintain methylation profiles in bimodal states.
Interestingly, a few regions were heterogeneously methylated in some tissues only, suggesting that the tissue sampled was a mosaic of several sub-types among which the methylation profile at certain genes varied, or that the region displays tissue-specific parental imprinting similar to the insulin-like growth factor 2 (IGF2)
gene, which is imprinted in all tissues except brain (Pham et al. 1997
Direct sequencing of the heterogeneously methylated amplicons was unsuccessful in a small proportion of cases. Possible reasons include incomplete bisulphite conversion and genetic polymorphisms within the primer binding site. We also noticed a mobility shift of the sequence in a few cases. This occurs because a population of bisulphite PCR products generated from a heterogeneously methylated region contains a mixture of molecules, some with cytosines at certain CpG sites (i.e., initially methylated) and others with thymidines at those CpG sites (i.e., initially unmethylated). When these PCR products are sequenced directly, the cumulative effect of the molecular weight difference between cytosines and thymidines is that some molecules migrate faster than others during capillary electrophoresis. The sequence trace therefore contains two traces that do not perfectly overlap, resulting in erroneous estimation of methylation levels. Such sequences were excluded from further analyses.
Analysis of CpG Islands
CpG islands are GC-rich regions that contain a high density of CpGs and are positioned at the 5′ ends of many human genes (reviewed in Bird 2002
). Although most CpG islands remain hypomethylated throughout development in all tissues (Antequera and Bird 1993
), regardless of expression state, a small proportion become hypermethylated during development (reviewed in Bird 2002
), and this correlates with transcriptional silencing of the associated gene. In our study, 27 amplicons overlapped CpG islands, and 22 of these (i.e., 80%) were hypomethylated in all tissues examined. Interestingly, this proportion of hypomethylated CpG islands is similar to that reported by Yamada et al. (2004)
, who analysed the methylation status of CpG islands on human Chromosome 21q and found that 103 out of 149 CpG islands (i.e., 70%) were hypomethylated.
In our study, CpG island amplicons situated in the upstream ROIs were always hypomethylated, whereas hypermethylated CpG island amplicons were found only in the intragenic regions. Among the intragenic CpG island amplicons, those situated at the 5′ end of the gene (i.e., overlapping exon 1, intron 1, or exon 2) were always hypomethylated. A tissue-specific methylation profile was observed for the CpG island situated within exon 3 of the tenascin-XB
) gene, which was hypomethylated in muscle samples only. This hypomethylation correlates with the temporally regulated and tissue-specific expression of TNXB,
which is abundantly expressed in connective tissues. It has been suggested that TNXB
has a role in limb, muscle, and heart development (Burch et al. 1995
), and, therefore, epigenetic modifications at the TNXB
CpG island may have an important regulatory role (tissue specificity of methylation profiles is discussed in more detail below). Interestingly, the CpG island amplicon located within exon 3 of the HLA-G
gene spanned a methylation boundary, being hypomethylated at the 5′ end with a sharp transition to a hypermethylated profile at the 3′ end. Overall, the results are consistent with the prevailing model of CpG islands being regions of the genome that are hypomethylated, especially when they occur upstream or within the 5′ end of the gene.
Tissue Specificity of the Methylation Profiles
DNA methylation profiles are complex and dynamic, and can vary with developmental stage, tissue type, age, the alleles' parent-of-origin, and also phenotype or disease state (reviewed in Bird 2002
). In particular, the role of DNA methylation in setting up and maintaining tissue-specific expression patterns has received a lot of attention. However, the extent of tissue specificity of DNA methylation profiles is relatively unknown. The HEP pilot study involved the analysis of 32 samples (from different individuals) comprising seven tissues: adipose, brain, breast, liver, lung, muscle, and prostate.
Upon comparison of the amplicon profiles, we found that 10% of all amplicons displayed differential methylation between the tissue types (examples are shown in ). Of these amplicons, 31% were located in the upstream regions, a proportion that is in the same range as the total number of upstream amplicons relative to intragenic amplicons analysed in this study (see ). We scanned the literature and publicly available gene expression databases to determine whether the cognate genes displayed tissue-specific expression. An example is the complement protein C2
mRNA, which normally has a long 5′ upstream region; in the liver, an additional transcript with a much shorter 5′ upstream region is expressed (Horiuchi et al. 1990
). In our study we found that a region that overlaps intron 2 and exon 2 of the C2
gene was hypomethylated in liver samples only (however, this region is downstream of the transcriptional start sites of both forms of C2
mRNA). Another example is DOM3Z,
which is ubiquitously expressed but occurs only at very low levels in the lung (Yang et al.1998
), and this correlates with a region overlapping exons 4 and 5 of DOM3Z
that is hypermethylated in lung (and brain) but hypomethylated in the other tissues examined. It has also been demonstrated that the murine complement factor B
utilises differential tissue-specific start sites (Garnier et al. 1995
), and in our analysis the human homologue is hypomethylated at a region overlapping exons 3 and 4 only in liver. However, the majority of the genes that were associated with tissue-specific methylation profiles in our study did not show corresponding tissue-specific expression profiles in a previously reported whole human genome expression microarray analysis (Su et al. 2002
). Some of these genes are known to be associated with various mRNA isoforms, but detection of such alternative transcripts is quite difficult with conventional microarray analysis and usually requires more detailed analysis. It is also possible that the tissue-specific methylation profiles we observed in adult tissue may hint at tissue-specific expression profiles that existed during early development, or they may be associated with as yet unknown transcripts, e.g., non-coding RNAs. Alternatively, there may be only a modest proportion of genes in which tissue specificity of gene expression is affected by methylation.
Example of METHANE Output Showing Regions That Display Tissue-Specific Methylation Profiles
Inter-Individual Variation of Methylation Profiles
There is increasing evidence that an individual's epigenetic profile can influence phenotype and susceptibility to various diseases such as cancer, an example of such evidence being a recent report linking the loss of imprinting at the IGF2
locus with an increased risk of developing colorectal cancer (Cui et al. 2003
). In our study, nearly all loci displayed some degree of heterogeneity, which probably has no bearing on the differences in genome function among individuals. However, considerable differences in methylation profiles between individual samples within a tissue were observed for a number of amplicons. We calculated a median methylation value for each individual sample and then compared these values within each tissue type for each amplicon. A total of 118 amplicons displayed a difference of greater than 50% between the lowest and highest median methylation values in at least one tissue. Of these amplicons, 76% were intragenic, which is a similar proportion to the overall number of intragenic amplicons (71%; 181 out of 253 amplicons) analysed in the study. This proportion is also similar to the overall proportion of amplicons that showed tissue-specific methylation profiles and were classified as intragenic (69%). Although inter-individual variation for a given amplicon was not observed in every tissue, there was no apparent tissue-specific enrichment for inter-individual variability of methylation profiles.
Examples of amplicons that displayed significant inter-individual variation in methylation profiles include a region overlapping the last exon in CYP21A2
that showed considerable inter-individual variation in prostate (A), and a 5′ upstream region of tumour necrosis factor
(LocusID 7124) that varied significantly between individuals in liver (B). Although the differences could be attributable to the technical variability inherent in our approach or the fact that we did not control for age or sex of the tissue donors, it is also possible that certain genotypes are associated with unique epigenotypes. In a recent study, Van Laere et al. (2003)
mapped a porcine quantitative trait locus that affects muscle growth, fat deposition, and heart size to an evolutionarily conserved CpG island within the imprinted Igf2
gene. Pigs inheriting the mutation from their sire had a 3-fold increase in Igf2
expression in postnatal muscle (i.e., the quantitative trait locus is paternally expressed). Furthermore, the mutation abrogated in vitro interaction with a nuclear factor, and this effect was phenocopied following in vitro DNA methylation of the region. Evidence for an interaction between genotype and epigenotype at the IGF2
gene in humans has also recently been reported (Murrell et al. 2004
). Of the 3,273 unique CpG sites we analysed, 101 overlapped with known SNPs (relatively evenly distributed over all amplicons), all representing sites at which the CpG was lost (see ; ). The SNPs were extracted from dbSNP (http://www.ncbi.nlm.nih.gov/SNP
) and are annotated in the HEP database. One could postulate that the gain or loss of one or more critical CpG sites may affect the overall methylation profile of a locus and, consequently, promoter activity. Alternatively, non-CpG SNPs located within an epigenetically sensitive regulatory element could also influence the epigenetic makeup of that region. Therefore, mutations in regulatory sequences could influence epigenetic profiles, resulting in altered phenotypes.
Example of METHANE Output Showing Regions That Display Inter-Individual Variation of Methylation Profiles
Analysis of Methylation Variable Positions by MALDI-MS
A major aim of the HEP is to identify genomic regions at which DNA methylation profiles display statistically significant variation due to biological or environmental influences. Therefore, based on the tissue-specific and inter-individual variation in methylation profiles discussed above, we were interested in establishing high-throughput assays for epigenotyping. This involved the identification (manually or using the METHylation ANalysis Engine [METHANE]) of methylation variable positions (MVPs), which we define as CpG sites that have statistical power to discriminate between different biological samples or states. In other words, by assaying the methylation state of just a few select CpG sites within a given region, information can be inferred about the tissue source or disease state. Such a high-throughout MVP epigenotyping method was recently developed based on the GOOD assay (Tost et al. 2003
). This recently developed epigenotyping assay allows for accurate discrimination of methylation levels that differ by 5% or more. Furthermore, MALDI-MS is a relatively inexpensive method that offers a high degree of automation and integration and that has no requirement for sample purification. Assays for 231 MVPs in 77 amplicons, including all those that displayed differential methylation profiles between different tissue types or inter-individual variability, were designed and analysed in a triplex format (i.e., methylation levels at three independent CpG sites are analysed in one assay). A subset of 11 MALDI-MS assays is shown in .
Comparison of Methylation Values Measured in Five Tissues and Eleven Amplicons Using MALDI-MS and ESME Analysis of Directly Sequenced PCR Products
Comparison of Methylation Profiles with Independent Gene Expression Data
The primary function of epigenetic modifications is to modulate gene expression: a specific combination of epigenetic modifications at regulatory elements, notably promoters and enhancers, influences the transcriptional state of a gene (reviewed in Bird 2002
). In many cancers, aberrant epigenetic modifications occur within CpG islands that overlap promoters (some of which are candidate tumour suppressors), which is thought to result in aberrant transcription of the cognate gene, thus contributing to tumour progression.
We compared the amplicon methylation profiles with the human genome expression patterns available from the Genomics Institute of the Novartis Research Foundation Gene Expression Atlas database (http://expression.gnf.org
). This publicly available database contains whole-genome mRNA expression data obtained by Su et al. (2002)
using human U95A Affymetrix microarray chips. We calculated a median methylation value for each amplicon (see Materials and Methods
). As mentioned above, the methylation profiles displayed a bimodal distribution, with more than 90% of the amplicons being either hypomethylated (median methylation of 30% or less) or hypermethylated (median methylation of 70% or greater). Therefore, to perform the analyses we divided the amplicons into two categories: hypomethylated (methylation less than 50%) and hypermethylated (methylation greater than 50%) (see Materials and Methods
). We then compared the range of expression values associated with hypomethylated amplicons with those of hypermethylated amplicons. Most genes on the U95 microarray are represented by multiple probes, and, in a few cases, contradictory expression values were obtained for the same gene, in which case the gene was excluded from our analyses. Analyses were performed for liver, lung, and prostate samples only (), since appropriate Gene Expression Atlas data were unavailable for the other tissues. For prostate and liver, a significant difference was found between expression levels associated with hypomethylated versus hypermethylated upstream amplicons: hypomethylated upstream amplicons correlated with a wide range of expression levels whereas hypermethylated upstream amplicons correlated with a lack of expression (p
< 0.0001 for prostate and p
< 0.01 for liver). The intragenic amplicons did not show any correlation between methylation and expression levels (p
> 0.3 for both prostate and liver). A list of all upstream amplicons included in the analysis is given in Table S2
Comparison of DNA Methylation with Gene Expression
For the lung samples there was no significant correlation between expression and methylation state for amplicons within the upstream or intragenic categories (p > 0.3 for both categories). Although the lung data show the same trend as the prostate and liver data, the lung hypomethylated data contained a number of outlier data points representing very high expression values (as shown in by the unfilled circles). The overall trend of the data suggests that these data may be artefactual, but there is nothing that indicates these data points are not real. These data points were enough to influence the analysis such that we could not find a significant difference in the expression of between hypo- and hypermethylated lung genes. If the data points are real, the lack of correlation for the lung samples may be due to inconsistencies within the expression or methylation datasets for lung. Alternatively, there may be additional regulatory elements that influence the expression state of the analysed genes in the lung.
Overall, the findings are consistent with a model in which the DNA methylation profile of the upstream region of the gene is an informative indicator of the expression of the cognate gene, specifically, in which hypermethylation within the upstream region is associated with transcriptional silencing. Furthermore, the data also suggest that epigenetic modifications within the upstream regions influence the transcriptional state of a significant number of the genes within the MHC. This is supported by the study of Jackson-Grusby et al. (2001)
in which they employed homogeneous cultures of primary mouse embryonic fibroblasts and used the Cre-loxP system to conditionally inactivate Dnmt1, an enzyme that methylates DNA. They found that in the absence of Dnmt1, several mouse MHC class I genes showed altered expression profiles.
One of the principal challenges in the post-genomic era is to provide a holistic view of genome function, a challenge which is currently being addressed by several large-scale studies of the transcriptome, proteome, metabolic networks, and haplotype maps. The HEP is therefore timely, since DNA methylation is an indispensable part of the genome's regulatory mechanisms. Here we have described the pilot study for the HEP—DNA methylation profiling of the MHC region—which is the first systematic large-scale study of methylation profiles at the sequence level within a multi-megabase region of the human genome. For this project, we developed an integrated pipeline for high-throughput methylation analysis using bisulphite DNA sequencing, MVP discovery, and epigenotyping by MALDI-MS, and created an integrated database (http://www.epigenome.org
) for public access to the data generated by the study. The results from the pilot study demonstrate that a significant proportion of the analysed loci within the MHC show tissue-specific methylation profiles, and inter-individual methylation differences are common. Furthermore, the tissue-specific differences in DNA methylation suggest that epigenetic mechanisms are involved in the use of alternative transcriptional start sites. We have also shown that the generated methylation data allow the identification of MVPs that can be typed with high quantitative resolution and sensitivity using MALDI-MS, providing a tool for large population-based studies and for diagnosing diseases in the future.
The study reported here lays the foundation for the HEP, which aims to analyse the methylation state of the regulatory regions of all annotated genes in most major cell types and their diseased variants. In the first phase, which is well underway, we are analysing the DNA methylation profiles of over 5,000 amplicons (representing a 20-fold scale-up relative to the pilot HEP study reported here) associated with nearly all the annotated genes (approximately 3,000) on human Chromosomes 6, 13, 20, and 22. The excellent genomic annotation available for these four chromosomes, e.g., high-quality transcript information and location of SNPs, will enable us to perform comprehensive analyses linking the epigenetic information gained from the HEP with the underlying genetic information. Samples from over 40 different individuals representing 20 tissues will be used in the study.
The resulting data will generate a map that complements other large-scale efforts that are linking our knowledge of gene sequence and cellular phenotypes: studies involving DNA sequencing, SNPs, histone modifications, and transcriptome and proteomic analyses. The epigenome map will be invaluable for understanding gene regulation and the interactions between genes in normal and disease states. It will offer new explanations in well-studied areas such as cancer research, and will also provide a basis for novel approaches to research on environmental effects, nutrition, and ageing (Eckhardt et al. 2004
). The HEP also promises to provide DNA methylation markers for disease states, and new targets for drug development and diagnostic applications based on DNA methylation research are already emerging (Cairns et al. 2001
). Current efforts to target the epigenomic machinery of cells with drugs have global effects (Besterman and McLeod 2000
; Lubbert 2000
; Munster et al. 2001
), and more refined approaches will become possible with accumulating knowledge in the new field of epigenomics.