Allele-specific DNA methylation discovery by RRBS is reproducible
To discover differential methylation on homologous chromosomes, we used reduced representation bisulfite sequencing (RRBS) 
, which can be used to measure the DNA methylation state in a subset of the genome in many samples. Because RRBS uses bisulfite treatment, it detects both 5-methylcytosine and 5-hydroxymethylcytosine 
. Thus, the allelic differences identified by the method can be differences in 5-methylcytosine or 5-hydroxymethylcytosine. However, alleles that differ in the ratio between levels of 5-methylcytosine and 5-hydroxymethylcytosine are not detectable. We refer to the combination of these marks as DNA methylation.
We used the Illumina Genome Analyzer IIx (GAIIx) to sequence 36 base pair ends of bisulfite-treated MspI digestion fragments ranging in length from 40 to 120 bp. For each sample, we assayed approximately 1 million CpG dinucleotides with at least 10 sequencing reads, and calculated the percent of reads that are methylated at each CpG. This type of deep sequencing also identifies DNA sequence variants in the fragments for which we are measuring methylation (see Materials and Methods
). Detecting these SNPs in the bisulfite sequence reads allows direct determination of whether a particular allele is found in cis
with methylated or unmethylated CpGs (). Bisulfite treatment obscures C to T SNPs in forward strand reads and G to A SNPs in reverse strand reads, because a C to T mismatch in forward strand reads could represent an unmethylated cytosine and cannot be unambiguously called a SNP. We do not analyze possible C to T SNPs in our sequence reads because they cannot be identified. These SNPs account for approximately 30% of human SNPs; thus, most SNPs remain observable.
Discovery of ASM by reduced representation bisulfite sequencing.
We first measured DNA methylation on different alleles in the human embryonic stem cell line H1 
. The majority of alleles show no cis
association with DNA methylation. At a 5% false discovery rate (FDR), 5.4% (1,340 of 24,979) of autosomal heterozygous SNPs are associated with allele-specific methylation (ASM). In total, 1,340 SNPs are associated with ASM at 1,574 CpGs. Because a SNP can be linked to multiple CpGs and multiple SNPs can be linked to the same CpG, we were able to identify 1,937 ASM events in this cell line (all ASM data can be found in Dataset S1
; Table S1
provides a summary of ASM events in each sample, including the number of heterozygous SNPs called). Of the 1,937 ASM events, 573 (29.6%) involve a SNP that mutates a CpG. These cases represent a trivial mechanism for generating ASM because there is no longer a CpG to methylate when the CpG-disrupting allele is present; however, they are functional variants in that they result in a change in the methylation status of a locus.
The average difference in percent methylation between alleles is 59.8% for ASM events that do not involve CpG disrupting SNPs (). Less than 8% of ASM events exhibit greater than a 95% difference in percent methylation, which suggests that most changes in genotype induce more subtle changes in the frequency of DNA methylation than complete reversal of CpG methylation. These quantitative differences in percent methylation between alleles are highly reproducible. In two separate growths of the H1 cell line, the correlation of the difference in percent methylation between the reference and variant allele for all ASM events in both replicates is 0.98 (). These results show that measurements of the quantitative differences in percent methylation associated with SNPs are highly reproducible.
ASM associates with X chromosome inactivation
One of the two X chromosomes in women are randomly silenced in each cell during development, and DNA methylation at many sites has been shown to exist specifically on alleles from the inactive X 
. To assess our ability to observe inactivation of the X chromosome and thus validate the ASM approach taken here, we analyzed ASM in four clonal cell lines derived from single cells of the EBV-transformed lymphoblastoid line GM12878. Gene expression patterns indicate that two of these cell lines silence the paternal X chromosome, while the other two cell lines silence the maternal X chromosome (Table S2
; K.S.K. and H.F.W., manuscript in preparation). We observed reproducible patterns of ASM on the X chromosome (Figure S1
). Clones with an inactivate X chromosome from the same parental lineage exhibited the same allelic bias for every ASM event on the X chromosome (e.g., either the variant allele was more methylated in each clonal line or the reference allele was more methylated in each clonal line). When comparing any maternal X-inactive clone with any paternal X-inactive clone, an average of 59.7% of ASM events on the X chromosome switch the allele that is most often methylated, which is consistent with the pattern of inactivation. Of the genes nearby the ASM events that are consistent with inactivation, six were previously assayed for allelic expression 
and all six were identified as being inactivated. Because inactivation is not complete across the chromosome, we do not expect all ASM to switch the most often methylated allele when the inactive X chromosome is switched. These results demonstrate that our method is sensitive to X inactivation in single cell clones; however, many ASM events on the X chromosome are driven by allele status rather than by parental origin.
ASM is prevalent in a three-generation family
To determine the prevalence of ASM throughout the human genome and to deconvolute whether it is dependent on DNA sequence variants or parental origin imprinting, we applied RRBS to DNA extracted from leukocytes in fresh blood from six members of a three-generation family as well as two unrelated individuals (). We collected data for more than 950,000 CpGs that had at least ten sequencing reads in each family member. Overall, 859,531 CpGs had at least 10 sequencing reads in all samples and 668,545 had at least 20 sequencing reads in all samples. Leukocytes from fresh blood are comprised of a heterogeneous population of cells, consisting primarily of neutrophils, lymphocytes and monocytes. It has been shown that the relative proportion of each cell type does not considerably affect DNA methylation levels 
. Consistent with this observation, we found that the patterns of DNA methylation are strikingly similar between the family members. The average correlation in percent methylation of all autosomal CpGs assayed between any two family members is 0.985. All of the samples assayed were highly similar, indicating that there were no systematic biases in DNA isolation, library preparation or DNA sequencing.
Methylation patterns recapitulate family relationships.
Analysis of the CpGs that vary within the eight individuals reveals that DNA methylation patterns recapitulate relatedness. We clustered individuals based on the methylation status of the top 237 most varying autosomal CpGs across all samples. The resulting tree, based on hierarchical clustering, is shown in . The two unrelated individuals are separated from the family members and serve as an out-group. Genetically unrelated pairs of family members (e.g., maternal grandmother and father, and father and mother in the middle generation) are also separated in the tree. The closest relationships in DNA methylation patterns exist between the grandmothers and their children and the methylation patterns in the grandchildren are mixtures of their parents' methylation patterns. These results indicate that DNA methylation levels can capture the relatedness of individuals, suggesting that there is a strong genetic component to DNA methylation levels.
We identified an average of 1,702 ASM events for each member of the family. When individuals are clustered on CpGs that exhibit ASM in at least one individual (), the same relationships are observed as those shown in . To look at features of CpGs exhibiting ASM, we focused on 2,391 autosomal ASM events that we identified in at least two of the six family members. 42.6% (1,018) of these ASM events are SNPs that mutate a CpG. The high prevalence of CpG-disrupting SNPs is not due to higher sequence coverage (Table S3
). Our analysis showed that SNPs that disrupt CpGs tend to lie in intergenic regions that are not CpG islands. Only 21.9% of CpGs that are disrupted by SNPs are found in CpG islands, while 65% of all CpGs queried by RRBS reside in CpG islands (P
; Fisher's exact test). CpGs that are disrupted by SNPs are over-represented in intergenic regions, as 44.1% are at least 2 kilobase pairs (kb) away from the nearest gene, compared to 22% of all CpGs assayed by RRBS that are at least 2 kb away from the nearest gene (P
; Fisher's exact test). SNPs that mutate a CpG are often linked to the methylation status of other CpGs in the immediate vicinity. More than 45% of CpG-mutating ASM SNPs exhibit ASM associated with at least one other CpG within 36 base pairs, which represents a 5.5-fold increase in the chance that a CpG with a CpG-disrupting SNP in the same read shows allele specificity (P<2.2×10–16; Fisher's exact test). These data demonstrate that, while CpG-disrupting SNPs alter DNA methylation at their particular position due to a DNA sequence difference, they also influence the methylation of nearby CpGs. This is further evidence that there is a strong genetic effect on the regulation of DNA methylation.
ASM events that do not involve a mutated CpG are under-represented in CpG islands and are most often located in intergenic regions, as were CpGs disrupted by SNPs. Only 49.4% of CpGs that exhibit ASM are present in CpG islands, even though 65% of CpGs with nearby SNPs assayed by RRBS are in islands (; P
: Fisher's exact test). shows the distribution of gene features for CpGs that exhibit ASM. The majority of ASM CpGs are present in intergenic regions, even though most CpGs with nearby SNPs assayed by RRBS are present in promoters, first introns or first exons. There is a 3.5-fold increase in the fraction of CpGs that reside in intergenic regions (P
; Fisher's exact test). The overall trends of where ASM events are located in the genome are not the result of differences in coverage between these categories of SNPs (Table S3
). The tendency for ASM CpGs to be located outside of CpG islands and further away from genes may indicate that ASM events occur in regions under less selective pressure.
ASM is often found in regions that are outside of CpG islands, away from genes, and that exhibit low evolutionary conservation.
Because ASM tends to occur in regions without clear regulatory function, we looked at evolutionary constraint at these sites by analyzing mammalian alignments. ASM events involving SNPs that disrupt CpGs and SNPs that do not disrupt CpGs were found more often outside of CpG islands and in intergenic regions. We therefore assayed both sets and determined the level of evolutionary conservation in each group. We found significantly less conservation in CpG-disrupting ASM events (P
0.008; Wilcoxon test) and non-CpG-disrupting ASM events (P
0.002; Wilcoxon test) compared to all assayed regions that contained SNPs. To determine if this reduction in conservation could be explained by the overabundance of ASM events in intergenic regions, we repeated the conservation analysis using only SNPs in intergenic regions. We found that even within intergenic regions, SNPs associated with ASM were significantly less conserved than all intergenic SNPs assayed (). This was true for both CpG-disrupting ASM events (P
0.009; Wilcoxon test) and non-CpG-disrupting ASM events (P
0.007; Wilcoxon test). These data are consistent with the hypothesis that there is evolutionary constraint on DNA methylation levels, as genetic variants that affect DNA methylation tend to lie in regions under less selective pressure.
Genotype influences allelic methylation more often than parental origin
By carrying out RRBS in a family, we were able to follow SNPs through each generation and determine whether ASM is associated with the identity of the SNP or with parental origin. To distinguish these modes of transmission, we first had to identify SNPs with unambiguous inheritance in the pedigree that switched parental origin (e.g., those SNPs that have a maternal origin in one generation and a paternal origin in the next). We then identified those SNPs that were associated with ASM in both parent and child. We identified 432 autosomal ASM events in which the variant allele switched parental origin in the family. We then eliminated the 341 of these ASM events that were due to SNPs that mutated a CpG, as these are necessarily associated with the identity of the SNP. Only 7 (7.7%) of the remaining 91 ASM events displayed a methylation pattern that followed parental origin. The methylation patterns of the other 84 ASM events depended on the DNA sequence and an example is shown in . These results indicate that, in most cases, an allele's sequence plays a larger role in determining DNA methylation than an allele's parental origin.
Inheritance of ASM in the family.
While the majority of ASM events depend on the underlying allelic sequence, we found seven autosomal ASM events that are consistent with a genetic imprinting mode of transmission. These events occur in five distinct loci, most of which are intergenic. The introns of OVOS1 and TRAPPC9 both contain parent-of-origin ASM events, while the other three loci are at least 20 kb from the nearest gene. shows ASM of TRAPPC9 (also know as NIBP), which is involved in neuronal NF-κB signaling 
and is thought to have a maternal effect on height 
. The TRAPPC9 SNP mutates a CpG, which exhibits ASM that follows the sequence of the allele as expected. However, the CpG located 32 bp away is unmethylated on the variant allele in the mother (which came from her father) and is methylated on the variant allele in both sons. The RRBS data suggests that parent-of-origin ASM may extend further in this region, as CpGs extending 65 bp upstream and 125 bp downstream of the SNP are all close to 50% methylated in every member of the family. The transcription start site of KCNK9, a known maternally imprinted gene 
, resides downstream of TRAPPC9 more than 350 kb away from the SNP. It is possible that methylation of a TRAPPC9 intron is involved in silencing KCNK9. The TRAPPC9 region represents a candidate for maternal imprinting, which provides supporting evidence for the locus' role in maternal effect on height.
Because genotype-dependent differences in DNA methylation were far more prevalent than parental origin-dependent differences, we built a general model of DNA methylation and genotype that was able to incorporate information from homozygous individuals. We built a linear model of the relationship between DNA methylation and genotype that assumes that a CpG's methylation level in a heterozygous individual is halfway between the level of methylation in an individual homozygous for one allele and an individual homozygous for the other allele. When the model is constructed for the top 2,900 most variable CpGs in the family that have detectable SNPs nearby, 602 (20.8%) showed significant genotype contributions and 370 (12.8%) were still significant after multiple hypothesis correction. Even with a small sample size of six, genetic association with DNA methylation is detectable for a large number of loci. These results, together with ASM data, suggest that DNA sequence plays a major role in determining inter-individual differences in DNA methylation levels.
ASM is also observed in unrelated individuals
Some of the genetically linked ASM events found in the family may be dependent on genetic background and specific to the particular family studied, while other ASM events may be common to many individuals. To determine the persistence of ASM in different genetic backgrounds, we analyzed two unrelated individuals as well as the lymphoblastoid cell line GM12878. We could test 40 of the 84 sequence-dependent autosomal regions that we observed in the family because at least one unrelated individual was heterozygous at the same SNP as that observed in the family. 30 of those 40 regions (75%) were found to be allele-specific in the unrelated individuals as well. In every case, the more often methylated allele in the family was also more often methylated in the unrelated individuals. We also found that the CpGs in the TRAPPC9 locus discussed above were also methylated between 39% and 57% in the unrelated individuals, which is consistent with parental origin imprinting at this locus. These results indicate that many ASM events that we observed in the family are present in other genetic backgrounds.
For a broader view of the influence of genotype on DNA methylation, we used the linear model of DNA methylation and genotype discussed above to predicted DNA methylation levels in the two unrelated individuals. In both individuals, the model was able to predict methylation levels with high accuracy based on each individual's genotype (R2
0.82 and 0.83). The ability of the linear model, which was trained on the family, to predict DNA methylation in unrelated individuals accurately shows that the relationship between DNA sequence and methylation levels is consistent among individuals.
While the influence of genotype could be observed across individuals in the same tissue, we sought to determine whether the impact of DNA sequence on DNA methylation could be observed in different tissues from the same individual. We analyzed DNA extracted from kidney and skeletal muscle of a third unrelated individual. There were 1,521 ASM events in the kidney sample and 1,332 ASM events observed in the skeletal muscle sample. Of the 984 ASM events observed in the kidney with sufficient coverage in both samples, 834 (84.8%) were also allele-specific in skeletal muscle. For all 834 overlapping ASM events, the more often methylated allele was identical in the kidney and skeletal muscle samples. If CpG disrupting SNPs are removed from the analysis, there were 931 ASM events in the kidney sample and 790 events observed in the skeletal muscle sample. Of the 614 ASM events observed in the kidney with sufficient coverage in both samples, 445 (72.5%) were also allele-specific in skeletal muscle. These results indicate that allelic differences in DNA methylation can be observed across different tissues in the same individual at a high frequency.
ASM associates with gene expression differences
To determine whether ASM events are indicative of functional gene regulatory differences, we compared ASM with allele-specific gene expression (ASE; T.E.R. et al., manuscript in preparation) in the original lymphoblastoid line GM12878. Of the autosomal genes that have an ASM event within 5 kb of the transcription start site, we found that 21.7% (18 out of 83) of the autosomal genes with sufficient RNA-seq read depth (greater than 25 reads that cover SNPs) show ASE (Table S4
). This represents more than a two-fold enrichment (P
; hypergeometric test) over an overall rate of 9.2% (594 out of 6464) of ASE for all autosomal genes with sufficient read depth. Data for the gene encoding acid alpha-glucosidase (GAA) are summarized in . GAA, which is involved in the degradation of glycogen to glucose and is associated with Pompe's disease 
, showed 3% methylation on the maternally inherited copy of chromosome 17 and 89% methylation on the paternally inherited copy; RNA-seq showed that 76.7% of the transcripts came from the maternal copy of chromosome 17. TRAPPC9 and OVOS1 did not harbor SNPs with sufficient read depth in the RNA-seq data to determine allele-specificity. These results indicate that genes exhibiting ASM are enriched for gene expression differences between alleles.
Allele-specific gene expression overlaps with ASM.
Independent validation of ASM and ASE
We sought to validate the next generation sequencing results by performing Sanger sequencing on a small number of loci. To investigate ASM, we performed bisulfite PCR, cloning and Sanger sequencing of four loci from family member C (). All four loci exhibited ASM in the Sanger sequencing data. We also observed that the allele-specificity of DNA methylation extended beyond the region assayed by RRBS. In two cases, (chr21:4415835 and chr8:141109575) allele-specificity was found across the entire region. In the other two examples, half of the assayed region exhibited allele-specificity, while the other half of the region was similar between the two alleles. These results indicate that ASM identified in the RRBS data is also observed with Sanger sequencing and that patterns of ASM are complex.
ASM extends past regions assayed by RRBS.
To look further at extended ASM, we analyzed SNPs within 200 bp of each other and found that adjacent SNPs were often concurrently associated with ASM. For example, in family member A 655 ASM SNPs have another SNP detected within 200 bp and 455 (71.4%) of these SNPs also exhibit ASM. On average across the family members, 76.3% of SNPs adjacent to ASM SNPs were also associated with ASM themselves. These results show that ASM can be seen across extended regions.
To validate the next generation ASE findings, we cloned and Sanger sequenced genomic DNA and cDNA from GM12878 for six loci. Using the ratio of alleles observed in the genomic DNA as a background, we found that five (GAA, KCNQ10T1, HLA-DPB2, LOC654433 and LOC253039) loci exhibited significant allele biased expression with the bias coming from the expected parental chromosome. ZNF132 showed a bias in the expected direction (69% from the allele predicted to be higher expressed); however, the number of reads (16), was too low to call the bias significant. Both ASE and ASM validation results show that the allele-specificity observed with our next generation sequencing approaches are replicated with an independent technique.