|Home | About | Journals | Submit | Contact Us | Français|
In order to properly comprehend the epigenetic dysregulation that occurs during the course of disease, there is a need to characterize the epigenetic variability in healthy individuals that arises in response to aging and exposures, and to understand such variation within the biological context of the DNA sequence. We analyzed the methylation of 26,486 autosomal CpG loci in blood from 205 healthy subjects, using three complementary approaches to assess the association between methylation, age or exposures and local sequence features, such as CpG island status, repeat sequences, location within a polycomb target gene or proximity to a transcription factor binding site. We clustered CpGs (1) using unsupervised recursively partitioned mixture modeling (RPMM) and (2) bioinformatically-informed methods and (3) also employed a marginal model-based (non-clustering) approach. We observed associations between age and methylation and hair dye use and methylation, where the direction and magnitude was contingent on the local sequence features of the CpGs. Our results demonstrate that CpGs are differentially methylated dependent upon the genomic features of the sequence in which they are embedded, and that CpG methylation is associated with age and hair dye use in a CpG context-dependent manner in healthy individuals.
Epigenetics can be defined as stable and heritable changes that either alter or have the potential to alter gene expression without changing the DNA sequence.1 DNA methylation is the most commonly studied epigenetic modification in humans due to its stability and amenability to measurement. The covalent attachment of a methyl group to cytosine at the 5-carbon of the pyrimidine ring occurs primarily in the context of CpG dinucleotides.2 CpGs are disproportionately concentrated in enriched regions referred to as CpG islands (CGI), which tend to be differentially located in the promoter regions of genes. Methylation of CGIs in gene promoter regions is typically associated with transcriptional repression, although CGIs are generally not methylated in non-pathologic cells,2 but exceptions exist, as in the case of X-inactivation, imprinting3 or tissue differentiation.4–8 However, 70–90% of all CpGs in the human genome are not situated within CGIs and are typically methylated under normal conditions, helping to maintain genomic stability9,10 and suppress expression of transposable elements.11
Although the term epigenetics was originally coined in 1942,12 this discipline has burgeoned over the last three decades with the major advances initially found in cancer biology. However, in recent years, the study of the epigenetics of aging has emerged as a novel field, seeking to discern the epigenetic contribution to the highly complex process of aging in the context of the environment, broadly conceived.1 Alterations in DNA methylation have been associated with aging-related diseases, including insulin-resistant diabetes mellitus (type 2),13 Alzheimer disease,14 cardiovascular disease15,16 and cancer.3 Part of the key to understanding how altered DNA methylation patterns associate with aging is to determine whether they occur in response to endogenous or exogenous environmental exposures, are preprogrammed as a course of life, are primarily a stochastic process, or if the patterns of DNA methylation that develop over time in a tissue reflect an amalgamation of all of these inputs.
Gaining a better understanding of the potential for local sequence features and genomic context to influence the methylation state of CpGs will enhance our comprehension of how DNA methylation is regulated under normal conditions and becomes altered by aging and exposures and will provide important clues to the role of epigenetics in pathogenesis. While the methylome of the peripheral blood mononuclear cell has been described in reference 17, enhancing our ability to understand the normal state of blood cells, we have focused upon the role of aging and the environment in explaining inter-individual differences in DNA methylation. Considerable epidemiologic and basic research is currently being conducted investigating the patterns of DNA methylation in peripheral blood for a plethora of pathological conditions thought to be related to altered epigenetic states.18 Hence, it is imperative that we further define the intrinsic factors, such as sequence context, affecting DNA methylation in the non-pathologic state. We have previously demonstrated that CpG loci can be clustered in DNA extracted from the blood of healthy individuals according to their methylation patterns and that the extent and direction of correlation between CpG methylation and age is dependent upon CGI context.19 Here, we have extended this research to the evaluation of 26,486 autosomal CpG sites for methylation in blood DNA from 205 healthy subjects to investigate the relationships among patterns of DNA methylation and age, gender, environmental exposures and sequence features in healthy individuals. We demonstrate that intrinsic biological characteristics, such as local sequence features surrounding CpGs, may interact with aging and the environment to influence DNA methylation, and suggest an appropriate approach and methodology for assessing associations of methylation with aging and environmental exposures in healthy people.
To assess the complex relationship of DNA methylation with age, gender and various exposures, we used DNA from blood samples of 205 healthy individuals (a description of the study population is provided in Sup. Table S1) and high-density methylation array technology to analyze CpG methylation via three complementary approaches. We first used a data-driven approach, clustering CpGs (as opposed to subjects) by relative methylation across all subjects using an unsupervised model-based hierarchical clustering algorithm. Our second approach applied classes derived externally from bioinformatic considerations, fitting the methylation data into these bioinformatic classes. Finally, to supplement the CpG cluster-based analyses, we developed marginal models to further assess the interactions between aging/exposures and local DNA sequence features with regard to methylation without needing to cluster the CpGs. An overview of our analytic strategy is presented in Figure 1.
Previous work by our group identified a correlation of CpG methylation with age in peripheral blood DNA from healthy individuals, the magnitude and direction of which depended upon the CGI-status of the CpGs.19 We sought to expand on these findings with a larger pool of healthy study subjects using a denser methylation array, and to examine relationships of CpG methylation with several well-characterized exposures and potential cancer risk factors, while taking into account variability in propensity for methylation among CpGs. CpG loci were clustered by unsupervised recursively partitioned mixture model (RPMM),20 based on methylation (β) Z-scores, into 32 methylation classes. RPMM was chosen for its efficient and effective performance in clustering high-dimensional methylation array data (Sup. Analysis S1). The resultant CpG classes demonstrated interclass variability in their degree of methylation (Fig. 2; see Sup. Fig. 1 for additional detail).
To evaluate the associations between methylation and aging/exposures, Spearman's rank correlation coefficient with each respective exposure was calculated for the average methylation within each class (32 such class-specific averages per subject). This approach was taken based on the notion that CpGs within classes should possess similarities that influence the direction and magnitude of methylation. Overall association of class methylation and aging/exposures was assessed using two separate omnibus tests (Table 1), described in detail in the Methods section. CpG class methylation was significantly correlated with age (P1st difference = 0.007; Psupremum < 0.001), cigarette pack-years restricted to ever-smokers (P1st difference = 0.004; Psupremum = 0.006) and lifetime ever-use of hair dye (P1st difference = 0.003; Psupremum = 0.01). After controlling for potential confounding factors using multiple linear regression (Table 1), there was still an overall association of CpG class methylation and age, adjusted for gender (P1st difference = 0.006; Psupremum < 0.001); and hair dye use after adjusting for age and gender (P1st difference = 0.006; Psupremum = 0.06); while the association between CpG methylation class and pack-years was borderline significant based on the “1st difference” test (P1st difference = 0.06), which is designed to account for structure among the class methylation-exposure relationships but was non-significant by the omnibus test based on maximum absolute value (Psupremum = 0.41), after adjusting for age and gender.
When considering individual CpG classes (Fig. 2), after adjusting for potential confounding, classes with relatively greater extents of methylation (denoted by blue dots) were observed to have an inverse relationship between methylation and aging, although most were non-significant (controlling for multiple comparisons); while in the relatively unmethylated CpG classes (denoted by yellow dots), methylation tended to be positively associated with age, although none were individually significant (controlling for multiple comparisons). Hair dye use, arsenic and tanning lamp use also displayed similar patterns to that of age and methylation by class but no significant associations were observed for any individual class.
Methylation array data was available for peripheral blood from a second population of 92 healthy subjects (validation subjects). To assess the robustness of the unsupervised classes, CpG loci were again clustered into 32 classes by RPMM using the methylation data from the validation subjects, and class membership (CpG loci) was compared by cross-tabulation of the CpG classes derived from the primary study subjects and validation subjects. There was substantial concordance of CpG loci between the two sets of classes, indicating that the unsupervised clustering of CpGs by RPMM has a high-level of reproducibility in blood from healthy subjects (Sup. Fig. S2).
Next, we sought to validate the observed methylation-exposure associations. The two sets of CpG classes, one derived via RPMM from the primary study subjects (primary classes) and one from the validation subjects (validation classes) were applied (i.e., used to define class-specific methylation averages) to each study population, resulting in four comparisons: primary classes x primary subjects, primary classes x validation subjects, validation classes x primary subjects and validation classes x validation subjects. For each of the 32 classes in each comparison, Spearman's correlation and multiple regression were used to evaluate the association between methylation and exposures that were available in both data sets (age, gender, smoking and alcoholic drinks per week). Finally, omnibus tests of overall association of each exposure with class methylation were performed using a supremum test statistic, described in detail in the Methods section. Of the aforementioned exposures, only alcohol consumption (p = 0.001) and race/ethnicity (p < 0.001) differed by study population (Sup. Table S1) with the validation subjects less likely to be non-drinkers and more likely to consume >6.5 drinks per week and more likely to be of a racial/ethnic background other than Caucasian, although the vast majority identified as non-Hispanic Caucasians.
When applied to the same study population, each of the 2 sets of CpG classes, derived respectively from the primary study subjects and validation subjects, were similar with regard to correlation of class methylation and exposures, which was sustained after adjusting for potential confounders in the multiple regression models (Sup. Table S2). However a higher degree of variation was observed for each set of CpG classes when applied across populations, possibly indicative of inherent unaccounted differences between the populations.
To investigate how the genomic context around specific CpG sites may impact the associations between exposures and methylation, we examined variability of the individual unsupervised (RPMM-derived) classes with regard to specific local sequence features. Interclass variability by sequence feature for the CpG loci was observed (Fig. 3). The classes with relatively high levels of methylation had higher proportions of CpGs within non-long terminal repeat (non-LTR) transposable elements, including LINE-1, LINE-2, Alu and mammalian wide-interspersed repeat (MIR) elements. Conversely, unmethylated CpG classes predominately contained loci residing within CGIs and had a higher proportion of CpGs located within 1,000 bases (1 kb) of at least one putative transcription factor binding site (TFBS). There was also variability among classes with respect to percent of CpGs located within a polycomb group (PcG) target gene,21–24 with the frequency of CpG loci associated with PcG targets within classes ranging from 4.0% to 32.8%; five classes had more than 20% of member loci that were associated with PcG targets.
Motivated by the interclass variability by sequence features observed in the unsupervised RPMM-based clustering, we next utilized a bioinformatically-informed classification scheme, subdividing CpG sites by their sequence features to account for intricate interactions between them. Taking into consideration presence in a CGI, PcG target gene, LINE-1, LINE-2, Alu and MIR elements and proximity (≤1kb) to a TFBS, we obtained 41 classes containing at least one CpG based on various combinations of the aforementioned bioinformatic attributes. Classes are denoted by the applicable attributes separated by a “|” (e.g., a class of CpGs located in CGI and LINE-1 element would be symbolized as CGI|LINE1). The distribution of CpG loci by bioinformatically-derived class is presented in Supplemental Table S3. There was an overall significant relationship between age and bioinformatically-derived CpG class methylation (Table 2) by omnibus tests (supremum) of per class Spearman's rank correlation (p = 0.001) and multivariable regression (p = 0.001), adjusting for gender. Dye-use (p = 0.04) and female gender (p = 0.05) were also significantly correlated with class using an omnibus test of Spearman's coefficient but lost significance after adjusting for age and gender and age and dye-use, respectively.
There was a significant inverse association of age and methylation for several individual bioinformatically-derived classes (Fig. 4). These 8 classes (each class is shown in brackets) included [CGI|MIR], [PcG target|TFBS], [PcG target], [TFBS], [PcG target|MIR|TFBS], [MIR|TFBS], [CGI|PcG target|LINE2] and [LINE2|TFBS]; all of which had relatively higher degrees of average methylation. No other exposures were associated with methylation of individual bioinformatically-derived classes (after controlling for multiple comparisons).
In response to recent literature suggesting a role in transcriptional control and differentiation,25,26 we conducted a subanalysis of CpG island shores (defined as sequences within 2 kb distance of CGI). However, we found no association between their methylation and exposures (Sup. Table S4) and thus have not further included them in our analyses.
Building on the bioinformatically-derived classes, a model-based approach (independent of CpG clustering) was employed to further assess the relationship between exposures, sequence features and methylation. To do this, we developed separate marginal models for each exposure of interest, adjusted for potential confounders and examined the main effect and interactions of each exposure and sequence feature with respect to DNA methylation. The models substantiate the inclusion of the sequence features used in the bioinformatically-derived classification of CpG loci, showing them to each be independently associated with methylation at p < 0.00001, with the exception of PcG targets, which is non-significant in all but one of the models (Sup. Table S5–S14).
A summary of the results from the marginal models for the association of exposures and methylation, overall and by sequence features of the CpG loci, are presented in Table 3 (the individual models are presented in entirety in Sup. Table S5–S14). Age was inversely associated with overall average methylation (p = 0.002). When considering CpG loci by sequence feature, there was no significant effect of age (adjusted for gender) on methylation of CpGs located in CGIs, although there was a significant inverse interaction (Pinteraction = 0.02; Sup Table S5); however methylation significantly decreased with age (adjusted for gender) for CpGs associated with all other sequence features, with significant interactions with LINE-2 (Pinteraction = 0.0003), MIR elements (Pinteraction < 0.0001) or PcG target genes (Pinteraction = 0.0003). There was no significant effect of any other exposures assessed, overall or by sequence feature, albeit there was an interaction between ever-use of hair dye (adjusted for age and gender) and methylation of CpGs in LINE-1 elements (Pinteraction = 0.04; Sup. Table S6) and an interaction between ever-use of tanning lamps (adjusted for age and gender) and methylation of CpG loci located within a PcG target gene (Pinteraction = 0.04; Sup. Table S13).
Epigenetic research in human subjects has been ongoing for decades but has primarily focused on alterations related to cancer. However, in order to properly understand aberrant epigenetic regulation that occurs during the course of disease, the normal methylome must be described. More specifically, there is a need to characterize the epigenetic state in non-pathologic tissues from healthy individuals to identify the variability in the overall profile of DNA methylation across individuals, and to clarify the relationship of that variability with aging or environmental exposures. Elucidation of the methylation patterns of CpG loci embedded in different genomic sequences or proximal to different features will critically inform our comprehension of alterations in epigenetic regulation that occur through pathologic processes and the mechanisms by which these alterations arise.
The study of the epigenetics of aging in healthy individuals is emerging as a novel discipline, seeking to discern the epigenomic changes that occur during the course of life. Early studies of this phenomenon examined candidate loci, such as individual gene promoters and “global” methylation markers, finding increased methylation of many of these specific gene promoters with aging,27–32 while methylation of the “global” markers (e.g., LINE-1, Alu, LUMA, CCGG, etc.,) decreased,33–36 giving rise to the notion that we lose global methylation with age, while we gain localized promoter methylation.37 In accordance with these earlier reports, we present here evidence, based on a genome-wide approach, of an association between aging and DNA methylation, the magnitude and direction of which is dependent upon the genomic context of the sequence in which the CpG is embedded. This is demonstrated by our marginal model results, which show no effect of age on CGI methylation but a decrease in methylation, overall and for all other sequence features considered, including several repeat sequences, with varying effects. This is additionally corroborated by our previous results in reference 19, which clustered 1,413 CpG loci into methylation classes using blood samples from 30 healthy adult subjects and finding the association of methylation with age to be CGI-context dependent. Furthermore, our present results indicate that inter- and intra-genomic differences in methylation acquired with age are more complex than just CGI vs. non-CGI, but rather vary according to biological differences in DNA sequence, as exemplified by the complex interactions observed in our bioinformatically-derived clustering approach.
We also undertook a more thorough examination of the relationship between methylation and environmental exposures experienced by the subjects studied. In doing so, we identified an association between hair dye use and methylation, where ever-use of hair dye was inversely associated with methylation among the more highly methylated unsupervised (RPMM-based) classes and positively associated with methylation in the classes with low methylation and higher CpG island contents. This finding was further supported by our marginal model estimates, which indicate an interaction between use of hair dye and methylation of CpGs in LINE-1 elements. However, using the bioinformatically-informed classification scheme, after adjusting for age and gender, we no longer observe a significant association between methylation and ever-use of hair dye by class. This may suggest that while the bioinformatically-derived classes are meaningful, they either do not fully explain the genomic context which accounts for differences in methylation between CpG loci or are over-parsing CpGs based on bioinformatic features, sacrificing statistical power for detection of associations. While several varieties of hair dyes exist, oxidative (permanent) dyes comprise 80% of the market share in the US.38 The main components of oxidative dyes include primary intermediates and couplers, composed of various forms of arylamines, oxidants and alkalinizing agents.39 A recent review concluded that there is no consistent evidence of genotoxicity from biomonitoring studies of hair dye exposure40 but there are some epidemiologic reports of increased risk of bladder41,42 and hematopoietic cancers43–45 among hair dye users, albeit the literature is conflicting.46 In light of our findings, further studies are indicated to examine the effect of hair dye use on epigenetic endpoints and the impact of these alterations on disease susceptibility.
We found no overall association of class methylation with ever-smoking but there was a borderline association among ever-smokers of pack-years with methylation using the RPMM-based approach, although the direction was contrary to what would be expected and thus further research is required to determine whether this effect is real or spurious. No association was observed for alcohol consumption, arsenic or selenium exposure (measured via toenail clippings). However, it is important to note that although the measured exposures may not be significantly associated with methylation in peripheral blood, they may be affecting methylation in other tissue types not measured in this study. Additionally, in response to evidence that ultraviolet (UV) exposure modifies the immune system,47,48 which could potentially result in altered methylation signatures in peripheral blood, we assessed measures of UV exposure, including ever-use of tanning lamps and lifetime number of painful sunburns. CpG class methylation was not associated with either UV measure, although we did observe an interaction between ever-use of tanning lamps and methylation of CpGs located within PcG target genes, the significance of which is unknown.
A key strength of this study is the employment of three complementary analytic strategies for evaluating the impact of aging and exposures on DNA methylation: (1) unsupervised clustering by recursively partitioned mixture modeling (RPMM), (2) a bioinformatically-informed clustering approach and (3) a marginal-model based analysis. Each of the 3 methodologies used bears its own set of strengths and weaknesses, with each making a positive contribution to the analysis and filling in for potential shortcomings of the others. The bioinformatically-derived clustering approach takes into account intricate interactions between DNA sequence features of the CpGs but is limited in scope to the sequence features that we considered and could potentially over-partition the data. Conversely, the unsupervised (RPMM-based) clustering approach has the capacity to capture variation in methylation due to unknown or poorly-understood features and interactions that would otherwise be unaccounted for since it clusters based on like-methylation patterns rather than specific DNA sequence attributes, although the source of variation may not be as easily interpreted. Additionally, the data-driven RPMM approach suffers from the weaknesses of all 2-stage latent variable approaches, i.e., “double-dipping” where the data are used twice (once to predict the latent variables and once again to assess their associations with other variables/phenotypes). In general, 2-stage approaches provide reasonably unbiased point estimates but can often underestimate standard errors.49 Finally, the addition of the marginal model-based (non-clustering) approach allows us to specifically analyze the interaction of each exposure of interest with each sequence feature. However, this assessment is limited to evaluation of 1st order interactions, whereas the cluster analyses may better capture more complex relationships between aging/exposures, variation in the DNA sequence and methylation.
Our results clearly demonstrate that the genomic context of CpGs is important when assessing associations of methylation with aging or exposures. They also indicate that simple consideration of CpG island status is not sufficient with respect to methylation, but rather that other variations in DNA sequence should be taken into account. Moreover, we have provided additional evidence that DNA methylation is associated with age and novel evidence for an association with hair dye use, each operating in a CpG context-dependent manner. Proper careful analysis of CpG loci with respect to methylation patterns in response to aging and exposures in healthy individuals, such as we have described here, will help us to gain insight into the mechanics of DNA methylation and epigenetic control. Ultimately, such conception of normal epigenetic variation will help to guide future research of aberrant methylation that occurs during the course of disease, enhancing our understanding of pathologic processes.
The primary study population was composed of 205 healthy subjects with no prior history of cancer who served as shared controls for two case-control studies on bladder and skin cancer, and for whom peripheral blood (buffy coat) was available. Briefly, controls were population-based New Hampshire residents, ages 28–74 years.50 Upon enrollment, consenting subjects underwent personal interviews furnishing sociodemographic and exposure information, and provided a toenail sample used to assess the burden of arsenic and selenium in the body via inductively coupled plasma mass spectrometry.
A second study population of 92 healthy control subjects from a case-control study of head and neck squamous cell carcinoma (HNSCC) was used for cross-validation of the unsupervised CpG clustering (validation subjects), and has also been previously described in reference 51. Population-based control subjects were randomly selected from a larger pool recruited from the greater Boston area, ages 32–86 years. All subjects completed a self-administered questionnaire, providing sociodemographic and exposure information.
Institutional Review Board approval was obtained for sample collection and use of patient data for all subjects included in this study. All subjects provided written informed consent for participation in this study.
DNA was extracted from peripheral blood buffy coats using the QIAmp DNA mini kit (Qiagen, Valencia, CA) according to the manufacturer's recommendation and was subsequently sodium bisulfite converted using the EZ DNA methylation kit (Zymo Research, Orange, CA). The bisulfite-converted DNA was analyzed using the Infinium HumanMethylation27 BeadChip array (Illumina, San Diego, CA) according to the manufacturer's recommendations at the Genomics Core Facility at the UCSF Institute for Human Genetics (San Francisco, CA). Analysis was conducted in 2 batches across 42 BeadChips. Outliers were detected using array control probes supplied by Illumina to diagnose problems such as poor bisulfite conversion, batch or BeadChip effect or color-specific problems. Specifically, Mahalanobis distances were determined based on fitted mean vector and variance-covariance matrix, and arrays with large distances (inconsistent with multivariate normality52) were discarded. The methylation status for each individual CpG locus was calculated as the ratio of fluorescent signals (β = Max(M,0)/[Max(M,0) + Max(U,0) + 100]), ranging from 0–1, using the average probe intensity for the methylated (M) and unmethylated (U) alleles. Beta (β) = 1 indicates complete methylation; β = 0 represents no methylation. Only the 26,486 autosomal CpGs were considered in the statistical analyses. We and others, have previously demonstrated that methylation of CpG loci detected through BeadArray platforms can be replicated using alternative detection techniques including pyrosequencing, Massarray analysis and quantitative methylation-specific PCR.53–58
To capture relative, CpG-specific heterogeneity across specimens, methylation β values Bij were transformed to Z-scores (conferring robustness to biochemical range) by calculating mean and standard deviation Sj for each individual CpG j and subsequently computing . CpG loci were clustered into methylation classes based on Z-scores using a recursively partitioned mixture model (RPMM) 20 adapted for Gaussian distributions. This likelihood-based hierarchical clustering algorithm has processing and memory requirements that are less burdensome than commonly used metric-based hierarchical clustering procedures, thereby granting computational feasibility to the clustering of 26,486 CpGs. In fact, by comparing the consistency of RPMM clustering to that of metric clustering (using Euclidean distance with Ward's linkage) by pairwise analysis of 100 resampling experiments, we have demonstrated that RPMM provides more consistent clustering than metric hierarchical clustering for this dataset (Sup. Analysis S1). In addition, its hierarchical presentation of classes confers robustness, compared with other mixture model algorithms, in the selection of the number of classes. The model was arbitrarily pruned after 5 splits (Q = 5), yielding 32 CpG methylation classes. For each of the 205 control subjects, 32 corresponding aggregate methylation values were obtained by averaging together average β values from all CpGs within the class. RPMM classes are labeled by 5-letter combinations of L (left) and R (right), denoting the direction of each of the 5-splits in the dendogram.
For each of these 32 aggregate measures, Spearman's rank correlation coefficient was used to measure the correlation between subject-specific exposure and subject- and class-specific aggregate methylation. Multiple linear regression models were used to assess the association of exposures and aggregate methylation, while adjusting for potential confounding variables. The model for the association of aggregate methylation and age (continuous, centered at the median) was adjusted for gender; the model for the association of aggregate methylation and gender was adjusted for age and hair dye use (ever/never); the model for the association with pack-years of smoking (continuous) was restricted to ever-smokers and was adjusted for age and gender; the respective models for ≤ and >6.5 alcoholic drinks per week (median) were compared to non-drinkers and adjusted for age and gender; and models for smoking (ever/never), hair dye use (ever/never), arsenic exposure (measured from toenail clippings as µg/g), selenium exposure (measured from toenail clippings as µg/g), tanning lamp use (ever/never) and number of lifetime painful sunburns (continuous) were all adjusted for age and gender.
Omnibus tests for overall association between exposure and aggregate CpG class methylation were obtained by permutation test. Two types of tests were used. The first type of test is a supremum test, analogous to a Kolmogorov-Smirnov test: specifically, for each hypothesized association, a test statistic was constructed as either the maximum absolute correlation or the maximum absolute t-statistic for the appropriate coefficient from the regression model, where the maximum was computed over the 32 individual correlations or regression models. The corresponding null distribution was obtained by randomly permuting the individual exposure or phenotype variable with respect to aggregate methylation values and potential confounders and computing the corresponding test statistic. 10,000 permutations were used and a hypothesized association was considered significant when p ≤ 0.05. Since this test is inefficient for detecting structural dependencies between classes that are adjacent with respect to a natural ordering (e.g., CpG classes ordered alphabetically by RPMM label or numerically by mean methylation) we employed a second type of test statistic, a “1st-difference” test: the sum of the squares of the first-order differences in smoothed correlation or t-statistic, where the smoothing was obtained by fitting a generalized additive model (GAM) to the statistics (with respect to the assumed order of the classes) and extracting the predicted smooth. GAMs were fit using the mgcv library in R. Since the classes must have a natural ordering in order for the “1st difference” test to be meaningful, this test was not applied to the bioinformatically-derived classes.
Additionally, an alternative clustering of CpGs was obtained by considering epigenetically relevant bioinformatic attributes of each CpG, including CpG island status,59 PcG target status of associated gene (i.e., gene was described as a PcG target in at least one of21–24), presence within 1 kb of at least one of 258 computationally predicted TFBS sequences obtained from the tfbsConsSites track of the UCSC Genomes Browser site (TFBS Z-score >2) and situation within each of the following classes of repetitive elements as defined by the Repeatmasker track of Genomes Browser: Alu, LINE-1, LINE-2 and MIR. This bioinformatic classification resulted in 41 distinct CpG classes containing at least one CpG, summarized in Supplemental Table S3.
Finally, to further analyze the interaction of each exposure of interest with each bioinformatic attribute with respect to CpG methylation, we fit attribute x exposure/confounder interaction regression models (marginal models). For each regression, we assumed the following data-generating model:
Where Yij = sin−1(Bij1/2) is the variance-stabilized methylation value obtained by arcsine transformation of average β value, Bij, for subject i and CpG j, xi is a vector of exposures/phenotypes and confounding variables, zj is a vector of CpG-specific attributes, denotes Kronecker product, mj and aj are zero-mean CpG-specific effects, εij is a zero-mean error term, T symbolizes a transpose operation and the remaining coefficients are the focus of biological interest. Specifically, the vector α represents overall effect of exposure or phenotype on DNA methylation, γ represents the effects of individual CpG attributes on DNA methylation and δ represents the extent to which various CpG-specific attributes modify the effect of exposure or phenotype. Estimates were obtained in a two-stage approach by first computing individual regression coefficients and for the model , then fitting the models and to obtain estimates and the coefficient matrix estimate and finally vectorizing to obtain . This marginal-models approach is similar in spirit to the generalized estimating equation (GEE) popular in longitudinal data analysis. Statistical inference was obtained by bootstrap, i.e., obtaining 500 representatives of the sampling distribution by constructing 500 bootstrap data sets, each of which was obtained by sampling, with replacement, 205 data vectors consisting of methylation data concatenated with exposure/phenotype covariate data. We acknowledge that small biases in estimates will arise from the extent to which and , computed over all the autosomal CpGs on the 27K array, would differ from the corresponding values obtained from all CpGs on the human genome, but conjecture that the bias is small for dense arrays, and that the resulting regression estimates will be representative of all human genome CpGs that conform to the selection criteria used by Illumina for inclusion on the 27K array.
All statistical analyses were performed using the R statistical package (v. 2.11.1).
This work was supported by the Flight Attendant Medical Research Institute grant YCSA 052341 to C.M.; and the National Institutes of Health (R01CA121147 to K.K., R01CA100679 to K.K., R01CA078609 to K.K., R01CA126939 to K.K., R01CA057494 to M.K., P42ES007373 to M.K., R01CA082354 to H.N.).