PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2014 April; 42(6): 3515–3528.
Published online 2014 January 20. doi:  10.1093/nar/gkt1380
PMCID: PMC3973306

Predicting DNA methylation level across human tissues

Abstract

Differences in methylation across tissues are critical to cell differentiation and are key to understanding the role of epigenetics in complex diseases. In this investigation, we found that locus-specific methylation differences between tissues are highly consistent across individuals. We developed a novel statistical model to predict locus-specific methylation in target tissue based on methylation in surrogate tissue. The method was evaluated in publicly available data and in two studies using the latest IlluminaBeadChips: a childhood asthma study with methylation measured in both peripheral blood leukocytes (PBL) and lymphoblastoid cell lines; and a study of postoperative atrial fibrillation with methylation in PBL, atrium and artery. We found that our method can greatly improve accuracy of cross-tissue prediction at CpG sites that are variable in the target tissue [R2 increases from 0.38 (original R2 between tissues) to 0.89 for PBL-to-artery prediction; from 0.39 to 0.95 for PBL-to-atrium; and from 0.81 to 0.98 for lymphoblastoid cell line-to-PBL based on cross-validation, and confirmed using cross-study prediction]. An extended model with multiple CpGs further improved performance. Our results suggest that large-scale epidemiology studies using easy-to-access surrogate tissues (e.g. blood) could be recalibrated to improve understanding of epigenetics in hard-to-access tissues (e.g. atrium) and might enable non-invasive disease screening using epigenetic profiles.

INTRODUCTION

Tissue-specific gene expression patterns that determine cell types and functions are regulated in part by tissue-specific methylation at CpG sequences (1). It has been shown that epigenetic variation in the methylation of DNA is related to transcription regulation, cell differentiation, diseases and cancers (2). Recent advances in genome-wide technologies make it possible to study the impact of epigenetics on health outcomes in areas such as cardiovascular epigenetics (3), environmental epigenomics (4), and the role of early life social environment and associations with long-term disorders (5). To understand the variation of methylation in the human genome and its relation to common disease, large-scale population-based studies are needed. However, the target tissues directly relevant to the outcome of interest are often impossible or extremely difficult to collect in a substantial number of samples, which often makes large human studies based on target tissues infeasible. Also, use of DNA methylation data from biospecimens that can be easily and non-invasively collected from human individuals, such as blood, would be critical to develop novel epigenetic biomarkers for clinical diagnosis and prevention. For example, Barault et al. (6) showed that leukocyte DNA methylation levels of selected imprinted genes may serve as surrogate markers of DNA methylation in mammary tissue in the study of breast cancer, and Ursini et al. (7) showed that methylation in prefrontal cortex target tissue can also be well correlated with methylation in blood lymphocytes.

Previous studies have shown that DNA methylation patterns are largely conserved across tissues, although intra-individual variation exceeds inter-individual variation (2,8,9). For example, Byun et al. (2009) reported that the intra-individual correlations for 11 tissues were 0.852 (range: 0.738–0.941) using the IlluminaGoldenGate Bead Array, which integrates 1505 CpG sites in 807 genes. This suggests that it is possible to develop a model to study DNA methylation in target tissues for population-based epidemiological studies using easily accessible tissues. To determine whether methylation markers from surrogate tissues can be used as a proxy for methylation in target tissues to study an outcome of interest, it is necessary to determine whether methylation in surrogate tissues can adequately predict target tissue methylation.

In this study, we systematically addressed this question using the latest high-throughput technologies (Illumina HumanMethylation27 and HumanMethylation450 arrays) to collect data from multiple tissues in the same individuals participating in two independent studies as well as data from public databases. Specifically, first we investigated related tissues including Epstein-Barr virus (EBV)-transformed lymphoblastoid cell lines (LCL) and blood. LCLs have been used to increase the amount of DNA that can be obtained from peripheral blood leukocytes (PBLs) for genetic studies, such as the HapMap (10,11) and the 1000 Genomes Project (12). They have been used to study genetic and epigenetic determinants of gene expression (13–18) and are found to recapitulate the naturally occurring gene expression and methylation variation in primary B- and T cells (8). LCLs are particularly attractive in epigenetic studies because they could potentially allow for DNA methylation analyses even when the amount of blood that can be collected is limited. In the next stage, we examined methylation across differentiated tissues including PBLs, right atrial appendage and left internal mammary artery (subsequently abbreviated as ‘atrium’ and ‘artery’, respectively). We then examined the prediction accuracy using cross-validation and systematically evaluated the performance by predicting methylation status in independent data obtained from a public database repository.

MATERIALS AND METHODS

Subject and tissue collection

PBL and LCL samples

DNA methylation data were collected from 195 siblings and their parents in 95 nuclear pedigrees identified through a proband with asthma. These data were derived from a previous family study of childhood asthma (19) and gene expression quantitative trait loci mapping exercise for global expression in LCL from a subset of individuals (13). PBL samples were collected from 39 children (18 male) derived from 20 nuclear families collected through a proband with asthma. Among these 39 samples, 22 were asthmatic (see Moffatt et al., 2007, for criteria). DNA from PBLs and paired EBV-transfected LCLs were available from each individual. The transformation of peripheral blood lymphocytes in all 39 samples was carried out by the European Collection of Cell Cultures (ECACC, http://www.hpacultures.org.uk/collections/ecacc.jsp). Previously transformed cryopreserved EBV cell lines were grown as 500-ml roller cultures. Once log phase was reached, cells were pelleted, medium was discarded and a mixture of RLT buffer (RNeasy Lysis Buffer, Qiagen, Valencia, CA, USA) and β-mercaptoethanol was added. Pellets were vortexed to ensure thorough re-suspension, after which they were frozen at −70°C and stored at −80°C (13). DNA was extracted from PBL and LCL using the Promega Wizard Kit.

PBL, atrium and artery samples

Patients undergoing coronary artery bypass graft surgery at Beth Israel Deaconess Medical Center were recruited to participate in a study of DNA methylation and atrial fibrillation. PBL, atrium and artery tissue were collected from 18 participants using PaxGene tubes for blood and PaxGene (Qiagen, Valencia, CA, USA) tissue containers. After surgery, six participants developed atrial fibrillation. Blood DNA was extracted using the PAXgene Blood DNA Kit (Qiagen, Valencia, CA, USA); atrial and artery tissue DNA was extracted using the PAXgene Tissue DNA Kit (Qiagen, Valencia, CA, USA) according to the manufacturer’s protocol. Among the four samples with duplicates, correlation R2 between duplicates was >99%. For 18 participants, one or more tissue types were available. There were 14 individuals with methylation measures in all three tissues that met quality control standards and were included in downstream analysis.

DNA methylation profiling

Illumina Infinium HumanMethylation27 array

DNA samples were quantified using a NanoDrop spectrophotometer (Thermo Scientific, Wilmington, DE, USA) and bisulfite converted using the Zymo EZ DNA Methylation Kit (Zymo Research, Orange, CA, USA) with an input of 1000 ng. The assay was carried out as per the IlluminaInfinium Methylation instructions. Each conversion assay included a commercially available positive control (Universal Methylation DNA Standard, Zymo Research) and in-house–generated negative control (whole-genome amplified genomic DNA). Bisulfite-converted samples were eluted in a volume of 8 μl and re-quantified on the NanoDrop spectrophotometer using the RNA settings (because recovered DNA is single stranded and exhibits similar absorption properties to RNA at 260 nm). Dilution plates were constructed from these bisulfite-converted samples at a concentration of 60 ng/μl in a total volume of 6 μl. These plates (from which 4 μl was ultimately taken) formed the input for the IlluminaInfinium Methylation assay using the HumanMethylation27 BeadChips (IlluminaInc, San Diego, CA, USA). This assay interrogates 27 578 CpG sites for the extent of DNA methylation. The plates were processed as per the manufacturer’s instructions, including the positive and negative controls from each bisulfite conversion assay. Data were visualized using the BeadStudio software, and examined using both sample-dependent and sample-independent quality control criteria. Samples that failed quality control were repeated. Signal intensities of methylated and unmethylated probes were exported from the BeadStudio interface, along with detection of P-values representing the likelihood of detection relative to background.

Illumina Infinium HumanMethylation450 array

DNA was quantified using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and PicoGreenQuant-iT TM PicoGreen dsDNA Kit (Invitrogen, Carlsbad, CA, USA). DNA was bisulfite-converted using the Zymo EZ DNA Methylation Kit (Zymo Research, Orange, CA, USA) with an input of 1000 ng using the EZ DNA Methylation Kit (Zymo Research, Orange, CA, USA) according to the manufacturer’s protocol. Final elution was performed with 30 μl M-elution buffer. Bisulfite-treated DNA was aliquoted and stored at −80°C until ready for use. HumanMethylation450 BeadChips (IlluminaInc, San Diego, CA, USA) were used to interrogate ~450 000 DNA methylation sites covering 14 000 genes including CpG islands and shores, non-coding regions, microRNA promoter, and disease-associated regions plates were processed as per the manufacturer’s instructions, including the positive and negative controls from each bisulfite conversion assay. Data were visualized using the BeadStudio software and were examined using both sample-dependent and sample-independent quality control criteria. Samples that failed quality control were repeated. Signal intensities were exported from the BeadStudio interface both before and after background correction, along with detection P-values representing the likelihood of detection relative to background.

Methylation normalization

Methylation β values normalization

For PBL and LCL samples using the HumanMethylation27 array, we normalized the probe intensity by applying quantile normalization to all methylated and unmethylated probes together across all samples, similar to the approach used in the lumi package (R, Bioconductor) (20). The methylation β values were recalculated as the ratio of methylated probe signal/(total signal + 100). Individual data points with detection P > 0.05 were treated as missing data.

For the PBL, atrium and artery samples using HumanMethylation450 array, we used the pipeline developed by Touleimat and Tost (21). Individual data points with detection P > 0.01 or number of beads <3 were treated as missing data. Samples with >20% missing probes were treated as missing data. Probe overlaps with any common single-nucleotide polymorphisms (MAF > 0.05) in the HapMap CEU population and single-nucleotide polymorphisms within 10 bp of query sites were removed. The lumi package (20) was used for background and color bias correction. Quantile normalization across samples was then applied to probes within each functional category (CpG island, shelf, shore, etc.) separately to correct the shift of methylation β value between Infinium I and Infinium II probes by aligning the distribution of Infinium II probes to the reference distribution built on the Infinium probes (21).

Statistical models for prediction

Methylation pattern across tissues

We first examined correlations between PBL and LCL from the asthma study and between PBL, atrium and artery in the atrial fibrillation study. We then removed CpGs with extreme high or low methylation in all samples to assess the correlation between tissues at intermediate level of methylation, as many CpG sites are either completely methylated or unmethylated across individuals and tissues. The first correlations evaluated could be potentially inflated by these CpG sites at both extremes of methylation distribution. This would artificially increase the between-tissue correlation coefficient and mask the relationship at CpGs that shows more tissue specificity.

Cross-tissue prediction of methylation level

Linear prediction model

The linear model was developed using a training data set to predict methylation in a testing data set. Suppose that methylation values in the training data set are organized into two n × m matrices, X and Y, where X is the surrogate tissue and Y is the target tissue. There are n samples and m CpG probes. Each sample is a row and each probe is a column in the matrix. Let xij and yij, i = 1,2, … ,n and j = 1,2, … ,m be the element of matrices X and Y, respectively. The linear regression model for prediction of the j-th probe is An external file that holds a picture, illustration, etc.
Object name is gkt1380i1.jpg, i = 1,2, … ,n. Let An external file that holds a picture, illustration, etc.
Object name is gkt1380i2.jpg and An external file that holds a picture, illustration, etc.
Object name is gkt1380i3.jpg be the estimates of this model. For a particular sample in the testing data set, the predicted methylation value at the target tissue is An external file that holds a picture, illustration, etc.
Object name is gkt1380i4.jpg, j = 1,2, … ,m, where An external file that holds a picture, illustration, etc.
Object name is gkt1380i5.jpg is the methylation value at the surrogate tissue from the sample being predicted.

Support vector machine prediction model

Support vector machine (SVM) is one of the most popular supervised learning methods used to analyze data and recognize patterns (22). SVM represents a powerful technique for general (nonlinear) classification, regression and outlier detection and has been widely used in many bioinformatics applications. The SVM function in R package e1071 was used to build the statistical prediction model (23). Default parameters for eps-regression were used with radial basis kernel and ε = 0.1. The prediction using the SVM model is constructed in a similar manner as the linear regression method. For a given CpG site j, we used xij and yij, i = 1,2, … ,n as the training data set to build an SVM model, denoted as f(x). For a new sample, the predicted value An external file that holds a picture, illustration, etc.
Object name is gkt1380i6.jpg is obtained by applying the SVM model to An external file that holds a picture, illustration, etc.
Object name is gkt1380i7.jpg, i.e. An external file that holds a picture, illustration, etc.
Object name is gkt1380i8.jpg, where An external file that holds a picture, illustration, etc.
Object name is gkt1380i9.jpg is the methylation value at the surrogate tissue from the sample being predicted. To allow other users to apply our method to their data set, we have prepared an R package to do cross-tissue methylation prediction. This R package is available to download from our Web site (http://www.hsph.harvard.edu/liming-liang/cross-tissue-methylation/).

Assessment of prediction accuracy using cross-validation

Cross-validation was used to estimate prediction accuracy without overfitting. It consists of the following steps:

(i)
Leave-one-family out or leave-one-sample out One sample was removed, and the remaining samples were used as training data set. The left-out sample was used as testing data set. Because the PBL–LCL asthma data consist of family members, we removed the entire family in each iteration so that the training data set and the testing data set are completely independent.
(ii)
Statistical prediction model (linear regression model or SVM) was estimated using the training data set.
(iii)
Applying the prediction model (linear regression model or SVM) to the left-out families or samples, we obtained the predicted value for the samples in testing data set.
(iv)
Repeat (i)–(iii) for all families or samples.
(v)
After we obtained the predicted value for all n samples and all m CpG sites, An external file that holds a picture, illustration, etc.
Object name is gkt1380i10.jpg. The prediction accuracy was measured by correlation coefficient R2 and mean absolute error MAE for specific sample (An external file that holds a picture, illustration, etc.
Object name is gkt1380i11.jpg, An external file that holds a picture, illustration, etc.
Object name is gkt1380i12.jpg) or specific CpG site (An external file that holds a picture, illustration, etc.
Object name is gkt1380i13.jpg, An external file that holds a picture, illustration, etc.
Object name is gkt1380i14.jpg), where An external file that holds a picture, illustration, etc.
Object name is gkt1380i15.jpg is the ith row and An external file that holds a picture, illustration, etc.
Object name is gkt1380i16.jpg is the jth column of the experimentally obtained methylation matrix Y. Similar definition for An external file that holds a picture, illustration, etc.
Object name is gkt1380i17.jpg is the i-th row and An external file that holds a picture, illustration, etc.
Object name is gkt1380i18.jpg is the j-th column of the predicted methylation matrix.

Single probe versus multiple probes

Methylation at a particular CpG site may be correlated with other CpG sites either at nearby regions or elsewhere on the genome. Including these correlated CpG sites in the prediction model might further improve performance. We examined the utility of including multiple CpG sites in the prediction of atrium methylation based on PBL for 1000 target CpG sites with substantial variation but relatively poor prediction performance (randomly selected from probes with standard deviation in atrium between 0.1 and 0.2, and R2 between PBL and atrium <0.3).

Prediction generalizability across studies

We further evaluated the prediction performance by applying our prediction model to an independent data set described in Caliskan et al. (2011) (GEO accession ID: GSE26211). This data set contains six individuals. Each individual has two T cell samples and 12 LCL samples (24 LCL-T cell pairs for each individual). For each LCL-T cell pair, we applied the SVM and linear model built using our 39 LCL-PBL samples to the LCL sample and compared the predicted value with the T cell methylation value.

Prediction performance in other tissue settings

To evaluate our model performance in other tissues, we have obtained data from Byun et al. (2009) (2), where there are six cases and each has 11 tissues: brain, bladder, colon, esophagus, heart, kidney, liver, lung, pancreas, spleen and stomach. We examined all tissue pairs (55 pairs). For each pair, we applied our SVM model to predict methylation in one tissue using the other tissue in the pair.

Cross-tissue prediction and utility of surrogate tissue

To examine the utility of the predicted methylation value, we carried out association analysis between postoperative atrial fibrillation (PostOpAF, the outcome of primary interest) with the following linear regression model:

equation image

where outcome is the methylation of individual probe and PostOpAF is the postoperative atrial fibrillation. We then performed a clustering analyses using PBL, atrium and predicted atrium methylation to evaluate hierarchical clustering by AF status based on peripheral blood leukocytes and tissue methylation as well as predicted methylation levels.

Sample size effect on prediction accuracy

We hypothesized that a training data set with a larger sample size would increase the precision of model parameters and reduce the effect of outliers. We evaluated the sample size effect by randomly choosing a subset of our 39 PBL-LCL samples as training data set and predicted the methylation in the GSE26211 data set. We varied the size of the training data set from 3, 4, 5, 6, 7, 8, 10, 20, 30 to 39. For each size of the training data set (except n = 39), we randomly selected a different training data set 10 times and reported the average performance across the 10 replicates.

RESULTS

Methylation pattern across tissues

Consistent with previous studies (2,8), we also observed that DNA methylation values were largely conserved across tissues. The correlation R2 between PBL and LCL for the 39 samples ranged from 0.81 to 0.95 with mean 0.92 (Supplementary Figure S1 and W-S1). In the 14 atrial fibrillation study (AF) samples, the correlation between tissues was substantially high (for PBL-artery, R2 ranged from 0.76 to 0.89 with mean 0.81, for PBL-atrium, R2 ranged from 0.81 to 0.87 with mean 0.83, for artery-atrium, R2 ranged from 0.91 to 0.97 with mean 0.94, Supplementary Figure S2 and W-S2). After removing CpG sites with minimum methylation β value > 0.9 or maximum β value < 0.1 among all subjects and tissues, correlations were reduced. They were further reduced after removing CpG sites with minimum methylation β values >0.8 or maximum β values <0.2 among all subjects and tissues (Table1).

Table 1.
Correlation R2 between raw data across tissues in asthma study and AF study

Despite the high level of overall correlation R2, there were many CpG sites that exhibited differences in methylation across tissues (14% CpGs have methylation difference >0.1 in PBL-LCL data set, 26% in PBL-artery, 24% in PBL-atrium and 14% in artery–atrium data sets).These CpG sites determined tissue-specific methylation patterns. We first examined how cross-tissue differences in methylation at each locus were distributed across individuals and found that they were largely consistent. For example, a CpG site that had higher methylation level in target tissue than surrogate tissue in one individual generally had higher methylation level in target tissue than surrogate tissue in other individuals. In addition, the magnitude of difference was similar across individuals (Supplementary Figure S3, S4, W-S3 and W-S4).

Cross tissue prediction of methylation level

We explored the utility of methylation prediction by using two statistical models based on linear regression model (LM) and SVM and two independent data sets with five tissues. Leave-one-out cross-validation procedure is used to estimate the prediction accuracy and avoid overfitting.

PBL and LCL

After iterating through all 20 families, we had a vector of 39 predicted methylation levels in PBL and a vector of their observed methylation levels as measured by the Illumina array. Correlation R2 between these two vectors was used to evaluate the prediction performance for the CpG site (CpG-specific or probe-specific accuracy). We applied this leave-one-out procedure on all probes of the Illumina array and obtained a predicted methylation vector for all CpG sites for each sample. Results showed that the predicted PBL methylation level was much closer to its experimental counterpart (methylation measured directly in PBL). The improvement was illustrated through the scatter plots of LCL versus PBL and predicted PBL versus PBL for sample #1 in Figure 1a (based on SVM prediction) and Supplementary Figure S8a (based on LM prediction). The difference between predicted PBL and PBL is greatly reduced and consistent across samples (Table 2, Figure 1b and Supplementary Figure S8b for sample #1 and #2). Similar improvement was observed for all 39 samples (Supplementary Figure S5 and W-S5) and the overall correlation R2 between true and predicted PBL methylation from the same individual increased from 0.92 to 0.99 (average across all 39 samples for both linear regression model and SVM model). After eliminating CpG sites that were completely methylated or unmethylated, we still observed substantial increases in the R2. The mean absolute error is mostly below one standard deviation of methylation in PBL. Smaller error was observed for probes with large variation.

Figure 1.
Methylation pattern across tissues and between-tissue difference across individuals. (a) Scatter plots for sample #1. Red circles for PBL versus LCL (x = PBL, y = LCL) and purple circles for PBL versus SVM-predicted PBL (x = PBL, y = predicted PBL based ...
Table 2.
Mean correlation R2 between true and predicted methylation in asthma study and AF study

PBL, artery and atrium

The relatively close relationship between PBL and LCL methylation might contribute to the good prediction accuracy. We next extended the same prediction procedures (linear regression model and SVM model) to the second data set where three tissues (atrium, internal mammary artery and PBL) using IlluminaInfinium HumanMethylation450 array were collected from 14 individuals. We treated PBL as the surrogate tissue and artery and atrium as the target tissues. The correlations between artery-raw PBL (R2 = 0.81) and atrium-raw PBL (R2 = 0.83) are less than PBL-LCL but are similar to correlations reported in other studies (2). Again, we observed that the predicted artery or atrium methylation level is much closer to its experimental counterpart (Figure 1c–f and Supplementary Figures S6, S7, S8c–S8f, W-S6, W-S7) and the overall correlation R2 increased from 0.81 (raw PBL-artery) to 0.97 (calibrated PBL-artery) or from 0.83 (raw PBL-atrium) to 0.99 (calibrated PBL-atrium). After removing CpG sites with minimum methylation β value > 0.9 or maximum β value < 0.1 among all subjects and tissues, our prediction model substantially increased the overall R2 (Table 2). At individual CpG sites, we observed that when there is substantial variation [SD > 0.35 (LM)], SD > 0.33 (SVM) for artery or SD > 0.27 (LM), SD > 0.30 (SVM) for atrium in the target tissue (artery or atrium), the prediction accuracy is close to 1, and the mean absolute error is mostly below 1 SD of methylation in the target tissues (Figure 2c–f and Supplementary Figure S9c–S9f).

Figure 2.
Probe-specific prediction accuracy based on SVM model by methylation variation within target tissues. (a) Standard deviation (SD) of methylation in PBL versus R2 between PBL and predicted PBL based on SVM. For each dot, x = standard deviation (SD) of ...

For some individuals, the scatter plots in Figure 1c and e and Supplementary Figure S8c, S8e become less informative due to the large spread of data points. Therefore, we examined the density of predicted values by different intervals of experimentally obtained methylation values. For example, we categorized the probes according to artery methylation values from one participant into 10 equally spaced bins and plotted the density of predicted values for probes within each bin (Figure 3). Our analysis shows that the predicted values by either model were more likely to fall within the window of the experimentally obtained methylation level than uncalibrated methylation level measured in PBL. The density plots from another individual show similar pattern (Supplementary Figure S10). These results suggested that the predicted artery, atrium or calibrated PBL methylation is a better surrogate model than the raw PBL methylation level to study methylation variation in artery/atrium. Although the original correlation is >0.8 based on all probes, the calibrated values still gave higher correlation R2 (0.97 for PBL-artery prediction and 0.99 for PBL-atrium prediction, both linear regression model and SVM model).

Figure 3.
Density of predicted methylation level by true methylation in artery for sample 177. Asterisk: red line represents the density of methylation in PBL. Green line represents the density of the predicted artery methylation by using linear regression model. ...

Linear regression versus SVM

We have used both linear regression model and SVM model for our prediction engines. These two models gave similar overall performance (Figure 1 and Supplementary Figure S8). We expected that the linear regression model would be more vulnerable to outliers for small sample size. After examining the range of predicted values, we found that the linear regression model can sometimes produce methylation values out of the 0–1 range due to the effect of outliers and extrapolation, whereas SVM regression always gives prediction within the range spanned by the training data set (Supplementary Figures S11, W-S8, S12, W-S9, S14 and S15). For the larger PBL–LCL sample, the influence of outliers is reduced (Supplementary Figures S13, W-S10, S16).

Prediction generalizability across studies

In the independent data set of six individuals and 24 LCL T cell pairs, the correlation R2 between predicted value and T cell methylation was 0.95 for both SVM and LM models (average across 144 LCL-T cell pairs) compared with 0.92 for LCL-T cell correlation (Supplementary Figure S17). After removing CpG sites with minimum methylation β values > 0.9 or maximum β values < 0.1 among all subjects in both LCL and T cells, our prediction model increased the overall R2 from 0.88 to 0.92. When the cut-points for minimum and maximum β values were changed to 0.8 and 0.2, respectively, the overall correlation R2 increased from 0.80 to 0.87. The magnitude of improvement was smaller than the cross-validation estimate in our 39 samples (Supplementary Figure S17). This is likely because the target tissue in the training data set is PBLs, whereas the target tissue in testing data set is T cells. The significant improvement, especially for LCL-T cell pairs with lower correlation (Supplementary Figure S17), suggests that the prediction model built in a training data set is applicable to future studies and improves the utility of surrogate tissues.

Prediction performance in other tissue settings

To evaluate our model performance in other tissues, we obtained data from Byun et al. 2009 (2), where there were six cases with 11 tissues: brain, bladder, colon, esophagus, heart, kidney, liver, lung, pancreas, spleen and stomach. We examined all tissue pairs (55 pairs). For each pair, we applied our SVM model to predict methylation in one tissue using the other tissue in the pair. Figure 4 compares the R2 based on raw data and predicted data. R2 of raw data is the R2 between raw methylation of tissue pair by individual and average across all six subjects, R2 of predicted data is the R2 between predicted and true methylation in the target tissue by individual and average across all six subjects. Our results demonstrate that cross-tissue methylation prediction is feasible and its performance depends on actual tissue pairs and sample size. In the future, collection of additional paired tissue data could be used to re-train the prediction model, and would be useful for other large-scale study based on blood, which was not included in this study.

Figure 4.
Predicting performance across multiple tissues. Data obtained from Byun et al. (2009) Hum Mol Genet (PMID:19776032), where there are six cases and each has 11 tissues: brain, bladder, colon, esophagus, heart, kidney, liver, lung, pancreas, spleen and ...

Cross-tissue prediction improves utility of surrogate tissue

Using effect size for AF derived from atrium methylation as the gold standard, we found that predicted atrium methylation (or calibrated PBL methylation) gave less bias in the effect size than uncalibrated PBL methylation in 60% of loci based on SVM (59% based on LM). We anticipate that improvement will increase with larger sample size. We performed a clustering analyses using PBL, atrium and predicted atrium methylation by SVM (see Figure 5). When we cluster by atrium methylation, the controls are clustered in two groups (Figure 5a). The clustering by PBL methylation shows a different pattern and cases and controls cluster together. When we use the predicted atrium methylation, we find that it better represents the patterns showing from the true atrium methylation data, thus improving the utility of the PBL tissue as a surrogate of the atrium tissue. This result suggests the predicted methylation can be particularly useful for analyses that involve multiple CpGs, such as the clustering analysis here and network analyses that show to be useful for gene expression data (24,25).

Figure 5.
Clustering using atrium, PBL and PBL calibrated (SVM) methylation. (a) There are 14 samples: two female (white) and 12 male (red). The PostOpAF contains four cases (red) and 10 controls (white). The controls are grouped into two groups indicated by red ...

Sample size effect on prediction accuracy

For each of the six individuals in the GSE26211 data set, there are 24 LCL-T cell pairs. For each LCL-T cell pair, we computed the average absolute difference across all probes as prediction accuracy for the pair. In Supplementary Figure S18, each line represents a particular LCL-T cell pair in one of the 10 replicates. The trend of the lines shows that increasing sample size increases overall prediction accuracy for the sample. SVM has a generally better performance than the linear model and is less subjective to influential outliers (the pairs with poor prediction accuracy were all based on one LCL sample’s prediction of T cells from the same individual, purple lines at left panel). We also examined probe-specific prediction accuracy. Increasing sample size can greatly reduce prediction errors, especially when the methylation has substantial variation (Figure 6).

Figure 6.
Effect of training sample size on cross-study probe-specific prediction error. Asterisk: for a given sample size, we randomly chose samples from our family data set of 39 individuals to construct the training sample and predict the T cell methylation ...

We observed that prediction error rate was higher with methylation variation at the beginning and then decreased dramatically. We speculate that a relatively constant technical measurement error in methylation across probes could possibly explain the increasing of prediction error when total variance is small, which is dominated by technical measurement errors. But this hypothesis needs to be tested with additional experiments.

When the training data set was too small (n = 3 or 4), the training samples were not able to represent the majority of the population. Consequently, prediction error increased along with the methylation variance, particularly for SVM, which ensures that the prediction value falls within the range of the training samples (Figure 6). We recommend that at least 10 samples are used in the training data set, and this could vary by tissues and the technology used to measure methylation.

DISCUSSION

In this study, we developed and systematically evaluated statistical models to predict methylation level at target tissues using surrogate tissues. Through both cross-validation and the application to an independent data set, we showed that methylation value at the target tissue can be well predicted, especially for CpG sites with substantial variation in the target tissue. It suggests that one can improve the utility of surrogate tissues by learning the relationship between target tissues and surrogate tissues in an independent data set. We expect that the prediction accuracy could vary across tissue type and different populations. Predictions may be further improved by incorporating more information, such as from correlated CpG sites and additional samples as discussed below.

Prediction accuracy on independent samples

The cross-validation procedure estimates the prediction accuracy for target tissue of an independent sample that was drawn from the same study population and was processed and normalized together with the surrogate tissue. When the training data set and study data set were from different cell populations, we expect the prediction accuracy will be lower than the estimate from cross-validation as was seen in the prediction exercise using the GSE26211 samples (target tissue is PBL in training data versus target tissue is T cell in testing data). To maximize prediction accuracy, it is ideal to collect the training sample from the cell type as close to the target cell type as possible and use the same technology platform for the methylation data.

Potential limitations of cross-tissue prediction

We expect there are limitations in applying cross-tissue prediction in epidemiology studies. It is possible that correlation between tissues might not necessarily apply to the CpG sites that are informative for a specific phenotype or disease. In situations when an important exposure only affects the methylation in the target tissue but has little impact on the surrogate tissue, the cross-tissue prediction might not work well. Genotypes that destroy, introduce or shift a CpG would lead to correlation between tissues. The accurate prediction of methylation would reflect the prediction of genotype in this case. Environmental exposures applied to both tissues, especially at the developmental stage, would also lead to high correlation between tissues. In general, processes unique to diseased tissues would make the prediction become difficult.

In the atrial fibrillation study, we have four cases and 10 controls. For the four cases, the average R2 between raw PBL and atrium methylation is 0.83, whereas average R2 between atrium and predicted atrium methylation by the SVM model is 0.98. For the 10 control samples, the average R2 between raw PBL and atrium methylation is 0.83, whereas the average R2 between atrium and predicted atrium methylation by the SVM model is 0.99. In our data set, the prediction performance was similar in cases and controls. If the relationship across tissues differs in cases and controls, it would be required to include both case and control subjects in the training data set. We recommend that it is important for the training data set to cover as many important conditions as possible. In future work, we will extend our model to explicitly take into account such information, e.g. model the relationship separately in cases and controls and incorporate environmental exposure information.

Implications to epidemiology study and clinical utility

This study confirmed the finding from previous studies that methylation level is largely conserved across tissues and showed that methylation status measured in surrogate tissues can be further recalibrated to better represent the true methylation status in target tissues, which would greatly enhance the potential utility of the surrogate model. The utility of this method will depend on the actual tissue pairs and sample size of the training data set, as demonstrated in Figures 4 and and6.6. We note that our method can be used to evaluate the usefulness of a proposed surrogate tissue for a specific target tissue. It would be important for a pilot study to evaluate the feasibility for a large-scale study to use the surrogate tissue. If the surrogate tissue would be representative of the target tissue, our method provides a way to greatly improve the utility of the surrogate tissue, as we have shown the predicted value is a much better surrogate than the original raw methylation value. With the proposed methylation recalibration or prediction, large-scale epidemiological studies could become feasible if surrogate tissue, such as blood, is the only available data.

The clinical utility of methylation markers identified in surrogate tissues could be improved by using our method to calibrate the methylation level to better represent the status in target tissues. For example, in the study for postoperative atrial fibrillation, our results suggest that atrium epigenome might be informative to predict postoperative atrial fibrillation (which presents in ~30% of the patients). The lower accuracy in artery compared with atrium is likely because the artery tissues collected is in fact a mixture of endothelium, blood and smooth muscle cells, thus increasing noise in the target methylation level. If we could identify patients with at-risk epigenomes, we could treat them with prophylactic therapy or intensive mornitoring. Also, if prediction could be done with blood rather than atrium, clinical strategy could be developed in advance of the surgery. We have summarized potential applications of our approach in Table 3 along with strengths and limitations. This list is not meant to be complete but could provide some guideline as how to better use surrogate tissue in large-scale epidemiology studies.

Table 3.
Potential applications of the cross-tissue recalibration approach

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

HSPH-NIEHS center [ES000002], NIH [T32-HL007374], NIH [R01GM104411], NIH [R01ES021733] postdoctoral research fellow scholarship of the China Scholarship Council [2011821109], Fundamental Research Funds for the Central Universities and Young Foundation of Dalian Maritime University [2011QN119]. Funding for open access charge: NIH [R01GM104411] and NIH [R01ES021733].

Conflict of interest statement. None declared.

Supplementary Material

Supplementary Data:

ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for their valuable comments on this article.

REFERENCES

1. Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16:6–21. [PubMed]
2. Byun HM, Siegmund KD, Pan F, Weisenberger DJ, Kanel G, Laird PW, Yang AS. Epigenetic profiling of somatic tissues from human autopsy specimens identifies tissue- and individual-specific DNA methylation patterns. Hum. Mol. Genet. 2009;18:4808–4817. [PubMed]
3. Baccarelli A, Rienstra M, Benjamin EJ. Cardiovascular epigenetics: basic concepts and results from animal and human studies. Circ. Cardiovasc. Genet. 2010;3:567–573. [PMC free article] [PubMed]
4. Fleisch AF, Wright RO, Baccarelli AA. Environmental epigenetics: a role in endocrine disease? J. Mol. Endocrinol. 2012;49:R61–R67. [PMC free article] [PubMed]
5. Provencal N, Suderman MJ, Guillemin C, Massart R, Ruggiero A, Wang D, Bennett AJ, Pierre PJ, Friedman DP, Cote SM, et al. The signature of maternal rearing in the methylome in rhesus macaque prefrontal cortex and T cells. J. Neurosci. 2012;32:15626–15642. [PMC free article] [PubMed]
6. Barault L, Ellsworth RE, Harris HR, Valente AL, Shriver CD, Michels KB. Leukocyte DNA as surrogate for the evaluation of imprinted Loci methylation in mammary tissue DNA. PLoS One. 2013;8:e55896. [PMC free article] [PubMed]
7. Ursini G, Bollati V, Fazio L, Porcelli A, Iacovelli L, Catalani A, Sinibaldi L, Gelao B, Romano R, Rampino A, et al. Stress-related methylation of the catechol-O-methyltransferase Val 158 allele predicts human prefrontal cognition and activity. J. Neurosci. 2011;31:6692–6698. [PubMed]
8. Caliskan M, Cusanovich DA, Ober C, Gilad Y. The effects of EBV transformation on gene expression levels and methylation profiles. Hum. Mol. Genet. 2011;20:1643–1652. [PMC free article] [PubMed]
9. Fan S, Zhang X. CpG island methylation pattern in different human tissues and its correlation with gene expression. Biochem. Biophys. Res. Commun. 2009;383:421–425. [PubMed]
10. Consortium TIH. The international HapMap project. Nature. 2003;426:789–796. [PubMed]
11. Consortium TIH. A haplotype map of the human genome. Nature. 2005;437:1299–1320. [PMC free article] [PubMed]
12. Consortium TGP. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. [PMC free article] [PubMed]
13. Dixon AL, Liang L, Moffatt MF, Chen W, Heath S, Wong KC, Taylor J, Burnett E, Gut I, Farrall M, et al. A genome-wide association study of global gene expression. Nat. Genet. 2007;39:1202–1207. [PubMed]
14. Stranger BE, Dermitzakis ET. From DNA to RNA to disease and back: the ‘central dogma' of regulatory disease variation. Hum. Genomics. 2006;2:383–390. [PMC free article] [PubMed]
15. Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavare S, et al. Genome-wide associations of gene expression variation in humans. PLoS Genet. 2005;1:e78. [PubMed]
16. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315:848–853. [PMC free article] [PubMed]
17. Stranger BE, Montgomery SB, Dimas AS, Parts L, Stegle O, Ingle CE, Sekowska M, Smith GD, Evans D, Gutierrez-Arcelus M, et al. Patterns of cis regulatory variation in diverse human populations. PLoS Genet. 2012;8:e1002639. [PMC free article] [PubMed]
18. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, et al. Population genomics of human gene expression. Nat. Genet. 2007;39:1217–1224. [PMC free article] [PubMed]
19. Moffatt MF, Kabesch M, Liang L, Dixon AL, Strachan D, Heath S, Depner M, von Berg A, Bufe A, Rietschel E, et al. Genetic variants regulating ORMDL3 expression contribute to the risk of childhood asthma. Nature. 2007;448: 470–473. [PubMed]
20. Du P, Kibbe WA, Lin SM. lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008;24:1547–1548. [PubMed]
21. Touleimat N, Tost J. Complete pipeline for infinium((R)) human methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics. 2012;4:325–341. [PubMed]
22. Vapnik V. Statistical Learning Theory. New York: Wiley; 1998.
23. Lexandros KMD. Support vector machines in R. J. Stat. Softw. 2006;15:1–28.
24. Langfelder P, Horvath S. Eigengene networks for studying the relationships between co-expression modules. BMC Syst. Biol. 2007;1:54. [PMC free article] [PubMed]
25. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9:559. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press