In this paper, we employ the concept of DMRs as markers of immune cell identity using a high density methylation platform, and propose a set of analytical tools for estimating the proportions of immune cells in unfractionated whole blood. The backbone of the approach is the DNA methylation signature of each of the principal immune components of whole blood (B cells, granulocytes, monocytes, NK cells, and T cells subsets). The examples we have provided above serve to illustrate that our proposed methodology produces parameter estimates consistent with the literature, thus validating its utility.

Our proposed method resembles regression calibration, where we consider a methylation signature to be a high-dimensional multivariate surrogate for the distribution of white blood cells. In turn, this distribution is of interest for predicting or modeling disease states. As a surrogate, the DNA methylation signature is assumed to be a highly correlated, yet imperfect, measure of leukocyte distribution, and thus fits into the framework of measurement error models, where the use of a noisy surrogate marker to investigate an association with a disease outcome of interest results in biased estimates, unless internal or external validation data can be obtained to “calibrate” the model and correct the bias
[

12]. However, in this case, the problem is complicated by the extremely high dimension of the surrogate. Measurement error problems are typically formulated as a set of relationships between

**z**, the disease outcome (e.g. case/control status),

*ω*, the gold standard (e.g. leukocyte distribution), and

**Y**, the surrogate (e.g. DNA methylation). Of interest is E(

**z **|

* ω*), which may be difficult to estimate due to the cost or logistical complications involved in obtaining

*ω *in a large number of samples. Typically, it is possible to collect sufficient data for modeling E(

**z **|

** Y**), which provides information about E(

**z **|

* ω*) through the (often imperfect) association E(

**Y **|

* ω*), which is inferred from an external validation sample
[

12,

34]. Unfortunately, the high-dimensional nature of

**Y** renders E(

**z **|

** Y**) difficult to formulate. While multivariate methods of measurement error correction exist, even in a high-dimensional context
[

35], they require an explicit specification of E(

**z**|

**Y**), requiring a large number of parameters even for a main effects regression model, and many more in order to account for interactions. This becomes unwieldy when each component of

**Y** contributes a small amount of information about

**z**, and both dimension-reduction strategies and constrained regression strategies entail substantial loss of information and may be extremely computationally intensive. Existing measurement error formulations
[

34,

35] would have required us to specify a logistic regression model for case/control status, conditional on DNA methylation signature, a computationally difficult task that would have extreme vulnerability to model mis-specification. On the other hand, our method requires specification of E(

**Y **|

** z**), which is natural and straightforward. Note that in some treatments of regression calibration, E(

*ω *|

** Y**) is used as a surrogate for

*ω* in regression models for

**z**[

12]; our treatment essentially assumes a linear form for E(

**Y **|

* ω*) and effectively obtains E(

*ω *|

** Y**) by projecting

**Y **onto the column space of resulting matrix. We note that it is possible using existing methods to qualitatively describe immune response contributions to DNA methylation. This is typically done by conducting a pathway analysis along the lines of one of the methods described in
[

36], the best option of which is Gene Set Enrichment Analysis (GSEA)
[

37]. For example, Teschendorff et al. (2009)
[

22] use GSEA to qualitatively motivate an immunological explanation. However, these methods do not directly quantify the immunological contribution.

An important consideration in the measurement error literature is that of transportability of model parameters
[

38]. In our setting, an important consideration is whether the methylation profiles obtained from the purified blood cells used to assemble

*S*_{0} would be representative of the white blood cells measured within

*S*_{1}. Because of the biological assumptions inherent in the DMR literature and underlying current understanding of hematopoeisis and lineage commitment, this assumption is reasonable, provided our method is used to detect abnormal mixtures of normal white blood cells. However, methylation abnormalities in the white blood cells themselves constitute a form of non-cell mediated alteration (in the sense of the term we have been using), and contribute to bias in our methods, as described briefly above and in detail in the Additional file

1.

Note that our formulation respects the study design (DNA methylation assay data collected after sampling from phenotype groups). An alternative strategy outside the measurement error literature but within the larger missing-data literature might have been the use of an Expectation-Maximization (EM) algorithm to integrate over the missing data

*ω*[

39]. However, by design, the distribution of

*ω *varied substantially between the data sets

*S*_{0} and

*S*_{1}, severely complicating the approach; notably, an would be the introduction of feedback from

*S*_{1} to

*S*_{0}, contaminating the gold-standard status of

*S*_{0}. An alternative, might be the use of an empirical Bayes procedure, reminiscent of existing mixture-model approaches
[

40]. However, difficulty in specifying the distribution of “remainder terms” (denoted as

*ξ*above) render this approach untenable, and in simulations (not presented), attempts to impute

*ω*among

*S*_{1} samples using parameters obtained from

*S*_{0} samples resulted in extremely biased estimates of

*ω*.

The most significant aspect of the current study is our development of a method for inferring changes in the distribution of white blood cell types between different human populations (e.g. cases and controls) using DNA methylation signatures; an approach guided by an external validation set consisting of methylation profiles from purified white blood cell components. DNA methylation in peripheral blood is a potentially powerful new biomarker for clinical and epidemiological investigation. By example, numerous studies have now attempted to distinguish cancer cases from controls using whole peripheral blood assayed via DNA methylation arrays, including ovarian
[

22], bladder
[

41], and pancreatic
[

42] cancers. While these studies have demonstrated good to excellent discrimination of cases from controls, sound evidence for a biological mechanism has been elusive. Presumably, disease associated alterations in blood methylation have several etiological components driven by inherent genetic, environmental and disease specific factors. Given the known developmental associated differences in DNA methylation among specific blood cell types, changes in the distributions of blood cell types alone could account for disease associated DNA methylation. While numerous authors provide a qualitative discussion that includes the possibility of immune-related DNA methylation differences (e.g.
[

22]), none to date has specifically quantified the contribution from immune response. On the other hand, the many diverse types of immune cells in blood make this issue highly complex and problematic to tackle using single cell type assays. Therefore, it is crucial to the development of this new avenue of biomarker research to delineate effects due to the immune cell distribution itself from other “non cell type” alterations in DNA methylation. We term the differences among human populations attributed to cell distributions to be “immunologically mediated”. Our solution to partition this component of variation in methylation from other determinants are multivariate analytic tools including regression coefficients and associated inference, as well as coefficients of determination measures. Taken together these provide a means for evaluating whether the observed DNA methylation differences are due to an immunologically mediated response.

In our Additional file

1 we provide a detailed analysis of potential sources of bias in our analysis. One obvious biological source of bias is age of the subjects contributing cells for validation. At certain CpG loci, DNA methylation is known to change with age
[

43], especially in T cells
[

44]. In the Additional file

1 we demonstrate that any age-related associations with DNA methylation in our top 100 CpGs were too weak to be detected with the current validation sample, and thus unlikely to bias the results of our analyses (notably age coefficients provided for the HNSCC example). However, we remark that with larger sample sizes, adjustments for age can be incorporated with an appropriate additional term in the linear model (1) for

**Y**_{0h}.

Similar methods based on mRNA have been employed
[

13-

15]. The statistical principles described in this article would apply, wholesale, to mRNA expression profiles, but with two cautionary statements. The first is mathematical: mRNA is typically analyzed on a logarithmic scale, yet the assumptions of the proposed methodology involve linearity on an arithmetic scale, since the mixing coefficients are assumed to act linearly on absolute numbers of nucleic acid molecules; thus, the proposed methods would require analysis of untransformed fluorescence intensities, whose skewed distributions would result in numerical instabilities. The second is biological: there is no necessarily linear relationship between cell number and mRNA copies, since proteins may be translated as a consequence of an initial burst of mRNA transcription upon cellular development, after which significant mRNA degradation is possible. In contrast, one would expect the average beta value provided by Illumina bead-array products (and similar quantities) to scale in proportion to the actual fraction of methylated nucleic acids; in addition, an assumption of two DNA molecules per cell seems biologically reasonable. In the Additional file

1 we provide an example of an application of our methods using mRNA data.

Going forward there are two issues that require further experimental and analytical refinement. First, although the current studies suggest group level comparisons of blood cell DNA methylation can reveal important immune alterations, it will be important to provide methods for individual level immune cell profiling, since clinical and detailed analytical epidemiologic applications that examine individual risk factor information will be the subject of future studies. As we have demonstrated above, individual immune profiles are theoretically achievable but will require extensive validation, with a wide array of mixture combinations, before gaining widespread acceptance. Secondly, there is intense interest in minor immune cell fractions and their role in disease, though the signal strength of cell types comprising < 5% of the total white cell compartment may be difficult to quantitate. Examples of such cell types include the regulatory T-cell or NK cell fractions, which are implicated in autoimmune and malignant diseases. Optimization of platforms for technical sensitivity to minor subtypes combined with statistical optimization of signature recognition are needed to enhance the approach for testing highly targeted immune hypotheses.