|Home | About | Journals | Submit | Contact Us | Français|
The computational prediction of DNA methylation has become an important topic in the recent years due to its role in the epigenetic control of normal and cancer-related processes. While previous prediction approaches focused merely on differences between methylated and unmethylated DNA sequences, recent experimental results have shown the presence of much more complex patterns of methylation across tissues and time in the human genome. These patterns are only partially described by a binary model of DNA methylation. In this work we propose a novel approach, based on profile analysis of tissue-specific methylation that uncovers significant differences in the sequences of CpG islands (CGIs) that predispose them to a tissue- specific methylation pattern.
We defined CGI methylation profiles that separate not only between constitutively methylated and unmethylated CGIs, but also identify CGIs showing a differential degree of methylation across tissues and cell-types or a lack of methylation exclusively in sperm. These profiles are clearly distinguished by a number of CGI attributes including their evolutionary conservation, their significance, as well as the evolutionary evidence of prior methylation. Additionally, we assess profile functionality with respect to the different compartments of protein coding genes and their possible use in the prediction of DNA methylation.
Our approach provides new insights into the biological features that determine if a CGI has a functional role in the epigenetic control of gene expression and the features associated with CGI methylation susceptibility. Moreover, we show that the ability to predict CGI methylation is based primarily on the quality of the biological information used and the relationships uncovered between different sources of knowledge. The strategy presented here is able to predict, besides the constitutively methylated and unmethylated classes, two more tissue specific methylation classes conserving the accuracy provided by leading binary methylation classification methods.
The most important epigenetic modification of vertebrate DNA involves the addition of a methyl group to the carbon-5 of the pyrimidine ring of the cytosine in CpG dinucleotides (CpGs) [1-3]. The methylation of DNA provokes a localized restriction of transcription that can be used for the selective silencing of genes. This form of transcriptional control is mediated by regulatory regions termed CpG islands (CGIs) which overlap the promoter of all human housekeeping genes and over half of all tissue-specific genes [4-7]. CGIs are the only regions in the human genome that are rich in unmethylated CpGs  and therefore represent a notable exception to the almost "global" methylation that affects the bulk of the genome and has, over the time, resulted in the depletion of CpG dinucleotides from it.
The methylation of CGIs is associated with a host of normal and cancer-related processes [8-19], making them an important target for large-scale studies of DNA methylation that aim to shed light on their role in the epigenetic control of gene expression [20,21]. Nevertheless, measuring DNA methylation experimentally involves procedures that are time-consuming, expensive and have only recently been scaled to genome-wide approaches that maintain a high degree of resolution [3,22,23]. Computational solutions to the genome-wide prediction of CGI methylation would therefore be a great aid . However, the characteristics that make a sequence susceptible or resistant to methylation are not completely understood.
Recent studies employing supervised machine learning methods account for differences between methylated and unmethylated DNA sequences [25-28]. However, recent experimental results have shown the presence of much more complex patterns of methylation in the human genome . Since these patterns may vary across tissues and developmental stages, they are partially described by a binary methylation model . Current computational methods therefore distinguish well between constitutively methylated and unmethylated CGIs, but do not take tissue-specific CGI methylation into account. This is a significant source of uncertainty, since their prediction models were trained on a heterogeneous mixture of constitutive and tissue-specific methylation. This could mask the characteristics that truly discriminate between CGIs that are methylated or unmethylated in all tissues. One of the primary causes for this situation was the lack of high-resolution methylation data from multiple healthy human tissues . This impeded the discovery of tissue-specific CGI methylation classes and the key characteristics that predispose certain CGI sequences to either constitutive or tissue-specific methylation.
The data of Human Epigenome Project (HEP)  gives us the opportunity to try and resolve this issue. They specify the methylation status of more than 30000 individual CpGs from the human chromosomes 6, 20 and 22 in twelve healthy tissues and cell types, therefore representing the highest-quality source of experimental methylation data currently available  and have been used before to gain insights into the epigenetic variability of the human genome . In this paper we use this information to identify novel profiles that map DNA sequence, structural, physicochemical and evolutionary attributes of CGIs into methylation profiles. They clearly distinguish CGIs that are constitutively methylated or unmethylated from CGIs that show a tissue-specific degree of methylation. At the same time, these profiles provide important insights into the key attributes that determine if a CGI has a functional role in the epigenetic control of gene expression and is predisposed to become methylated during normal cellular differentiation.
In recent years, there have been several successful efforts at predicting the methylation status of CGIs. These methods use different combinations of DNA patterns and attributes to classify them as methylated or unmethylated [25-27,32], but do not take tissue-specific CGI methylation into account. In this work we elucidate intermediate subclasses of CGIs with a differential degree of methylation that better reflect the complex methylation patterns of the human genome.
Our approach is primarily a mining process carried out in the absence of supervised data [33-37] (Fig. (Fig.1)1) that identifies associations among two different data domains and uses these associations to label the database. First, we independently clusterer two different data-domains: CGI methylation and sequence-related attributes. Then, both domain-clusters are related and evaluated based on the probability of intersection using the hypergeometric measurement. This measure helps to uncover cohesive relational clusters while avoiding conditional decisions -often derived from the use of conditional probabilities–of clustering first one domain and then the other or viceversa .
Once the database is labeled, after a summarization process, we proceed with the corresponding feature selection of the new methylation classes and inference steps with the creation of a classifier with the newfound knowledge.
We selected CGIs that were covered to over 70% by the CpGs in the HEP dataset, requiring their methylation status to be defined in at least 2 tissues. This measure ensured a balanced dataset, where more than two-thirds of the CGIs were defined in all tissues and 493 (95%, of the CGIs) were defined in at least 10 tissues (Table (Table1,1, Additional file 1).
The CGIs were identified by the CpGcluster algorithm , which does not rely on the traditional three parameters of length, GC-content and O/E-ratio . Instead, it searches for closely spaced CpGs and computes the probability of finding a cluster with the same length and number of CpGs in the genome (p-value). CpGcluster has a high degree of sensitivity for detecting known, functional CGIs, while at the same time, excludes spurious repetitive elements . Over 97% of the CGIs in our dataset were covered by a CGI predicted using the traditional parameters , while the inverse ratio is 71%. This lower degree of coverage is due the high specificity of CpGcluster that excludes more false-positives, such as Alu-repeats, than methods based on the traditional thresholds.
We then characterized each of the CGIs in our dataset using 38 attributes belonging to four distinct categories: (1) CGI-specific attributes (e.g. their G+C content, Observed/Expected ratio and CpGcluster p-value), (2) Repetitive sequences (number and type of repetitive elements); (3) Evolutionary conservation (e.g. PhastCon content), as well as (4) Structural and physicochemical properties of the DNA itself (e.g. twist, tilt, roll, shift, slide and rise) . The attribute global linear analysis (PCA analysis) (Additional file 2) showed that all of them contributed significantly to the overall variability of the dataset. Therefore, all raw data were used in further steps because they provide a more interpretable characterization.
Then attribute and methylation data were independently clustered by both hierarchical and k-means clustering methods. The validity indices that define the appropriate number of clusters were not conclusive, as shown in Table Table2.2. Thus we selected cluster partitions yielding more than two clusters using the best two partition scores (see Methods).
In order to determine a combination of biological CGI attributes that naturally intersected with a specific pattern of methylation, we linked the two pairs of clusters by calculating the probability of intersection (PI) and employing a significance p-value < 0.05. This approach optimizes the cluster partitions based on the coincidence between independent clusters  instead of intrinsic intra/inter clustering measurements . The application of this unsupervised process to our dataset identified 55 significant intersections (profiles) where two independent clusters had more CGIs in common than would be expected by chance (Additional file 3). These 55 profiles are redundant due to the fact that partitions from distinct numbers of clusters were allowed in the former step. Therefore, a cluster from one domain might be related to more than one cluster from the other domain and vice versa. We removed this redundancy (Figure (Figure2a)2a) by grouping the 55 profiles and selecting a representative prototype from those that recognize similar observations. The process resulted in 9 non-redundant profiles (PBC) (Figure (Figure2),2), which demonstrate clear patterns of tissue-specific methylation (Table (Table3)3) associated with distinct biological characteristics (Table (Table4).4). The attribute values in Table Table44 were normalized between 0 and 1. This normalization is performed before the clustering process in order to prevent bias clusters caused by attributes with high absolute values. The significance at a p-value < 0.05 is relative to these normalized values. The non-normalized values are available in the supplementary information. The number of CGIs recovered with each profile is registered in Table Table55.
Finally, we labeled the observations using the corresponding methylation patterns (Figure (Figure3a,3a, Figure Figure3b)3b) into four methylation classes: Constitutively methylated, containing CGIs that are highly methylated in all tissues (Profile 1); Unmethylated in sperm contained CGIs that only lacked methylation in sperm (Profile 2); Differentially methylated contained CGIs that showed a distinct degree of methylation for each tissue (Profile 3); and Constitutively unmethylated, comprising CGIs that are uniformly unmethylated across all tissues and cell types (Profiles 4 through 9).
The initial PCA analysis suggested that all attributes were informative. However, we considered the data labeled in the previous step and applied a local feature selection analysis based on the entropy of the attributes (i.e. decision tree) for dissecting each new methylation class. The most relevant attributes were the novel ones included in this study: PhastCon content, the p-value computed by the CpGcluster algorithm, and structural characteristics of the CGIs describing their three-dimensional flexibility such as: twist, tilt, roll, shift, slide and rise of a DNA sequence from a novel model of dinucleotide stiffness. Moreover, a correlation analysis of all attributes shows that each relevant attribute was not replaceable by any other (Additional file 6). Here we describe the most relevant attributes for each class.
CGIs in this class had a high average degree of methylation in all tissues (avg_meth > 0.8), and reflect PBC1. This class is described by a higher average content of CA and TG dinucleotides, which can be seen as the "footprint" of methylation, since they are often the result of the deamination of methylated cytosine. The most notable difference between the constitutively methylated class and the other classes, specially the constitutively unmethylated, is its higher overlap with PhastCon elements. This attribute never reaches values greater than 0.1 in the former class, while it never falls below 0.5 in the differentially methylated classes. A high PhastCon overlap was originally seen as a sign of a potential conserved functional regulatory element [38,42] however their high degree of methylation poses limits on their functionality by restricting access to the DNA.
The CGIs in this class have a low avg_meth (≤ 0.2) in all tissues; it summarizes PBCs from 4 through 9. Previous results have shown a negative correlation between the concentration of CpGs and a high degree of methylation, and this idea has been used as the starting point for methylation prediction. However, our findings agree with the new experimental work of Raykan et al. . The authors find that DNA methylation can occur at high-, medium- and, contrary to previous notions, at even some low-CpG density promoters. It has been also found that certain promoters with few CpGs were active and methylated, whereas other promoters of that group can be unmethylated when active . These data suggest that DNA methylation is involved in regulating activity over a broad range of CpG O/E-ratios, including CpG-poor promoters, located in tissue-specific differentially methylated regions (tDMRs).
All constitutively unmethylated PBCs are very similar, with exception of the unexpectedly high content in GC-rich repetitive elements of PBC 9. Only 6% of the CGIs supported this PBC indicating that the combination of attributes learned from the few remaining repetitive elements is not representative of a large subset of CGIs. This is a consequence of the HEP selection process, which excluded most of these repeats . However, in our results, 36 out of the 46 CpG islands overlapping with a repetitive element, overlap either with the extended promoter region (14 CpG islands) or with the TSS of a gene (22 CGIs). Of these CGIs, 32 (70%) are unmethylated despite the presence of a repeat. In fact, all TSS-overlapping CGIs are unmethylated regardless of the repeat in their vicinity. The presence of a promoter seems to be incompatible with the methylation of these repeats. This small set agrees with the recent experimental findings of Meissner et al. , where it was proven that not all transposable elements are equally affected concerning methylation. Long interspersed nuclear elements (LINEs) and long terminal repeats (LTR) are generally methylated independently of CpG density. However, CpG density influences whether non-autonomous short interspersed nuclear elements (SINEs) and low complexity regions remain unmethylated, which may be a reason for the low degree of methylation of the repetitive elements of profile 9. These results show that CGIs with distinct biological characteristics can share the same methylation status (Figure (Figure2a)2a) and that the degree of CpG enrichment or the presence of a repetitive element alone does not determine if a sequence is protected against methylation.
These results show that CGIs with distinct biological characteristics can share the same methylation status (Figure (Figure2a)2a) and that the degree of CpG enrichment or the presence of a repetitive element alone does not determine if a sequence is protected against methylation.
The CpGcluster p-value is a key attribute for distinguishing between constitutively methylated and unmethylated CGIs. This measure distinguishes true CGIs from repetitive Alu-elements, which are the main source of false-positive CGI predictions, and is not linked to either G+C content or the O/E ratio. This is important because it can determine the significance of a CGI independently of changes in the G+C content of the genomic sequence and is therefore not affected by fluctuations in the sequence composition. G and C-containing dinucleotides (CpC, CpG, GpG) in conjunction with a reduction in A and T-containing dinucleotides (ApA, TpT) and the curvature are also important attributes for this class. The high contents of G and C dinucleotides leads to an increasing degree of sequence bending and reduces both the macroscopic curvature of the DNA as well as the amount of energy needed to separate the strands (stacking energy). Despite the increasing G+C content, the O/E ratio decreases continuously (Figure (Figure2c),2c), indicates that there is a balance between CpG enrichment on one hand and the overall G+C content on the other.
In addition to the previously known classes we found two tissue-specific methylation classes:
These CGIs lack methylation in sperm, and have a slightly lower degree of methylation in CD4 T lymphocytes, dermal fibroblasts, dermal keratinocytes and the muscle tissue of the heart. Nevertheless, sperm is the only tissue that shows a level of methylation below 0.2 making it the defining characteristic of this class. This agrees with the known normal development and control of sperm-specific gene expression of germ cells , as well as with their epigenetic reprogramming during gametogenesis.
In contrast to the classes constitutively methylated and unmethylated in sperm, this class showed a heterogeneous degree of methylation ranging across all tissues. It presented both highly methylated and unmethylated CGIs in the same tissue yielding an intermediate degree of methylation after averaging. This average was significantly lower than that of the CGIs unmethylated in Sperm. This class also presented the lowest average distance between CpGs, the highest CpG O/E value, G+C content, and CpG and GpC dinucleotide content than the constitutively methylated and unmethylated in sperm classes. This class presented unique structural characteristics, such as low degrees of bending and curvature, as well as a high degree of solvent-accessibility of the DNA backbone (DNA cleavage), which may indicate a higher degree of permissiveness for DNA binding. DNA sequence bending is generally higher if the sequence contains phased GGGCCC sequences, and therefore the bending should be higher for sequences rich in G+C. This does not apply to this class where alternating CpGs and GpCs limit bending, curvature, twist and the number of helix turns, but tends to increase the flexibility of the sequence.
In order to assess the functionality of the four newly defined methylation classes, we measured the coincidence between them and defined gene association classes (Table (Table6)6) using probability of intersection (PI) (Table (Table7).7). The PI is usually employed to perform coincidence analysis because it is a context-sensitive metric that takes into account the domain where the intersection is calculated. It measures the degree of inclusion of one set into another, considering both the number of instances intersected between two sets as well as those instances not belonging to the intersection. Formally, the PI determinates the p-value, which is the statistical significance of observing over represented intersected instances (i.e. occur more frequently than could be expected by pure chance). The results obtained show that the new classes represent unique, functional CGI profiles associated with distinct CGI methylation patterns and gene compartments (Table (Table88).
The CGIs in our dataset were associated with 497 protein-coding gene loci (Table (Table6).6). While the vast majority of the CGIs (> 68%) were located in the vicinity of a promoter, less than 4% of the CGIs were outside the genic environment, indicating that the dataset is skewed towards genic regions. This is a consequence of a bias in the HEP data, which includes few intergenic regions . Normally a higher percentage of CGIs would be found outside of the genic region in an unbiased dataset . Approximately 20% of the CGIs are located in the gene-body and we found that these CGIs were unequally distributed, since there were approximately 30% more CDS-overlapping CGIs than those located in introns.
Moreover, it is known that CGIs associated with the gene-body are susceptible to both constitutive as well as tissue-specific methylation [2,43,46]. However, by separating between coding and non-coding regions, we were able to distinguish between highly methylated and differentially methylated CGIs.
It coincided significantly with the CDS of the genes in our dataset. They are highly conserved and may be the result of GC-rich codons simulating the presence of a CGI. The methylation of a CDS region itself, in contrast to a TSS region, does not impede the progression of transcription, making this region permissive for both methylation and compact chromatin conformation.
This class coincided significantly with CGIs located in introns, indicating the presence of functional methylation-dependent sequence elements. Though the majority of the differentially methylated CGIs that were conserved overlapped with the CDS (Table (Table9),9), we found that they were the only class of CGI that was significantly enriched in highly conserved non-coding elements (HCNEs)  (Additional file 5). Since the differential methylation of HCNEs has recently been shown in a comparison of embryonic stem cells (ES) and ES-derived differentiated cells in mouse  these differentially methylated CGIs may represent examples of enhancers that are controlled by methylation [48,49].
Over 40% of the CGIs belonging to this class coincided with promoter-overlapping CGIs, including genes known to be testis-specific, such as DDX43 , HIST1H2BA , PIWIL3  and ECAT1 . Although this supports the view that germline-specific genes are preferentially methylated in somatic tissues , the only significant intersection with a gene-class was with the pseudogene-proximal CGIs, 22% of which were unmethylated in sperm. Only 12% of all CGIs were associated with pseudogenes and the majority of them represent "processed" pseudogenes (> 64%). This may still include parts of the core promoter region, including the promoter-overlapping CGI . Their lack of methylation in sperm was thought to be a by-product of the global changes in methylation that occur during spermatogenesis . Although it may also permit them to be transcriptionally active in sperm  they are normally targeted for silencing through methylation during differentiation and therefore show a high degree of methylation in somatic tissues . This complicates the identification of CGIs that are involved in controlling the sperm-specific expression of protein-coding genes via promoter-CGI methylation such as the MAGE and HAGE-genes [9,54] because it may lead to false positives in genome-wide studies of promoter methylation. For example, DPPA5 a functional testis-specific gene , is active in pluripotent cells and down-regulated during the differentiation process , but we found that it contains a CGI and a pseudogene within its 5'UTR. Therefore it is not clear if the lack of methylation of this CGI is necessary to maintain tissue-specific activity or simply a by-product of the pseudogene in its vicinity.
This class showed significant coincidence with the CGIs overlapping the TSS. However, they showed neither the highest G+C-content nor the highest O/E ratio of the whole unmethylated group. Instead, they showed the lowest average CpGcluster p-value, further supporting the use of this attribute as a better measure of functionality than the CpG enrichment or G+C content alone .
This categorization of the methylation classes was used to re-classify the dataset into the four functional methylation classes shown in Table Table8.8. The promoter/TSS proximal CGIs represent the vast majority of all CGIs and they are predominantly unmethylated. It has been estimated that about 18% of the CGIs in the human genome are subject to tissue-specific methylation  and we found our data to support this estimate since just over 17% of the CGIs were either unmethylated in sperm or differentially methylated. It is noteworthy to mention that neither of these two classes overlapped significantly with the promoter-proximal CGIs.
Our four functional CGI methylation classes were then employed in the development of a supervised classifier. Our hypothesis was that if these classes were biologically and computationally significant, they would be useful in predicting new observations. To do so we first labeled each observation using the classes assigned by the profiles. We then used a simple classifier (i.e. decision tree) that employs 23 of the 38 attributes to predict the four classes of methylation. The methods were tested via 10-fold cross-validation where imbalanced classes were compensated  (see Methods).
These results show that the decision tree can encode rules that predict CGIs with distinct methylation patterns at a high level of accuracy (Table (Table9,9, Additional file 4). It is difficult to asses the performance of our method compared to previous computational approaches due to the fact that all constrain the prediction to only two methylation classes, where a sequence is either "methylated" or "unmethylated" across all possible tissues or cell types [26,32] and they do not take into account tissue-specific methylation.
The ability of our approach to predict methylated and unmethylated CGIs, was then directly compared to the results obtained from EpiGRAPH , which were tested on our HEP-based CGI methylation data and the methylation data used by the EpiGRAPH system (Table (Table10).10). In order to compare the methods we used a binary methylation classification system as described in the Materials and Methods section.
Both datasets were classified using two methods, SVM and decision tree, the former one with two different implementations, EpiGRAPH C4.5 and the Matlab decision tree (CART, version R2007a). In all cases we used default parameters.
The results obtained using both datasets are very similar, independently of the classifier used. Thus, we suggest that our attribute set is capable of predicting methylated and nonmethylated CGIs with a high degree of accuracy even without sampling and method specific parameter optimization (Table (Table1010).
Unexpectedly, a different implementation of a simpler method such as a decision tree, obtained a better accuracy and CC than the more complex SVM with the default setting parameters. In addition, this method produces interpretable rules that can be used for a better understanding of the data and easily extended to the use of multiple classes.
As shown in table table1010 we are able to predict four different classes with accuracy close to that of the binary methylation prediction. The comparison with a more sophisticated classifier, suggests that the new information in terms of CGIs attributes and tissue-specific methylation classes are the key factors that improve the CGI classification instead of the classifier themselves (i.e. method bias) .
The analysis of DNA methylation has been based primarily on the use of binary models which predict DNA sequences to be methylated or unmethylated. We have presented a profile-based approach that is able to define novel CGI methylation data relationships which not only separated between constitutively methylated and unmethylated CGIs but also identified CGIs showing a differential degree of methylation across tissues and cell-types or a lack of methylation exclusively in sperm. Our approach differs fundamentally from previous studies since it does not specify CGI classes a priori. Instead, it employs unsupervised data clustering methods for the detection of groups of CGIs sharing a common tissue-specific degree of methylation as well as similar attributes. These types of clustering methods avoid the potential biases of the limited CGI dataset available here since they do not require pre-determined classes in order to detect homogeneous groups within the data.
The functional CGI profiles discovered in this work bring new insights into the features associated with CGI methylation susceptibility, which included their evolutionary conservation, their significance, as well as the evolutionary evidence of prior methylation. Moreover, the usefulness of this information in building a simple classifier demonstrated that the ability to predict CGI methylation is mostly based on the biological information and the relationships uncovered between different sources of knowledge. This information can be exploited for the improvement and development of new tools able to detect not only constitutive or tissue-specific CGI methylation with equally high degrees of accuracy, but CGI functionality across the genome as well.
Contrary to previous studies, our method does not rely on ad hoc thresholds in order to determine if a CGI is constitutively methylated, unmethylated or shows a tissue-specific degree of methylation [25,31,32,43].
This yielded a series of novel, functional CGI profiles that allowed us to measure the extent of tissue-specific CGI methylation within the genic environment. We found that the different functional regions of genes were not equally affected by methylation. Furthermore, we were able to determine biological attributes that influence both the functionality and the methylation status of the CGIs, allowing us to use this knowledge for the computational prediction of their methylation.
In addition to the insights provided by our approach we demonstrate that the attribute set used is able to predict four methylation classes conserving the accuracy provided by leading binary methylation classification methods.
The methylation data of the Human Epigenome Project (HEP) were used for the analysis of tissue-specific CGI methylation . They specify 1.9 million CpG methylation values, from 2,524 sequences ("amplicons") across human chromosomes 6, 20 and 22. Methylation levels were measured in 12 different healthy tissues and cell types. Methylation values stemming from the same tissue and CpG were averaged and only unique CpGs, whose methylation status has been measured in at least one of the twelve tissues, were retained. The CGIs selected for this study were determined via the CpGcluster algorithm  and the degree of CGI methylation was then calculated by averaging all methylation values per CGI. This was done separately for each of the twelve tissues and only CGIs that had at least two tissues where over 70% of their CpGs were defined by a methylation value were then included in the database. In order to minimize the impact of missing methylation values during the detection of the CGI methylation profiles, a CGI was determined to have a degree of methylation of 0.5 if it was not defined in a particular tissue. This value indicates neither a high nor low degree of methylation and introduces the least amount of bias without having to limit the database to CGIs whose methylation status was known in all 12 tissues.
Biological attributes belonging to the following categories were then used to characterize each of the CGIs in the database: (1) CGI-specific attributes: This category included the p-value calculated using the CpGcluster algorithm as a measure of CGI significance, the CpG Observed/Expected ratio (O/E ratio), the sequence content in Guanine and Cytosine (G+C content)  as well as the average distance between the CpGs of each CGI (CpG-distance), which is a measure of CpG spacing. Furthermore it included the standard deviation of the CpG distances (SD), calculated as:
where N is the number of CpGs in the CGI, χi the distance between two consecutive CpGs and is the average distance between neighboring CpGs of a CGI. Furthermore, the frequency of the 16 possible dinucleotides was measured (Dinucleotide content).
(2) Repetitive sequences: This category included both the number of repetitive elements intersecting with a CGI (Repetitive elements) and the fraction of a CGI covered by a repetitive element (Repetitive content). The human repetitive elements were identified using the RepeatMasker program .
(3) Evolutionary conservation: Conservation was measured via the fraction of each CGI overlapping with a PhastCon and the number of PhastCon elements per CGI. The PhastCon elements we used were highly conserved across 17 vertebrate genomes . We obtained the "most conserved"PhastCons that demonstrate a log-odds conservation score of 100 or better via the UCSC Genome Browser .
(4) Structural and physicochemical properties: This category included the local sequence bending (Bending) and the macroscopic sequence curvature (Curvature), calculated using the banana algorithm from EMBOSS . Furthermore it included the four attributes quantifying the number of DNA helix turns (Turns), the number of bases per turn (Bases per turn), the degree of DNA sequence twist (Degree of twist)and the base-pair stacking energy (Stacking energy), measured in kilocalories per mol, were calculated via the btwisted algorithm of the EMBOSS toolkit  and averaged over the length of the sequence. The stacking energy is measured in negative kcal/mol and the normalization was performed to between 0 and 1, values close to zero indicate that higher energy is needed to separate a region of double-stranded DNA.
The amount of DNA cleavage (DNA cleavage) indicates the solvent-accessible surface area of the DNA [63,64]. This information and was provided for each individual CGI by Thomas D. Tullius of the Department of Chemistry and Eric Bishop of the Program in Bioinformatics at the University of Boston. DNA cleavage was computed by averaging the single nucleotide cleavage values over the length of the CGI. The remaining attributes are based on a recent method described in Goni et al.  for calculating the six helical force-constants used to measure the average deformability of the CGI sequence: rotational parameters twist, tilt and roll (measured in kcal/mol degree2), translation-related parameters shift, slide and rise (measured in kcal/mol Å2) .
Prior to the profile searching, we performed a filtering of potentially uninformative CGI attributes via the Principal Components Analysis (PCA) method  using the Spotfire® DecisionSite® system . The principal components that included 90% of the cumulative Eigenvalue were then chosen for further analysis. The contribution of each attribute to the principal components was analyzed via the eigenvector plots shown in Table Table22 of the Supplementary data. CGI attributes that did not have a coefficient value greater than 0.1 or smaller than -0.1 in any of the selected principal components  were determined to be uninformative and removed from the database. Though the principal components themselves represent a reduction of the dimensionality of the data, meaning that they capture the same variability of the data but with fewer attributes, they were not used for any of the cluster analyses since neither the CGIs nor the actual values of the attributes that form part of the principle components are known.
The CGIs were assigned to 7 classes based on their association with pseudogenes acquired from http://www.pseudogene.org/ and protein coding genes annotated in the AceDB  which summarizes all curated cDNA data from GenBank , dbEST  and the RefSeq . This database was chosen because it provides a richer view of the human transcriptome, with three to five times more transcripts than the UCSC Known Genes, RefSeq or Ensembl annotations . CGIs overlapping with a TSS or the promoter proximal region were defined as two separate classes (TSS, Promoter), since the TSS-overlapping CGIs are generally thought to have a higher G+C content and higher degree of CpG enrichment than non-TSS overlapping CGIs even if they are in the vicinity of the promoter . In addition, we determined if a CGI was part of the 3'UTR (3'UTR) and separated purely intronic CGIs (Intron) from those overlapping partially or completely with a CDS on either strand (CDS).
HCNEs were downloaded from the Ancora database , a web resource that provides data and tools for exploring the genomic organization of HCNEs in multiple genomes http://ancora.genereg.net/. The HCNEs available from Ancora were identified from BLASTZ net whole-genome alignments of the human (hg18) and mouse (mm8) genomes and correspond to regions that have at least 70% identity over 50 alignment columns.
The two datasets containing the CGI methylation data and the CGI attribute data were analyzed separately using unsupervised cluster learning methods. The resulting clusters were later (section 2.2) used to define CGI profiles that connect coincident methylation and attribute clusters. Two distinct clustering methods were employed in this step: hierarchical clustering [27,28,73,74] and k-means clustering [73,75,76]. All functions are part of the Matlab Statistics Toolbox®  if not indicated otherwise. The hierarchical clustering was performed using the Euclidean distance and the complete linkage approach. The number of clusters was calculated using inconsistency threshold (ICT) and coefficient  as validity indices . The k-means clustering was performed using the Euclidean distance. To reduce the sensitivity of the algorithm to the initial random cluster centroids, each of the k-means runs was repeated ten times and the best solution was chosen. We used the silhouette method  to estimate the number of clusters. The potentially optimal number of k-means clusters was then chosen in order to maximize the average distance between silhouette means (ADSM).
The probability of intersection (PI) was used to determine the most significant intersections between the attribute and methylation data clusters. It is based on the hypergeometric distribution and is an adaptive measure that is sensitive to small sets of examples while retaining specificity with large datasets. The PI is more sensitive to relationships between smaller but highly similar groups than other measures that are based solely on the number of instances in the intersection [37,80]. This measure gives the chance probability of observing at least p candidates from a profile Vi within another profile Vj  as:
where Vi represents a cluster of CGIs defining profile of size h, Vj is a cluster of CGIs defining a profile of size n, p is the number of CGIs in the intersection between two clusters and g is the total number of CGIs in the database. The PI was computed using custom Matlab® scripts.
Identifying the right number of clusters is an unsolved problem [78,81]. As expected, different validity indices  as well as distinct clustering methods provide inconclusive results. Therefore, we selected more than one option (2.1) and instead of optimizing typicality inter and intra clustering measurements we optimized the posterior probability of matching between two different sources of clusters (2.3). This process generated several redundant profiles originating from redundant clusters. To remove this redundancy, we re-clustered the centroids  of the profile methylation classes and obtained a reduced set of classes including constitutive unmethylated, constitutive methylated, unmethylated in sperm, and differentially methylated. Then, we replaced the original classes in the profiles by the reduced ones and used them to label each instance in the CGI sequence attribute database. In other words, we transformed an unsupervised problem into a supervised one .
This transformation into a supervised problem (i.e., labeled data) allowed us to apply typical feature selection strategies to identify the most relevant attributes for each profile. We use the entropy as a discriminative measurement  implementing a decision tree [73,82] (CART, Matlab version R2007a) with default parameters. For each profile we use the labeled data covered by it. This process was locally carried out for each profile to identify which are the relevant features for a particular methylation pattern. This process could also be performed by a global decision tree including all profiles but with less interpretable results (i.e. very long rules).
The functional CGI methylation profiles were used to predict based on a classification tree with default parameters as described above, and labeled observations (2.4). The classifier performance was evaluated using 10-fold cross-validation (crossvalind Matlab, version R2007a), the accuracy (Acc), which represents the fraction of CGIs whose methylation profile was predicted correctly (equation 3) and the correlation coefficient (CC) on the test subset which combines both sensitivity and specificity (equation 4):
To compare results with other data sources we used another implementation of the decision tree in EpiGRAPH  and the support vector machine classifier from the same source. The CC was only used in the binary classification, where a CGI is classified as "methylated" if it was not constitutively unmethylated. Finally, we compensated the unbalanced number of observations per class in the non-binary experiments by oversampling the unmethylated in sperm and differentially methylated classes .
(ADSM): Average distance between silhouette means; (CART): Classification and regression tree; (CC): Correlation coefficient; (CD4): CD4 Tlymphocytes; (CD8): CD8 T lymphocytes; (CDS): Protein codingsequence; (CGI): CpG island; (DF): Dermal fibroblasts; (DK): Dermalkeratinocytes; (DM): Dermal melanocytes; (FL): Fetal liver; (FSM): Fetal skeletal muscle; (HM): Heart muscle; (ICT): Inconsistency threshold; (HCNE): Highly conserved non-coding element; (HEP): Human Epigenome Project; (LINE): Long interspersed nuclear Element; (MWW): Mann-Whitney-Wilcoxon test; (O/E ratio):Observed/Expected ratio; (PBC): Profile based class; (PCA): Principlecomponent analysis; (PI): Probability of intersection; (SD): Standard deviation of CpGdistances; (SINE): Short interspaced nuclear element; (SM): Skeletal muscle; (T-DRM): Tissue-specific differentially methylated region; (TSS): Transcription start site; (UTR): Untranslated region.
CP created the database. CP and OH performed the experiments. CP, IZ and CV analyzed the data. CP and CV wrote the manuscript. IZ and CV conceived, designed the study and financed the experiments. CV supervised the project. All authors read and approved the final manuscript.
CGI dataset. This Excel table contains the CGI dataset that formed the foundation for our data-mining approach, as well as the classification of each CGI in the binary and tissue-specific methylation classes.
Results of the Principle Component Analysis. This Excel table contains the Eigenvector plots of the PCA that were used to measure the contribution of each attribute to the principal components as well as the principle components themselves.
Table of significant cluster intersections. This table shows the significant cluster intersections, ordered by size (#CGIs) and significance of each intersecting cluster (PI). Each cluster is identified by the data used in the clustering (Attribute or Methylation data) followed by the overall number of clusters, the clustering method (Hierarchical or K-means clustering) and the particular cluster used.
Decision tree. This Figure shows the decision tree used to predict the four CGI methylation classes.
Distribution of HCNE-overlapping PBCs over the gene association classes. Absolute number and percentage of CGIs in each PBC and gene-association class that overlap with a HCNE. A total of 3 conflicting CGIs that were determined to contain CDS but were part of a HCNE were excluded. Significant enrichment (p-value < 0.05) of methylation classes with HCNEs is marked bold and was determined via the Fisher exact test in conjunction with Bonferroni correction for multiple testing.
Table of variable correlation. Correlation analysis of all attributes shows that each relevant attribute was not replaceable by any other.
We would like to thank C. Bock for facilitating the analysis with Epigraph, T. D. Tullius and E. Bishop for providing DNA cleavage data. B. Lenhard, J.L. Oliver and M. Hackenberg for their helpful comments and discussions. This work was supported in part by the Spanish Ministry of Science and Technology (MEC) under project TIN-2006-12879 and the Consejeria de Innovacion, Investigacion y Ciencia de la Junta de Andalucia under project TIC-02788. C. Previti was supported by a grant from the German Academic Exchange Service (DAAD). O. Harari acknowledges the doctoral MAEC- AECI fellowship. I. Zwir is a senior research scientist supported by the Howard Hughes Medical Institute and the "Ramon y Cajal" program of the MEC, C. del Val was supported by the "Programa de Retorno de Investigadores" from the Junta de Andalucia.