Once the database is labeled, after a summarization process, we proceed with the corresponding feature selection of the new methylation classes and inference steps with the creation of a classifier with the newfound knowledge.
1. Profile identification
We selected CGIs that were covered to over 70% by the CpGs in the HEP dataset, requiring their methylation status to be defined in at least 2 tissues. This measure ensured a balanced dataset, where more than two-thirds of the CGIs were defined in all tissues and 493 (95%, of the CGIs) were defined in at least 10 tissues (Table , Additional file 1
Unique CpGs and CGIs defined by the HEP data
The CGIs were identified by the CpGcluster
], which does not rely on the traditional three parameters of length, GC-content and O/E-ratio [39
]. Instead, it searches for closely spaced CpGs and computes the probability of finding a cluster with the same length and number of CpGs in the genome (p-value
has a high degree of sensitivity for detecting known, functional CGIs, while at the same time, excludes spurious repetitive elements [38
]. Over 97% of the CGIs in our dataset were covered by a CGI predicted using the traditional parameters [39
], while the inverse ratio is 71%. This lower degree of coverage is due the high specificity of CpGcluster that excludes more false-positives, such as Alu
-repeats, than methods based on the traditional thresholds.
We then characterized each of the CGIs in our dataset using 38 attributes belonging to four distinct categories: (1) CGI-specific attributes (e.g. their G+C content, Observed/Expected ratio and CpGcluster p
-value), (2) Repetitive sequences (number and type of repetitive elements); (3) Evolutionary conservation (e.g. PhastCon
content), as well as (4) Structural and physicochemical properties of the DNA itself (e.g. twist, tilt, roll, shift, slide and rise) [40
]. The attribute global linear analysis (PCA analysis) (Additional file 2
) showed that all of them contributed significantly to the overall variability of the dataset. Therefore, all raw data were used in further steps because they provide a more interpretable characterization.
Then attribute and methylation data were independently clustered by both hierarchical and k-means clustering methods. The validity indices that define the appropriate number of clusters were not conclusive, as shown in Table . Thus we selected cluster partitions yielding more than two clusters using the best two partition scores (see Methods).
Validity indices used to estimate the optimum number of data clusters
In order to determine a combination of biological CGI attributes that naturally intersected with a specific pattern of methylation, we linked the two pairs of clusters by calculating the probability of intersection (PI
) and employing a significance p
-value < 0.05. This approach optimizes the cluster partitions based on the coincidence between independent clusters [37
] instead of intrinsic intra/inter clustering measurements [41
]. The application of this unsupervised process to our dataset identified 55 significant intersections (profiles) where two independent clusters had more CGIs in common than would be expected by chance (Additional file 3
). These 55 profiles are redundant due to the fact that partitions from distinct numbers of clusters were allowed in the former step. Therefore, a cluster from one domain might be related to more than one cluster from the other domain and vice versa. We removed this redundancy (Figure ) by grouping the 55 profiles and selecting a representative prototype from those that recognize similar observations. The process resulted in 9 non-redundant profiles (PBC) (Figure ), which demonstrate clear patterns of tissue-specific methylation (Table ) associated with distinct biological characteristics (Table ). The attribute values in Table were normalized between 0 and 1. This normalization is performed before the clustering process in order to prevent bias clusters caused by attributes with high absolute values. The significance at a p
-value < 0.05 is relative to these normalized values. The non-normalized values are available in the supplementary information. The number of CGIs recovered with each profile is registered in Table .
Figure 2 Determining non-redundant CGI profiles. Elimination of redundant CGI profiles. Initially, 55 profiles (relations between CGI sequence attributes and methylation classes linked by the probability of intersection) were identified. We grouped all profiles (more ...)
CGI profiles: Methylation values of each non-redundant CGI profile.
CGI profiles: Attribute values of each non-redundant CGI profile.
Finally, we labeled the observations using the corresponding methylation patterns (Figure , Figure ) into four methylation classes: Constitutively methylated, containing CGIs that are highly methylated in all tissues (Profile 1); Unmethylated in sperm contained CGIs that only lacked methylation in sperm (Profile 2); Differentially methylated contained CGIs that showed a distinct degree of methylation for each tissue (Profile 3); and Constitutively unmethylated, comprising CGIs that are uniformly unmethylated across all tissues and cell types (Profiles 4 through 9).
Figure 3 Linking clusters and Feature selection of new methylation classes. Summarization and feature selection of CGI profiles. A) Identification of 9 CGI profiles by linking CGI sequence attribute clusters (lower left corner) and methylation clusters (upper (more ...)
The initial PCA analysis suggested that all attributes were informative. However, we considered the data labeled in the previous step and applied a local feature selection analysis based on the entropy of the attributes (i.e. decision tree) for dissecting each new methylation class. The most relevant attributes were the novel ones included in this study: PhastCon
content, the p
-value computed by the CpGcluster
algorithm, and structural characteristics of the CGIs describing their three-dimensional flexibility such as: twist, tilt, roll, shift, slide and rise of a DNA sequence from a novel model of dinucleotide stiffness. Moreover, a correlation analysis of all attributes shows that each relevant attribute was not replaceable by any other (Additional file 6
). Here we describe the most relevant attributes for each class.
Constitutively methylated class
CGIs in this class had a high average degree of methylation in all tissues (avg_meth > 0.8), and reflect PBC1. This class is described by a higher average content of CA and TG dinucleotides, which can be seen as the "footprint" of methylation, since they are often the result of the deamination of methylated cytosine. The most notable difference between the constitutively methylated class and the other classes, specially the constitutively unmethylated, is its higher overlap with PhastCon
elements. This attribute never reaches values greater than 0.1 in the former class, while it never falls below 0.5 in the differentially methylated classes. A high PhastCon
overlap was originally seen as a sign of a potential conserved functional regulatory element [38
] however their high degree of methylation poses limits on their functionality by restricting access to the DNA.
Constitutively unmethylated class
The CGIs in this class have a low avg_meth (≤ 0.2) in all tissues; it summarizes PBCs
. Previous results have shown a negative correlation between the concentration of CpGs and a high degree of methylation, and this idea has been used as the starting point for methylation prediction. However, our findings agree with the new experimental work of Raykan et al. [43
]. The authors find that DNA methylation can occur at high-, medium- and, contrary to previous notions, at even some low-CpG density promoters. It has been also found that certain promoters with few CpGs were active and methylated, whereas other promoters of that group can be unmethylated when active [44
]. These data suggest that DNA methylation is involved in regulating activity over a broad range of CpG O/E-ratios, including CpG-poor promoters, located in tissue-specific differentially methylated regions (tDMRs).
All constitutively unmethylated PBCs are very similar, with exception of the unexpectedly high content in GC-rich repetitive elements of PBC 9. Only 6% of the CGIs supported this PBC indicating that the combination of attributes learned from the few remaining repetitive elements is not representative of a large subset of CGIs. This is a consequence of the HEP selection process, which excluded most of these repeats [31
]. However, in our results, 36 out of the 46 CpG islands overlapping with a repetitive element, overlap either with the extended promoter region (14 CpG islands) or with the TSS of a gene (22 CGIs). Of these CGIs, 32 (70%) are unmethylated despite the presence of a repeat. In fact, all TSS-overlapping CGIs are unmethylated regardless of the repeat in their vicinity. The presence of a promoter seems to be incompatible with the methylation of these repeats. This small set agrees with the recent experimental findings of Meissner et al. [30
], where it was proven that not all transposable elements are equally affected concerning methylation. Long interspersed nuclear elements (LINEs) and long terminal repeats (LTR) are generally methylated independently of CpG density. However, CpG density influences whether non-autonomous short interspersed nuclear elements (SINEs) and low complexity regions remain unmethylated, which may be a reason for the low degree of methylation of the repetitive elements of profile 9. These results show that CGIs with distinct biological characteristics can share the same methylation status (Figure ) and that the degree of CpG enrichment or the presence of a repetitive element alone does not determine if a sequence is protected against methylation.
These results show that CGIs with distinct biological characteristics can share the same methylation status (Figure ) and that the degree of CpG enrichment or the presence of a repetitive element alone does not determine if a sequence is protected against methylation.
The CpGcluster p-value is a key attribute for distinguishing between constitutively methylated and unmethylated CGIs. This measure distinguishes true CGIs from repetitive Alu-elements, which are the main source of false-positive CGI predictions, and is not linked to either G+C content or the O/E ratio. This is important because it can determine the significance of a CGI independently of changes in the G+C content of the genomic sequence and is therefore not affected by fluctuations in the sequence composition. G and C-containing dinucleotides (CpC, CpG, GpG) in conjunction with a reduction in A and T-containing dinucleotides (ApA, TpT) and the curvature are also important attributes for this class. The high contents of G and C dinucleotides leads to an increasing degree of sequence bending and reduces both the macroscopic curvature of the DNA as well as the amount of energy needed to separate the strands (stacking energy). Despite the increasing G+C content, the O/E ratio decreases continuously (Figure ), indicates that there is a balance between CpG enrichment on one hand and the overall G+C content on the other.
In addition to the previously known classes we found two tissue-specific methylation classes:
Unmethylated in sperm class
These CGIs lack methylation in sperm, and have a slightly lower degree of methylation in CD4 T lymphocytes, dermal fibroblasts, dermal keratinocytes and the muscle tissue of the heart. Nevertheless, sperm is the only tissue that shows a level of methylation below 0.2 making it the defining characteristic of this class. This agrees with the known normal development and control of sperm-specific gene expression of germ cells [45
], as well as with their epigenetic reprogramming during gametogenesis.
Differentially methylated class
In contrast to the classes constitutively methylated and unmethylated in sperm, this class showed a heterogeneous degree of methylation ranging across all tissues. It presented both highly methylated and unmethylated CGIs in the same tissue yielding an intermediate degree of methylation after averaging. This average was significantly lower than that of the CGIs unmethylated in Sperm. This class also presented the lowest average distance between CpGs, the highest CpG O/E value, G+C content, and CpG and GpC dinucleotide content than the constitutively methylated and unmethylated in sperm classes. This class presented unique structural characteristics, such as low degrees of bending and curvature, as well as a high degree of solvent-accessibility of the DNA backbone (DNA cleavage), which may indicate a higher degree of permissiveness for DNA binding. DNA sequence bending is generally higher if the sequence contains phased GGGCCC sequences, and therefore the bending should be higher for sequences rich in G+C. This does not apply to this class where alternating CpGs and GpCs limit bending, curvature, twist and the number of helix turns, but tends to increase the flexibility of the sequence.
2. Functional CGI categorization by gene association
In order to assess the functionality of the four newly defined methylation classes, we measured the coincidence between them and defined gene association classes (Table ) using probability of intersection (PI) (Table ). The PI is usually employed to perform coincidence analysis because it is a context-sensitive metric that takes into account the domain where the intersection is calculated. It measures the degree of inclusion of one set into another, considering both the number of instances intersected between two sets as well as those instances not belonging to the intersection. Formally, the PI determinates the p-value, which is the statistical significance of observing over represented intersected instances (i.e. occur more frequently than could be expected by pure chance). The results obtained show that the new classes represent unique, functional CGI profiles associated with distinct CGI methylation patterns and gene compartments (Table ).
Distribution of CGIs over the gene association classes
Coincidence between gene association classes and PBCs
Re-classification using functional CGI profiles
The CGIs in our dataset were associated with 497 protein-coding gene loci (Table ). While the vast majority of the CGIs (> 68%) were located in the vicinity of a promoter, less than 4% of the CGIs were outside the genic environment, indicating that the dataset is skewed towards genic regions. This is a consequence of a bias in the HEP data, which includes few intergenic regions [31
]. Normally a higher percentage of CGIs would be found outside of the genic region in an unbiased dataset [38
]. Approximately 20% of the CGIs are located in the gene-body and we found that these CGIs were unequally distributed, since there were approximately 30% more CDS-overlapping CGIs than those located in introns.
Moreover, it is known that CGIs associated with the gene-body are susceptible to both constitutive as well as tissue-specific methylation [2
]. However, by separating between coding and non-coding regions, we were able to distinguish between highly methylated and differentially methylated CGIs.
Constitutively methylated class
It coincided significantly with the CDS of the genes in our dataset. They are highly conserved and may be the result of GC-rich codons simulating the presence of a CGI. The methylation of a CDS region itself, in contrast to a TSS region, does not impede the progression of transcription, making this region permissive for both methylation and compact chromatin conformation.
Differentially methylated class
This class coincided significantly with CGIs located in introns, indicating the presence of functional methylation-dependent sequence elements. Though the majority of the differentially methylated CGIs that were conserved overlapped with the CDS (Table ), we found that they were the only class of CGI that was significantly enriched in highly conserved non-coding elements (HCNEs) [47
] (Additional file 5
). Since the differential methylation of HCNEs has recently been shown in a comparison of embryonic stem cells (ES) and ES-derived differentiated cells in mouse [30
] these differentially methylated CGIs may represent examples of enhancers that are controlled by methylation [48
Distribution of conserved and not conserved CGIs over PBCs and gene association classes
Unmethylated in sperm class
Over 40% of the CGIs belonging to this class coincided with promoter-overlapping CGIs, including genes known to be testis-specific, such as DDX43 [16
], HIST1H2BA [50
], PIWIL3 [51
] and ECAT1 [52
]. Although this supports the view that germline-specific genes are preferentially methylated in somatic tissues [44
], the only significant intersection with a gene-class was with the pseudogene-proximal CGIs, 22% of which were unmethylated in sperm. Only 12% of all CGIs were associated with pseudogenes and the majority of them represent "processed" pseudogenes (> 64%). This may still include parts of the core promoter region, including the promoter-overlapping CGI [53
]. Their lack of methylation in sperm was thought to be a by-product of the global changes in methylation that occur during spermatogenesis [45
]. Although it may also permit them to be transcriptionally active in sperm [44
] they are normally targeted for silencing through methylation during differentiation and therefore show a high degree of methylation in somatic tissues [45
]. This complicates the identification of CGIs that are involved in controlling the sperm-specific expression of protein-coding genes via promoter-CGI methylation such as the MAGE and HAGE-genes [9
] because it may lead to false positives in genome-wide studies of promoter methylation. For example, DPPA5 a functional testis-specific gene [55
], is active in pluripotent cells and down-regulated during the differentiation process [56
], but we found that it contains a CGI and a pseudogene within its 5'UTR. Therefore it is not clear if the lack of methylation of this CGI is necessary to maintain tissue-specific activity or simply a by-product of the pseudogene in its vicinity.
Constitutively unmethylated class
This class showed significant coincidence with the CGIs overlapping the TSS. However, they showed neither the highest G+C-content nor the highest O/E ratio of the whole unmethylated group. Instead, they showed the lowest average CpGcluster p
-value, further supporting the use of this attribute as a better measure of functionality than the CpG enrichment or G+C content alone [38
This categorization of the methylation classes was used to re-classify the dataset into the four functional methylation classes shown in Table . The promoter/TSS proximal CGIs represent the vast majority of all CGIs and they are predominantly unmethylated. It has been estimated that about 18% of the CGIs in the human genome are subject to tissue-specific methylation [43
] and we found our data to support this estimate since just over 17% of the CGIs were either unmethylated in sperm or differentially methylated. It is noteworthy to mention that neither of these two classes overlapped significantly with the promoter-proximal CGIs.
3. Prediction of CGI methylation
Our four functional CGI methylation classes were then employed in the development of a supervised classifier. Our hypothesis was that if these classes were biologically and computationally significant, they would be useful in predicting new observations. To do so we first labeled each observation using the classes assigned by the profiles. We then used a simple classifier (i.e. decision tree) that employs 23 of the 38 attributes to predict the four classes of methylation. The methods were tested via 10-fold cross-validation where imbalanced classes were compensated [57
] (see Methods).
These results show that the decision tree can encode rules that predict CGIs with distinct methylation patterns at a high level of accuracy (Table , Additional file 4
). It is difficult to asses the performance of our method compared to previous computational approaches due to the fact that all constrain the prediction to only two methylation classes, where a sequence is either "methylated" or "unmethylated" across all possible tissues or cell types [26
] and they do not take into account tissue-specific methylation.
The ability of our approach to predict methylated and unmethylated CGIs, was then directly compared to the results obtained from EpiGRAPH
], which were tested on our HEP-based CGI methylation data and the methylation data used by the EpiGRAPH
system (Table ). In order to compare the methods we used a binary methylation classification system as described in the Materials and Methods section.
Comparison of accuracy using binary methylation classification.
Both datasets were classified using two methods, SVM and decision tree, the former one with two different implementations, EpiGRAPH C4.5 and the Matlab decision tree (CART, version R2007a). In all cases we used default parameters.
The results obtained using both datasets are very similar, independently of the classifier used. Thus, we suggest that our attribute set is capable of predicting methylated and nonmethylated CGIs with a high degree of accuracy even without sampling and method specific parameter optimization (Table ).
Unexpectedly, a different implementation of a simpler method such as a decision tree, obtained a better accuracy and CC than the more complex SVM with the default setting parameters. In addition, this method produces interpretable rules that can be used for a better understanding of the data and easily extended to the use of multiple classes.
As shown in table we are able to predict four different classes with accuracy close to that of the binary methylation prediction. The comparison with a more sophisticated classifier, suggests that the new information in terms of CGIs attributes and tissue-specific methylation classes are the key factors that improve the CGI classification instead of the classifier themselves (i.e. method bias) [57