"Stretches of DNA with a high G+C content, and a frequency of CpG dinucleotides close to the expected value, appear as CpG clusters within the CpG-depleted bulk DNA, and are now generally known as CpG islands"
. This original description of CpG islands by Gardiner and Frommer in 1987 [20
] formulates the basic idea underlying the present work: CpG dinucleotides appear clustered within the CpG-depleted bulk DNA and these clusters should be able to be associated with CpG islands. In the same work [20
], the above authors also proposed a criterion for CpG islands based on thresholds which later became the basic principle of practically all existing CpG island finders. They justify these criteria by assuming that CpG-rich regions over 200 bp in length are unlikely to have occurred by chance alone
, which points out another important property of CpG islands implemented in this work: the statistical significance. Some years before, McClelland and Ivarie [3
] had introduced a Chi-square test to assign a statistical significance to CpG islands. Therefore, our approach is probably more related to the original perception of CpG islands as statistically significant CpG clusters within CpG-depleted regions.
Both our distance approach (which directly predicts CpG clusters) and the threshold approach are derived from the same original idea stating that the CpGs form clusters in the genome. However, the main disadvantage of any threshold approach is that generally valid CpG islands may become discarded as well, an effect that is aggravated as the dimension of the parameter space grows. In our distance approach, we reduced the parameter space notably, furthermore linking the distance parameter to intrinsic statistical properties of the sequence. The chosen median distance between two CpGs in a given chromosome separates fairly well the CpG clustering from the inter-cluster distances (see Fig. ) and therefore affords certain objectivity to the choice of this parameter. Note furthermore that the median distance is correlated to the G+C content of the chromosome sequence. The higher the G+C content of the chromosome, the higher the probability that a CpG appears and consequently the lower will be the median distance. In this way, the median distance adjusts itself to the global conditions dictated by the given input sequence. This can hardly be achieved using the conventional large-dimension threshold parameter space and therefore, in previous work, the same threshold values were used indiscriminately for all the chromosomes.
Figure 1 Probability density function of distances between neighboring CpGs. Distribution of distances between neighboring CpG dinucleotides in the human chromosome 1. The observed distribution is represented in symbols, while the random expectation corresponding (more ...)
The first consequence of the difference between the distance and threshold approaches is that, on average, CpGcluster islands are shorter. However, they show higher mean G+C content, CpG density, and CpG fractions than do any of the other five tested algorithms (Table ). The lower values shown by these threshold-based algorithms may be an inherited consequence of the general approach shared by most of them. To some extent, the chosen thresholds predetermine the statistical properties of the islands, since these usually become enlarged as long as the thresholds are not violated. This threshold-dependent enlargement in the search process may also lead to the observed over-prediction of CpG islands and high Alu overlap shown by most threshold-based algorithms. On the contrary, CpGcluster overcomes this drawback since statistical properties of the CGIs, such as G+C content or CpG fraction are not used as search parameters. Note furthermore that the p-value is a crucial filter parameter to sort out spurious Alu elements. Young Alus have p-values around 10-7 (with slight variations among chromosomes); therefore, the high substitution rates on the Alu CpG sites produce a fast loss of statistical significance, which explains the low overlap with spurious Alu elements shown by the islands predicted by CpGcluster.
Finally, we wish to discuss briefly the lack of any length filter in CpGcluster which allows the prediction of extremely short islands and which, at first glance, could be interpreted as a disadvantage. It should be noted that in all of the previous algorithms the length is not used for prediction purposes, and is considered only in the final filtering process. In fact, the original idea of the length threshold was to guarantee that the predicted islands are not just a mere product of chance alone. Instead, we change the length filter by a statistically stricter criterion: the p-value. In this way, all predicted CGIs are statistically significant CpG clusters. We are aware that the putative functional CGIs are on average very long (as for example the L1 class in Table ). However, it is important to stress the conceptual difference between the detection of CpG clusters and the subsequent filtering for a particular subset (e.g. promoter overlapping CGIs). These two steps should be clearly distinguished.