Search tips
Search criteria 


Logo of ijbiostatThe International Journal of BiostatisticsThe International Journal of BiostatisticsSubmit to The International Journal of BiostatisticsSubscribe
Int J Biostat. 2009 January 1; 5(1): 14.
Published online 2009 May 7. doi:  10.2202/1557-4679.1158
PMCID: PMC2818740

A Nearly Exhaustive Search for CpG Islands on Whole Chromosomes*


CpG islands are genome subsequences with an unexpectedly high number of CG di-nucleotides. They are typically identified using filtering criteria (e.g., G+C% expected vs. observed CpG ratio and length) and are computed using sliding window methods. Most such studies illusively assume an exhaustive search of CpG islands are achieved on the genome sequence of interest. We devise a Lexis diagram and explicitly show that filtering criteria-based definitions of CpG islands are mathematically incomplete and non-operational. These facts imply that the sliding window methods frequently fail to identify a large percentage of subsequences that meet the filtering criteria. We also demonstrate that an exhaustive search is computationally expensive. We develop the Hierarchical Factor Segmentation (HFS) algorithm, a pattern recognition technique with an adaptive model selection device to overcome the incompleteness and non-operational drawbacks, and to achieve effective computations for identifying CpG-islands. The concept of a CpG island “core” is introduced and computed using the HFS algorithm, which is independent from any specific filtering criteria. Upon such a CpG island “core,” a CpG-island is constructed using a Lexis diagram. This two-step computational approach provides a nearly exhaustive search for CpG islands that can be practically implemented on whole chromosomes. In a simulation study realistically mimicking CpG-island dynamics through a Hidden Markov Model we demonstrate that this approach retains very high sensitivity and specificity, that is, very low rates of false positives and false negatives. Finally, we apply the HFS algorithm to identify CpG island cores on human chromosome 21.

1. Introduction

In the human genome, the CpG di-nucleotide (a cytosine followed by a guanine) is distinctive in the sense that its observed frequency is less than one fifth of its expected overall frequency. Furthermore, CpGs are characterized by unusual distributional patterns of sparsity and aggregation along the chromosomes. Importantly, CpGs are associated with methylation status (Bird, 1986). In particular, segments of the genome with aggregated CpG di-nucleotides, socalled CpG islands, receive very intensive attention from biological, biomedical and evolutionary researchers because of their functional properties in relation to genomics and cancer development. Though the underlying generating mechanism for CpG islands by and large still remains mysterious, the spatial patterns of CpGs in the genome are currently attributed to methylation at the 5′-end of the cytosine, followed by conversion of cytosine (C) to thymine (T) via deamination. Thus the majority of CpG islands are clusters of non-methylated CpGs that are protected from this process.

Researchers have established in the last decade that CpG islands indeed play many important functional roles. They are confirmed to be frequently located close to transcription start sites, and are involved in the regulation of gene expression (Saxonov et al, 2006). Based on these associations, many researchers have used CpG islands to predict gene promoters (Bajic et al, 2004), and have linked changes in the methylation status of CpG islands to cancer development (Ushijima, 2005; Opavsky et al, 2007).

Theoretically speaking, all association studies of CpG islands rely on a complete set of well-defined CpG islands on the genome sequence under study. This is because statistical inference about associations and promoter predictions may be seriously biased when based on an incomplete set of CpG islands. This requirement in fact has promoted the development of many computational algorithms that attempt to achieve an exhaustive search for CpG islands by using machine learning and other statistical methodologies (Bock et al, 2006, 2007). Many of the CpG island finder algorithms in the literature are available online.

Typically these algorithms employ a sliding window computation to scan through a target genome sequence for segments that satisfy some subjective choice of filtering criteria based on information about G+C%, expected vs. observed CpG ratio, and length. Among the many versions, the most popular one was proposed by Gardiner-Garden and Frommer (1987), and later modified by Takai and Jones (2002). The simplicity of these filtering criteria seems to mislead many researchers explicitly or implicitly to assume or anticipate the exhaustive search capability of their CpG island finder algorithms.

However this exhaustive search issue has not yet been rigorously addressed in the literature. Unfortunately, counter to their assumptions and anticipations, most CpG island finder algorithms are in fact not equipped with such a capability. This fact can be perceived from the following incoherence: Different CpG finders, based on the same set of filtering criteria and genome sequences, very commonly produce very different CpG island predictions. These incoherent identifications with significant non-overlapping proportions indicate that these finders may miss a significant proportion of true CpG islands. For instance, we empirically estimated that the missing proportion is likely to be as high as 50% for the algorithm used in Bock et al (2006, 2007), when applied to human chromosome 21. Similarly, the algorithm used in Yamada et al (2004), can have very low sensitivity as well.

On the other hand, some CpG island finders have reasonably good sensitivity, such as the CpGIE algorithm proposed in Wong and Leung (2004). But algorithms in this class are computationally practical only for relatively short genome sequences. This shows another essential aspect of the exhaustive search issue.

In this paper we point out that the incoherence among CpG island finders and their high percentages of missed islands are all caused by the fundamental issue: no operational definition of a CpG island is available, or equivalently, CpG island definitions are mathematically incomplete. By mathematically incomplete definition of a CpG island we mean that it is not possible to ascertain precisely whether a genome location is either inside or outside of a CpG island. This incompleteness crucially undermines the computational feasibility of any exhaustive search directly based on filtering criteria. We explicitly demonstrate the mathematical incompleteness through a technique we developed here, called a Lexis diagram. Further, since a lengthier genome sequence is subject to more severe effects of mathematical incompleteness, the Lexis diagram reveals that an exhaustive search using sliding window computations for all subsequences satisfying a particular set of filtering criteria is computationally infeasible.

To avoid the effects of mathematical incompleteness of CpG island definitions, and at the same time to resolve the computational infeasibility issue, we propose a CpG island identification approach in two steps. In the first step, a pattern recognition technique with an adaptive model selection device, called Hierarchical Factor Segmentation (HFS), is developed to identify a collection of genome segments with aggregating CpG di-nucleotides, called “cores” of CpG islands. In the second step, by centering at each identified CpG island core, we then compute and identify a CpG island using the Lexis diagram pertaining to any specific choice of filtering criteria.

The reasons why our approach is free from the issue of mathematical in-completeness are the follows. Within Step 1, no filtering criteria are involved in the computations of the HFS algorithm when it is applied on the whole genome sequence. After the HFS algorithm identifies CpG island cores along the genome sequence, the Lexis diagram is used in Step 2 to overcome potential “local” mathematical incompleteness issues within the vicinity of each CpG island core. Furthermore, the reason why our approach can achieve a nearly exhaustive search for CpG islands is that the HFS algorithm can efficiently compute and identify segments which have a higher density of CpG di-nucleotides in contrast to its neighboring segments. The adaptive model selection device built into the HFS algorithm is designed to balance the total number of cores (more cores being closer to an exhaustive search) against the computational complexity of the algorithm.

This paper is organized as follows. The mathematical incompleteness of the filtering criteria and construction of the Lexis diagram are discussed in Section 2. The HFS algorithm and its adaptive model selection device are developed in Section 3. In Section 4, our approach is demonstrated and evaluated through real data analysis of an illustrative genome sequence and human chromosome 21. In Section 5, our approach is shown to have very high sensitivity and specificity in identifying CpG islands under simulated genome dynamics based on the Hidden Markov Model. Some related issues and problems are collected and addressed in the discussion section. The resultant collection of CpG island cores on the whole chromosome 21 is provided in a supplementary file.

2. Incompleteness of filtering criteria via Lexis diagram

In this section, we first show how to construct a Lexis diagram to accommodate a set of filtering criteria. Then, based on the constructed Lexis diagram, we explicitly point out the mathematical incompleteness pertaining to this filtering criteria and discuss its implications.

Denote the genome sequence of interest Xn = {Xi}, with Xi [set membership] {A, G, T, C} at the ith base pair (bp), i = 0,..., n. A CpG dinucleotide on Xn is an occurrence of a C followed by G. The filtering criteria is given as follows (see Gardiner-Garden and Frommer (1987) and Takai and Jones (2002)):

  • [CpG filtering criteria]
  • [GF-1 ] GC content ≥ 50%;
  • [GF-2 ] Ratio O/E of observed (O) to expected (E) number of CpG dinucleotides in Xn is ≥ 0.6;
  • [GF-3 ] Length ≥ 200 bp (or 500 bp).

The threshold values 50%, 0.6 and 200 (or 500) used in the criteria are subjective choices. Different choices are used in different studies. The expected number (E) of CpGs is computed as n times the product of the frequencies of G and C in Xn (i.e., assuming independence). Heuristically, identifying CpG islands means to screen through a genome sequence for all segments satisfying all three conditions.

The Lexis diagram is a technique originally used to represent lifetime data with staggered entries. In the algorithm below we adapt this diagram to represent runs of genomic sites satisfying the filtering criteria.

  • [Left-to-Right] version of the Lexis diagram
  • [Initialize ] Let i = 0.
  • [LR-1 ] Sequentially compute the two characteristic values [GF-1] (GC content) and [GF-2] (O/E ratio) on segments {Xj : j [set membership] [i, k]}for k = 200,..., n;
  • [LR-2 ] Consider the 45° line crossing the horizontal axis at i. Mark the point (i + k, k) along this line whenever both [GF-1] and [GF-2] are satisfied on the subsegment [i, i + k] of Xn;
  • [LR-3 ] Repeat steps [LR-1] and [LR-2] starting from Xi with i = 1,..., n – 200.

A [Right-to-Left] Lexis diagram can be similarly constructed with Xn as the starting point and performing all computations in the opposite direction.

Figure 1(a) shows a genome sequence marked with CpG dinucleotides. [Left-to-Right] Lexis diagrams for the two subsegments marked with green brackets are given in Figure 1(b) (right-most subsegment) and Figure 1(c) (left-most subsegment). These diagrams illustrate the sparsity and clustering of CpG di-nucleotides. Specifically, a marked point with coordinate (i + k, k), (k > 200), in the n×n lattice coordinate system indicates that the subsegment [i, i + k] of Xn satisfies [GF-1] and [GF-2] with length k. For example, marked points b1 = (407,406), b2 = (440, 423) and b3 = (699, 547) in Figure 1(b) correspond to the subsegments [1, 407], [17, 440] and [152, 699], all satisfying the first two filtering conditions with lengths being equal to 406, 423 and 547, respectively. Here we note that both GC content and O/E are sensitive to starting location and length. For example, marked points b4 = (500, 347) in Figure 1(b) and c1 = (358, 305) in Figure 1(c) indicate that subsegment [153, 500] of genome segment #1 and segment [53, 358] of genome subsegment #2 do not satisfy [GF-1] and [GF-2]. However if we either extend both subsegments a little bit in length or move the starting point slightly to the right then the two conditions are satisfied.

Fig. 1.
Lexis diagram, (a) Actogram representation of a genome sequence with each CpG occurrence marked by a “bar”; (b) Lexis diagram for first segment marked by green square brackets; (c) Lexis diagram for second marked segment.

In addition, we observe wiggle patterns on both the lower-left and upper-right ends of the Lexis diagram, as shown in Figure 1(b,c). We see many subsegments satisfying the filtering criteria. Some are contained by the others, such as the marked subsegment corresponding to c3 = (699, 613) which contains the one marked by c4 = (690, 577). In contrast, the two subsegments marked by b2 and b3 are overlapping, but neither contains the other. We remark that there will be many similar examples in the [Right-to-Left] Lexis diagram, though here we focus on the [Left-to-Right] diagram.

These observations together point out that many genome locations have an undetermined status with regard to being inside or outside a CpG island. This is an important observation of the filtering criteria. Here we call it mathematical incompleteness. An immediate implication of this incompleteness is that the above filtering criteria is not operational for CpG island identification in the sense that there does not exist an algorithmic prescription that can be used to unambiguously define CpG islands using the filtering criteria. Therefore ad hoc human judgement is needed in order to define CpG islands. Alternatively, one can define a site as being in a CpG island if it is in any subsegment satisfying the criteria. In this case, an exhaustive search like that illustrated in the Lexis diagram is necessary.

The exhaustive search requires O(n2) calculations. For a relatively short genome sequence with n ≈ 105, the computation is of order 1010. Typical mammalian chromosomes are on the order of a million bp long. Thus, an exhaustive search is infeasible on a single processor. We note that the calculations can be done in parallel, with steps [[LR-1]] and [[LR-2]] computed for each i on a separate processor. With order n processors, the search could theoretically be reduced to O(n). However, most currently available computer clusters typically have only tens or hundreds of nodes.

The Lexis diagram also reveals the reason why CpG island finder algorithms based on sliding window computations have low sensitivity. We illustrate two cases in Figure 1(b). If 500 bp is used as the threshold value in [GF-3], then the subsegment marked by b2 would not be identified by a sliding window technique with window size 500. If the window size is 200 instead, then the moving window technique would pick up the subsegment [1, 407] as marked by b1. Then if 408 bp is taken as the starting point, the second cluster on genome segment #1 is likely to be missed. If the two clusters were a little bit further apart, then it is likely that the sliding window technique would miss both of them. This is not an uncommon case.

Hence, from the perspectives of incompleteness and implied computability, an indirect method for finding CpG islands is clearly needed. One indirect approach proposed in this paper consists of two steps. The first step is based on a pattern recognition technique, called the Hierarchical Factor Segmentation (HFS) algorithm, as discussed in the next section. This HFS algorithm identifies a collection of CpG island “cores,” where the CpG dinucleotide is aggregated, without involving any filtering criteria. In the second step, upon each computed CpG island core, a CpG island is computed using a chosen filtering criteria on the target genome sequence.

3. HFS approach: identifying CpG cores

The Hierarchical Factor Segmentation (HFS) algorithm was originally proposed in Fushing et al (2006), for analyzing animal behavioral time series. It was used there as a pattern recognition platform generically called “non-parametric decoding.” The version of the HFS algorithm employed here consists of three coding levels, so it is denoted by 3HFS throughout this paper. A schematic illustration of this algorithm is given in Figure 2. Intuitively, the HFS algorithm performs a CpG di-nucleotide intensity change-point analysis without prior knowledge of the number of changes along the genome sequence. The algorithm is described as follows.

Fig. 2.
Schematic illustration of HFS algorithm.
  • [3HFS algorithm:]
  • [3HFS-1. ] The genome sequence Xn is transformed into a base 2 digital string using the following coding scheme: code 1 replaces the value of Xi when a CpG ends at time i, otherwise code 0 replaces Xi. This first level 0-1 digital string is denoted by the code sequence C0.
  • [3HFS-2. ] Construct a histogram, say H*, of the event recurrence time distribution from C0, and denote the sequence of recurrence times (inter-event spacing) by Rm.
  • [3HFS-3. ] Choose the first threshold value ħ0 on H* to transform the R˜m sequence into a 0*-l* digital string via the second level coding scheme: (1) a recurrence time less than ħ0 is coded by a 0* and (2) a recurrence time greater than or equal to ħ0 is coded by a 1*. The resultant base 2 digital string is denoted by the code sequence C*.
  • [3HFS-4. ] Upon code sequence C*, we take code word 1* as another “new” event and construct its corresponding recurrence time histogram H@ and the sequence of inter-l*-event spacing as R˜@.
  • [3HFS-5. ] Choose the second threshold value ħ* on H@ to transform R˜@ into another digital string using the top-level coding scheme: (1) an inter-l*-event spacing less than ħ* is coded by 0@; (2) an inter-l*-event spacing greater than ħ* is coded by a 1@. The third base 2 digital string is denoted by C@.
  • [3HFS-6. ] The resultant code sequence C@ is mapped back onto the time series C0 or Xn as a partition of | R@ |(= m(3)) segments on [1, n]. Denote this partition by N(3)(Xn)={[NLi(3),NRi(3))}i=1m.

The 3HFS segmentation N(3)( Xn) will separate the high CpG-intensity segments from the low CpG-intensity ones. A segment, say [NLi(3),NRi(3)), corresponding to the l@-code, is a period of time points falling in between two widely separated l*-codes. The wide separation of two successive l*-codes implies that there are many 1-codes on the segment of code sequence C0, or equivalently, that many CpG events are observed on the segment of Xn. Thus this segment is a cluster of CpGs. We call a segment corresponding to 1@-odes a “core” of a potential CpG island. In contrast, a neighboring segment, say [NRi(3),NLi+1(3)), corresponding to 0@-codes (including segments corresponding to the two l*-codes on both of its ends) would present an extended period consisting of many 0-codes, but very few 1-codes. Such a segment contains only sparse CpGs.

Thus, given (ħ0, ħ*), the 3HFS algorithm can very efficiently locate all cores of CpG islands on Xn, even for whole chromosomes. Since the total length of this collection of cores is much smaller than the length of Xn, the construction of a collections of Lexis diagrams on short segments centered at each identified core become computationally feasible. In the next two subsections we discuss an adaptive model selection device for choosing thresholds values (ħ0, ħ*).

3.1. Extracting information of CpG dynamics via mixture analysis

¿From the point of view of pattern recognition, the segmentation N(3)( Xn) of Xn can also be viewed as a solution to the CpG intensity change-point analysis problem that makes no assumptions about prior knowledge of the number of changes. Since this solution critically depends on the choice of threshold values (ħ0, ħ*), we propose to choose such threshold values by employing an adaptive model selection technique, which will be developed in the next subsection. By adaptive here we mean to choose a suitable penalty function in a likelihood-based model selection procedure such that the resultant thresholds can coherently reflect information about the CpG dynamics embedded in the original genome sequence Xn. An apparent fact which is very relevant to the unknown CpG dynamics is the CpG-intensities inside and outside the CpG islands contained in Xn. For simplicity, we assume that all CpG islands share one common intensity λ1, while non-CpG island segments share another common intensity λ0. We expect that λ1 > λ0. We extract this intensity information using the maximum entropy principle as follows.

With a given Xn, we know the number m of CpG di-nucleotides it contains. We have the average recurrence time n/m as an estimate of the mean recurrence time of CpG di-nucleotides. Given this average recurrence time and under the maximum entropy principle, the recurrence time of CpG di-nucleotides along Xn are independently distributed and follow a geometric distribution G(·; λ1) inside CpG islands and a geometric distribution G(·; λ0) for the remaining bulk of DNA.

Based on these geometric distributions and the independence among recurrence times, we derive the likelihood for geometric mixture analysis of the collection of all recurrence times by collapsing their order. Denote the sequence of recurrence times of CpGs observed in Xn by R˜m={R1,,Rm}. The state-space sequence pertaining to R˜m is denoted by S˜m={S1R,,SmR}. That is, if Ri is from a CpG island SiR=1 then Ri 1 is distributed according to G(·; λ1). Otherwise, SiR=0 (Ri is not in a CpG island), and Ri 1 is distributed according to G(·; λ0). The mixture likelihood of ϕ = 1, λ1, λ0) is computed as:


Denote M1*=ΣSiR=mM0*. Then the log-likelihood of ϕ is:


To find the MLE for ϕ, we apply the EM-algorithm with S˜m as the missing data. The expected log-likelihood in the E-step at the jth iteration is:


where we have the so-called responsibility of the state s = 1 for the ith datum


The M-step at the jth iteration involves the following computations:

  1. Estimate the mixture proportion
  2. Estimate the geometric distribution parameters, for s = 0, 1,
    where μs(j) is the mean recurrence time in the s-state.

Let (π^s(M),λ^s(M),μ^s(M)) denote the MLE estimates of the mixture proportion, geometric intensity and mean recurrence time parameters, respectively. We can compute the sample Fisher information matrix Î0 of ψ = 1, λ1, λ0) using a variant of the missing information matrix proposed in Louis (1982). See McLachlan and Krishnan (1997) and Meilijson (1989) specifically for the geometric mixture case.

3.2. Finding the optimal threshold parameters via adaptive model selection criterion

Next we develop our adaptive model selection methodology as follows. Assuming that the segmentation N(3)( Xn) using the HFS algorithm achieves the separation of the CpG island and non-CpG island segments. Then the recurrence time within all [NLi(3),NRi(3)) segments of 1@ is G(· : λl) distributed, while recurrence time in the remaining [NRi(3),NLi+1(3)) segments is distributed as G(· : λ0). Therefore, any point of {[NLi(3)} and {NRi(3))} separating 0@ and 1@ segments can be defined as an intensity change-point. Denote the number of change-points by ω(ħ0, ħ*). Let {R1j'}j'=1m1 and {R0j}j=1m0 denote the observed recurrence times within identified core and non-core segments, respectively, where m0 + m1 = m is the length of R˜m in step [3HFS-2.] of the algorithm.

Next, we compute the likelihood function of (λ0, λl) based on the observed recurrence times within the two classes of segments as:


where P0s) = (1 – e−λs), s = 1, 0. Note that this estimation of (λ0, λ1) is independent of the estimation in the mixture geometric analysis in Section 3.1. With L(λ0, λ1; Xn), the maximum likelihood estimates of (λ0, λ1) are derived as follows:


where c¯1=ΣR1j'm1 and c¯0=ΣR0jm0 are the average recurrence times on core and non-core segments, respectively.

In the null case when no segmentation is imposed on Xn, the maximum entropy principle implies that the the CpG recurrence time distribution is again geometric. Denote this geometric distribution G(· : λ) with maximum likelihood estimator G(:λ^)=L(λ^|σ)=P0m(λ^)eλ^(nm) with λ^=logc¯c¯1, c = n/m, and m = m0 + m1. Since the number of events on Xn is a fixed constant, the estimator L([lambda with circumflex]|σ) is the maximum entropy distribution.

Consider the null hypothesis that there is no aggregation of CpGs in Xn. A reasonable and effective test statistic is the log-likelihood ratio


where l0 + l1, = n, and l0 =k0j and l1 =k1j′ are the total lengths of the two kinds of segments.

To make use of Δ00, ħ*) for selecting an optimal threshold parameter 0, ħ*) in the 3HFS algorithm, we perform the following optimization:


where k is the penalty for adding one extra parameter. For instance, k = 1 is the AIC model selection criterion, and k = log m corresponds to the BIC criterion. Empirically, we know that k = 1 would be likely to give rise to too many cores (0@), while k = log m tends to produce too few cores. Hence, we propose the adaptation step as the choice of k for which the 3HFS segmentation based on threshold values ( ħ˜0o(k), ħ˜*o(k)) provides a pair of intensity estimates, say ( λ^1(k), λ^0(k)), that is closest to ( λ^1(M), λ^0(M)) derived in the geometric mixture analysis. Specifically, we choose the integer k minimizing the following sum of signal-to-noise (S-N) ratios:


Denote the corresponding optimal threshold parameters by ( ħ˜0o(ko), ħ˜*o(ko)) and let the segmentation N(3)(Xn|(ħ˜0o(ko),ħ˜*o(ko))) refer to the output of the 3HFS algorithm performed using these thresholds.

3.3. Local exhaustive search: identifying CpG islands

We suggest the following heuristic for identifying CpG islands based on a set of identified cores. We first define an extended core as a union of all cores which are < 250 bp apart, where 250 bp is an empirically chosen ad hoc threshold. Consider a genome subsegment centered at an extended core and extending 250 bp to the left from the left endpoint of the extended core. Construct the Left-to-Right Lexis diagram on this genome subsegment, using a chosen filtering criteriona. The Lexis diagram produces a set of predicted CpG islands.

4. Applications of HFS algorithm on real genome sequences

In this section, we demonstrate how the 3HFS algorithm (with optimal penalty and threshold parameter values) can be applied to identify CpG island cores in genome sequences NT_000874.1 and human chromosome 21.

4.1. An illustrative example

We first apply the 3HFS algorithm to the human genome sequence NT_000874.1, shown in Figure 3. This sequence was analyzed by Wang and Leung (2004) using their CpG island finder CpGIE with filtering criteria: (i) GC content ≥ 50%, (ii) O/E ≥ 0.6, and (iii) length ≥ 500 bp. Although the CpGIE algorithm does not quite resemble the Lexis diagram, it provides an intensive search for CpG islands in relatively short genome sequences. Hence, CpGIE is a good benchmark for the performance of our 3HFS algorithm. To quantify the relative performance of 3HFS, we use the percentage of CpG islands identified by CpGIE that overlap with cores identified by the 3HFS algorithm.

Fig. 3.
Human genome sequence NT-00874.1 and its CpG recurrence time distribution, (a) Actogram; (b) Frequencies of di-nucleotides; (c) Histogram of CpG recurrence time with the geometric density function with same empirical mean shown in red; (d) Double logarithmic ...

Geometric mixture analysis of NT_000874.1 results in MLE estimates λ^0(M)=0.02178, λ^1(M)=0.07839, and [pi]1 = 0.6802. The optimal penalty is k* = 4 and the corresponding optimal threshold values are (ħ˜0o(ko),ħ˜*o(ko))=(56,15). Heuristically, this pair of threshold values indicate that the optimal CpG island “core” in this genome sequence contains at least 15 CpG di-nucleotides separated by no more than 56 bp.

In Figure 4(a), the collection of CpG island cores identified by 3HFS are marked on the sequence NT_000874.1 with green brackets, and the CpGIE CpG island predictions are shown in red curly brackets. The proportion of 3HFS cores overlapping CpGIE hits is 11/12. The Lexis diagram of the one CpGIE-predicted CpG island that does not contain a 3HFS core is a single point diagram. This can also be seen through the rather uniformly distributed CpGs in this CpG island. Figure 4(b) is a plot of all identified cores on the two-dimensional coordinate system “GC-content × O/E ratio”. All cores satisfy the [GF-1] and [GF-2] conditions, except one. This core is contained in a long genome subsegment that does satisfy the conditions.

Fig. 4.
Comparison of CpGIE and cores identified by HFS on sequence NT-000874.1. (a) CpG-islands identified by CpGIE are marked with red curly-brackets. Cores identified by HFS are in green square brackets; (b) Joint distribution of GC-content, O/E-ratio and ...

In Figure 5, we illustrate how CpG islands can be identified from 3HFS cores. We found a genome subsegment of length 385 bp satisfying [GF-1] and [GF-2]. Similarly, all 4 cores that do not overlap a CpGIE CpG island produce subsegments of varying length less than 500 bp that satisfy [GF-1] and [GF-2].

Fig. 5.
Construction a CpG island from a core, (a) Extended core segment; (b) Lexis diagram for identifying CpG islands satisfying the filtering criteria.

In summary, this example has shown that CpG islands identified by the CpGIE algorithm are very likely to overlap 3HFS cores. Hence, we can empirically conclude that our 3HFS algorithm should provide a nearly exhaustive search for CpG islands on whole chromosomes. We confirm this performance through a computer experiment in the next section. It is also noted that CpGIE is among the most computationally expensive CpG island finders, and hence can only be applied to short genome sequences. 3HFS is a competitive alternative that can be run on much longer sequences. Furthermore, the set of CpG islands predicted by CpGIE is somewhat different from the set identified by Lexis diagrams in regions near 3HFS cores (with the same filtering criteria). This indicates that our approach may provide complementary information, making it useful even on short sequences.

4.2. CpG island cores on human chromosome 21

Next we apply the 3HFS algorithm to 15 sequences that make up human chromosome 21. The resulting cores are reported in a supplementary file containing the genome locations, GC content, and O/E ratio for each core. The descriptive statistics for the latter two are summarized in Figure 6(a). Figure 6(b) shows that the rate of CpGs varies considerably among the 15 sequences, rising from left to right on the right arm of the chromosome in both predicted CpG islands (high intensity regions) and the non-islands (low intensity regions). This variation suggests that using one constant triplet of threshold values for the filtering criteria [GF-l]-[GF-3] throughout the whole chromosome may not be ideal biologically.

Fig. 6.
Chromosome 21 cores, (a) Joint distribution of GC-content (> 0.5), O/E-ratio (> 0.6) and length; (b) CpG-intensities in core and non-core segments along the chromosome.

5. Numerical evaluations of HFS algorithm

In this section, we examine the performance of the 3HFS algorithm on simulated data. It is not clear what model should be used to generate CpG island dynamics along genome sequences. Here we tentatively employ a hidden Markov model (HMM) with two states: CpG island and non-CpG island. The emission variables are the recurrence times of CpGs in the two states, which are assumed to be geometrically distributed as G(;λ^0(M)) in the non-CpG island state and G(;λ^1(M)) in the CpG island state. We use the values λ^0(M)=0.022 and λ^1(M)=0.078 from the illustrative example in the previous section.

The unobservable Markov random variable is specified by the 2 × 2 transition matrix: p00 = p0 = 1 – p01 and p11 = p1 = 1 – P10 We further specify p0 and p1 through the following equations. Let K0 be equal to the threshold value of CpG island length in [GF-3], and set


where υ1 is the expected number of consecutive sites in the CpG island state and [mu]1 is the expected length of a random variable distributed as G(;λ^1(M)). Hence K0 is close to the expected length (in bp) of a CpG island. This specification ensures that this HMM can give rise to CpG islands with expected length K0. In our computer experiment, we take K0 = 500 and calculate μ^1=11eλ^1(M)=13.26, υ1 = K0/[mu]1 = 37.70.

The estimated mixture proportion vector ([pi]0, [pi]1), with [pi]0 + [pi]1 = 1, is the stationary probability distribution corresponding to the HMM transition matrix. We define


and set [pi]1 = 0.6802, based again on the illustrative example. Subsequently, we solve for p1 = 0.9735 from the equation υ1=2p1p12(1p1). Then p0 = 0.9435 is computed using the equation π^0π^1=1p11p0.

For each replication, this HMM is employed to generate a 0-1 state-space vector of length 2000, which in turn is used to generate an event-time series around 45,000 bp long using the emission variables. The 3HFS algorithm is applied to each event-time series to decode the state-space vector. Performance is summarized in terms of the rate of CpG islands (of varying length) overlapping with computed cores (true positives) and the rate of computed cores falling in non-CpG island segments (false positives).

We performed 100 replications in our computer experiment. Performance of 3HFS across the 100 data sets is summarized in Figure 7. The overall sensitivity is quite high (Fig. 7(a)), even for relatively short CpG islands: conditional sensitivity is 100% for islands over 350 bp and 95% for islands 150 to 350 bp. The specificity of the 3HFS algorithm is also very high, as shown in Figure 7(b). Out of 4,046 cores computed in the 100 replications, only 315 cores (< 7%) fall in non-CpG island segments. On the average, in one replication there are only 3 cores falling in non-CpG island segments (constituting about 35,000 bp of the 45,000 bp simulated sequence). The conditional false positive rate is as low as 1% for cores longer than 400 bp.

Fig. 7.
Performance of HFS cores on simulated data with CpG cores of different lengths, (a) Sensitivity; (b) Specificity.

An illustrative simulation is shown in Figure 8. We see that almost all simulated CpG islands (in red) overlap cores computed using 3HFS, except for two rather short islands. There are 3 cores falling within non-CpG island genome segments about 35,000 bp long. It should be noted that these are not false positive CpG island predictions, as one would conduct an exhaustive search in each extended core to make CpG island predictions. The optimal penalty is 5 times the AICs penalty and the threshold values are (59,10).

Fig. 8.
Actogram of a simulated genome sequence with CpG-islands marked in red. HFS cores are marked by green square brackets.

In summary, the 3HFS algorithm has very high sensitivity and specificity, especially for CpG islands longer than 350 bp. This analysis confirms that our approach has the capacity to produce a nearly exhaustive search for CpG islands.

6. Discussion

In this paper we propose a nearly exhaustive search for CpG islands on genome sequences. This two-step methodology is computationally feasible even for whole chromosomes. Globally, the steps of the 3HFS algorithm pinpoint nearly all CpG di-nucleotide-rich segments, or CpG island cores, along the genome sequence under study. This pattern recognition technique primarily computes the contrasting sparsity vs. aggregation of CpG di-nucleotides, and is independent of any filtering criteria. Its coding threshold parameter values are chosen to match the mixture information of high and low CpG di-nucleotide intensities by means of an adaptive model selection device. Locally, the step of the Lexis diagram is implemented on every extended genome segment centered around one or several CpG island cores to exhaustively extract all potential CpG islands satisfying a chosen filtering criteria. Hence, from the perspectives of computational feasibility and capability of a nearly exhaustive search, our CpG island search algorithm provides a promising and realistic foundation for unbiased inferences. These inferences may then better facilitate more effective promoter predictions and differential methylation studies.

It is worth emphasizing again that filtering criteria based on a CpG island definition is neither operational, nor mathematically complete. We further illustrate this on the first 7000 bp of the genome segment of NT_000874.1, shown in Figure 4(a). One CpG island located at [2109,4220] is identified by the CpGIE algorithm within the segment. Interestingly, the Lexis diagram identifies five different subsegments satisfying the same filtering criteria. They are located at [168, 4771], [599, 4922], [1296, 5151], [1299, 6013] and [1487, 6211], and all contain the CpG island [2109, 4220] identified by CpGIE. Hence the question becomes, which one of these is the CpG island to be used? It is very likely that different researchers might choose differently among the six genome subsegments for distinct functional purposes. This example not only spotlights the problems facing filtering criteria, but more importantly it reveals the merit of Lexis diagrams for a CpG island search.

Next we remark that our computational approach for finding CpG islands, like any filtering criteria-based algorithm, does not bear information about any functional status of CpG islands. Therefore it cannot be used for prediction purposes until proper information is incorporated into the algorithmic computations. For instance, with the increasing availability of methylation data in the literature, a study of the association between genome sequence and methylation can be performed on a collection of CpG island cores extracted from one particular genome sequence. And then, by incorporating these experimental and empirical patterns into our two-step approach, especially the Lexis diagram construction, we may potentially attempt to predict methylation status on other genome sequences. This kind of learning approach could be a promising area for future interdisciplinary research.

Finally we conclude with some comments on modeling CpG island dynamics. A comparison of the genome sequence NT_000874.1 in Figure 4(a) with the HMM-generated sequence in Figure 6, suggests that the HMM seems to be a realistic model for CpG island recurrence. However, our analyses on these two sequences results in two different sets of optimal threshold values. Thus we speculate that some dynamic characteristics underlying the real genome sequence still have not been completely captured by the HMM mechanism. More research effort is needed for better understanding of the CpG island dynamics.

Supplementary Material


A supplemental file of our CpG island cores on Chromosome 21


  • Bajic VB, Tan SL, Suzuki Y, Sugano S. Promoter prediction analysis on the whole human genome. Nature Biotechnology. 2004;22:1467–1473. doi: 10.1038/nbt1032. [PubMed] [Cross Ref]
  • Bird AP. CpG island and the function of DNA methylation. Nature. 1986;321:209–213. doi: 10.1038/321209a0. [PubMed] [Cross Ref]
  • Bock C, Paulsen M, Tierling S, Mikeska T, Lengauer T, Walter J. 2006. CpG island methylation in human lymphocytes is highly correlated with DNA sequence, Repeats, and predicted DNA structure PLoS Genet 23e26.http://www.plosgenetics.Org/article/info:doi/10.1371/journal.pgen.0020026Accessed 20 January 2006.10.1371/journal.pgen.0020026 [PubMed] [Cross Ref]
  • Bock C, Walter J, Paulsen M, Lengauer T. CpG island mapping by Epifenome prediction. PLoS Computational Biology. 2007;6:1055–1069. [PMC free article] [PubMed]
  • Fushing H, Hwang CR, Lee HC, Lan YC, Horng SB. Testing and mapping non-stationarity in animal behavioural processes: ace case study on an individual female bean weevil. Jour Theor Biol. 2006;238:805–816. doi: 10.1016/j.jtbi.2005.06.031. [PubMed] [Cross Ref]
  • Gardiner-Garden M, Frommer M. CpG island in vertebrate genomes. J Mol Biol. 1987;196:261–282. doi: 10.1016/0022-2836(87)90689-9. [PubMed] [Cross Ref]
  • Laird PW. Cancer epigenetics. Hum Mol Genet. 2005;14:R65–R76. doi: 10.1093/hmg/ddi113. [PubMed] [Cross Ref]
  • Saxonov S, Berg P, Brutlag D. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci U S A. 2006;103:1412–1417. doi: 10.1073/pnas.0510310103. [PubMed] [Cross Ref]
  • Takai D, Jones PA. Comprehensive analysis of CpG island in human chromosomes 21 and 22. Proc Natl Acad Sci U S A. 2002;99:3740–3745. doi: 10.1073/pnas.052410099. [PubMed] [Cross Ref]
  • Ushijima T. Detection and interpretation of altered methylation patterns in cancer cells. Nature Review:Cancer. 2005;5:223–231. doi: 10.1038/nrc1571. [PubMed] [Cross Ref]
  • Wong Y, Leung FCC. An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics. 2004;20:1170–1177. doi: 10.1093/bioinformatics/bth059. [PubMed] [Cross Ref]
  • Yamada Y, Watanabe H, Miura F, Soejima H, Uchiyama M, Iwasaka T, Mukai T, Sakaki Y, Ito T. A comprehensive analysis of allelic methylation status of CpG islands in human chromosome 21q. Genome Res. 2004;14:247–266. doi: 10.1101/gr.1351604. [PubMed] [Cross Ref]

Articles from The International Journal of Biostatistics are provided here courtesy of Berkeley Electronic Press