We pursued the hypothesis that there is a subpopulation of sequences in the genome defined solely by their clustering of CG dinucleotides. This clustering is a result of the genome-wide decay of CG dinucleotide content, with preservation of CG density at certain regions. By measuring the distance spanned by a fixed number of CG dinucleotides for every such group genome-wide, we observed that there are two populations of loci with distinctive CG clustering densities (a and b). Using the first local minimum in the distribution of spanned sequence fragment lengths as the boundary of the short, CG-dense population, we identified the maximum fragment length for each cluster corresponding to a fixed number of CGs. In analyzing these cutoffs, we defined a linear relationship between CG dinucleotide number and the associated maximum fragment length (c).
Figure 1. The analytical technique used to define CG clusters. First, a fixed number of CG dinucleotides is chosen, as illustrated in the example using six CGs (a). The first CG is identified in a chromosome, then the sixth, allowing the number of nucleotides between (more ...)
The clear differentiation of CG-dense fragments from the rest of the genome provides a means of mathematically defining CG-dense regions and can therefore be used as a robust foundation for computational genomic annotation. Given a fixed number of CGs, the CG-dense fragments below the maximum fragment length could be identified and mapped back onto the genome. But, as a and b show, each of the fixed number of CGs generates different annotations. Using fewer CGs and correspondingly smaller fragments, many small CG clusters are identified, whereas by using a greater number of CGs and correspondingly larger fragments, fewer clusters are identified, but each extends into large flanking regions of lower CG density.
Figure 2. Creating a CG cluster annotation for the human genome. For a given number of CGs, significantly CG-dense fragments are defined as being shorter than the maximum fragment length. When these fragments are mapped back to the genome, some loci have multiple (more ...)
We were able to optimize the criteria when we recognized that at any individual CG-dense locus, a given number of CGs generates multiple overlapping fragments. More CG-dense clusters require a greater number of fragments to span all of the CGs they contain. Accordingly, the more overlapping fragments that represent a given locus, the more likely it is to be significantly CG-dense. For each number of CGs, we calculated the number of overlapping fragments per cluster. We obtained a representation of information content for each CG number by summing this total across all loci in the genome and dividing by maximum fragment length. We then determined the optimal number of CGs per fragment using the maximum value obtained (c). For the human genome, this optimum corresponds to 27 or more CG dinucleotides in a sequence of no more than 531 bp in length. This new means of identifying CG clusters is neither constrained by (G+C) content nor by the associated observed/expected CG dinucleotide ratio. In , we show that the thresholds imposed by even the least stringent original base compositional criteria (2
) cause many CG-dense loci in the genome to be missed. However, even though we are annotating the entire sequenced genome, including repetitive DNA, we identify only a small fraction of the ~350 000 CpG islands predicted by these old criteria (2
Figure 3. The base compositional characteristics of CG clusters (black) are shown in terms of observed to expected CG dinucleotide densities (O/E CG ratio) on the x-axis with (G+C) content on the y-axis. The dashed lines illustrate the relatively non-stringent (more ...)
We compared the functional significance of CpG islands and CG clusters in two ways—testing their relative frequency co-localizing with promoters and with hypomethylated loci. A major use of the CpG island annotation has been to predict the location of transcription start sites in the genome. Approximately 40% (18
) to 50% (6
) of human promoters have been found to co-localize with CpG islands, while promoters of housekeeping genes have been described to have a near-universal association with CpG islands (18
). We cross-correlated our CG clusters and the CpG island locations annotated at the UCSC Genome Browser with transcription start sites of refSeq genes from the same database, finding CpG islands to overlap 57% of refSeq transcription start sites, 79% of a published list of housekeeping genes (19
) and 38% of a published list of tissue-specific genes (20
). In contrast, the proportion of refSeq transcription start sites associated with CG clusters is substantially higher (68%, an additional 11% or 2701 refSeq transcription start sites), with 45% of the tissue-specific genes and 91% of housekeeping genes co-localizing with these CG clusters (a; ). As the UCSC CpG island annotation is of non-repetitive sequence and our CG cluster annotation was generated without this filter, we were concerned that the comparison was unfairly penalizing the UCSC CpG island annotation, so we tested the relative proportions of refSeq promoter overlaps for two other annotations, the 350 201 CpG islands that occur in the genome as a whole, and the 31 225 CG clusters that are not defined due to substantial overlap with transposable elements. In b, we show that the performance of the CG cluster annotation is stronger for both unfiltered sequence (positive predictive values of 0.381 compared with 0.049) and non-transposon sequence (0.535 compared with 0.508) in identifying refSeq promoters. Similar patterns are found for the mouse genome (). Given that CpG islands have been used as a component of algorithms for predicting promoters in the genome (6
), CG clusters should offer a more powerful resource for this and comparable purposes.
Figure 4. (a) CG clusters (white bars) overlap more refSeq transcription start sites than the CpG island annotations of the UCSC genome browser (black bars). CG clusters overlap the presumed promoters of a substantially higher proportion of genes overall (left), (more ...)
We tested the relative ability of CG clusters to detect hypomethylated sites by performing the HELP assay (12
) on human embryonic stem cells and CD34+ hematopoietic stem and progenitor cells. A microarray representing HpaII-amplifiable fragments located near transcription start sites in the human genome was used for two biological replicates of each cell type. While similar proportions of loci at CpG islands and at CG clusters demonstrated hypomethylation (a), the absolute number of hypomethylated loci differed (b), as the hypomethylated CpG islands represent a subset of the larger group of hypomethylated CG clusters. The CG clusters identify ~50% more hypomethylated loci than do CpG islands. We conclude that the CG cluster annotation is not only identifying more transcription start sites, it is also defining loci with comparable epigenetic characteristics.
Figure 5. The HELP assay (13) was used on a custom promoter microarray to test cytosine methylation patterns in two samples each of CD34+ hematopoietic stem and progenitor cells (CD34) and human embryonic stem cells (ES). In panel (a) it is apparent that similar (more ...)
We next addressed the question of why CpG islands are often not conserved between human and mouse (16
). This is a puzzling issue if the CG-rich nature of the promoter is of functional importance, for example conferring the ubiquitous expression patterns that define housekeeping genes (18
). It would be expected that such functional promoter characteristics would be conserved between species despite differences in overall CG dinucleotide content [observed/expected (O/E) CG ratios of 0.19 and 0.24 for mouse and human, respectively (22
)]. It is therefore surprising that the total number of CpG islands in mouse is only ~58% of the number annotated for the human genome ().
When we performed the CG clustering analysis of the mouse genome, we found it also generates two populations with distinct CG density characteristics, but that the optimal CG cluster definition for the mouse genome is different from that of the human, corresponding to 24 or more CG dinucleotides in a sequence of no more than 585 bp in length (). By comparison, human CG clusters consist of 27 CGs in no more than 571 bp. When we calculated the total number of CG clusters for the mouse genome, it was strikingly similar to that for the human (42 971 and 44 165, respectively, ). In addition, when we re-analyzed a sample of 23 loci originally published to demonstrate the failure of CpG island conservation between these species (15
), we found that while only 18 conserve CpG islands, 22 out of 23 conserve CG clusters, the single exception in this limited sample being the alpha globin orthologs (HBA1/Hba-a1
). We extended this study to test conservation of each annotation genome-wide. Of all of the 27 801 CpG islands annotated at the UCSC Genome Browser, 14 452 have orthologous sequences with CpG islands in the mouse genome, while there exist 19 410 sites of conserved CG clustering (). When studied using our genome-specific annotations, clustered CG dinucleotides are demonstrably much more conserved between species than previously appreciated.
Figure 6. The mouse genome has different CG clustering characteristics than those of the human genome. The optimization curve characteristics for mouse are clearly different from those for human (a). The optimal mouse annotation contains fragments no longer than (more ...)
We extended the CG clustering histogram analysis to eight more genomes, including other organisms that are known to methylate their genomes, those that do so only transiently (Drosophila melanogaster
), and those that do not methylate at all. The surprising result of these analyses is that the fugu (Tiger Blowfish, Takifugu rubripes
) genome, which has been described to methylate its DNA (24
), does not exhibit uniquely CG-dense regions. What may explain this difference is that the degree of decay of CG dinucleotide content in the fugu genome is less than that of most genomes in which unique CG-dense regions emerge (). The zebrafish (Danio rerio
) genome, on the other hand, does display uniquely CG-dense regions with only marginally greater CG dinucleotide decay (O/E CG 0.53 as opposed to 0.57 in fugu). The remaining major difference between these genomes is that of size, the fugu genome being substantially smaller than the other methylating genomes at only 365 Mb total (25
), a variable already suggested to be related to the evolution of cytosine methylation (26
). Our data demonstrate that while cytosine methylation appears to be necessary for CG decay, it is not sufficient to cause local preservation of clustered CG dinucleotides. Furthermore, we can conclude that any annotation of the fugu genome to indicate the presence of CpG islands or CG clusters is inappropriate.
Figure 7. CG cluster analysis of 10 different species. These CG fragment length frequency plots were generated using 30 CGs per fragment for each species. Genomes containing CG clusters are defined by the distinct peak of short, uniquely CG-dense fragments. While (more ...)