In the decade since the human genome sequence was published
[1],
[2], annotation of its landmarks and functional domains has been a priority. Protein coding genes have been quite comprehensively identified and mapped, but full annotation of the genome is far from complete. In addition to genes, there are DNA sequence categories of likely functional importance, including non-coding transcription units, conserved elements and regions of variant base composition, whose biological significance is not well understood. Into the latter category fall CpG islands (CGIs), which comprise about 1% of the genome and display an elevated G+C base composition spanning approximately 1000 base pairs. Their distinguishing feature is a high frequency of the dinucleotide CpG, but beyond this they do not share long range sequence similarity
[3]. In the human genome, CGIs have approximately 1 CpG every 10 base pairs, which is about 10 times more frequent than the surrounding DNA. The high density of CpG shared by CGIs is partly explained by a G+C-rich base composition, but also depends critically on the lack of the CpG deficiency that is typical of the bulk genome. These dense CpG clusters are usually devoid of CpG methylation, whereas the bulk genome is methylated at 70–80% of CpGs. The lack of methylation in the germline
[4] means that CGIs do not suffer accelerated mutational loss of CpGs caused by deamination of 5-methylcytosine
[5],
[6]. Over evolutionary time, this has given rise to the observed contrast between a CpG-deficient bulk genome and relatively CpG-rich CGIs. Clustering of unmethylated CpGs has allowed the CGIs to be biochemically isolated as a relatively homogeneous fraction of DNA
[3],
[7] or chromatin
[8].
CGIs encompass the transcription start site (TSS) of approximately 60% of human protein coding genes. Extensive genome-wide mapping of histone modifications by chromatin immunoprecipitation (ChIP) has established that trimethylation of lysine 4 of histone H3 (H3K4me3) is a signature mark coinciding with most promoter CGIs, even when the associated gene is not expressed
[9]–
[11]. A potential biological rationalisation for the maintenance of unmethylated CpGs at many promoters has recently emerged from studies of proteins that interact preferentially with CGIs. The protein Cfp1 contains a CXXC domain that specifically binds to CpG only when it is unmethylated and co-localises with almost all CGIs in the mouse genome. Cfp1 is a component of the Set1 complex which trimethylates histone H3 lysine 4 and its depletion drastically affects levels of this modification at CGIs
[12]–
[14]. Importantly, insertion of a promoterless stretch of CpG-rich DNA into the mouse genome is sufficient to recruit Cfp1 and create a novel peak of H3K4me3
[14]. Complementing this predisposition to form H3K4me3 chromatin is the intrinsic reluctance of CGIs to assemble nucleosomes
[15]. Both these features appear to pre-adapt CGIs for active promoter function.
The notion that CGIs facilitate promoter function fits well with their presence at TSSs, but is challenged by two observations that appear to weaken the link with genes. Firstly, genomic analysis has indicated that the number of CGIs in humans and mice is very different, with mice apparently possessing little more than half the number present in humans
[16],
[17]. Lack of evolutionary conservation would argue against a central role in promoter function. A second reason to query the importance of CGIs has come from the use of CXXC Affinity Purification (CAP) to identify a large fraction of CGIs. Mapping showed that many CGIs in the human genome are not coincident with annotated promoters, but are either intergenic or within the body of coding regions (intragenic)
[7]. To clarify these issues we have compiled a comprehensive CGI map for three developmentally distinct human and mouse tissues (sperm, whole blood and cerebellum). The results show that, contrary to previous conclusions, the numbers of CGIs in human and mouse are very similar. Moreover, in both organisms approximately half of all CGIs are remote from annotated promoters. These “orphan” CGIs co-localise with peaks of H3K4me3 and evidence suggests that a large proportion recruit RNA polymerase II (RNAPII) and give rise to novel transcripts. We find that de novo methylation during development predominantly affects orphan CGIs in both humans and mice, with few protein-coding gene promoters being methylated. This contrasts with the situation in colorectal tumors, where cancer-specific de novo methylation affects both CGI categories equally, with a strong preference for those marked in ES (embryonic stem) cells by H3K27me3 – the chromatin modification that is associated with polycomb-mediated repression
[18]–
[21]. Our findings sustain the notion that all CGIs correspond with promoters and that many orphan CGIs are associated with novel transcripts that may have regulatory significance.