The DNA of most vertebrates is depleted in CpG dinucleotide: a C followed by a G in the 5′ to 3′ direction. CpGs are the target for DNA methylation, a chemical modification of cytosine (C) heritable during cell division and the most well-characterized epigenetic mechanism. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). Knowing CGI locations is important because they mark functionally relevant epigenetic loci in development and disease. For various mammals, including human, a readily available and widely used list of CGI is available from the UCSC Genome Browser. This list was derived using algorithms that search for regions satisfying a definition of CGI proposed by Gardiner-Garden and Frommer more than 20 years ago. Recent findings, enabled by advances in technology that permit direct measurement of epigenetic endpoints at a whole-genome scale, motivate the need to adapt the current CGI definition. In this paper, we propose a procedure, guided by hidden Markov models, that permits an extensible approach to detecting CGI. The main advantage of our approach over others is that it summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for other species. The utility of this approach is demonstrated by generating the first CGI lists for invertebrates, and the fact that we can create CGI lists that substantially increases overlap with recently discovered epigenetic marks. A CGI list and the probability scores, as a function of genome location, for each species are available at http://www.rafalab.org.
CpG island; Epigenetics; Hidden Markov model; Sequence analysis
The DNA of most vertebrates is depleted in CpG dinucleotides, the target for DNA methylation. The remaining CpGs tend to cluster in regions referred to as CpG islands (CGI). CGI have been useful as marking functionally relevant epigenetic loci for genome studies. For example, CGI are enriched in the promoters of vertebrate genes and thought to play an important role in regulation. Currently, CGI are defined algorithmically as an observed-to-expected ratio (O/E) of CpG greater than 0.6, G+C content greater than 0.5, and usually but not necessarily greater than a certain length. Here we find that the current definition leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease, and does not apply at all to nonvertabrate genomes. We propose an alternative Hidden Markov model-based approach that solves these problems. We fit our model to genomes from 30 species, and the results support a new epigenomic view toward the development of DNA methylation in species diversity and evolution. The O/E of CpG in islands and nonislands segregated closely phylogenetically and showed substantial loss in both groups in animals of greater complexity, while maintaining a nearly constant difference in CpG O/E between islands and nonisland compartments. Lists of CGI for some species are available at http://www.rafalab.org.
The Human Papillomavirus (HPV) genome is divided into early and late coding sequences, including 8 open reading frames (ORFs) and a regulatory region (LCR). Viral gene expression may be regulated through epigenetic mechanisms, including cytosine methylation at CpG dinucleotides. We have analyzed the distribution of CpG sites and CpG islands/clusters (CGI) among 92 different HPV genomes grouped in function of their preferential tropism: cutaneous or mucosal. We calculated the proportion of CpG sites (PCS) for each ORF and calculated the expected CpG values for each viral type.
CpGs are underrepresented in viral genomes. We found a positive correlation between CpG observed and expected values, with mucosal high-risk (HR) virus types showing the smallest O/E ratios. The ranges of the PCS were similar for most genomic regions except E4, where the majority of CpGs are found within islands/clusters. At least one CGI belongs to each E2/E4 region. We found positive correlations between PCS for each viral ORF when compared with the others, except for the LCR against four ORFs and E6 against three other ORFs. The distribution of CpG islands/clusters among HPV groups is heterogeneous and mucosal HR-HPV types exhibit both lower number and shorter island sizes compared to cutaneous and mucosal Low-risk (LR) HPVs (all of them significantly different).
There is a difference between viral and cellular CpG underrepresentation. There are significant correlations between complete genome PCS and a lack of correlations between several genomic region pairs, especially those involving LCR and E6. L2 and L1 ORF behavior is opposite to that of oncogenes E6 and E7. The first pair possesses relatively low numbers of CpG sites clustered in CGIs while the oncogenes possess a relatively high number of CpG sites not associated to CGIs. In all HPVs, E2/E4 is the only region with at least one CGI and shows a higher content of CpG sites in every HPV type with an identified E4. The mucosal HR-HPVs show either the shortest CGI size, followed by the mucosal LR-HPVs and lastly by the cutaneous viral subgroup, and a trend to the lowest CGI number, followed by the cutaneous viral subgroup and lastly by the mucosal LR-HPVs.
The majority of mammalian gene promoters are encompassed within regions of the genome called CpG islands that have an elevated level of non-methylated CpG dinucleotides. Despite over 20 years of study, the precise mechanisms by which CpG islands contribute to regulatory element function remain poorly understood. Recently it has been demonstrated that specific histone modifying enzymes are recruited directly to CpG islands through recognition of non-methylated CpG dinucleotide sequence. These enzymes then impose unique chromatin architecture on CpG islands that distinguish them from the surrounding genome. In the context of this work we discuss how CpG island elements may contribute to the function of gene regulatory elements through the utilization of chromatin and epigenetic processes.
CpG island; chromatin; histone; methylation; promoter; ZF-CxxC domain; transcription; histone lysine demethylase
CpG islands (CGIs) are CpG-rich regions compared to CpG-depleted bulk DNA of mammalian genomes and are generally regarded as the epigenetic regulatory regions in association with unmethylation, promoter activity and histone modifications. Accurate identification of CpG islands with epigenetic regulatory function in bulk genomes is of wide interest. Here, the common features of functional CGIs are identified using an average mutual information method to differentiate functional CGIs from the remaining CGIs. A new approach (CpG mutual information, CpG_MI) was further explored to identify functional CGIs based on the cumulative mutual information of physical distances between two neighboring CpGs. Compared to current approaches, CpG_MI achieved the highest prediction accuracy. This approach also identified new functional CGIs overlapping with gene promoter regions which were missed by other algorithms. Nearly all CGIs identified by CpG_MI overlapped with histone modification marks. CpG_MI could also be used to identify potential functional CGIs in other mammalian genomes, as the CpG dinucleotide contents and cumulative mutual information distributions are almost the same among six mammalian genomes in our analysis. It is a reliable quantitative tool for the identification of functional CGIs from bulk genomes and helps in understanding the relationships between genomic functional elements and epigenomic modifications.
CpG islands were originally identified by epigenetic and functional properties, namely, absence of DNA methylation and frequent promoter association. However, this concept was quickly replaced by simple DNA sequence criteria, which allowed for genome-wide annotation of CpG islands in the absence of large-scale epigenetic datasets. Although widely used, the current CpG island criteria incur significant disadvantages: (1) reliance on arbitrary threshold parameters that bear little biological justification, (2) failure to account for widespread heterogeneity among CpG islands, and (3) apparent lack of specificity when applied to the human genome. This study is driven by the idea that a quantitative score of “CpG island strength” that incorporates epigenetic and functional aspects can help resolve these issues. We construct an epigenome prediction pipeline that links the DNA sequence of CpG islands to their epigenetic states, including DNA methylation, histone modifications, and chromatin accessibility. By training support vector machines on epigenetic data for CpG islands on human Chromosomes 21 and 22, we identify informative DNA attributes that correlate with open versus compact chromatin structures. These DNA attributes are used to predict the epigenetic states of all CpG islands genome-wide. Combining predictions for multiple epigenetic features, we estimate the inherent CpG island strength for each CpG island in the human genome, i.e., its inherent tendency to exhibit an open and transcriptionally competent chromatin structure. We extensively validate our results on independent datasets, showing that the CpG island strength predictions are applicable and informative across different tissues and cell types, and we derive improved maps of predicted “bona fide” CpG islands. The mapping of CpG islands by epigenome prediction is conceptually superior to identifying CpG islands by widely used sequence criteria since it links CpG island detection to their characteristic epigenetic and functional states. And it is superior to purely experimental epigenome mapping for CpG island detection since it abstracts from specific properties that are limited to a single cell type or tissue. In addition, using computational epigenetics methods we could identify high correlation between the epigenome and characteristics of the DNA sequence, a finding which emphasizes the need for a better understanding of the mechanistic links between genome and epigenome.
A key challenge for bioinformatic research is the identification of regulatory regions in the human genome. Regulatory regions are DNA elements that control gene expression and thereby contribute to the organism's phenotype. An important class of regulatory regions consists of so-called CpG islands, which are characterized by frequent occurrence of the CG sequence pattern. CpG islands are strongly associated with open and transcriptionally competent chromatin structure, they play a critical role in gene regulation, and they are involved in the epigenetic causes of cancer. In this article we make several conceptual improvements to the definition and mapping of CpG islands. First, we show that the traditional distinction between CpG islands and non-CpG islands is too harsh, and instead we propose a quantitative measure of CpG island strength to gradually distinguish between stronger and weaker regulatory regions. Second, by genome-wide comparison of multiple epigenome datasets we identify high correlation between features of the genome's DNA sequence and the epigenome, indicating strong functional interdependence. Third, we develop and apply a novel method for predicting the strength of all CpG islands in the human genome, giving rise to an improved and more accurate CpG island mapping.
Majority of CpG dinucleotides in mammalian genomes tend to undergo DNA methylation, but most CpG islands are resistant to such epigenetic modification. Understanding about mechanisms that may lead to the methylation resistance of CpG islands is still very poor.
Using the genome-scale in vivo DNA methylation data from human brain, we investigated the flanking sequence features of methylation-resistant CpG islands, and discovered that there are several over-represented putative Transcription Factor Binding Sites (TFBSs) in methylation-resistant CpG islands, and a specific group of zinc finger protein binding sites are over-represented in boundary regions (∼400 bp) flanking such CpG islands. About 77% of the over-represented putative TFBSs are conserved among human, mouse and rat. We also observed the enrichment of 4 histone methylations in methylation-resistant CpG islands or their boundaries.
Our results suggest a possible mechanism that certain putative zinc finger protein binding sites over-represented in the boundary regions of the methylation-resistant CpG islands may block the spreading of methylation into these islands, and those TFBSs over-represented within the islands may both reinforce the methylation blocking and promote transcription. Some histone modifications may also enhance the immunity of the CpG islands against DNA methylation by augmenting these TFs' binding. We speculate that the dynamical equilibrium between methylation spreading and blocking is likely to be responsible for the establishment and maintenance of the relatively stable DNA methylation pattern in human somatic cells.
DNA methylation, the only known covalent modification of mammalian DNA, occurs primarily in CpG dinucleotides. 51% of CpGs in the human genome reside within repeats, and 25% within Alu elements. Despite that, no method has been reported for large-scale ascertainment of CpG methylation in repeats. Here we describe a sequencing-based strategy for parallel determination of the CpG-methylation status of thousands of Alu repeats, and a computation algorithm to design primers that enable their specific amplification from bisulfite converted genomic DNA. Using a single primer pair, we generated amplicons of high sequence complexity, and derived CpG-methylation data from 31 178 Alu elements and their 5′ flanking sequences, altogether representing over 4 Mb of a human cerebellum epigenome. The analysis of the Alu methylome revealed that the methylation level of Alu elements is high in the intronic and intergenic regions, but low in the regions close to transcription start sites. Several hypomethylated Alu elements were identified and their hypomethylated status verified by pyrosequencing. Interestingly, some Alu elements exhibited a strikingly tissue-specific pattern of methylation. We anticipate the amplicons herein described to prove invaluable as epigenome representations, to monitor epigenomic alterations during normal development, in aging and in diseases such as cancer.
A systematic analysis of CpG islands in ten mammalian genomes suggests that an increase in chromosome number elevates GC content and prevents loss of CpG islands.
CpG islands, which are clusters of CpG dinucleotides in GC-rich regions, are considered gene markers and represent an important feature of mammalian genomes. Previous studies of CpG islands have largely been on specific loci or within one genome. To date, there seems to be no comparative analysis of CpG islands and their density at the DNA sequence level among mammalian genomes and of their correlations with other genome features.
In this study, we performed a systematic analysis of CpG islands in ten mammalian genomes. We found that both the number of CpG islands and their density vary greatly among genomes, though many of these genomes encode similar numbers of genes. We observed significant correlations between CpG island density and genomic features such as number of chromosomes, chromosome size, and recombination rate. We also observed a trend of higher CpG island density in telomeric regions. Furthermore, we evaluated the performance of three computational algorithms for CpG island identifications. Finally, we compared our observations in mammals to other non-mammal vertebrates.
Our study revealed that CpG islands vary greatly among mammalian genomes. Some factors such as recombination rate and chromosome size might have influenced the evolution of CpG islands in the course of mammalian evolution. Our results suggest a scenario in which an increase in chromosome number increases the rate of recombination, which in turn elevates GC content to help prevent loss of CpG islands and maintain their density. These findings should be useful for studying mammalian genomes, the role of CpG islands in gene function, and molecular evolution.
Astrocytomas are common and lethal human brain tumors. We have analyzed the methylation status of over 28,000 CpG islands and 18,000 promoters in normal human brain and in astrocytomas of various grades using the methylated-CpG island recovery assay (MIRA). We identified six to seven thousand methylated CpG islands in normal human brain. ~5% of the promoter-associated CpG islands in normal brain are methylated. Promoter CpG island methylation is inversely and intragenic methylation is directly correlated with gene expression levels in brain tissue. In astrocytomas, several hundred CpG islands undergo specific hypermethylation relative to normal brain with 428 methylation peaks common to more than 25% of the tumors. Genes involved in brain development and neuronal differentiation, such as BMP4, POU4F3, GDNF, OTX2, NEFM, CNTN4, OTP, SIM1, FYN, EN1, CHAT, GSX2, NKX6-1, PAX6, RAX, and DLX2, were strongly enriched among genes frequently methylated in tumors. There was an overrepresentation of homeobox genes and 31% of the most commonly methylated genes represent targets of the Polycomb complex. We identified several chromosomal loci in which many (sometimes more than 20) consecutive CpG islands were hypermethylated in tumors. Seven of such loci were near homeobox genes, including the HOXC and HOXD clusters, and the BARHL2, DLX1, and PITX2 genes. Two other clusters of hypermethylated islands were at sequences of recent gene duplication events. Our analysis offers mechanistic insights into brain neoplasia suggesting that methylation of genes involved in neuronal differentiation, in cooperation with other oncogenic events, may shift the balance from regulated differentiation towards gliomagenesis.
Methylation occurs frequently at 5′-cytosine of the CpG dinucleotides in vertebrate genomes; however, this epigenetic feature is rarely observed in CpG islands (CGIs) or CpG clusters in the promoter regions of genes. Aberrant methylation of the promoter-associated CGIs might influence gene expression and cause carcinogenesis. Because of the functional importance, multiple algorithms have been available for identifying CGIs in a genome or a sequence. They can be categorized into the traditional algorithms (e.g., Gardiner-Garden and Frommer (1987), Takai and Jones (2002), and CpGPRoD (2002)) or statistical property based algorithms (CpGcluster (2006) and CG cluster (2007)). We reviewed the features of these algorithms and evaluated their performance on identifying functional CGIs using genome-wide methylation data. Moreover, identification of CGIs is an initial step in many recent studies for predicting methylation status as well as in the design of methylation detection platforms. We reviewed the benchmarks and features used in these studies.
CpG island; CpG cluster; CG clusters; Methylation; Epigenetics; Promoter; Prediction algorithm
Methylation at CpG sites is a critical epigenetic modification in mammals. Altered DNA methylation has been suggested to be a central mechanism in development, some disease processes and cellular senescence. Quantifying the extent and identity of epigenetic changes in the aging process is therefore potentially important for understanding longevity and age-related diseases. In the current study, we have examined DNA methylation at >27 000 CpG sites throughout the human genome, in frontal cortex, temporal cortex, pons and cerebellum from 387 human donors between the ages of 1 and 102 years. We identify CpG loci that show a highly significant, consistent correlation between DNA methylation and chronological age. The majority of these loci are within CpG islands and there is a positive correlation between age and DNA methylation level. Lastly, we show that the CpG sites where the DNA methylation level is significantly associated with age are physically close to genes involved in DNA binding and regulation of transcription. This suggests that specific age-related DNA methylation changes may have quite a broad impact on gene expression in the human brain.
Cytosine-5 methylation within CpG dinucleotides is a potentially important mechanism of epigenetic influence on human traits and disease. In addition to influences of age and gender, genetic control of DNA methylation levels has recently been described. We used whole blood genomic DNA in a twin set (23 MZ twin-pairs and 23 DZ twin-pairs, N = 92) as well as healthy controls (N = 96) to investigate heritability and relationship with age and gender of selected DNA methylation profiles using readily commercially available GoldenGate bead array technology. Despite the inability to detect meaningful methylation differences in the majority of CpG loci due to tissue type and locus selection issues, we found replicable significant associations of DNA methylation with age and gender. We identified associations of genetically heritable single nucleotide polymorphisms with large differences in DNA methylation levels near the polymorphism (cis effects) as well as associations with much smaller differences in DNA methylation levels elsewhere in the human genome (trans effects). Our results demonstrate the feasibility of array-based approaches in studies of DNA methylation and highlight the vast differences between individual loci. The identification of CpG loci of which DNA methylation levels are under genetic control or are related to age or gender will facilitate further studies into the role of DNA methylation and disease.
DNA methylation occurs at CpG dinucleotide sites within the genome and is recognised as one of the mechanisms involved in regulation of gene expression. CpG sites are relatively underrepresented in the mammalian genome, but occur densely in regions called CpG islands (CGIs). CGIs located in the promoters of genes inhibit transcription when methylated by impeding transcription factor binding. Due to the malleable nature of DNA methylation, environmental factors are able to influence promoter CGI methylation patterns and thus influence gene expression. Recent studies have provided evidence that nutrition (and other environmental exposures) can cause altered CGI methylation but, with a few exceptions, the genes influenced by these exposures remain largely unknown. Here we describe a novel bioinformatics approach for the analysis of gene expression microarray data designed to identify regulatory sites within promoters of differentially expressed genes that may be influenced by changes in DNA methylation.
Bioinformatics; CpG islands; DNA methylation; Gene expression; In silico promoter analysis; Transcription factor binding sites
Across vertebrate genomes methylation of cytosine residues within the context of CpG dinucleotides is a pervasive epigenetic mark that can impact gene expression and has been implicated in various developmental and disease-associated processes. Several biochemical approaches exist to profile DNA methylation, but recently an alternative approach based on profiling non-methylated CpGs was developed. This technique, called CxxC affinity purification (CAP), uses a ZF-CxxC (CxxC) domain to specifically capture DNA containing clusters of non-methylated CpGs. Here we describe a new CAP approach, called biotinylated CAP (Bio-CAP), which eliminates the requirement for specialized equipment while dramatically improving and simplifying the CxxC-based DNA affinity purification. Importantly, this approach isolates non-methylated DNA in a manner that is directly proportional to the density of non-methylated CpGs, and discriminates non-methylated CpGs from both methylated and hydroxymethylated CpGs. Unlike conventional CAP, Bio-CAP can be applied to nanogram quantities of genomic DNA and in a magnetic format is amenable to efficient parallel processing of samples. Furthermore, Bio-CAP can be applied to genome-wide profiling of non-methylated DNA with relatively small amounts of input material. Therefore, Bio-CAP is a simple and streamlined approach for characterizing regions of the non-methylated DNA, whether at specific target regions or genome wide.
Epigenetic silencing of tumor suppressor genes in human cancers is associated with aberrant methylation of promoter region CpG islands and local alterations in histone modifications. However, the mechanisms that drive these events remain unclear. Here, we establish an important role for histone H4 lysine 16 acetylation (H4K16Ac) and the histone acetyltransferase hMOF in the regulation of TMS1/ASC, a proapoptotic gene that undergoes epigenetic silencing in human cancers. In the unmethylated and active state, the TMS1 CpG island is spanned by positioned nucleosomes and marked by histone H3K4 methylation. H4K16Ac was uniquely localized to two sharp peaks that flanked the unmethylated CpG island and corresponded to strongly positioned nucleosomes. Aberrant methylation and silencing of TMS1 was accompanied by loss of the H4K16Ac peaks, loss of nucleosome positioning, hypomethylation of H3K4 and hypermethylation of H3K9. In addition, a single peak of histone H4 lysine 20 trimethylation was observed near the transcription start site. Downregulation of hMOF or another component of the MSL complex resulted in a gene-specific decrease in H4K16Ac, loss of nucleosome positioning and silencing of TMS1. Gene silencing induced by H4K16 deacetylation occurred independently of changes in histone methylation and DNA methylation and was reversed upon hMOF re-expression. These results indicate that the selective marking of nucleosomes flanking the CpG island by hMOF is required to maintain TMS1 gene activity, and suggest that the loss of H4K16Ac, mobilization of nucleosomes and transcriptional downregulation may be important events in the epigenetic silencing of certain tumor suppressor genes in cancer.
DNA methylation; gene regulation; histone modifications; chromatin; cancer
At least half of the human genome is derived from repetitive elements, which are often lineage specific and silenced by a variety of genetic and epigenetic mechanisms. Using a transchromosomic mouse strain that transmits an almost complete single copy of human chromosome 21 via the female germline, we show that a heterologous regulatory environment can transcriptionally activate transposon-derived human regulatory regions. In the mouse nucleus, hundreds of locations on human chromosome 21 newly associate with activating histone modifications in both somatic and germline tissues, and influence the gene expression of nearby transcripts. These regions are enriched with primate and human lineage-specific transposable elements, and their activation corresponds to changes in DNA methylation at CpG dinucleotides. This study reveals the latent regulatory potential of the repetitive human genome and illustrates the species specificity of mechanisms that control it.
► A mouse carrying human chromosome 21 fails to repress primate-specific repeats ► The lack of repression was revealed by H3K4me3 and transcription factor binding ► Activation corresponded to a decrease in CpG methylation ► Primate-specific repeats activated in human testes were activated in the Tc1 mouse
The beginning of this millennium has seen dramatic advances in genomic research. Milestones such as the complete sequencing of the human genome and of many other species were achieved and complemented by the systematic discovery of variation at the single nucleotide (SNP) and whole segment (copy number polymorphism) level. Currently most genomics research efforts are concentrated on the production of whole genome functional annotations, as well as on mapping the epigenome by identifying the methylation status of CpGs, mainly in CpG islands, in different tissues. These recent advances have a major impact on the way genetic research is conducted and have accelerated the discovery of genetic factors contributing to disease. Technology was the critical driving force behind genomics projects: both the combination of Sanger sequencing with high-throughput capillary electrophoresis and the rapid advances in microarray technologies were keys to success. MALDI-TOF MS–based genome analysis represents a relative newcomer in this field. Can it establish itself as a long-term contributor to genetics research, or is it only suitable for niche areas and for laboratories with a passion for mass spectrometry? In this review, we will highlight the potential of MALDI-TOF MS–based tools for resequencing and for epigenetics research applications, as well as for classical complex genetic studies, allele quantification, and quantitative gene expression analysis. We will also identify the current limitations of this approach and attempt to place it in the context of other genome analysis technologies.
An effective tool for the global analysis of both DNA methylation status and protein–chromatin interactions is a microarray constructed with sequences containing regulatory elements. One type of array suited for this purpose takes advantage of the strong association between CpG Islands (CGIs) and gene regulatory regions. We have obtained 20 736 clones from a CGI Library and used these to construct CGI arrays. The utility of this library requires proper annotation and assessment of the clones, including CpG content, genomic origin and proximity to neighboring genes. Alignment of clone sequences to the human genome (UCSC hg17) identified 9595 distinct genomic loci; 64% were defined by a single clone while the remaining 36% were represented by multiple, redundant clones. Approximately 68% of the loci were located near a transcription start site. The distribution of these loci covered all 23 chromosomes, with 63% overlapping a bioinformatically identified CGI. The high representation of genomic CGI in this rich collection of clones supports the utilization of microarrays produced with this library for the study of global epigenetic mechanisms and protein–chromatin interactions. A browsable database is available on-line to facilitate exploration of the CGIs in this library and their association with annotated genes or promoter elements.
Transitions at CpG dinucleotides, referred to as “CpG substitutions”, are a major mutational input into vertebrate genomes and a leading cause of human genetic disease. The prevalence of CpG substitutions is due to their mutational origin, which is dependent on DNA methylation. In comparison, other single nucleotide substitutions (for example those occurring at GpC dinucleotides) mainly arise from errors during DNA replication. Here we analyzed high quality BAC-based data from human, chimpanzee, and baboon to investigate regional variation of CpG substitution rates.
We show that CpG substitutions occur approximately 15 times more frequently than other single nucleotide substitutions in primate genomes, and that they exhibit substantial regional variation. Patterns of CpG rate variation are consistent with differences in methylation level and susceptibility to subsequent deamination. In particular, we propose a “distance-decaying” hypothesis, positing that due to the molecular mechanism of a CpG substitution, rates are correlated with the stability of double-stranded DNA surrounding each CpG dinucleotide, and the effect of local DNA stability may decrease with distance from the CpG dinucleotide.
Consistent with our “distance-decaying” hypothesis, rates of CpG substitution are strongly (negatively) correlated with regional G+C content. The influence of G+C content decays as the distance from the target CpG site increases. We estimate that the influence of local G+C content extends up to 1,500∼2,000 bps centered on each CpG site. We also show that the distance-decaying relationship persisted when we controlled for the effect of long-range homogeneity of nucleotide composition. GpC sites, in contrast, do not exhibit such “distance-decaying” relationship. Our results highlight an example of the distinctive properties of methylation-dependent substitutions versus substitutions mostly arising from errors during DNA replication. Furthermore, the negative relationship between G+C content and CpG rates may provide an explanation for the observation that GC-rich SINEs show lower CpG rates than other repetitive elements.
Mutations are raw materials of evolution. Earlier studies have shown that mutations occur at different frequencies in different genomic regions. By investigating the patterns and causes of such “regional” variation of mutations, we can better understand the mechanisms of underlying mutagenesis. In the human and other mammalian genomes, the most common type of mutation is caused by DNA methylation, which targets cytosines followed by guanine (CpG dinucleotides). Methylated cytosines are then subject to spontaneous deamination, which will cause a C to T (or G to A) transition (CpG substitution). Because this mutational process is unique to CpG substitutions, we reasoned that they might show different patterns of variability from other substitutions. Using high quality genomic sequences from primates and by separately analyzing variability of CpG substitutions and other substitutions, we demonstrate that CpG substitutions occur approximately 15 times more frequently than other substitutions, and show a distinctive pattern of regional variability. Particularly, we propose and provide evidence that because the deamination step requires temporary strand separation, G+C composition near 1,500–2,000 bps each direction from a target CpG affects the probability of a CpG substitution. Incorporating the difference in CpG and other substitutions discovered in this study will help build more realistic evolutionary models.
DNA methylation, consisting of the addition of a methyl group at the fifth-position of cytosine in a CpG dinucleotide, is one of the most well-studied epigenetic mechanisms in mammals with important functions in normal and disease biology. Disease-specific aberrant DNA methylation is a well-recognized hallmark of many complex diseases. Accordingly, various studies have focused on characterizing unique DNA methylation marks associated with distinct stages of disease development as they may serve as useful biomarkers for diagnosis, prognosis, prediction of response to therapy, or disease monitoring. Recently, novel CpG dinucleotide modifications with potential regulatory roles such as 5-hydroxymethylcytosine, 5-formylcytosine, and 5-carboxylcytosine have been described. These potential epigenetic marks cannot be distinguished from 5-methylcytosine by many current strategies and may potentially compromise assessment and interpretation of methylation data. A large number of strategies have been described for the discovery and validation of DNA methylation-based biomarkers, each with its own advantages and limitations. These strategies can be classified into three main categories: restriction enzyme digestion, affinity-based analysis, and bisulfite modification. In general, candidate biomarkers are discovered using large-scale, genome-wide, methylation sequencing, and/or microarray-based profiling strategies. Following discovery, biomarker performance is validated in large independent cohorts using highly targeted locus-specific assays. There are still many challenges to the effective implementation of DNA methylation-based biomarkers. Emerging innovative methylation and hydroxymethylation detection strategies are focused on addressing these gaps in the field of epigenetics. The development of DNA methylation- and hydroxymethylation-based biomarkers is an exciting and rapidly evolving area of research that holds promise for potential applications in diverse clinical settings.
Affinity-based methylation analysis; bisulfite modification; hydroxymethylation; methylation-sensitive restriction enzymes; microarrays; next-generation sequencing
CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a non-CGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a non-CGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.
Epigenetic marks (eg, DNA 5-methylcytosine [5mC] content or CpG methylation) within specific gene regulatory regions have been demonstrated to play diverse roles in stress adaptation and resulting health trajectories following early adversity. Yet the developmental programming of the vast majority of the epigenome has not yet been characterized, and its role in the impact of early stress largely unknown. In the present study, we investigated the relationships among early life stress, whole-epigenome and candidate stress pathway gene (serotonin transporter, 5-HTT) methylation patterns, and adult behavioral stress adaptation in a non-human primate model. Early in life, experimental variable foraging demand (VFD) stress or control conditions were administered to two groups each of 10 female bonnet macaques (Macaca radiata) and their mothers. As adults (3–13 years of age), these females were assessed for behavioral adaptation to stress across four conditions of increasing intensity. Blood DNA 5-HTT 5mC status was determined using sodium bisulfite pyrosequencing and total 5mC content was determined using ELISA. Neither stress reactivity nor DNA methylation differed based on early life stress. However, we found that both greater 5-HTT and whole-genome 5mC was associated with enhanced behavioral stress reactivity following early life stress, but not control conditions. Therefore, regardless of developmental origin, greater DNA methylation conferred a genomic background of “risk” in the context of early stress. We suggest that this may arise from constrained plasticity in gene expression needed for stress adaptation early in development. This risk may have wider implications for psychological and physical stress adaptation and health.
Serotonin transporter; DNA methylation; genotype; variable foraging demand stress; development; Bonnet macaque
We applied a solution hybrid selection approach to the enrichment of CpG islands (CGIs) and promoter sequences from the human genome for targeted high-throughput bisulfite sequencing. A single lane of Illumina sequences allowed accurate and quantitative analysis of ~1 million CpGs in more than 21 408 CGIs and more than 15 946 transcriptional regulatory regions. Of the CpGs analyzed, 77–84% fell on or near capture probe sequences; 69–75% fell within CGIs. More than 85% of capture probes successfully yielded quantitative DNA methylation information of targeted regions. Differentially methylated regions (DMRs) were identified in the 5′-end regulatory regions, as well as the intra- and intergenic regions, particularly in the X-chromosome among the three breast cancer cell lines analyzed. We chose 46 candidate loci (762 CpGs) for confirmation with PCR-based bisulfite sequencing and demonstrated excellent correlation between two data sets. Targeted bisulfite sequencing of three DNA methyltransferase (DNMT) knockout cell lines and the wild-type HCT116 colon cancer cell line revealed a significant decrease in CpG methylation for the DNMT1 knockout and DNMT1, 3B double knockout cell lines, but not in DNMT3B knockout cell line. We demonstrated the targeted bisulfite sequencing approach to be a powerful method to uncover novel aberrant methylation in the cancer epigenome. Since all targets were captured and sequenced as a pool through a series of single-tube reactions, this method can be easily scaled up to deal with a large number of samples.
Epigenetic modification of DNA via CpG methylation is essential for the proper regulation of gene expression during embryonic development. Methylation of CpG motifs results in gene repression, while CpG island-containing genes are maintained in an unmethylated state and are transcriptionally active. The molecular mechanisms involved in maintaining the hypomethylation of CpG islands remain unclear. The transcriptional activator CpG binding protein (CGBP) exhibits a unique binding specificity for DNA elements that contain unmethylated CpG motifs, which makes it a potential candidate for the regulation of CpG island-containing genes. In order to assess the global function of this protein, mice lacking CGBP were generated via homologous recombination. No viable mutant mice were identified, indicating that CGBP is required for murine development. Mutant embryos were also absent between 6.5 and 12.5 days postcoitum (dpc). Approximately, one-fourth of all implantation sites at 6.5 dpc appeared empty with no intact embryos present. However, histological examination of 6.5-dpc implantation sites revealed the presence of embryo remnants, indicating that CGBP mutant embryos die very early in development. In vitro blastocyst outgrowth assays revealed that CGBP-null blastocysts are viable and capable of hatching and forming both an inner cell mass and a trophectoderm. Therefore, CGBP plays a crucial role in embryo viability and peri-implantation development.