|Home | About | Journals | Submit | Contact Us | Français|
A variety of DNA sequence motifs including inverted repeats, minisatellites, and the χ recombination hotspot, have been reported in association with gene conversion in human genes causing inherited disease. However, no methodical statistically-based analysis has been performed to formalize these observations. We have performed an in silico analysis of the DNA sequence tracts involved in 27 non-overlapping gene conversion events in 19 different genes reported in the context of inherited disease. We found that gene conversion events tend to occur within (C+G)- and CpG-rich regions and that sequences with the potential to form non-B-DNA structures, and which may be involved in the generation of double-strand breaks that could in turn serve to promote gene conversion, occur disproportionately within maximal converted tracts and/or short flanking regions. Maximal converted tracts were also found to be enriched (p<0.01) in a truncated version of the χ-element (a TGGTGG motif), immunoglobulin heavy chain class switch repeats, translin target sites and several novel motifs including (or overlapping) the classical meiotic recombination hotspot, CCTCCCCT. Finally, gene conversions tend to occur in genomic regions that have the potential to fold into stable hairpin conformations. These findings support the concept that recombination-inducing motifs, in association with alternative DNA conformations, can promote recombination in the human genome.
Homologous recombination is generally thought to be initiated by double-strand breaks (DSBs), resulting in either gene conversion or crossovers (Chen et al. 2007). Gene conversion occurs frequently between tandemly linked homologous DNA sequences and involves the non-reciprocal transfer of DNA from a ‘donor’ to an ‘acceptor’ sequence. When such transfers inactivate functional human genes, pathological consequences can ensue. To date, a large number of homologous recombination ‘hotspots’ have been described in yeast, mice and humans (Jeffreys et al. 2004; Nishant and Rao 2006). In addition, a variety of DNA sequences, including direct repeats, inverted repeats (sometimes incorrectly termed palindromes), minisatellite repeats, the χ recombination hotspot, and alternating purine-pyrimidine tracts with Z-DNA-forming potential, have been noted in association with gene conversion in human genes (Kilpatrick et al. 1984; Flanagan et al. 1984; Giordano et al. 1997; Chen and Férec 2000; Lopez-Correa et al. 2001; Lee et al. 2002; Rozen et al. 2003; Hallast et al. 2005; Wolf et al. 2009). However, the failure to identify any unique DNA sequence feature associated with ‘hotspot’ activity has fuelled speculation that the determinants of homologous recombination might not only be numerous but also ‘fuzzy’ (Jeffreys et al. 2004; Nishant and Rao 2006). Moreover, the reported gene conversion ‘hotspots’ could be remote from the DSB-initiating sites, since it is the sites of recombinational resolution rather than recombinational initiation that have almost invariably been investigated (Jeffreys et al. 2004).
Recent in vitro studies have revealed that simple repetitive DNA sequences known to be capable of adopting non-B DNA conformations (such as slipped structures, triplexes, tetraplexes, etc) are highly mutagenic and prone to breakage (Bacolla et al. 2004; Bacolla and Wells 2004; Wang and Vasquez 2006). These findings have been augmented by evidence for a hairpin processing activity possessed by the Artemis/DNA-PKcs complex and Holliday-junction resolvase, both in the context of V(D)J recombination (Ma et al. 2002; Raghavan et al. 2006; Lieber et al. 2006) and in the delivery of recombinant adeno-associated virus used for gene therapy (Inagaki at al. 2007). Thus, it may be that it is the ability of a given DNA sequence to adopt a non-B DNA conformation, rather than the DNA sequence per se in the orthodox right-handed Watson-Crick B-form, that induces chromosomal DSBs. Recently, this postulate has received broad support from the convergence of biochemical, genetic and genomic studies in the context of gross genomic deletions, inversions, duplications and translocations (reviewed in Wells 2007).
We speculated that DNA sequences with the potential to form non-B structures might also play an important role in homologous recombination leading to gene conversion. To test this postulate, we employed as a model system a series of known examples of human pathological mutations resulting from gene conversion events that involve the transfer of a short sequence tract from a donor to an acceptor gene. The advantage of this approach was that the extents of the maximal converted tracts (MaxCTs) and minimal converted tracts (MinCTs) (see Figure 1 for term definitions) associated with such pathological events can usually be accurately determined and annotated. To verify the prevailing view that the homologous recombination sites of DSBs that initiate gene conversion reside within MaxCTs (Chen et al. 2007) and to explore their precise location, a series of regions including or spanning the MinCTs and MaxCTs were screened for the presence of specific DNA sequence motifs and for sequences capable of adopting non-B DNA conformations. The basic assumption made in this analysis was that a marked overrepresentation of such DNA sequences either within the gene conversion tracts themselves, or within their short flanking regions, could imply their involvement in DSB formation.
Four different sequence datasets were employed comprising minimal converted tracts (MinCTs), maximal converted tracts (MaxCTs) [see Figure 1 for term definitions], and regions spanning MaxCTs but including either ±15 or ±150 bp flanking sequences (henceforth termed ShortFlank and LongFlank datasets respectively). DNA sequence data were derived from 27 non-overlapping interlocus gene conversion events in 19 different genes reported to cause human inherited disease (Table 1, Supp. Figure S1). Whereas the majority of gene conversion events were listed in the collation of Chen et al. (2007), an additional eight cases associated respectively with congenital adrenal hyperplasia (Globerman et al. 1988), increased CYP3A7 gene expression in adult liver and intestine (Kuehl et al. 2001), agammaglobulinemia (Conley et al. 1999) and conversion events in the GYPA (Huang et al. 2000), HBA1 (Law et al. 2006), Sec1 (Soejima et al. 2008), CD46 (Fremeaux-Bacchi et al. 2006) and KRT17 (Hashiguchi et al. 2002) genes were included.
Clearly, if all reported pathological gene conversion events had simply been collated without regard to their frequency of occurrence, the dataset would have included a number of identical DNA sequences which would have been represented multiple times owing to the existence of gene conversion hotspots. This multiple inclusion of specific DNA sequences would then have introduced considerable bias into any subsequent search for sequence motifs involved in promoting gene conversion. Therefore, to avoid this problem we adopted a highly conservative strategy in which no overlapping DNA sequences were allowed within any of the datasets to be analyzed. Thus, in cases where a number of different gene conversion events in the same gene overlapped (e.g. in the GBA, NCF1 and VWF genes), only non-overlapping events with MaxCT≤520 bp were included. The gene conversion events identified in the HBD and OPN1LW genes were also excluded from the analysis since the MinCTs and MaxCTs could not be accurately and unambiguously ascertained.
The locations of MaxCTs and MinCTs were broadly assigned to the following genomic regions: open reading frame (ORF) if the gene conversion region resided within the genomic sequence of a known gene or ORF; ORF/3′ if the gene conversion region started within the genomic sequence of a known gene or ORF but ended within a 3′UTR; 5′/ORF if the gene conversion region began within a 5′UTR but ended within the genomic sequence of a known gene or ORF; 5′ or 3′ if the gene conversion region lay within a 5′ or 3′ flanking region respectively.
An ‘artificial chromosome’ of ~7 Mbp was constructed by concatenating the genomic sequences (sense strand) of 100 randomly selected human genes (Supp. Table S1), taken from the complete set of ~23,000 annotated genes (UCSC hg18 Human Genome Assembly, March 2006), and including the exons and introns plus 1000 bp 5′ flanking region (before the transcriptional start site) and 1000 bp 3′ flanking region (beyond the transcriptional termination site) in each case. Although most of these 100 randomly selected genes have been validated, several of them still have provisional status. For genes characterized by multiple transcripts, the longest transcripts were almost invariably selected. The structure of the artificial chromosome therefore reflects as closely as possible the gene/ORF composition of an actual chromosome albeit without the intergenic regions. The R/Y composition of the transcribed strand of the artificially created chromosome (49%/51%) was found to correlate well with the corresponding nucleotide composition calculated for the human genome as a whole (50%/50%).
The DNA sequences from the MaxCT, MinCT and ShortFlank datasets were screened for the presence of 37 DNA sequence motifs of length ≥5 nucleotides (nt) [plus their complements] known to be associated with site-specific cleavage/recombination, high frequency mutation and gene rearrangement (Abeysinghe et al. 2003) as well as various ‘super-hotspot motifs’ found in the vicinity of micro-deletions, micro-insertions and indels (Ball et al. 2005). These DNA sequence datasets were also screened for the presence of 15 additional recombination-associated motifs (plus their complements) identified by Cullen et al. (2002) and a CCTCCCT motif associated with a classical meiotic recombination hotspot (Myers et al. 2005; Myers et al. 2006; Frazer et al., 2007).
In addition, complexity analysis (Gusev et al. 1999) was used to identify perfect direct repeats, inverted repeats and symmetric elements that would be capable of forming non-B DNA structures which are believed to induce DNA breakage (Wells 2007). To this end, the following sequences were sought: direct repeats of length ≥7 nt that were less than 20 nt apart from each other and which could form slipped structures; tetraplexes formed by four GGG, GGGG or GGGGG repeats (termed G-quartets) and separated from each other by up to 5 nt; cruciforms formed by two inverted repeats with the minimum stem size and maximum loop length being set to 7 nt and 20 nt respectively; triplexes formed by two symmetric elements of length ≥7 nt comprising at least 75% R (or Y) bases and separated by up to 20 nt (overlapping direct repeats, inverted repeats or self-symmetrical elements with combined lengths shorter than 14 nt were excluded from the analysis); left-handed Z-DNA formed by at least 6 consecutive RY motifs; triplet repeat sequences of the form GAA•TTC, CGG•CCG and CTG•CAG (the dot separates the complementary strands), of total length ≥9 bp, which are known to induce genetic instability (Napierala et al. 2005; Wells and Ashizawa 2006; Wells et al. 2005; Mirkin 2007). These assumptions about the lengths of the repeats capable of non-B DNA structure formation and their relative locations were derived from parameters observed in in vitro studies and from empirical biochemical data on non-D DNA conformations (Sinden 1994; Bacolla and Wells 2004; Wells 2007; Bacolla et al. 2008).
For each type of analysis, the statistical significance was assessed by comparison with two distinct types of control dataset, generated as follows. Firstly, 1000 control datasets comprising the same number of random sequences as the original dataset, and matching the original dataset in terms of their length and mononucleotide composition were simulated by reshuffling the original sequences. Secondly, for each of the original (case) sequence datasets, 1000 control datasets comprising the same number of sequences as the original dataset, and matching the original dataset in terms of their length and location, were randomly selected from the artificial chromosome. z-scores (Marino-Ramirez et al. 2004) were then calculated for the collection of sequence motifs (plus their complements) and the above-described non-B DNA-forming sequences as follows: where N (R) is the frequency of a specific non-B DNA-forming sequence or sequence motif either in the case dataset or in its matching control dataset generated as described above, and (R) and (R) are the mean frequency and its variance estimated from 1000 control datasets. Any sequence/motif with a z-score, calculated for all DNA sequences comprising the case dataset, that exceeded the 99th (99.9th) or 95th (99.5th) percentile of the maximum z-scores found for the corresponding 1000 control datasets, generated either by reshuffling or instead selected at random from the artificial chromosome, was deemed to be statistically significant at the respectively 1% (0.1%) or 5% (0.5%) level. All results were corrected for multiple testing using the Bonferroni correction.
To further validate the reshuffling procedure and to correct for possible asymmetries in nucleotide composition between the case and control datasets, a randomly selected matching dataset of non-gene conversion prone sequences from the artificial chromosome was treated as a mock ‘case dataset’.
The AlignACE3.0 software based on Gibbs sampling (Hughes et al. 2000) available at http://atlas.med.harvard.edu/cgi-bin/alignace.pl was used to search for novel sequence motifs in the MaxCT and MinCT datasets.
The MFold software (Zuker 2003), available at http://mfold.bioinfo.rpi.edu/cgi-bin/dna-form1.cgi, was used to predict the secondary structure of single stranded DNA sequences from the LongFlank dataset. ANOVA tests were performed to assess whether or not there was a significant difference in the mean free energy required to attain the adopted DNA conformations, between the case and control datasets. For each of the single stranded DNA sequences from the LongFlank dataset, 10 matching sequences were generated from the artificial chromosome and their mean free energy was calculated across these 10 generated sequences.
In the case of 21 of the 27 gene conversion events analyzed (Table 1), the MinCT and MaxCT regions were located within the ORF portions of the acceptor gene sequences. In one case (the CFHR1-CFH gene pair), however, the MinCT was confined to within the ORF whereas the MaxCT extended as far as the ORF/3′ portion of the acceptor CFH gene. In 3 other cases (one CYP21A1P-CYP21A2, the CYP3A4-CYP3A7 and GH2-GH1 gene pairs), both the MinCT and MaxCT were located within the promoter (5′) portion of the genes. Hence, >80% of the pathological gene conversion events occurred within the ORF segments of genes, whereas no cases were found within the 5′/ORF or downstream (3′) of the genes.
The basic premise underlying this analysis has been that a marked overrepresentation of certain DNA motifs located within the gene conversion tracts (or within ShortFlank regions spanning MaxCTs) in comparison to 1000 control datasets compiled from the artificial chromosome, could imply their involvement in DSB formation. Although this assumption would appear to have a sound statistical basis, comparison of MinCT, MaxCT and ShortFlank datasets with 1000 matching datasets generated from the artificial chromosome revealed striking asymmetries in their nucleotide compositions. Thus, the relative frequencies of nucleotides T, A, G and C in the MaxCTs from the case datasets, for example, were found to be 23%, 23%, 27% and 27% respectively, whilst the corresponding relative frequencies in the control dataset generated from the artificial chromosome were 31%, 28%, 21% and 20%. These differences were found to be highly significant by means of the χ2-test, with the corresponding p-values for the MinCT, MaxCT and ShortFlank sequences being 5.1×10-11, 6.44×10-47 and 9.2×10-54 respectively.
To determine whether these differences were simply due to the sense strand orientation of the artificial chromosome sequences in contrast to both the sense and antisense strand orientations of the case sequences, the nucleotide frequencies were re-categorized into two groups: (C+G) and (A+T). This analysis revealed that the (C+G) to (A+T) ratio in the MinCT, MaxCT and ShortFlank regions were >1.09 and <0.72 for the case and control sequences, respectively, with the Pearson χ2-test p-values for MinCT, MaxCT and ShortFlank regions being 1.39×10-12, 1.24×10-48 and 1.12×10-55 respectively. Therefore, because the (G+C) and (A+T) frequencies are independent of strand orientation, we conclude that the pathological gene conversion events have tended to occur disproportionately within (C+G)-rich regions.
In the context of assessing the overrepresentation of motifs/repeats in the vicinity of the gene conversion tracts, the observed asymmetries in nucleotide composition render the use of reshuffled sequence controls [which preserve the (C+G)-richness of the original sequences] preferable to those controls generated from the artificial chromosome. Indeed, the use of the artificial chromosome controls may actually increase the number of false positives; (C+G)- or (A+T)-rich motifs could be found to be respectively over-represented and underrepresented as a result of the paucity of (C+G) and relative abundance of (A+T) in the control dataset generated from the artificial chromosome. On the other hand, since analysis of the nucleotide composition of specific motifs used in this study indicated that there are no overall asymmetries in the occurrence of specific nucleotides that were previously observed between the case dataset and controls generated from the artificial chromosome (all nucleotides were found to be equally probable), both types of control were used to assess the overrepresentation of motifs.
Several motifs known to be associated with site-specific cleavage/recombination, high frequency mutation and gene rearrangement (Abeysinghe et al. 2003) were found to be over-represented (p≤0.05) within the MinCTs, MaxCTs and their flanking regions, ShortFlank (Table 2). Specifically, a truncated version of the χ-element, TGGTGG, previously reported as a mutational ‘super-hotspot’ common to micro-deletions, micro-insertions and indels (Ball et al. 2005; Table 2), was found to be significantly over-represented with respect to both control datasets (p<0.01) within MinCTs (6 occurrences), MaxCTs (11 occurrences) and ShortFlank (12 occurrences) regions. [NB. This TGGTGG motif self-evidently comprises two short direct repeats]. The human minisatellite conserved sequence/χ-like element CCWCCWGC (and/or its complement, GCWGGWGG) were found in all three datasets. However, the complementary motif was found to be over-represented in the MaxCTs only at the 5% level using the reshuffling control and at the 1% level for the ShortFlank dataset using both types of controls. Surprisingly, the full-length χ-element, GCTGGTGG, often noted in the vicinity of gene conversion tracts and reported to promote gene conversion (Giordano et al. 1997; Lopez-Correa et al. 2001; Rozen et al. 2003), was observed only once (and only as its complement CCACCAGC) within all three regions; it was found to be over-represented in the MinCTs for both types of control, but only at the 5% level of significance.
Two immunoglobulin heavy chain class switch repeats, TGGGG and GGGCT, were found to be over-represented (p<0.01) as direct copies within the MinCTs & MaxCTs and MaxCTs & ShortFlank regions respectively. Motifs complementary to TGGGG and GAGCT were over-represented within the MaxCTs and MaxCTs & ShortFlank regions respectively (p<0.01) in comparison with both types of control (note that TGGGG and GGGGT, and GAGCT and TGAGC represent the same sequence but with different reading frames). Interestingly, the actual number of occurrences of motifs GAGCTc and TGAGCc [the superscript c denotes the complementary sequence] is unchanged when longer sequences (ShortFlank) are analyzed. Analysis of the spatial distributions of the TGGGG motif and its complement, CCCCA, revealed that whereas the majority of the CCCCA motifs occurred in the regions flanking the MinCTs, the majority of the TGGGG motifs were located within the MinCT region (Supp. Figure S2).
One of the translin target sites, ATGCAG, was found to be over-represented (at the 5% level) in all three regions with respect to both types of control; translin is a DNA-binding protein that specifically recognises consensus sequences at the breakpoint junctions of chromosomal translocations, albeit usually involving immunoglobulin/T-cell receptor genes (Abeysinghe et al. 2003; Gajecka et al. 2006). Overall, statistical significance was independent on strand orientation for motifs occurring more frequently (≥12 occurrences), whereas it was frequently associated with strand bias for the more rarely occurring motifs (<12 occurrences). Hence, such strand bias might simply have resulted from the stochastic orientation of repeats along the chromosomes (Bacolla et al., 2008).
Thus, in summary, three types of DNA sequence motif, 1) a truncated version of the χ-element and its relative, the human minisatellite conserved sequence/χ-like element, 2) two immunoglobulin heavy chain class switch repeats and 3) translin target sites were found to be consistently over-represented in either MinCTs, MaxCTs or ShortFlank datasets analysed in our study. These findings were independent of the type of control dataset used.
Several motifs including the ‘hamster deletion hotspot’, TGGAG, and the DNA polymerase arrest site WGGAG and their complements, were found to be over-represented (p≤0.01) in both MaxCTs and ShortFlank regions (Table 2). Analysis of their relative spatial distributions (Supp. Figure S2) indicated that the majority of CTCCA (complement of TGGAG) and CTCCW (complement of WGGAG) motifs occurred in the regions flanking the MinCTs.
Motifs complementary to the DNA polymerase α frameshift hotspot sequence, CTGGCG, found in 2 and 3 copies in MaxCTs and ShortFlank regions, were also over-represented at the 5% and 1% levels respectively but only in comparison with the artificial chromosome. By contrast, the DNA polymerase β frameshift hotspot, ACCCWR, was found to be over-represented in MaxCTs and ShortFlank regions at the 5% level by comparison with the reshuffled controls. MaxCT and ShortFlank regions were enriched (p<0.01) in long R- or Y-tracts but only by comparison with the reshuffled controls.
Several motifs, such as the alternating purine-pyrimidine tract RYRYR, the human Fra(X) breakpoint cluster CGGCGG, and the murine parvovirus recombination hotspot CTWTTY, were found to be underrepresented in MaxCT and ShortFlank regions either in comparison with reshuffled controls and/or an artificial chromosome. On the other hand, a motif complementary to the human Fra(X) breakpoint cluster was overrepresented in ShortFlank regions but only in a comparison made with the artificial chromosome.
Sequences capable of non-B DNA slipped structure formation (Table 2; Supp. Figure S1) were found to be significantly over-represented (p<0.001) within MinCTs, MaxCTs and ShortFlank regions when compared with the reshuffled control datasets whereas sequences capable of cruciform structure formation were found to be over-represented (p<0.001) within MaxCTs and ShortFlank regions irrespective of the type of control used. Inspection of these non-B DNA forming sequences indicated that they were also (C+G)-rich, suggesting that (C+G)-richness of the non-B DNA forming sequences may be an important additional feature in rendering such sequences susceptible to gene conversion.
To further validate the reshuffling procedure and to account for (C+G) content, a mock ‘case dataset’ was randomly selected from the artificial chromosome (see Materials and Methods); none of the non-B DNA forming repeats found in the ‘mock case dataset’ were over-represented by comparison with controls generated by reshuffling of the ‘mock’ dataset. In addition, the sequences (GAA)≥3, (CGG)≥3 and (CTG)≥3, present in one copy in at least one of the datasets analyzed and already known to induce genetic instability (Napierala et al. 2005; Wells and Ashizawa 2006; Wells et al. 2005; Mirkin 2007) were found to be over-represented (p<0.01) in MinCTs (GAA) and MaxCTs (CGG) and in both MaxCTs and ShortFlank regions (CTG). By contrast, no triplex-forming (R•Y symmetric elements), tetraplex-forming (G-tetrads) or Z-DNA-forming [(RY)≥6)] motifs were found to be over-represented, in any of the gene conversion regions studied.
The AlignACE3.0 software was also used to search for novel DNA sequence motifs recurring within the MinCTs and MaxCTs. Motifs sharing at least 5, 6 or 7 positions, with ≤2 positions in the consensus sequence occupied by N (any nucleotide), were considered. The background C+G content was set to 41%, corresponding to that observed in the human genome at large. Sixteen and 23 novel sequence motifs (length ≥5bp) were identified as being overrepresented within the MinCTs and MaxCTs respectively; all were found to be over-represented (in one or both orientations) by comparison with the control dataset generated by reshuffling and with the control dataset derived from the artificial chromosome (p<0.01) (Supp. Tables S2 and S3).
Of the 39 types of motif identified in MinCTs and MaxCTs, 27 contained fragments of known ‘super-hotspots’ for mutation, 21 contained fragments of the immunoglobulin class switch region (Rabbitts 1994; Ohno 1981), 14 included fragments of translin-binding sites (Aoki et al. 1995), 13 contained fragments of the human hypervariable minisatellite core sequence/recombination hotspot (Wahls et al. 1990a) or fragments of the hamster APRT deletion hotspot (Smith and Adair 1996), 10 corresponded to known DNA polymerase frameshift/arrest sites, 5 contained fragments of the PUR protein binding site (Bergemann and Johnson 1992; Smith et al. 1998) or fragments of the classical recombination hotspot CCTCCCCT (Myers et al. 2005; Myers et al. 2006; Frazer et al., 2007), 4 contained fragments of the mariner transposon-like element (Reiter et al. 1996) or fragments of χ/χ-like elements, 2 consisted of (CGG)2 repeats of the human Fra(X) breakpoint cluster, and finally one sequence motif was part of the XY32 R•Y H-palindrome (Rooney and Moore 1995).
Hence, the search for motifs known to occur recurrently at recombination sites (Table 2) and the search for novel motifs (Supp. Tables S2 and S3) were concordant in that they identified a number of similar types of DNA sequence. In addition, the search for novel motifs revealed aptamer-like sequences used during class switch recombination, as well as translin pseudo-binding sites. It is certainly possible that, in these pathological gene conversion events, RAG-associated and translin proteins could have played key roles during the recombination process. By contrast, χ/χ-like elements are likely only to have played a modest role.
Several novel motifs were found to include either a fragment or the entire sequence of the human minisatellite conserved sequence/χ-like element. Specifically, motif CWGSWG was found to be over-represented in both orientations within the MinCTs whereas motifs GGWGGc, CTGGNSc and KGGWGGc were found to be over-represented within the MaxCTs.
One motif, GGSAG, found to be over-represented within MinCTs, and four motifs, GGWGG, SNSWGG, KGGWGG and SNWGNRRSS, found to be over-represented in MaxCTs include either a fragment or the entire sequence of the classical meiotic recombination hotspot CCTCCCT (Myers et al. 2005; Myers et al. 2006.
The second type of information uncovered by the search for novel motifs was the overrepresentation of CpG dinucleotides within the case dataset as compared to the corresponding control generated from the artificial chromosome (χ2-test; corresponding p-values for the MinCT, MaxCT and ShortFlank sequences are 7.5×10-5, 5.1×10-11 and 6.2×10-15 respectively). An association between CpG dinucleotide richness and recombination has been previously reported in different contexts (Kong et al., 2002; Jensen-Seaman et al., 2004; Han et al., 2008; Tsai et al., 2008) as has an association between CpG dinucleotide richness and gene conversion (Högstrand and Böhme 1999). Because CpG dinucleotides may be methylated, these composite observations raise the intriguing possibility of a relationship between CpG methylation and gene conversion.
The lengths of the MinCTs, MaxCTs and ShortFlank regions vary between 4 bp & 289 bp, 56 bp & 520 bp, and 86 bp & 550 bp, respectively. To determine whether the numbers of the various motifs/non-B DNA forming sequences found to be overrepresented in these regions correlated directly with the lengths of the tracts in which they were found, Pearson's correlation was calculated for each of the three datasets. For the majority of motifs found to be overrepresented within the MinCT regions, a strong correlation (r>0.8, p<0.001) was noted between the number of motifs observed and the length of the corresponding tract. One simple explanation is that >30% of the MinCTs have lengths shorter than the motifs or repeats sought. By contrast, motifs/repeats found to be overrepresented in MaxCT and ShortFlank regions did not exhibit any correlation >0.7 between their numbers and tract lengths. For example, the Pearson's correlation coefficient observed between the number of non-B DNA forming sequences and their corresponding tract lengths did not exceed 0.4 for both MaxCT and ShortFlank regions. This indicates that, at least for the MaxCT and ShortFlank regions, enrichment in a given motif/repeat was not dependent upon sequence length.
As described above, inverted repeats, which may form hairpin structures, were found to be both abundant and over-represented in the MaxCT and ShortFlank regions (Table 2). To investigate whether these possible secondary structures might have been part of larger and relatively stable DNA conformations encompassing the regions of pathological gene conversion, a comparison was performed of the free energies (–ΔG) required for folding the most thermodynamically stable hairpin structures, of the sequences from the LongFlank dataset with matching controls from the artificial chromosome (see Materials and Methods). The mean –ΔG value for the most stable folded-back hairpin structures from the 27 LongFlank gene conversion cases was -75.7 kcal/mole, whereas the mean free energy for the 270 controls was -53.6 kcal/mole (ANOVA, p=9.5×10-4). This implies that gene conversion has tended to occur within genomic regions that have the potential to fold into stable hairpin conformations (Supp. Figure S3 shows an example). The increased stability of the gene conversion-associated non-B DNA structures may be due to the extended hairpin-stems and/or a greater number of Watson-Crick C•G pairs within the stems, as would be expected from the (G+C)-rich nature of the relevant genomic regions. In summary, our composite analyses revealed that sites of gene conversion frequently comprise recombination hotspots (Hellmann et al. 2005; Kong et al. 2002) associated with non-B DNA-forming repeats.
Our findings have for the first time placed the DNA sequence analysis of gene conversion tracts on a sound statistical footing. This study has provided firm evidence that motifs associated with recombination activity and sequences with the potential to form non-B DNA structures are both over-represented within MaxCTs and ShortFlank sequences. These results strongly support the postulate that the gene conversion events were initiated by DSBs at sites of non-B DNA structure formation, which then activated the proximal recombination-promoting motifs to serve as substrate for the subsequent mutagenic repair process.
Two types of non-B DNA-forming short repeat were found to be consistently over-represented within the MaxCT and ShortFlank regions; in particular, direct repeats which may form slipped structures, and inverted repeats which may form hairpin/cruciform structures (Table 2; Supp. Figure S1). Both types of non-B DNA conformations may be expected to be acted upon by DNA repair proteins and therefore represent an intrinsic source of induced DSBs (Wang and Vasquez 2006). Since the number of direct and inverted repeats in the analysed regions was quite substantial (respectively 16 and 9 in the ShortFlank regions), these results provide strong support for the notion that occasional non-B DNA conformations formed by these repeats can contribute to the initial events (including DSBs) that trigger recombination leading to gene conversion (Wang and Vasquez 2006; Bacolla et al. 2004). Further, our results also support the view that DNA breakage tends to occur within MaxCTs, or at least within MaxCTs with short flanking sequences included.
Maximal converted tracts were found to be enriched (p<0.01) in a truncated version of the χ-element, TGGTGG, previously noted to be over-represented in the vicinity of microdeletions, microinsertions and indels (Ball et al. 2005). Several novel motifs, including either a fragment or the entire sequence of a classical meiotic recombination hotspot, CCTCCCCT, were also found to be over-represented (p<0.01) within MaxCTs. We therefore propose that a) non-B DNA-mediated DSBs occurring within the narrow regions immediately flanking and including the MaxCTs may serve to promote gene conversion and b) a high local density of recombination-promoting motifs may act in combination with these DNA conformations to potentiate gene conversion. This reinforces the early, isolated and purely anecdotal observations of an association between Z-DNA-forming sequences and gene conversion (Kilpatrick et al. 1984; Wahls et al. 1990b) since it would appear that other types of non-B DNA conformations such as slipped structures and cruciform-forming repeats are also likely to be involved in promoting DSBs.
Although most of the biochemical steps underlying the mutational mechanism(s) responsible for gene conversion still remain to be elucidated, our present findings provide additional support to a recently proposed model (Wang and Vasquez 2006; Bacolla et al. 2006; Kurahashi et al. 2006; Wells 2007; Raghavan et al. 2007; Rooms et al. 2007) in which it is the resolution of non-B DNA conformations that induces chromosomal DSBs which can then give rise to gross genomic rearrangements via processes that involve DNA recombination-repair. Most notably, we show that recombination-associated motifs play an integral part in this non-B DNA structure-induced mutational process. Specifically, we postulate that the high density of recombination-related motifs serves as target binding sites for protein complexes, such as translin and RAG-associated proteins, or arrest sites for DNA polymerases, which may assist, induce or indeed be required for the recombination-repair process.
This work was partially supported by the INSERM (Institut National de la Santé et de la Recherche Médicale), France (to J.-M.C. and C.F.), BIOBASE GmbH (through financial support to D.N.C.) and by grants from the National Institutes of Health (NS37554 and ES11347) and the Robert A. Welch Foundation to R. D. W.