An understanding of the contextual patterns of nucleotide substitutions in the vertebrate genome is important for several reasons. The spectrum of mutational events reflect how the genome has been shaped during evolution, the mechanism of substitution mutagenesis, and it can also shed light on fundamental cellular processes such as genome stability, DNA replication and repair.
In this study, we have inferred a large set of point mutations originating within segmental duplications in the human genome. These point mutations were compared with a genome-wide collection of high-quality SNPs to assess whether these two datasets of mutational events show similar patterns in terms of distribution and surrounding sequence contexts. We initially recognized that regions of the genome covered by segmental duplications had a higher GC content than the grand average in non-duplicated regions. A previous study also reported a positive correlation between GC content and segmental duplications [34
]. However, the biological interpretation of the strong association between GC content and segmental duplications is not obvious. One part may by attributed to the increased gene density in duplications [41
], as regions containing genes are known to be GC-rich. Biased gene conversion may in addition play a role, a process in which repair of mismatches in heteroduplex recombination intermediates favour the fixation of G and C alleles [42
]. Also, duplications are particularly enriched in subtelomeric regions of the chromosomes that are directly linked with GC-rich isochors [44
The distribution of nucleotide substitutions observed in segmental duplications displays a pattern that in general is similar to SNPs. Both sets of mutations display an excess of transitional substitutions, a common phenomenon in vertebrate genomes. Among the four different transversions, the greatest difference between SNPs and DIMs were found for C/G substitutions. This finding suggests a potential association between the nucleotide composition of duplications and the frequency of substitutions, given the high GC content found in segmental duplications. Moreover, we observed a notable difference in the overall ratio of transitions to transversions between DIMs and SNPs, and an increased ratio in recently occurring DIMs. These results may reflect the evolutionary time window in which the two sets of substitutions were sampled, as well as differences in nucleotide composition between duplicated and non-duplicated DNA. Substitutions within recent segmental duplications comprise mutational events potentially originating 35–40 million years ago (≥ 90% sequence identity) up until today (100% sequence identity), and will thus include a substitution spectrum beyond the human lineage. SNPs should on the other hand represent point mutational events within the human lineage only, as they represent genetic variation between humans. If one assumes that the rate of transversions and transitions varies over time [45
], one would therefore expect to see stronger long-term effects within the DIM dataset than in the SNP set. Previous studies have shown that the rate of 5m
C deamination is limited by local GC content [46
]. Thus, the GC richness of segmental duplications may be partly responsible for the fewer observed transitions relative to transversions.
The majority of DNA oligomers at DIM and SNP sites, respectively, displayed similar levels of abundance. This observation implies in essence that the majority of SNPs and DIMs appear to be generated by similar mutational mechanisms. We confirmed the latter in oligomers drawn from reference regions, that is intergenic regions of segmental duplications and intergenic, non-duplicated regions. However, we also discovered that many oligomers that contain substitutions at the CpG dinucleotide are overrepresented at SNPs while underrepresented at DIMs. In the reference regions, these oligomers were less underrepresented in duplications than in nonduplicated regions. As mentioned above, different effects at the CpG dinucleotide may be caused by differences in GC content, which in turn lead to different 5 mC deamination rates. Furthermore, when looking at the total mutational spectrum at the CpG dinucleotide, we observed that the frequency of methylation-related transitions differed significantly in CpG islands and non-island regions (Table ). Our results henceforth imply that mutational events drawn from paralogous sequences exhibit the same suppression of methylation-dependent deamination in CpG islands as SNPs have been shown to do [15
During large-scale computational identification of SNPs, many single nucleotide differences between genomic clones are taken as evidence of allelic variation and submitted to dbSNP. Without proper validation by other means, this form of SNP discovery will inevitably lead to spurious results caused by the duplication content of the human genome [26
]. To address this issue, we systematically examined predicted SNP alleles in segmental duplications and mutations inferred from duplication alignments. Our approach revealed that nearly one out of five SNPs in duplications bear resemblance of paralogous sequence variation. Whether these SNPs behave like ordinary SNPs, MSVs or fixed PSVs is yet to be determined. Nonetheless, we suspect that traditional genotyping of the majority of these SNPs will produce misleading allele frequencies and genotype patterns since they will receive additional signals from paralogous sequences. Further, we discovered that SNPs that mirror mutational events in duplications are most prominent in duplications of high (≈97–100%) sequence identity, an observation for which we have no obvious explanation at present. In a comparative analysis with a small set of previously experimentally verified PSVs, we found all designated paralogous sequence variants among our computationally inferred mutations. In addition, we observed an overlap with computationally inferred DIMs and sites that were determined to be MSVs. The type of polymorphisms represented by MSVs involves a variation in duplication copy-number, and presumably indicates that much multisite variation may have originated from point mutational events in paralogous sequences.
Our approach does have some inherent limitations that could affect the reliability of the results obtained. These limitations involve the data source, i.e. detection of segmental duplications and reliability of DNA sequence alignments, the approach for inference of mutational events, and the sample effect. With respect to the source of segmental duplications, we relied on data provided by HGSDB [36
]. The detection scheme employed by HGSDB uses BLAST for pairwise comparisons of all assembled chromosomes. Detected duplications will thus depend on the overall quality of the genome assembly, and inferred mutations will rely on correctly determined consensus sequences in the assembly. We reduced some potential assembly (and sequencing) errors by excluding high-copy repeats from the analysis, as assembly programs may fail to distinguish single base differences between repeat copies from erroneous base calls [49
]. Since the degree of sequence divergence between duplications in HGSBB are all less than 10%, the resulting alignments are highly significant. Also, we placed restrictions on the alignment window around candidate DIMs to exclude potential alignment artefacts. Altering the alignment restrictions for DIM calling in two other DIM sets did not change the distribution of DIM substitutions to a large extent. In error-prone DNA sequencing contexts we observed a small increase of DIMs relative to high-quality SNPs, suggesting a minor impact of random noise in the DIM set. Altogether, we believe that the sequence alignments did not cause any serious errors.
Computational inference of mutational events leading to DIMs also has limitations. First of all, the directionality of the mutations was not inferred with our approach, i.e. an A→T mutation could not be distinguished from a T→A mutation. Thus, an observed (C/T)G substitution may not necessarily reflect the deamination of a methylated thymine, but rather correspond to a thymine to cytosine transition. A recent study of the directionality of SNPs indicated that most substitutions in intergenic regions have roughly the same amounts of substitutions in either direction [11
]. Whether DIMs display the same characteristics is unknown. Secondly, when the same mutational events were found propagated in several duplications (Figure ), we excluded them as individual events under the assumption of no multiple substitutions at a single site. This assumption is not likely to be violated in DNA sequences that show as low degree of sequence divergence as recent segmental duplications.
The sample of inferred DIMs were, as mentioned above, retrieved from all human chromosomes in regions where duplications have been found to exist. The total number of DIMs sampled was so large (≈344,000) that we believe they can provide a general pattern of substitutions in segmental duplications. In contrast to unique DNA sequences, duplicated sequences frequently undergo homology-driven mutation when involved in either non-allelic homologous recombination or gene conversion [28
]. In the latter process, DNA repair of nucleotide mismatches in heteroduplex DNA intermediates has been shown to be GC-biased, providing a direct link to the GC-richness of duplications [51
]. Investigating the relationship between biased repair and the observed distribution of DIMs requires further work, considering that base mispairs are corrected with different efficiencies and specificities in mammals [52
]. The inferred point mutational spectrum was restricted to intergenic regions, excluding all DIMs located within RefSeq transcripts. Among all DIMs inferred, we thus omitted nearly 31.5% in our analyzed sample, as they all originated within UTRs, exons and introns residing in segmental duplications. As shown in early studies of molecular evolution, regions under functional constraints (i.e. human transcripts) show different patterns and rates of substitutions from selectively neutral sequences such as pseudogenes [53
]. In order to establish a neutral pattern of point mutations in segmental duplications, minimized with the confounding effects of natural selection, we excluded any mutational event in which either of the nucleotides were found inside RefSeq transcripts. Since the point mutational spectrum in coding regions of segmental duplications may display different characteristics than what we found in intergenic regions, we suggest that these nucleotide substitutions should be explored in further work.
Most important, our computational analysis of segmental duplications in the human genome suggests that they can be utilized as a novel data source for the analysis of vertebrate point mutagenesis. There are essentially two different observations that support this claim. First, the distribution and context of computationally inferred DIMs and a set of high-quality set of SNPs in intergenic regions of the genome were largely similar (Figures , and ). Second, we found that a large fraction of the inferred DIMs overlap with verified SNPs, which provides evidence that our inference strategy is able to retrieve actual mutational events that lead to genetic variation. Moreover, our inferred set of nucleotide substitutions originates from regions in all human chromosomes, as segmental duplications are not restricted to any particular chromosome, but rather distributed in a genome-wide fashion. We believe that the inferred dataset of point mutations may be a valuable complement to SNPs for the analysis of human genetic variation.