Deciphering the mechanisms of mutagenesis is central to our understanding of evolution and critical for studies of human genetic diseases. The availability of a multitude of sequenced genomes and their alignments provides an opportunity to study mutations on a genome-wide scale in many species, including humans. There is now substantial evidence for within-genome variation in mutation rates; in particular, regional variation in nucleotide substitution rates, insertion and deletion (indel) rates, and microsatellite mutability have been documented across the human genome [1
]. However, notwithstanding the attention it has received in the literature, the causative mechanisms underlying regional mutation rate variation remain elusive. Biochemical processes, including replication and recombination, have been suggested as potential contributors to mutation rate variation. For instance, replication likely determines the differences in nucleotide substitution rates among chromosomal types - nucleotide substitution rates are highest on chromosome Y, intermediate on autosomes, and lowest on chromosome X (for example, [10
]), consistent with the relative number of germline cell divisions and thus DNA replication rounds for each of these chromosome types [12
]. Local male recombination rate has been shown to be a significant determinant of regional nucleotide substitution rate variation [10
], supporting the potential mutagenic nature of recombination and/or biased gene conversion [1
]. Rates of small deletions have been found to be associated with replication-related genomic features, and rates of small insertions with recombination-related features [8
]. Finally, the role of replication slippage in determining variation in mutability among microsatellite loci has been recently corroborated [9
]. Other factors - for example, the predominance of aberrant DNA repair mechanisms like non-homologous end-joining at subtelomeric regions [14
], and yet unexplored mutagenic mechanisms potentially acting at telomeres [10
] - might influence regional variation in mutation rates as well.
Genome-wide information on three additional genomic features has recently become available. Nuclear lamina binding regions are thought to represent a repressive chromatin environment and are concentrated in the proximity of centromeres [15
]; their impact on local mutation rates has not been investigated to date. An abundance of methylated sites at non-CpG DNA locations in human embryonic stem cells was revealed by a recent study [16
], suggesting alternative roles for DNA methylation in CpG and non-CpG contexts. Although the function of methylation in generating mutations at CpG locations has been extensively researched [2
], no study to date has looked at the potential impact of the non-CpG methylome on the genome and its mutagenesis; in particular, methylated non-CpG cytosines may also elevate mutation rates. Finally, recent predictions of the density of nucleosome-free regions based on MNase digestion [17
] can be used to understand the influence of local chromatin structure on mutation rates. Assessing the contribution of these three novel genomic features to mutation rate variation is of obvious and immediate interest.
In addition to varying regionally, rates of different mutations frequently co-vary with each other. Co-variation was observed between rates of nucleotide substitutions (estimated at ancestral repeats and four-fold degenerate sites), large deletions and insertions of transposable elements [2
]. In a separate study, co-variation was observed between rates of nucleotide substitutions and both small insertions and small deletions [8
]. What causes regional co-variation in the rates of different mutation types? While explanations based on selection have been considered [18
], they are not satisfactory because mutation rates also co-vary in presumably neutrally evolving portions of the genome [2
]. Shared local genomic landscapes might be responsible for the co-variation of these rates and, on a purely mechanistic basis, one mutation type might be physically associated with another one (for example, indel-induced nucleotide substitutions) [19
], causing the corresponding rates to co-vary. However, these hypotheses have never been extensively explored. Notably, while a number of studies have documented regional variation and co-variation of rates of mutations of several types, they have mostly relied on correlation and univariate regression analyses, which relate mutation rates only in a pair-wise fashion, and attempt to explain their variation (as a function of genomic features) one at a time [2
]. A better understanding of the structure and causes of mutation rate co-variation, which is crucial for studies of mutagenesis, can be achieved only through more sophisticated data analysis approaches.
This is exactly what we pursued in the current study, where we jointly investigated multiple mutation rates alongside several plausible explanatory genomic features, shedding light on the interplay between mutagenesis and the genomic landscape in which it occurs. In more detail, we used multivariate analysis techniques to characterize the co-variation structure of four rates (nucleotide substitutions, insertions, deletions, and microsatellite repeat number alterations) and explore their joint relationship with several genomic landscape variables. First, we applied principal component analysis (PCA) to mutation rates computed along the genome. Next, we linked rates to genomic landscape variables using canonical correlation analysis (CCA). Finally, we applied non-linear versions of these multivariate techniques, kernel-PCA (kPCA) and kernel-CCA (k-CCA), to investigate the presence of non-linear associations. We conducted our analyses on two mutually exclusive neutral subgenomes - one repetitive (ancestral repeats (ARs)) and one unique (non-coding non-repetitive (NCNR) sequences), and three genomic scales (1-Mb, 0.5-Mb, and 0.1-Mb) using human-orangutan comparisons, and repeated them for two additional phylogenetic distances using human-macaque and mouse-rat comparisons, to understand if and how the structure of mutation rate co-variation and the contribution of various genomic features may differ among them.
Importantly, we have made the suite of software tools implemented for this research publicly available, with the aim of improving reproducibility and facilitating future studies of mutation rates and other genome-wide data. We integrated our software into a modular tool set in Galaxy [23
], a free and easy-to-use web-based genomics portal that has already established a substantial community of users.