The extent to which adaptive evolution has shaped the recent evolutionary history of humans is much debated. While polymorphism at certain genes, such as beta-globin or Duffy, is known to be associated with functional variation of selective importance, the functional importance of most DNA variation or substitution since the human-chimpanzee split is unknown. However, adaptive evolution is also expected to leave its footprint in patterns of genetic variation. In particular, selective sweeps that accompany the fixation of adaptive mutations will eliminate nearby genetic variation [1
]. In regions of high recombination, the footprint is expected to be smaller because recombination moves the beneficial mutation onto different genetic backgrounds, allowing linked diversity to persist. The observed positive correlation between recombination rate and genetic diversity [2
] therefore suggests that many loci have been the target of recent adaptive evolution.
However, genetic diversity is influenced by many factors, not just adaptive evolution. The rate at which new mutations appear in a population through mutation varies across the genome [5
] and is influenced by base composition [6
] (particularly the density of methylated CpG dinucleotides [7
]), which in turn is correlated with the recombination rate [8
]. Such indirect correlation may explain why the recombination rate also correlates with rates of substitution between human and chimpanzee [6
] and between human and mouse [11
]. Selection against deleterious mutations can also reduce genetic diversity indirectly through background selection [12
], the effect of which is stronger in regions of low recombination. Gene density varies across the genome [13
] and recombination hotspots typically occur outside genes [14
]; therefore, direct selection against deleterious mutations in genes could also potentially lead to a correlation between diversity and recombination. There is also some evidence that recombination may itself be directly mutagenic [9
There are two critical limitations in determining the nature of the association between recombination and diversity. First, previous analyses have relied on genetic maps estimated from pedigree studies [17
], which typically have a resolution at the centiMorgan scale (approximately 1 to 2 Mb). However, recombination rates are known to vary at the kilobase scale, with much recombination occurring in short hotspots of 1 to 2 kb in length [18
]. We would therefore expect direct (e.g., mutagenic) effects of recombination to be localised to recombination hotspots, yet this resolution is simply not available from existing genetic maps. The second major limitation is that different factors may have different (even conflicting) effects on diversity at different scales. For example, gene density could be positively correlated with mutation rate at broad scales because genes typically lie in GC-rich regions that have elevated mutation rates, yet at the very fine scale selective constraint will mean that genes themselves will tend to have lower diversity and divergence. Inference about the causal nature of the relationship between recombination and diversity requires analysis of large contiguous stretches of sequence from which it is possible to separate out the influence of different factors acting at different scales.
Here we introduce three innovations to analyse the relationship between recombination and diversity in humans. The first is the use of fine-scale genetic maps estimated from patterns of genetic variation, which provide a kilobase-scale resolution to the location of recombination hotspots [14
]. The second is the analysis of a large contiguous region of the genome, Chromosome 20, which allows assessment of both the scale over which factors influence diversity and comparison of genic and nongenic regions [22
]. Finally, we use discrete wavelet analysis [23
] to assess scale-specific interactions between factors.
Informally, wavelet analysis transforms a sequence of observations (such as the GC content or recombination rate along a chromosome) into a series of coefficients that describe variation in the signal at successively broader scales. Under the simplest discrete wavelet decomposition, using the Haar wavelet function, a series of observations is essentially transformed into (1) a series of detail coefficients representing the difference between pairs of neighbouring observations and (2) a smoothed version of the original signal (note that it is conventional to rescale both the differenced and smoothed signals to preserve the variance across levels). Differencing and smoothing is repeated at successively broader scales, such that for a series of 2n
observations there are n
iterations. If multiple signals have been measured, for example, base composition, gene content, recombination rate, etc., each signal can be transformed. Correlations between signals can subsequently be assessed through linear model analysis of the detail coefficients at each level [24
]. Linear model analysis of the smoothed coefficients is equivalent to assessing correlations between factors measured in windows of increasing size.
Although the transformed signal has no more or less information than the original, there are several benefits of analysing wavelet-transformed data in the analysis of genomic correlations. First, analysis of correlations at multiple scales removes the need to choose an arbitrary window size over which to search for correlations. Second, because of the way in which the transformation is constructed, the detail coefficients represent variation in the signal at a particular scale that cannot be attributed to variation at other scales (i.e., they are orthogonal to each other). Consequently, linear model analysis of the detail coefficients enables the detection of scale-specific correlations between factors. To give an illustration of why scale-specific effects can be important, note that different explanations for the link between recombination and diversity predict very different patterns with respect to the scale of the effect. If recombination is directly mutagenic we would expect to see a very local effect of recombination hotspots on diversity. In contrast, hitch-hiking explanations predict that the correlation will be over much broader scales. Finally, one useful way of thinking about linear model analysis of detail coefficients is that it measures how a change in one factor at a given scale influences change in another factor at the same scale. In effect, the analysis compares a series of paired observations and so implicitly controls for the background rate and autocorrelation of the signals. Consequently, linear model analysis of the detail coefficients is likely to be more robust to confounding factors that have not been measured. Of course, robustness may also be associated with reduced power relative to analysis of the smoothed coefficients.
To illustrate these points, consider the relationship between gene content and divergence. A shows the original signals and their wavelet decompositions over a 2-Mb region of the short arm (here a continuous wavelet decomposition is used merely for visual clarity; all analyses are carried out on discrete wavelet transformations). There is clearly both fine-scale and broad-scale variation in both signals. Correlation of the signals smoothed over successively broader scales over the long arm of Chromosome 20 (B) shows that gene content and diversity are positively correlated when calculated in windows of 1 to 16 Mb but negatively correlated if calculated in smaller windows. Indeed, if the signals are computed in windows of 1 Mb there is no apparent correlation. Analysis of the detail coefficients explains this unusual behaviour. Over fine scales the detail coefficients show negative correlation, while at broad scales there are weak, but positive correlations. The correlation between the smoothed coefficients at any scale can be decomposed into a weighted sum of the correlations between the detailed coefficients at broader scales (see Text S2) [23
]. Consequently, the detail coefficient correlations predict the behaviour of the smoothed coefficient correlations but critically also enable the separation of factors acting at different scales.
Wavelet Transformation of Genome Annotations
We have used wavelet analysis to assess the influences on genetic diversity along human Chromosome 20, chosen for its high degree of functional annotation [25
] and availability of high-density single nucleotide polymorphism (SNP) genotype data. By combining information on patterns of diversity and divergence with information on recombination rate, base composition, and functional annotation, we show that previously reported broad-scale correlations between recombination and diversity are likely to result from indirect correlation of the neutral mutation rate with other features of genome organisation, particularly base composition. However, we also show a direct and local effect of recombination hotspots on local patterns of diversity and allele frequency, suggestive of a role for base composition biases in heteroduplex mismatch repair or double-strand break (DSB) formation. Finally, while we demonstrate highly local correlations between recombination hotspots, diversity, and GC content, we find no local correlation between recombination and divergence. These results are consistent with recent observations that while the fine-scale structure of recombination appears to evolve rapidly [26
], rates over broader scales may be constrained [14