We have scanned the whole human genome and identified the most extreme examples of recent, rapid molecular evolution. After careful screening to remove alignment and assembly errors, we found 202 significantly accelerated elements. In this work, we have extensively characterized the bioinformatic properties of the HARs, paying particular attention to the most extremely accelerated elements. The ranked list of HARs is a rich source of genomic regions for further study and functional characterization. Some of this work has been undertaken for HAR1 [7
], showing that it is a small structural RNA expressed during development of the neocortex.
The 202 HARs resemble the full set of conserved regions. The majority are located in conserved non-coding regions. Many are found in the introns of, and adjacent to, genes annotated with GO terms related to transcription and DNA binding. These findings are in agreement with the hypothesis, first proposed by King and Wilson in 1975, that the majority of chimp-human phenotypic differences can be explained by differential control of transcriptional networks [8
], which may be expected to occur primarily in the non-coding DNA.
Changes in the human lineage could represent either the loss of a functional element [53
] or a change in its function. By comparing estimates of the human substitution rate (genome-wide and in the final bands of chromosomes), we found that all of the HARs have been evolving at faster than neutral rates. The two most accelerated regions, named HAR1 and HAR2, have exceedingly high substitution rates in the human lineage, implying an approximately 4-fold increase in selective coefficient if positive selection were the only explanation for the acceleration (Text S1). However, detailed examination of these data indicates that forces other than selection for random mutations that increase fitness in specific functional elements may be at play in the most rapidly evolving regions. Careful analysis is needed to tease apart these disparate forces. We observed a strong correlation between acceleration and bias toward AT→GC nucleotide pair changes in regions of size from 100–1,000 bp. This bias occurs equally in intronic, intergenic, and coding elements. Acceleration and bias are more frequent in regions in the final band of their chromosome arm. Interestingly, the orthologous regions of HAR1 and HAR5 are also in final bands in other mammals. Since the sequence of these elements is highly conserved across the vertebrates, they appear to have been very stable for an extended evolutionary period despite their location near chromosome ends, before being radically reworked during the last ~5 million y of human evolution. The general association between increased divergence rates and location near chromosome ends is consistent with a recent whole-genome comparison of chimp and human [1
] that found increased divergence (15% greater than the rest of the chromosome on average) in the terminal 10 Mb of each chromosome. Our results go further, indicating that regions at the ends of chromosome arms are not uniformly or constantly changing more rapidly than other regions, but rather, acceleration can be a sudden, extreme and uneven process, with clusters of rapid, biased changes occurring in local W→S regions of ~1 kb, even in elements that are otherwise usually highly resistant to change.
BGC is one possible factor in this process. There is more recombination at the distal ends of chromosome arms, and the location of recombination hotspots is known to change rapidly during evolution. In particular, it differs widely between human and chimp [48
]. Hence, we do not necessarily expect there to be an association between HARs and current recombination rates. Nonetheless, we do find more HARs than expected based on genome-wide data in regions with elevated recombination rates. Recombination can also be mutagenetic [18
]. Recombination hotspots appearing some time in the last 5–6 million y could thus provide a mechanism for both the biased fixation of G and C nucleotides in the pre-human population and the polymorphic sites needed to start this process. In particular, the error prone repair of recombination-associated double-stranded breaks in the DNA could produce clusters of mutations over a relatively short period of evolutionary time, either together during a single recombination event or as independent mutations. BGC could then drive the rapid fixation of the derived GC alleles in the population. Note also that there is a marked increase in the number of segmental duplications and rearrangements created by non-homologous end-joining and interlocus gene conversion in human subtelomeric regions [56
]. This also implies an increased number of double-stranded breaks, which in combination with BGC could have contributed to the effects we see. A similar hypothesis was recently put forth by Spencer et al. [57
] to explain a fine scale (2–4 kb) association between recombination and diversity observed on human Chromosome 20.
Increased positive selection in these regions is an alternative explanation; if rather than (or in addition to) selection for random fitness-increasing changes in specific functional elements, there is selection for increased G + C content in larger isochores, as proposed by Bernardi and colleagues [19
]. In this theory, neutral and weakly deleterious changes drive a large region (>100 kb) to a critical point, below which the G + C content cannot fall without significantly deleterious effect. At that point, W→S substitutions in the region suddenly gain a selective advantage, and may sweep through the population. The effects of the sweep on polymorphism and divergence would be similar to those that result from selection for specific, non-isochore-related advantageous alleles in genes. With the data at hand it would be difficult to distinguish this from selection for specific changes in functional elements. However, we may still hope to distinguish selection in general from BGC.
BGC mimics selection in many ways [58
], so that most tests cannot distinguish them. However, the size of a gene conversion event (i.e., track length during DNA repair) is thought to be geometrically distributed with a mean of several hundred bp in humans [59
], whereas the domain of selection in a sweep can be tremendous [43
]. Under the selected-isochore model, selective constraints are shared over larger regions (hundreds of kb). Thus, we do expect quite a different sweep signature for selection versus BGC. The regions around HAR1 and HAR2 that have significantly reduced polymorphism relative to divergence are ~5 kb, which is more consistent with selection than with BGC. This does not rule out the possibility that large transient mutational hotspots created short-lived increases in mutation rate in these regions, increasing divergence without affecting current levels of polymorphism and thereby simulating a selective sweep [18
], or that there was an unusually extended BGC event. However, it does at least suggest that selective forces were at work in driving the changes in these regions; albeit, not on the scale of hundreds of kb. In cases like HAR1, where the DNA that exhibits the W→S substitution bias is transcribed, another possibility is selection for increased gene expression [21
Although we found a reduced ratio of polymorphism to divergence suggestive of positive selection around HAR1 and HAR2, directed resequencing of 6.5 kb around HAR1 produced a folded-site frequency spectrum that is consistent with the neutral model [7
] and does not suggest a recent selective sweep. It is important to note that these two analyses of selection at HAR1 use different data (1 Mb of publicly available SNPs here versus 6.5 kb of resequencing in single populations in [7
]) and hence different methods. The HKA and coalescent-based tests that we performed with publicly available SNPs were not feasible with the resequencing data which lack a suitable control region sequenced in the same populations. Hence, allele frequencies in the observed resequencing data were compared to theoretical expectations under the neutral model. In contrast, we perform a more nonparametric, empirical analysis here, in which each focal locus (centered on a HAR element) is compared directly to the surrounding genomic environment. In addition, the use of divergence data as a benchmark for levels of polymorphism may improve our ability to detect a sweep when both diversity and skew in allele frequencies have mostly recovered (after ~ Ne
generations, where Ne
is the effective population size). One interpretation of these results is that selection most likely occurred, but that it appears to have acted long enough ago (>250,000 y) or been weak enough (as suggested by the ~5-kb footprint) that it could not be detected in the site-frequency spectrum observed in the resequencing data analysis. The presence of compensatory substitutions in the RNA structure of HAR1 [7
] supports this hypothesis. Unfortunately, however, our ability to confidently reject the neutral model in the HARs is reduced by the likely presence of ascertainment biases present in the publicly available data used here.
Thus, while we can pinpoint the locations of the most rapidly accelerated elements in the human genome, we cannot determine the exact cause of this acceleration with present data. Since we searched the entire genome for the most extreme cases, there is the distinct possibility that changes in the regions we observe result from a combination of multiple evolutionary processes, perhaps including BGC and a selection-based process. In particular, the intensity of the increased selective coefficient in the most dramatically accelerated elements supports the hypothesis that multiple evolutionary forces have contributed to these fastest evolving elements in the human genome.