We began by developing a segmental duplication map for each of the four primate genomes (macaque, orangutan, chimpanzee and human) (Fig. S1
). The approach is based on the alignment of whole-genome shotgun (WGS) sequence data against the human reference genome and predicts high-identity segmental duplications (SDs) based on excess depth of coverage and sequence divergence11
(Methods). Previous analyses have suggested excellent sensitivity and specificity for computational detection of duplications larger than 20 kbp in length11
(, Table S1
and Supplementary Note Table 2
). By this criterion, we characterized 73 Mbp corresponding to the duplications identified in at least one of the four primate species, correcting for copy number in each primate (Methods). We furthermore characterized each duplication as “lineage-specific” or “shared”, depending on whether it was seen in only one or multiple genomes. This comparative map (Fig. S3
) is available as an interactive UCSC mirror browser, http://humanparalogy.gs.washington.edu
, allowing researchers for the first time to interrogate the evolutionary history of any duplicated region of interest.
Classes of primate segmental duplication
We validated our primate genomic duplication map using two different experimental approaches and, wherever possible, using DNA from the same individuals from which the computational predictions were generated. Using fluorescence in situ
hybridization (FISH), we found that 86.5% of SDs were concordant with computational predictions when categorized as either lineage-specific (50/58) or shared duplications (40/46) (Figs. S1
) (see below, and Fig. S2
and Tables S2, S3 and S4
). As a second approach, we designed a specialized oligonucleotide microarray (1 probe/585 bp) targeted to primate SDs () and performed array comparative genomic hybridization (arrayCGH) between species (, and S2
). Among the great-ape genomes, we confirmed 89-99% of the lineage-specific duplications by interspecific arrayCGH () with a very good correlation between computationally predicted and experimentally validated copy-number differences (). Since only 45% of macaque-specific duplications could be confirmed by interspecific arrayCGH, we performed an independent assessment of the macaque genome assembly and conservatively validated ~85% of macaque-specific duplications9,12
Experimental validation of duplication map
The comparative duplication map reveals several important features of primate SDs. As expected, most (80% or ~55 Mb) high-identity human segmental duplications arose after the divergence of the Old World and hominoid lineages (). Humans and chimpanzees show significantly more duplications than either macaque or orangutan (); with a large fraction being shared between chimpanzee and human. Based on our four-way primate genome analysis and leveraging arrayCGH data from gorilla and bonobo, we classify only ~10 Mb of duplication content as human-specific (210 duplications intervals with an average length of 53.1 Kb). The genomic distribution of great-ape segmental duplications is highly nonrandom (Fig. S5
) with the presence of ancestral duplications being a strong predictor of “new”, lineage-specific events (P-value<0.001, randomization test, Supplementary Note, Table S5a,b
). For example, 45% of human-chimp shared duplications map within 5 kbp of SDs shared among human-chimpanzee-orangutan, while 31% of human-chimpanzee-orangutan duplications map adjacent to human-chimpanzee-orangutan-macaque duplications. These observations emphasize that unique sequences flanking more ancient duplications have a much higher probability of segmental duplication11,13
and the duplication process itself is not random.
Shared vs. lineage-specific duplications and great-ape polymorphism
Within the human-specific set of duplications, we identify 39 partial and 17 complete human genes (Table S7
). As expected, we find that full-length hominid genes show greater evidence of positive selection when compared to similarly analyzed unique genes (Supplementary Note). Our analysis indicates that several genes associated with human adaptation (amylase (AMY1
), aquaporin 7
) are shared with chimpanzee but humans show a general increase in copy number. Gene models associated with signal transduction, neuronal activities (e.g. neurotransmitter release, synaptic transmission), and muscle contraction are significantly enriched in human, chimpanzee and orangutan lineage-specific duplications (Table S7
). Human and great-ape shared duplications or those shared with macaque are, in contrast, enriched for biological processes associated with amino acid metabolism (P-value=1.69e-2
) (great-ape shared SDs) or oncogenesis (P-value=5.80e-13
) (ape SDs shared with macaque). Although the number of such duplication events is few, these data suggest a shift in the types of genes that have been duplicated most recently during great-ape and human evolution.
There are two important caveats to the above analysis. First, we have analyzed a single individual in each case and it is unclear to what extent that single genome represents the duplication pattern of the species. Second, duplicated sequences shared by two or more species might have potentially been subjected to recurrent mutations (homoplasy) leading to an overestimate of the proportion of ancestral duplications. Both copy-number polymorphism and evolutionary homoplasy, in principle, will complicate classification of segmental duplications as “ancestral” or “lineage-specific”. We therefore performed a number of additional analyses to address the impact of polymorphism and recurrent events on our assignments.
First, we investigated the extent of copy-number variation for both shared and lineage-specific duplications. Using arrayCGH targeted to primate SDs, we assessed the extent of copy-number variation in a set of unrelated DNA samples () (Methods). As expected14,15
, lineage-specific SDs are highly copy-number variant, with humans showing 1.5- to 2-fold less diversity in copy number when compared to chimps and orangutans (; Supplementary Note Table S9
). Surprisingly, we find that shared SDs are as copy-number variant as lineage-specific duplications and that humans show slightly greater copy-number variation for these (42% versus 34%) when compared to apes.
It is, however, important to distinguish between duplication copy-number variation versus duplication status. A segmental duplication may show a high level of copy-number variation while its status as duplicated remains relatively constant among different individuals within a species. To address this, we performed a series of 3-way arrayCGH comparisons (Supplementary Note Fig. 7
; Methods) where we investigated how duplication status (human-specific, chimpanzee-specific status and orangutan-specific SDs) varied as function of copy-number polymorphism within a species. The results from these triangulations indicate that only 1-8% of the SDs change duplication status even though 18-32% of the duplications are copy-number polymorphic between two individuals within a species (Supplementary Note Fig. 8
). As a second independent test, we compared the duplication maps of two human genomes (Venter or HuRef and Watson genomes)16,17
and found that 89% (595/666) of the regions are shared duplications between HuRef and the Watson genome. Although we predict copy-number differences between these shared duplications, the boundaries of the duplication intervals remain remarkably consistent (Fig. S7
), suggesting again that duplication status is a relatively constant character state within a species.
To assess the potential impact of recurrent mutations leading to misclassification of ancestral events, we focused on shared duplications between human and chimpanzee that were not identified as duplicated in either orangutan or macaque. We examined 103 sets of chimpanzee-human shared duplications that mapped to two or more distinct locations in the human genome (Supplementary Note) and determined what fraction of these mapped to two or more orthologous positions between chimp and human. Using a paired end-sequence mapping approach18,19
(Supplementary Note, Figure 9
), we find that 85% (88/103) of the chimpanzee-human shared duplications have two or more copies mapping to the same orthologous position in the two genomes. This implies that the majority of shared duplications were already duplicated in the human-chimp common ancestor (Supplementary Note Tables 6 and 7
As part of our comparative analyses, we identified regions whose duplication patterns were inconsistent with the generally accepted human/great-ape phylogeny (Fig. S4
, Table 2, S5 and S6
). For example, we identified 43 intervals that are duplicated in human and gorilla but not chimpanzee (H+
duplications). Such a scenario may arise as a result of a deletion event in the chimpanzee lineage, incomplete lineage sorting or, less likely, recurrent duplication events in the human and gorilla lineages. Only the latter possibility would potentially lead to an overestimation of ancestral duplication events. We estimated the frequency of such events by mapping the location of the duplications in each species using paired end-sequence data19
(see Supplementary Note). If the duplicated sequence mapped to the same location in gorilla and human, we classified it as a chimpanzee-specific deletion event or incomplete lineage sorting. If mapping to different locations in the two genomes, we categorized it as a recurrent event. As expected, most of the informative H+
duplications (80% or 12/15) were the result of chimpanzee-specific deletions.
We investigated the most extreme example of recurrent African ape duplications in more detail (). We identified a region (~150 kbp in length) mapping to human chromosome 10 that had expanded in the chimpanzee genome but was largely single copy in human and orangutan. It consists of two distinct duplication blocks (~86 and 66 kbp in length). Both arrayCGH and FISH () confirm that the segments had been duplicated multiple times (~5-100 copies depending on the block and species) in the chimpanzee, bonobo and gorilla genomes but are single copy in all humans tested. Notably, the duplication boundaries (as delimited by arrayCGH) differ between the gorilla and chimpanzee lineages. With the exception of the chromosome 10 locus, we find that the map locations between gorilla and chimpanzee are non-orthologous (Supplementary Note and Methods) suggesting that this duplication expansion has occurred independently in both lineages.
Convergent gene duplication expansion in African great apes but not humans
Based on the large number of interstitial sites on gorilla chromosomes, we compared chromosome 1 from four unrelated gorillas for variation in copy number and location of this segmental duplication. Remarkably, we find that both copy number (10-14 copies per homologous chromosome) as well as map location for this segmental duplication vary among these eight gorilla homologues with as many as 50% of the map locations being unoccupied by a duplication in another homologue ( and Supplementary Fig. 13
). We conclude that this ancestral region of chromosome 10 has served as a preferred donor of chimpanzee/great-ape duplications and that the chimpanzee and gorilla genomes have been restructured by independent bursts of duplication activity. Interestingly, we detect and confirm by RT-PCR (reverse transcription PCR) at least one previously uncharacterized gene (14 exons, 141 Kb of genomic sequence, 1311 nt of CDSs and 437 a.a.) mapping to duplication block 1, which shows significant similarity to endosomal glycoprotein genes (Supplementary Note, Fig. 14-17
). Thus, these duplications, in principle, may have led to African ape gene family expansions while remaining conspicuously a single copy in the human lineage. Although the mechanism by which such events have occurred is unclear, our data highlight the rapidity by which segmental duplications have restructured hominid genomes and emphasize their nonrandom nature both temporally and spatially.
Based on our genome-wide assessment of segmental duplications in each of four primate species and our estimate of 20% homoplasy (see above), we calculated rates of segmental duplication both in events20
and basepairs along each lineage and ancestral node (, Supplementary Note Tables 13-16
). We developed a maximum likelihood model to test if the rate of accumulation of segmental duplication has remained constant during the course of human/great-ape evolution. We compared the likelihood that the rate of segmental duplication has been uniform versus the likelihood of differential rates within specific lineages (). We find a significant increase (Likelihood Ratio Test (LRT), P-value<1e-10
) in both the number of events and basepairs in the human/African great-ape lineage when compared to macaque/Old World monkey lineage. While terminal hominid lineages show an excess of duplications, the most significant burst of activity (4-10-fold, LRT P-value=1e-10
) occurs in the common ancestor of human/chimpanzee and gorilla and after divergence of gorilla from the human-chimpanzee lineage (Supplementary Note Table 17
). Our prediction is in strong agreement with the degree of sequence divergence among human intrachromosomal segmental duplications that shows a mode at 97-99% sequence identity. We note that this burst of duplication activity corresponds to a time when other mutational processes, such as point substitutions and retrotransposon activity, were slowing along the hominoid lineage. This apparent burst of activity may be the result of changes in the effective population size, generation time or imply a genomic destabilization at a period prior and perhaps during hominid speciation. In light of the importance of segmental duplications in contributing to copy-number changes associated with neurocognitive disease21-24
and disease susceptibility25-27
, we predict that this apparent acceleration has had a profound impact on the reproductive success, adaptability and evolution of ancestral hominid populations.
Rates of segmental duplication