|Home | About | Journals | Submit | Contact Us | Français|
Duplicated sequences are substrates for the emergence of new genes and are an important source of genetic instability associated with rare and common diseases. Analyses of primate genomes have shown an increase in the proportion of interspersed segmental duplications (SDs) within the genomes of humans and great apes. This contrasts with other mammalian genomes that seem to have their recently duplicated sequences organized in a tandem configuration. In this review, we focus on the mechanistic origin and impact of this difference with respect to evolution, genetic diversity and primate phenotype. Although many genomes will be sequenced in the future, resolution of this aspect of genomic architecture still requires high quality sequences and detailed analyses.
Over 40 years ago, Ohno and colleagues postulated the importance of duplications in the evolution of new gene functions . Since then, our knowledge and understanding of the evolution of genes and genomes has increased enormously. Both computational and experimental approaches indicate that gene loss and gain have been common within the primate lineage [2,3,5–10] and that much of this occurs within or is mediated by duplicated sequences. The dynamism and complexity of these changes has complicated molecular comparative genomic studies. Nevertheless, the available data clearly indicate that this variation is critical for understanding the evolution and phenotypic variation of our species. The study of SDs, defined as highly identical duplicated DNA fragments greater than 1 Kb, is relevant for two basic reasons. First, SDs are hotbeds for genome structural change between and within species. Regions of SDs are preferred sites of copy number polymorphism [13–15], disease-causing rearrangement [16,17] and evolutionary breakpoints during primate genome evolution [18,19]. Duplicated sequences – by nature of the sequence homology that promotes unequal crossover during meiosis – are mutagenic with both beneficial (genome flexibility) and damaging (human disease) consequences. Second, because primate SDs are particularly enriched for transcripts, there is the potential for gene innovation and rapid adaptation as a result of the accelerated tempo of mutation within these regions [20–22].
To date, the sequences of three primate genomes have been published: human , chimpanzee  and macaque . Sequencing of the orangutan and marmoset genomes is near completion, with additional genomes from other branch points of the primate phylogeny being targeted. More than 20 other mammalian genomes have been earmarked for whole-genome sequencing at various levels of coverage. Although many draft mammalian genomes have been published, only one (mouse)  is comparable in quality to the human genome because it was completed using a hierarchical bacterial artificial chromosome (BAC)-based approach . The resources  developed as part of these projects have begun to provide a new framework for researchers to understand the evolution of genomes. However, most available genome assemblies have been resolved from a whole-genome shotgun (WGS) approach, rendering the mapping and interrogation of SDs a more challenging task. Computational methods have been developed to identify duplicated sequences independently from the genome assembly, and experimental methods (FISH and array comparative genomic hybridization (array-CGH)) have been used to validate and explore the distribution and organization of these sequences . In this review, we will address the various methods of detection, organization and impact of SDs toward primate evolution and human disease.
Characterization of the human genome revealed that SDs are large, highly identical and interspersed [2,8,29], usually separated by >1Mb of unique sequences. Their distribution is largely nonrandom, with peculiar clustering observed near the subtelomeric and pericentromeric regions in addition to enrichments within the euchromatic portions of specific human chromosomes. The majority of human SDs map to ~400 distinct regions of the genome (termed “duplication blocks”) . Within these duplication blocks, SDs are organized into complex mosaics where individual ancestral duplicated segments (termed “duplicons”) are juxtaposed adjacent to other duplicons of diverse origin. Evolutionary studies of these duplication blocks suggest additional rounds of duplication among the duplication blocks creating a complex pattern of duplications-within-duplications and a bedazzling complexity of interspersed duplications – a hallmark feature of hominoid duplication organization .
Building on this shared evolutionary history among the duplication blocks, Jiang and colleagues applied a de Bruijn graph approach (a model applied to resolve a complicated set of sequence alignments using a data structure from combinatorial mathematics) to construct an evolutionary framework for the origin and relationship of duplication blocks . There were three important conclusions: (i) most human duplication blocks can be grouped into 24 distinct clades based on the sharing of at least one duplicon; (ii) pericentromeric and subtelomeric duplications evolved independently from intrachromosomal duplications; and (iii) the analysis pinpointed 14 specific gene-rich sequences called “core duplicons” ranging in length from 5–30 Kb that were found to be associated with most of the intrachromosomal duplications within specific human chromosomes. It was proposed that these core duplicons were the focal point for duplication expansion, driving the duplication of other segments in a stepwise process during hominid evolution (“core duplicon hypothesis”). A detailed study of one of these “core” duplicons LCR16a in humans and African apes supports the notion that these sequences were the catalyst for independent and recent expansions of duplications within different ape lineages .
Despite the draft nature of non-human primate genomes, it has been possible to estimate and compare SD content using both experimental and computational methods. Using an assembly-independent method (Box 1), it was found that both human and chimpanzee show an overall similar content of SDs (~5%). Although the majority are shared between the two species (~66%), differences in duplication content and copy number account for more genetic differences between the two species (~2.5%) than single basepair substitutions (~1%) . Analysis of orangutan duplications performed by mapping orangutan WGS reads to the human assembly found ~40% less duplication in the Asian ape than either human or chimpanzee. The draft sequence of the macaque genome shows a substantial reduction of SDs (~2.4% of the genome) using three different methods: two based on the genome assembly and one independent of the genome assembly . A small proportion of SDs is shared between human and macaque. These data support a model where SD activity increased after the divergence of African great apes (chimpanzee, gorilla and human) from the Asian great ape (orangutan)  (Figure 1). Numerous studies tracking the history of specific duplicons by comparative FISH analyses among primates provide additional support for this burst of activity (e.g. SLC6A18, NF1, CHEK2 [31,32]). Interestingly, this two- to fourfold acceleration seems to have occurred at a time when other mutational processes were slowing down (such as point mutations or retrotransposon activity). This finding is also consistent with the statistical mode of sequence divergence among highly identical human intrachromosomal SDs (97–99%) .
Despite extensive progress in genome sequencing, only the human genome can be considered reliable in terms of the assembly of high identity duplications among the primates. SDs are complicated to detect for several reasons. First, by definition, they are highly identical (>90% similarity), making them challenging to distinguish. Thus, SDs are either under- or over-represented owing to the inherent nature of whole-genome assembly-based methods. Second, these analyses are frequently interfered by common repeats dispersed throughout the genome. Third, the mosaic architecture of duplications derived from various chromosomal locations by incomplete or partially duplicated transposition complicates SD detection.
Two main methods are broadly used to detect SDs: an assembly-based method (WGAC or whole-genome assembly comparison ) and an assembly-independent method (WSSD or whole-genome shotgun sequence detection ). WGAC was first used to describe global duplication content in the preliminary version of the human genome assembly . Briefly, the genome assembly is partitioned into shorter segments (400 Kb). Because common repeats complicate self-alignment, repeats are repeat-masked and removed, leaving “unique” genomic segments that are compared to identify large regions of high identity. Once seed duplication alignments are identified, local pairwise alignments are computed with their common repeats reinserted and the endpoints defined using a reiterative heuristic method that maps within common repeats. We initially reported alignments longer than 1000 bases aligned with >90% identity (hence, the formal definition of SD was created). Assuming neutrality and a molecular clock, this level of sequence identity within the human genome should detect all duplications since the split of New and Old World monkeys (35 million years ago (mya)). This method has the advantage of providing an absolute estimate of copy number, the structural details of duplicated regions and the location of all duplications. Obviously, this method depends on the quality of the assembly. If sequences are missing, collapsed or not correctly assembled, duplication content can be over-or underestimated.
A more versatile method, one not directly dependent on a finished genome assembly, is WSSD . This method aligns whole-genome shotgun (WGS) reads against a reference assembly (with a defined identity threshold, usually 94%). The idea is straightforward. Because duplicated reads will map to both the paralogous and the orthologous location in the assembly, duplications will be detected as an excess of read-depth even if that duplication has not been resolved. The number of reads and average sequence identity are calculated across window intervals, and the boundaries of SDs are then determined by defining transitions in read-depth across smaller window intervals. This method requires that the WGS reads be randomly distributed and that all sequences are represented at least once within the reference assembly. The method does not provide information on the location of paralogous copies, details about their structure or the sequence identity between paralogs. However, it is a method with high sensitivity for high identity duplications larger than 20 Kb and can be used to detect duplications in other species that are genetically similar to the assembly reference without that genome being fully resolved [8,20–22]. Of course, the greater the evolutionary distance, the more difficult the mapping, but it has performed well in all the great ape WGS sequences mapped to the human assembly. As expected, duplication copy number correlates well with depth of read coverage, allowing copy number differences between primate genomes to be predicted (Figure I).
The duplication content in non-primate mammalian genomes is much less clear owing to the draft nature of most current genome assemblies and the greater evolutionary distance from humans, which complicates the mapping of WGS sequence reads to the human reference. Estimates of duplication content based on the extant assemblies range widely among the different mammalian genomes (Table 1). The observation that an appreciable fraction of the assembly-detected duplications are not supported by assembly-independent methods and that 30–40% of validated duplications cannot be assigned to a chromosome clearly indicate that too much stock should not be put into current assembly-based estimates [25,26,33–36]. Only the mouse genome (C57BL/6J) is comparable in quality to that of the human genome . As the mouse genome assembly progressed from a WGS assembly to a clone-ordered assembly, the duplication content more than doubled. Both human and mouse C57BL/6J genomes show similar levels of recent duplication (~5%); however, the two genomes differ radically in the organization of these sequences. In mouse, 88% of the larger duplications (>20Kb) are organized in tandem (as opposed to just 33% in the human genome) . Experimental analyses of duplications in dog  and cow suggest that an abundance of tandem duplications represent the mammalian archetype. In total, these data imply a fundamental shift in the organization and evolution of primate SDs in which the mosaic architecture and expansion of high identity interspersed intrachromosomal duplications seem to be most pronounced in human and great ape genomes .
Unequal crossovers between directly orientated duplicated sequences can predispose to disease in two distinct ways. First, they can directly increase or decrease the copy number of a particular gene or parts of a gene embedded within SDs . This local expansion or contraction leads to dosage changes or the altered functional properties of a gene (Table 2). Most gene copy number polymorphisms associated with human diseases belong to this category. Second, duplicated sequences can sense particular unique regions of the genome to duplicate or delete because they are bracketed by interspersed duplications . Dosage imbalance or gene disruption of one or more genes leads to a highly penetrant rare allele. Most of the recurrent, large copy number variants associated with neurocognitive disease belong to this second category. The characterization of human and great ape SDs (http://humanparalogy.gs.washington.edu) allows researchers to reconstruct the evolutionary framework for any duplicated region of interest. For instance, a comparison of the duplication maps of the spinal muscular atrophy (SMA) region (5q13.2) among primates shows that the SMN2 (survival of motor neuron 2) gene is duplicated in both human and chimpanzee . Humans have had the most dramatic expansion of SMN2. Interestingly, when SMN1 is deleted the severity of SMA is determined by the number of remaining copies of SMN2 . Similarly, the lipoprotein Lp (a) gene (6q25) partially overlaps with a SD that is shared by human, chimpanzee and macaque and at less extent in orangutan (in which there seems to be a reduction of tandem repeats). This particular duplication is a tandem expansion of two exons within a 4.5 Kb segment and has increased in copy among humans compared with other primates (Table 2). Expansion of this cassette reduces serum levels of lipoprotein A, which is protective for coronary heart disease.
From the perspective of the evolution of genomic disorders, we can now determine the most probable age of appearance of disease-predisposing duplications (Figure 2). For example, lineage-specific amplification of the 24 Kb SDs flanking the Charcot–Marie–Tooth disease region (17p12) CMT1A-REP occurred more recently in the hominoid common ancestor after the divergence of chimpanzees and humans . The LCR22 duplications flanking the DiGeorge syndrome region (22q11.2) expanded after the divergence of hominoids from Old World monkeys . SDs flanking the Angelman/Prader–Willi region (15q11–q13) began to expand before the divergence of the Old World monkeys, whereas the Smith-Magenis syndrome-specific SD blocks (17p11.2) SMS-REPS date back to after the separation of the New World monkeys . Interestingly, these comparative analyses suggest that the predisposing genomic architecture for most genomic disorders emerged during the past 25 million years. In the few cases where detailed large-scale clone-based sequencing has been performed, humans and chimpanzees show the greatest complexity of structure . A corollary of this research is that these specific molecular causes of complex diseases such as schizophrenia, epilepsy, intellectual disability and developmental delay are, in part, the result of relatively hominid-specific duplication architectures that emerged during the evolution of our species.
Genome duplication is a classically accepted mechanism for the birth of new genes and the functional diversification and expansion of gene families. The outcome of a gene duplication event is contingent on the nature of the duplication and lineage-specific selection. Natural selection operates independently on the new copy of the duplicated genes such that the new duplicate can acquire a novel (neofunctionalization) or modified (subfunctionalization) function [44,45]. The latter frequently results in tissue-specificity or partitioning of the function from the ancestral single-copy genes. Because the process of duplication is no respecter of gene structure, partial or incomplete gene duplications are more common and these segments are, by definition, born “dead” and decay naturally within the genome as unprocessed pseudogenes. The duplication blocks of the human genome, thus, can be regarded as graveyards of exon-rich DNA from which evolutionary innovations occasionally arise.
Certain SDs undergo lineage-specific expansions in copy numbers before fixation by positive selection; although, in some lineages fixation never occurs and the gene continues to vary in copy. Interspecific variation in copy numbers is potentially essential for the evolution of species-specific adaptive traits. Trichromatic color vision, a trait essential for distinguishing red, blue and green, arose by X-linked gene duplication after the divergence of New World monkeys [46,47]. Variation in copy numbers between populations within species has resulted in dosage-sensitive effects for certain diseases (Table 2). For example, an increased CCL3L1 (chemokine (CC motif) ligand 3-like1) copy is associated with a significant reduction in susceptibility to HIV infection and the progression of AIDS in humans . and other gene duplications are associated to basic biologic phenomena such as lactation in mammals  or the presence of venom in monotremes . Copy number differences within macaque populations also affect the rate of progression of simian AIDS; Indian-origin macaques with fewer CCL3L1 copies showed shorter post-infection survival rates than Chinese-origin macaques containing higher amounts of CCL3L1 copies . Copy number variation of human FCGR3 (Fc receptor, IgG, low affinity III) also seems to determine susceptibility to immune-system-mediated glomerulonephritis . Notably, an increased copy number of beta-defensins is associated with a significant risk of psoriasis , whereas a decreased copy number predisposes to Crohn’s disease of the colon . It is, therefore, evident that genes within SDs contribute to human morbidity, in addition to providing the raw material for evolutionary novelties.
In principle, the interspersed architecture of the human and great ape genomes offers tremendous evolutionary potential. The mosaic architecture of the duplication blocks in these genomes means that disparate segments can be juxtaposed, essentially shuffling different functional segments of the genome in combinations that are not found in ancestral species. Although most of these juxtapositions are non-functional, occasionally an evolutionary novel “fusion gene” can arise with functional importance (TRE2/USP6) (Box 2). Numerous, exon–intron structures with unknown functions have also been identified within SDs [4,32,55–61]. It is noteworthy that many of the core duplicons (see above) that seem to be central in the great ape expansion of SDs also harbor rapidly evolving genes and gene families. Several show evidence of positive selection, changes in gene structure or radical differences in gene expression compared with their ancestor genes [4,56,57,59,61], including NPIP (nuclear pore complex interacting protein)/morpheus (Box 2), RANBP2 (RAN binding protein 2) and the DUF1220 domain containing NBPF11 (neuroblastoma breakpoint family 11) gene. These “genes” have expanded in the human–great ape lineage to show variation in copy numbers and content within and between primate populations, and are the source of recurrent rearrangements associated with disease.
More than a dozen genes within SDs have been identified to have undergone rapid diversification and apparent neofunctionalization. Intrachromosomal remodeling by SDs usually involves euchromatic regions of chromosomes. The transposed blocks of SDs are 96–98% identical and deciphering the ancestral “duplication core” is nontrivial. Interestingly, many of these genes map to paralogous regions that are responsible for recurrent disease-associated rearrangement events (16p11.2 autism deletion, 16p13.11 deletion/duplication, Williams, Prader–Willi, velocardiofacial/DiGeorge, neurofibromatosis, spinal muscular atrophy and Smith-Magenis syndromes). A classic example of intrachromosomal remodeling is the formation of the morpheus gene. A 20 Kb segment of chromosome 16 (LCR16a) proliferated from 1–2 copies in Old World monkeys to 15–20 copies in humans and chimpanzees . The morpheus gene family was identified within these LCR16a duplication blocks, Furthermore, evolutionary analysis of the protein coding segments of morpheus showed an enhanced rate of adaptive evolution, with an excess of non-synonymous substitutions compared with synonymous substitutions for certain exons (Ka/Ks = 35) after the separation of human and great ape lineages from the Old World monkey. Notably, this exquisite pattern of evolutionary dynamics has given rise to the diversification of the expression profile for this gene family, from testis-specific mRNA expression in baboons to ubiquitous expression in humans and closely related primates (Johnson and Eichler, unpublished results).
Interchromosomal remodeling involves the coalescing of duplicated segments from disparate chromosomal regions, occasionally leading to distinct functional roles. For example, the trypsinogen IV gene on chromosome 9p13 is formed by a fusion of PRSS3 (encoding mesotrypsinogen) from 7q35 and LOC120224 from 11q24 [11,12]. The first exon of trypsinogen IV is derived from the non-coding first exon of LOC120224, whereas exons 2–5 are derived from PRSS3. This interchromosomal juxtaposition of SDs from chromosomes 7 and 11 occurred after the divergence of hominids from Old World monkeys. Furthermore, the two variants of PRSS3 mesotrypsinogen and trypsinogen IV exhibit tissue-specific expression differences, suggesting different selective constraints on functionality.
How are these SDs fixed in the genomes? One possibility is that the negative effect of these core duplicon expansions is offset by the advantage of newly minted genes located within these regions (“core duplicon hypothesis”) [12,30]. An alternative explanation that would help fix SDs even if they are slightly deleterious is a reduction in the effective population size of primate hominid populations. This hypothesis has already been proposed to account for the burst of nuclear mitochondrial insertion sequences at the prosimian–anthropoid divergence . If we assume that most large SDs are weakly deleterious, such variants might be disproportionately fixed because of the whims of genetic drift as opposed to being eliminated by purifying selection in a large effective population size. Such an excess of deleterious mutations has been seen in certain cases, such as gene control regions in comparisons between humans and chimpanzees  or, at a smaller scale, in human populations experiencing a bottleneck .
The role of SDs in evolutionary rearrangements has supported a nonrandom “fragile–breakage” model for chromosomal rearrangements in mammals [19,65,66]. The association between clusters of SDs and evolutionary chromosomal breakpoints is strong and has been observed in most mammalian genomes [67,68]. Overall, about half (51%) of human–mouse breakpoints of conserved synteny are associated with SDs, significantly more than by random chance (2%) . An important outcome of this non-random model is the propensity for evolutionary “re-use” of chromosomal breakpoints; supporting this, approximately 20% of evolutionary breakpoints from eight mammalian genomes showed evolutionary re-use . In primates, lineage-specific hyperexpansion of SDs might be the consequence of the intrinsic fragility of certain chromosomal sites for rearrangements or, alternatively, this instability could lead to SD hyperexpansion (see below). Unsurprisingly, six of the nine breakpoints of the large cytological pericentric inversions that distinguish the karyotype of humans and chimpanzees map within SD duplication blocks. Furthermore, some of the species-specific SD-mediated inversions (chromosomes 4, 5, 9, 12, 15, 16 and 17 for chimpanzee and chromosomes 1 and 18 for human) also map within species-specific SDs (chr12 in chimpanzee and chr1 and chr18 in human). The breakpoints of the inversions that do not map to SDs are enriched for common repeats (SINE, LINEs), among which non-allelic homologous recombination (NAHR) events might also have occurred (for a review on the genomic comparison between humans and chimpanzees see ).
Notably, the great apes and the lesser apes (gibbons) show apparent contrasting trends in terms of chromosomal evolution, with a slow rate of rearrangement in the African great apes and a rapid karyotypic evolution in the gibbon lineage giving rise to four species and 12 sub-species. In contrast to humans and great apes, a smaller fraction (~46%) of Nomascus leucogenys (NLE) gibbon rearrangement breakpoints map to SDs in the human lineage . If SDs are more common in humans and great apes and they associate with rearrangement, one might expect the African great ape lineage to show more rearrangements as opposed to the fourfold excess of rearrangements in the gibbon lineage . One possible explanation for this paradox is that the paucity of SDs in ancestral gibbon genomes diverted rearrangement pathways away from homology-mediated events, favoring alternative replication-based mechanisms (e.g. MMIR, FoSTeS, break-induced replication) for a review on specific non-homology-mediated replication based mechanisms see . If we assume that the rate of rearrangement is uniform among all ape genomes, but that fewer SDs drive fewer homology-mediated events, we would expect non-homology-based mechanisms to contribute more significantly, manifesting as larger chromosomal rearrangements in gibbons. The abundance of duplication blocks dispersed through great ape/human chromosomes might have promoted many more regional and smaller structural rearrangement events (<1Mb) that have a transparent cytogenetic resolution . Moreover, given that NAHR events are often associated with breakpoint re-use [18,36,68] at a constant rearrangement rate, the great apes would show apparently fewer structural changes because of the recurrent rearrangements involving “local” chromosomal segments. Therefore, with the same effective number of events, gibbons with fewer SDs would tend to have more distinct, cytogenetically visible “global” structural changes. In support of this model, no excess of smaller regional structural rearrangements has been reported in gibbons despite a genome-wide survey for such events using BAC-end sequence pairs .
The origin and mechanism of the dispersion of SDs is still unclear. Different models of SD formation have been suggested for pericentromeric, subtelomeric or general interstitial SDs . Within subtelomeric regions, a translocation-based model was proposed wherein recurrent unequal non-homologous end-joining or non-homologous end joining (NHEJ)-mediated translocations followed by the serial transfer of sequences generated the complex blocks of subtelomeric duplication . A common observation is that SD breakpoints are enriched for SINE repeats (especially Alus) [75–77]. This has opened the possibility that the expansion of Alu elements within the primate lineage might have shaped the ancestral human genome, making it particularly susceptible to Alu–Alu-mediated rearrangement events, which, in turn, promoted the expansion of SDs and their subsequent role in NAHR . Notably, the timing of the burst of Alu repeats (~35 million years ago (mya)) is dated earlier than the expansion of SDs in the human and great ape ancestry (10–20 mya). High resolution sequencing of primate genomes for some of these complex regions has suggested the possibility that specific sequences might be apt to duplicate themselves and flanking sequences to new locations. For example, the LCR16a core duplicon has moved independently in both orangutan and human lineages to new locations, acquiring its own suite of lineage-specific duplications on its flanks . The independent expansion of the gorilla and chimpanzee chr10 duplicon (Figure 3)  might represent another manifestation of this core duplicon-flanking transposition model. Interestingly, many core duplicons, such as LCR16a, are particularly Alu-repeat-rich and also the source of primate gene innovations (see above).
Two studies on replication-based mechanisms in yeast and high quality sequencing of the human–NLE gibbon breakpoints of synteny have provided additional insights into the nature of the formation of SDs [72,78]. An experiment designed to study single gene amplification and gene dosage in Saccharomyces cerevisiae led to the serendipitous observation of spontaneous duplication of multiple large inter- and intrachromosomal DNA segments encompassing several dozens of genes . Furthermore, even when all potential DNA repair pathways (homologous recombination and NHEJ/MMEJ (Micro homology Mediated End Joining) pathways) were suppressed, SD formation was observed, suggesting alternative replication-mediated events . These duplicated blocks are essentially formed by replication accidents as other recombination-based repair mechanisms were suppressed . The proposed model suggested that following a double-strand break (DSB) originating from a collapsed replication fork, the free end of the DNA spontaneously invades a suitable template strand with low complexity (polyA/T) sequences or micro-homology, followed by reassembly of a new replication fork. The template switching mechanism can be favored by the presence of microhomology or microsatellite (MMIR).
Sequencing analysis of human–NLE gibbon rearrangements (regions specifically selected because they did not carry SDs) identified mosaic new insertions in ~40% of NLE precisely at the breakpoint of synteny  (Figure 4). Similar to the duplication blocks, these mosaic segments originated as small duplications from disparate locations that were both intra- and interchromosomal. The presence of sequence microhomology, topoisomerase binding sites and mosaic architecture at the larger breakpoint intervals suggested a replication-based mechanism for these rearrangements. A subset of these mosaic insertions were, in fact, SDs that had amplified specifically within the gibbon lineages (Figure 4). A notable example is the presence of a 4.2 Kb gibbon-specific SD mapping precisely at the translocation fusion point between chromosomes 3 and 12. Sequence analysis revealed that this SD actually consisted of duplicatively transposed sequences mapping 72 Kb and 64.5 Kb further upstream of the point of fusion on chromosome 3  (Figure 4). Thus, regions of rearrangement are indeed a source of new duplications [80,81]. These data support an alternative model that associates SDs and rearrangements and reinforces that DSBs can generate SDs [79,82,83]. Consequently, regions of genome rearrangement might, in effect, promote the formation of SDs in other regions of the genome as opposed to SDs being the cause of evolutionary rearrangements.
Regions of SDs are doomed to endless cycles of rearrangement. If duplication events are not eliminated by selection, they can promote additional rounds of inversions, duplications and deletions with an increased probability of further rearrangements as a direct function of the complexity and homology of the flanking duplications. Not surprisingly, unique genes mapping adjacent to ancestral duplications have a 10-fold higher likelihood of being duplicated – a phenomenon described as “duplication shadowing” [8,21]. Given the high dynamism of these regions, it is common to find recurrent events at nearly identical locations within the genome. A 150 Kb human polymorphic inversion on chromosome Xq28, for example, has been shown to be recurrently inverted in eutherian evolution at least a dozen times  owing, in part, to the presence of a diverse array of duplicated sequences located at the inversion breakpoint in almost every mammalian species. Similarly, a 970 Kb inversion polymorphism on human chromosome 17q21.31 is predicted to have inverted at least three times independently in the orangutan, human and chimpanzee lineages [36,85]. In humans, the inverted haplotype (referred to as H2) enriched in European populations, is associated with increased fecundity and is a predisposing factor to recurrent deletions found in handicapped children with the 17q21.31 deletion syndrome [17,86,87]. Both the evolutionarily recurrent inversion and predisposition to recurrent microdeletions in European populations are consequences of the recent duplication architecture that evolved within the human–great ape lineage (Figure 5). This example highlights the complexity of these regions and the importance of high quality final sequences for understanding the role of SDs in human disease, evolution and diversity.
Gene duplication is considered the primary means by which new genes and gene families evolve. Until recently, considerations of the birth–death process of gene duplications uncoupled these events from the underlying genomic duplication events. Recent published data suggest that dynamic structural changes mediated by duplication are intricately intertwined with the emergence of functional novelty. Primates provide a unique opportunity to study this aspect of biology. First, there has been an excess of interspersed SD relatively recently in evolution, which provides ample substrate for novel juxtapositions and selection. These studies have also suggested a nonuniform rate of duplication throughout primate evolution with an excess of duplication rate at the time of the hominoid common ancestor. Second, the human genome sequence is arguably the best functionally annotated and assembled reference sequence. Finally, genomic resources (BAC libraries, cDNAs, etc.) and sequences are available to characterize these complex regions of dynamism with precision.
Primate genomes, therefore, provide an opportunity to understand the evolutionary history and mechanism of SDs and how these events precipitated the emergence of novel genes. Such analyses, we believe, are beginning to have far-reaching implications. Recent research is revealing more genetic dissimilarity between humans and the great apes than previously anticipated, leading to the identification of novel human genes, many of which lack antecedents in other mammalian species, and suggesting mechanisms of evolutionary plasticity. Finally, it is apparent that SDs mediate genomic instability associated with disease. Understanding the dynamics of this process is, therefore, critical in assessing its impact on human health.
In this era of massive parallel sequencing, there is the promise that the genomes of most extant primate taxa will ultimately be sequenced. Simply sequencing greater diversity without a focus on the complex duplicated regions of our genome is shortsighted because it will limit our understanding of disease and the origin of our species. Without high quality sequences, it will be difficult to provide a comprehensive and functional understanding of lineage-specific duplicated genes that have been important, if not critical, in our adaptation. Not only the sequence but the diversity of these regions must be systematically understood to accurately genotype and determine their phenotypic consequences within our species, which requires accurately predicting copy, content and structure of these duplicated regions. Comparative high quality sequences of these regions among primates will provide insight into the mechanisms of their dispersion in different lineages (primates vs. other mammalian species) and the mode of selection acting on these regions. Focused efforts on these complex duplicated regions will enhance our understanding of the structure of primate genomes and their dynamic integration within the full spectrum of evolutionary change. Such studies bring to light their potential impact in evolution, variation and disease.
We thank Jeff Kidd, Lin Chen, Ze Cheng, Heather Mefford, Leslie Emery, and Tonia Brown for valuable comments and help in the preparation of this manuscript. This work was supported, in part, by NIH grants GM058815 and HG002385 to E.E.E. T.M.-B. is supported by a Marie Curie fellowship. E.E.E. is an investigator of the Howard Hughes Medical Institute. The authors declare no conflicts of interest.