In 2004, two pioneering studies showing that copy number variants (CNVs) are abundant in healthy human individuals
[1],
[2] accelerated research on this class of variation. The focus on these variants was well motivated because duplications and deletions of DNA regions have long been known to underlie a variety of genomic disorders
[3],
[4]. The discovery of the abundance of CNVs in otherwise healthy individuals made them good candidates to underlie common and rare diseases as well as other physiological traits. In just a few years, CNVs were implicated in a variety of diseases such as autism
[5], schizophrenia
[6], Crohn's disease
[7], psoriasis
[8] and other traits such as body weight
[9] and starch consumption
[10]. Duplications and deletions also have a long history of being implicated in adaptation and of being a major source of genetic innovation
[11]–
[14]. In domesticated animals, for example, they are responsible for white coat color in horses (duplication within an intron leading to
cis-regulatory changes
[15]), reduced comb and wattle size in chickens (duplication within an intron leading to expression changes
[16]) and short-legged dogs (new retrogene
[17]). Although much has been learned about CNVs, recent research raises more questions than it answers. Two independent avenues of research focus on studying the roles played by mutation and selection on copy number variation.
Understanding the mutational processes underlying the formation of CNVs is important from both a medical and an evolutionary perspective. Duplications and deletions can result from the imperfect repair of DNA double strand breaks generated by both exogenous (e.g. ionizing radiation) and endogenous (e.g. reactive oxygen species) agents as a consequence of the normal cellular metabolism
[18],
[19]. DNA replication errors can also generate CNVs, with or without the formation of DNA double strand breaks
[4],
[19]. Replication-based repair processes have been proposed to explain complex CNVs (i.e. CNVs with multiple breakpoints)
[20]–
[22] but evidence suggests they underlie the formation of simple CNVs as well
[23]–
[25]. Several lines of evidence suggest that CNV mutation rates vary throughout the genome
[26],
[27] and CNV hotspots have been identified in the human
[27]–
[29], chimpanzee
[28],
[30], mouse
[31] and fly
[32]–
[34] genomes. Mammalian CNV hotspots are significantly enriched with segmental duplications, which have been proposed to promote the occurrence of CNVs by facilitating non-allelic homologous recombination (NAHR)
[3],
[4]. Following this observation, Sharp and colleagues specifically targeted genomic regions associated with segmental duplications in the human genome and were able to identify CNVs associated with previously unidentified genomic disorders
[35]. But not all mammalian hotspots are associated with segmental duplications
[28],
[30] and
Drosophila hotspots are likely not associated with them at all
[33]. As such, a priority of the field is to identify the genomic feature(s), other than segmental duplications, that are associated with regions with increased numbers of CNVs.
Understanding the evolutionary forces shaping the evolution of CNVs is also important from a medical and evolutionary perspective. Despite their pervasiveness, analyses of the genomic distribution of CNVs among different functional regions clearly indicate that a large fraction is under purifying selection. Population genetic models that address both demographic and selection processes have been used to estimate the strength of selection acting on different classes of CNVs. In both flies
[36] and humans
[37] coding CNVs are under the strongest purifying selection followed by intronic CNVs and finally intergenic CNVs. Evidence for positive selection has been less clear. There are examples of CNVs under positive selection in humans, such as the copy number variation of the amylase
[10] and CCL3L1
[38] genes, and in flies (e.g. duplication of the Cyp6G1 locus)
[36],
[39]. However, on a genome-wide scale, the over-representation of certain classes of genes in CNVs, namely “environmental” genes, is best explained by reduced purifying selection acting on these variants than by positive selection
[40]. Although genome-scale studies of CNVs have only recently become technically feasible
[41], the study of gene duplication can be traced back to as early as 1911
[12],
[42]. An important problem is to determine the relative roles of positive selection and genetic drift in the fixation of new gene duplicates
[43]. Most population genetic models assume that gene duplicates are fixed by genetic drift and that their subsequent fate in genomes (being retained or lost) is determined by ensuing mutations in one or both copies
[43],
[44]. An alternative hypothesis is that gene duplications are fixed by positive selection. Assessing the roles of drift and selection requires the study of young duplications that still bear the hallmarks of the evolutionary process responsible for their fixation
[11],
[14],
[43].
The aim of this work is to investigate the roles played by mutation and selection on duplication polymorphisms. We take advantage of the genetic model system composed by the sibling species
D. melanogaster and
D. simulans, which have been used extensively to conduct population and evolutionary genetic studies
[45]. While they share a recent common ancestor and are morphologically very similar, at an average of 4% DNA sequence divergence, they are sufficiently diverged to provide many evolutionary insights
[46]. Hence, the structural differences (and similarities) of their genomes can be leveraged to dissect the genomic features responsible for the variation in CNV density along the genome and elucidate the existence of duplication hotspots. For example, while the
D. melanogaster genome is rich in inversion polymorphisms these are rare in
D. simulans [47]. Similarly, the fraction of repetitive sequence is considerably larger in the
D. melanogaster genome
[46],
[48] and transposable elements are differentially distributed in the two species
[46]. Another useful distinction between the two species, and one that can be used to investigate the role of selection, is the difference in their effective population sizes.
D. simulans has a ten-fold larger effective population size than
D. melanogaster, which is predicted to translate into a greater effectiveness of selection in
D. simulans [49],
[50]. Thus this species is expected to be more efficient both at purging deleterious mutations and fixing those that are beneficial
[51]. The differences in population size and genome structure between
D. melanogaster and
D. simulans provide us with a powerful genetic model in which to study how mutation and selection processes shape patterns of copy number variation.
Duplication and deletion polymorphisms have previously been surveyed in 15 lines of
D. melanogaster using tiling microarrays
[36]. Here, we use the same approach to identify and characterize duplication polymorphisms in 14 lines of
D. simulans. By integrating this new dataset of duplications in
D. simulans with the previous dataset of duplications in
D. melanogaster, we identified duplication hotspots shared between the two species. Significantly, we found that these hotspots are not associated with segmental duplications or transposable elements but are instead associated with regions of the genome that are late-replicating. We also show a higher effectiveness of selection acting on
D. simulans duplications than on
D. melanogaster duplications, and suggest an important role for positive selection in driving a sizeable fraction of
D. simulans duplications to fixation.