|Home | About | Journals | Submit | Contact Us | Français|
Over the past decade, the ubiquity of copy number variants (CNVs, the gain or loss of genomic material) in the genomes of healthy humans has become apparent. Although some of these variants are associated with disorders, a handful of studies documented an adaptive advantage conferred by CNVs. In this review, we propose that CNVs are substrates for human evolution and adaptation. We discuss the possible mechanisms and evolutionary processes in which CNVs are selected, outline the current challenges in identifying these loci, and highlight that copy number variable regions allow for the creation of novel genes that may diversify the repertoire of such genes in response to rapidly changing environments. We expect that many more adaptive CNVs will be discovered in the coming years, and we believe that these new findings will contribute to our understanding of human-specific phenotypes.
Since diverging from the lineage leading to chimpanzees 6 million years ago, humans have evolved larger brains, gained an upright posture, lost body hair and developed complex spoken language, among other characteristics [1,2]. Even now, human evolution continues [3–9]. For example, humans are currently evolving to digest sugars commonly found in modern diets more efficiently [5,7], have higher birth rates [3,8] and are developing increased protection against infectious diseases, such as malaria [6,9]. Such evolutionary changes are adaptive because they affect an individual's likelihood of survival and reproduction. By studying the genetic variation influencing such characteristics, it is possible to understand adaptation in modern humans. Single nucleotide polymorphisms (SNPs; see Glossary), restriction fragment length polymorphisms (RFLPs) and protein polymorphisms have long been archetypes of human genetic variation, and many current tools for understanding evolution are based upon these variants (Box 1). However, little is known about the role of other forms of genetic variation in human adaptation. CNVs are a form of structural variation defined by the gain or loss of genomic material [10,11] and are common among healthy individuals [12–14]. In fact, any two persons will differ in copy number across 0.78% of their genomes on average . When combining three of the most comprehensive CNV maps to date [12–14], approximately 7.6% of the human reference genome contains CNVs. CNVs can impact human phenotypic variation and contribute to diseases such as autism, psoriasis, schizophrenia, obesity and Crohn's disease [15–22]. However, the evolutionary impact of CNVs has yet to be comprehensively assessed. In this review, we explore the impact of fixed copy number differences (CNDs) between humans and other primate species on human evolution, and discuss the extent to which CNVs contribute to adaptive evolution in humans.
We argue that genomic copy number variation is a mechanism through which phenotypes are altered and upon which selective forces can act. With the recent advances in high-resolution array and second-generation sequencing technologies, CNVs can be detected with unprecedented sensitivity and vastly improved breakpoint resolution (Box 2). It is now possible to explore the sequence context and population dynamics of CNVs, permitting the study of their evolution. Herein, we describe the ways in which CNVs can be acted upon by selection and their unique features that should be considered when studying their evolutionary history.
CNVs can have a phenotypic impact and, consequently, can alter the fitness of an allele through mechanisms including: (i) changing the coding sequence of a gene [23,24]; (ii) creating paralogs that can diverge from each other and take on new or specialized function (neofunctionalization) [25,26]; and (iii) altering the expression level of a gene [24,27]. Such scenarios are not necessarily mutually exclusive. For example, in theory, a CNV within a gene encoding a transcription factor can affect both the coding sequence of the gene itself as well as the expression of downstream targets of that gene. When a variant alters the function of a gene, whether through changing its expression levels or the encoded product, the variant then has the potential to alter the fitness of the organism and be acted upon by selection. Thus, it seems plausible that CNVs overlapping genes are more likely to be adaptive and affect fitness than are CNVs in intergenic regions . Most functional regions in the human genome appear to be under purifying (negative) selection [29,30], but there are several examples of positive selection on SNPs . The same is probably true for CNVs; that is, most CNVs overlapping genes are under purifying selection , but a handful of CNVs are thought to be under positive selection [28,32]. However, it is currently unclear to what extent positive selection acts upon CNVs and whether these few examples are exceptions, or indicative of a more general phenomenon. Recently, major resequencing projects have been accumulating massive amounts of genomic data at the population level, including data on human and non-human primates [12,33]. When these data are analyzed for the adaptive effect of copy number variation, the number of known CNVs that are under positive selection will probably increase and help researchers to better delineate the role of these variants in human evolution.
As described above, CNVs can be subject to either purifying or positive selection. However, such scenarios usually lead to fixation (either by the removal of a harmful CNV or the rapid rise in frequency of a beneficial CNV). Thus, such selective forces cannot account for the breadth of copy number variation currently present among human genomes. One explanation for the persistence and extent of CNVs among human genomes (especially for intergenic CNVs) is that most of these CNVs have evolved under neutral evolutionary pressures. Therefore, the frequency and sequence context of these neutral variants have been shaped entirely by demographic events, mutation rate and genetic drift. CNVs can be formed by a variety of mechanisms, including nonallelic homologous recombination (NAHR), nonhomologous end-joining and retrotransposition (for a detailed review on CNV formation mechanisms, see ). Once a CNV is formed by such mechanisms, it may be maintained in a population even in the absence of selection. In fact, it appears that many CNVs are probably neutral because the genomes of healthy individuals are peppered with hundreds, if not thousands of such variants .
Purifying selection acts on detrimental variants that reduce fitness and, in general, eliminates those variants from the population . The effect of purifying selection acting on CNVs is indirectly visible in the genomic distribution of these variants. There is an obvious depletion of CNVs that overlap with functional regions, such as genes and ultra-conserved elements [14,28,36]. Furthermore, there is a significant depletion of larger CNVs (>500 kb), which are more likely to overlap with functional regions and potentially disrupt epigenetic patterns, indicating strong purifying selection . It is possible that purifying selection may act more potently on CNVs than on SNPs owing to the size of the variants. As such, variants that are under purifying selection (unless the selection is mild) are exceedingly rare, limiting the opportunity to study them directly.
Positive selection acts on either a new variant or a previously neutral variant that has become advantageous owing to changes in the environment. Under positive selection, a variant can be carried to fixation in a population (i.e. selective sweep; Figure 1); however, only a fraction of the thousands of currently known CNVs exhibit empirical evidence of positive selection in humans (Table 1) [7,38–41]. As more genomes are sequenced and new analytical tools are developed to detect positive selection, we expect to observe more evidence supporting the role of adaptive CNVs.
Accurate integer copy number counts and nucleotide breakpoint resolution for CNVs are important to determine because they will help resolve the evolutionary history of individual loci. The primary challenge in studying the evolution of CNVs is the difficulty in obtaining accurate genotypes, especially for those CNVs with a large range of copy number. For instance, unlike SNPs, which are typically biallelic, the CNV gene encoding salivary amylase, AMY1, is multi-allelic and ranges in diploid copy number among humans from 2 to 15. Any given person can have many possible combinations of alleles, making CNVs overlapping this gene difficult to genotype. In fact, of 11,700 CNVs discovered recently, only approximately 5000 could be accurately genotyped using arrays for several reasons [14,42]. First, high sequence identity between paralogs or repeated sequences (such as retrotransposons) can cause cross-hybridization on the array, resulting in array signals that are complex to interpret. Second, DNA segments with high copy numbers are difficult to accurately quantify and distinguish from one another (e.g. if the array reference has a copy number of 12 for AMY1 and the test has a copy number of 14, the expected log2 ratio of intensities would be 0.22, which is unlikely to meet the cut-off for some CNV calling algorithms). Finally, the precise boundaries of CNVs often cannot be determined because of the resolution of array-based technology and, consequently, overlapping CNVs with different breakpoints are difficult to distinguish from one another. Likewise, sequencing-based methods also have difficulty genotyping CNVs for several reasons including: (i) challenges in determining the precise breakpoints of tandem duplications; (ii) the location of the distant duplicated sequence is hard to identify with short reads; and (iii) duplications can have many possible integer copy number states. For these reasons, among others, it has been extremely difficult to genotype any duplications by sequencing-based methods alone . Sequencing, similar to array-based technology, uses algorithms to estimate breakpoints, but the precise nucleotide resolution of CNVs cannot always be determined. Nevertheless, CNV analysis is improving, and two recent analytical methods have jump started the effort to create accurate, integer copy number genotypes for CNVs using sequencing data [43,44]. In addition, the development of technologies that produce longer sequence reads will undoubtedly improve the detection and resolution of CNVs.
A second challenge in studying CNV evolution in humans is that traditional signatures of selection (or, more precisely, statistics that determine the likelihood of neutral evolution; Box 1), such as amino acid substitution based tests, allele frequency based tests and linkage based tests, are not applicable to all CNVs. As CNVs are more complex forms of genetic variation than are SNPs, tests of neutrality for CNVs have to be chosen carefully based on the specific context. In addition to having a large range of copy number, duplications of AMY1 and many other CNVs were probably recurrent during evolutionary history [45,46]. Because similar CNVs can occur on different haplotype backgrounds, they are less likely to be in linkage disequilibrium (LD) with neighboring SNPs than are biallelic CNVs (Figure 1) [47–49]. Even some non-recurrent, biallelic CNVs, such as the common deletion in the cytosine deaminase-encoding APOBEC3B gene, are not in LD with surrounding SNPs . In addition, duplications can initiate gene conversion events, which can then decrease the LD surrounding such variants . For AMY1 (and other complex CNVs that may have been under recent positive selection within human populations), tests of neutrality that examine LD, extended haplotype blocks and stretches of homozygosity will have diluted signals owing to multiple haplotypes containing the CNV that may be under selection (Figure 1B).
Another confounding factor in the identification of CNVs under positive selection is that high identity duplicates, such as the many copies of AMY1, are difficult to assemble unambiguously and so are underrepresented in annotated genomes . Without the sequences of such duplications, sequence-dependent signatures of selection (such as Ka/Ks, Box 1) cannot be determined. Hence, the study of positive selection acting upon CNVs often relies on other signatures of positive selection, such as population differentiation (e.g. ). AMY1 copy number exhibits high population differentiation. An increased copy number of AMY1 permits more efficient breakdown of starch . Populations consuming high quantities of starch, such as Japanese and Europeans, have significantly more copies of this gene than do those consuming lower quantities, such as Yakut and Biaka pygmy, indicating that copy number variation at AMY1 is under positive selection in response to cultural change .
It is important to remember that the application of SNP-based tests of selection to CNVs assumes that the CNVs in question behave in a similar manner to SNPs . Positive selection acting on duplications, as well as on recurrent and multi-allelic CNVs, are more difficult to test for selection and require a novel set of analytical tools [39,40]. Thus, studying CNVs for signs of selection is complicated twofold: first, the CNVs need to be genotyped accurately; and, second, tests of neutrality need to be developed that are appropriate for CNVs that are not in LD with surrounding variants.
Whereas CNVs exhibit variation within species, CNDs are segments of genomic DNA that differ in copy number between species. Studying CNDs can help uncover signatures of ancient positive selection, which potentially led to the genetic and phenotypic divergence of closely related species. A surge in genomic duplication events is thought to have occurred in the ancestor of the great apes [51,52]. The ubiquity of the resulting segmental duplications present in the genomes of great apes set the stage for recurrent CNV formation by NAHR [46,53]. Thus, segmental duplications among apes have allowed for gene family expansions that can be acted upon by positive selection and, in turn, potentially contribute to the phenotypic divergence of species. Lineage-specific duplications are not necessarily copy number variable. Many of these events have reached fixation and can be more accurately classified as segmental duplications or CNDs. Recent studies have documented human-specific gene family expansions for which the gene duplicates appear to be under positive selection [51,54–58]. Such gene families can be examined for high proportions of nonsynonymous substitutions (Box 1), which indicate that these genes are diversifying at the sequence level from each other and from their non-human primate orthologs. One of the most extreme examples of an amplified gene family with excessive nonsynonymous substitutions is the nuclear pore complex interacting protein (NPIP) family, also known as morpheus . The genes in this family appear to be under positive selection in multiple primate species, including in humans . Although the precise function of the morpheus genes is unknown, they are expressed in many tissues and their protein products appear to interact with the nuclear pore complex and may help shuttle mRNAs across the nuclear membrane . Likewise, other comparative studies between primates have led to the identification of lineage-specific CNDs potentially associated with innate immunity , endurance running  and reproduction [60,61].
Interestingly, many duplicated genes in the human lineage appear to be involved in brain function. Another example of a highly amplified gene family in the human lineage, is the DUF1220 domain encoding neuroblastoma breakpoint family (NBPF) [44,51,54,62]. The members of this gene family are expressed primarily in the brain and, similar to morpheus, exhibit high proportions of nonsynonymous substitutions. Several other human-specific gene duplications thought to have evolved under positive selection are also involved in brain function and development [44,63–65]. For instance, the ancestral hydrocephalus inducing homolog (mouse) HYDIN gene on chromosome 16 has an additional copy on chromosome 1 in humans that appears to be involved in regulating brain size . Individuals with deletions overlapping the human-specific copy of HYDIN on chromosome 1 tend to have microcephaly, whereas duplications tend to give rise to macrocephaly . Furthermore, specific Gene Ontology (GO) categories appear to be enriched for positive selection among the lineage-specific CNDs, including inflammatory response and functions related to cell division . Taken together, gene families related to brain function and cell growth have rapidly expanded and diversified in the human lineage. It could be postulated that these gene families have diverged in the human lineage as an adaptive change that is partially responsible for cognitive differences among primates.
Many of the copy number variable gene families in humans also include immune system-related genes that may have been under positive selection. Specifically, these genes seem to be subject to lineage-specific gene expansions and divergence [53,68]. One theoretical framework to explain this observation is that the duplication of immune system-related genes allows for a diverse `reservoir' of genes from which to draw when encountering a new pathogen . Such regions allow for the rapid adaptation to new environments and challenges; however, they also tend to be unstable and can predispose to disease. For example, β-defensins, major histocompatibility complex, class I (HLA) and killer cell immunoglobulin-like receptors (KIR) are copy number variable immune-related gene families that show evidence of positive selection in humans [39,68,70]. Such highly variable gene families have evolved in response to changing pathogens and environments. However, these gene families have been associated with certain disorders, such as psoriasis [16,71]. It is possible that the features of a locus that make it more `adaptable' also, in some cases, enhance its predisposition to genomic instability and disease (reviewed in ).
In addition to gene duplications, human-specific deletions can also be acted upon by positive selection and increase in allelic frequency until fixation. Recently, a systematic study found over 500 human-specific deletions by identifying conserved sequences that were present in the chimpanzee and rhesus macaque reference genomes, but were missing in the human reference genome . In a similar fashion to lineage-specific duplications, human-specific deletions are preferentially located near genes involved with brain function. For example, one such deletion removes an enhancer near the growth-limiting gene, GADD45G, reducing the expression of this gene in the developing forebrain and so potentially allowing more growth of this brain region in humans . Another one of the human-specific deletions identified appears to be involved in human reproduction. Humans have lost an enhancer element that regulates the expression of the androgen receptor (AR) gene, resulting in the prevention of growth of penile spines. The loss of this morphological characteristic in humans is coincident with increased copulation time in humans relative to chimpanzees, presumably owing to humans' tendency towards monogamy versus the tendency of chimpanzees towards promiscuity . Thus, the deletion of the AR enhancer and its subsequent fixation may have been an adaptive response to changing reproductive behavior . It is unknown how many of the other approximately 500 human-specific deletions are functional and under positive selection because fixed deletions are difficult to test formally for selection as they cannot be examined for substitution rates, linkage or allele frequencies.
Ancient positive selection acting on CNDs can also have modern-day medical relevance. For example, a higher copy number of the CYP2D6 gene, which encodes an enzyme important for metabolism of xenobiotics, causes increased metabolism of many commonly prescribed drugs . Indeed, patients with a high copy number of CYP2D6 can metabolize several commonly-used drugs so rapidly that the medicine does not remain in the bloodstream long enough to benefit the patient . Likewise, patients with a low copy number are hypersensitive to the same medications . This gene varies widely in copy number among primates, including humans . It is thought that the copy number variation of this gene and other members of its larger gene family (the cytochrome P450 genes) initially allowed for their expansion and neofunctionalization to combat various plant toxins . One hypothesis to explain the large range in copy number in modern human genomes is that, after the initial increase in gene copy number, CYP2D6 has since been released from tight selective pressure. This gene, and other cytochrome P450 gene family members, may currently be under neutral selection and, hence, randomly increasing or decreasing in copy number . As humans started engaging in agriculture, the need for protection against toxic plants decreased . Thus, the initial gene family expansion was driven by selection to prevent toxicity from the environment, but then the selective pressures upon this gene family subsequently lessened, leading to extensive copy number variation of CYP2D6 in modern humans owing to genetic drift.
Unlike signatures of ancient positive selection on CNDs between species, signatures of recent positive selection can be detected on CNVs within species. Because CNVs vary within species, they can be examined using traditional tests of positive selection initially designed for use on SNPs. UGT2B17 , which encodes an enzyme that breaks down steroids and has an important role in regulating the levels of androgens , has recently been described as a CNV under positive selection. UGT2B17 ranges in copy number from 0 to 2, and the deletion allele causes differences in the levels of excreted testosterone in urine and serum [79,80], increased risk of graft-versus-host disease  and decreased bone density . Despite these phenotypes, the UGT2B17 gene deletion was found to be under positive selection based on population differentiation, the allele frequency distribution of linked SNPs and the unusual haplotype structure . The deletion is common among Asian individuals who also have lower testosterone levels in general [41,83]. Thus, the deletion may help conserve testosterone in individuals whose overall level of testosterone is relatively low.
The UGT2B17 gene deletion was one of the first CNVs to be described under recent positive selection within human populations not necessarily because of its associated phenotypes or strength of selection, but rather because it is a `well-behaved' variant (i.e. the deletion overlapping this gene behaves like a SNP). Unlike many other CNVs, the UGT2B17 CNV is common, biallelic, inherited in a Mendelian fashion and is in LD with nearby variants . Positively selected variants generally rise in frequency quickly and there is not enough time for recombination to break up linkage between the selected variant and nearby variants (Figure 1). As such, the entire haplotype, and not just the selected variant, is indicative of a selective event. As described previously for the AMY1 gene, owing to recurrence and multi-allelism, many CNVs are not in LD with nearby variants, making the study of their evolutionary histories difficult [47–49,84]. However, because the UGT2B17 CNV is in LD with neighboring SNPs, established tests of neutrality on the nearby SNPs can be used as a proxy to study the haplotype as a whole, including the CNV.
Adaptation is not limited to positive selection; balancing selection can also explain some adaptive CNVs in humans. For example, the copy number of α-globin genes affects malarial morbidity and is a classic example of balancing selection in some human populations . Most humans have four identical diploid copies of α-globin (represented as αα/αα for each of two α-globin genes on two homologous chromosomes), but deletion polymorphisms and, in rare cases, duplication polymorphisms, exist, so that the diploid copy number can range from 0 to 6 . Deletions of two copies of α-globin, whether in cis (−−/αα) or in trans (−α/−α) cause thalassemia, a mild form of anemia. Loss of three copies of α-globin causes a more serious form of anemia, and having no copies is embryonically lethal. Despite this phenotype, in Southeast Asia, up to 5% of the population is heterozygous for the α-globin cis deletions (−−/αα) . These heterozygotes have reduced malarial morbidity compared with individuals with four copies of α-globin [9,87]. Together, the selective pressure to protect against malaria and the need for at least one intact α-globin gene to produce hemoglobin for survival has maintained the cis deletion allele in the population. A similar phenomenon has been observed in many other malaria-endemic populations, indicative of positive selection acting on the trans deletion genotype (−α/−α), despite its association with anemia . There are only a few other examples of balancing selection in the human genome acting on CNVs. This is probably in part because of the difficulty of detecting such selection.
One of the most publicized and controversial examples of a CNV that is potentially adaptive is the copy number of the CCL3L1 gene, encoding a ligand for the CCR5 receptor, which is used by HIV to enter its target cells. CCL3L1 varies in copy number among humans from 0 to 14 . It was suggested that an increased copy number of this gene would increase the amount of ligand, which could then compete with HIV for access to the cell receptors and prevent the entry of HIV into the cell . This was borne out by an association with CCL3L1 copy number and risk of acquiring HIV, where those with higher copy number appeared less susceptible to HIV infection . Although intriguing, this finding should be interpreted with caution as it may have been subject to `batch effects' because cases and controls were genotyped separately, and two groups have since not been able to replicate the original findings [88,89]. What remains undisputed is that this region exhibits both extensive variation in copy number and significant population differentiation. This indicates that the CCL3L1 region has evolved under special circumstances, owing to either demography (such as population expansions or bottlenecks, which can alter the frequency of variants even under neutral conditions) or selection.
Gene duplications in humans that evolved under positive selection can have related but varied functions allowing for diverse sensory perception of the environment, including smell, taste and sight. Diverse gene families, such as the olfactory receptors (ORs), may create a reservoir of genes that can respond to varying environments. There are approximately 800 OR genes and pseudogenes in the human reference genome, a third of which are copy number variable . After the teleost transition to land, there was a massive increase in the number of OR genes, presumably so that more airborne odorants could be detected . This could be an indication of positive selection toward a population with a higher copy number and diversity of OR genes. However, apes and old world monkeys have three different receptors for color vision (trichromatism), whereas most mammals only have two (dichromatism). Having trichromatic vision is thought to lessen the need to sense the environment through smell, an idea known as the `vision priority hypothesis' . Therefore, OR genes in some primates may be under less functional constraint than in other organisms and could decay. This is reflected in humans by a reduction of intact ORs compared with mouse, cow and other placental mammals that have dichromatic vision . Thus, the extreme variation of ORs (similar to the CYP2D6-containing gene family) may be primarily the result of an initial expansion under positive selection, followed by genomic drift , or the stochastic changes in gene copy number resulting from recurrent formation and deletion events. The highly variable nature of the OR genes may also affect the way in which humans perceive their environments. Each person has, on average, 25 copy number variable OR genes and pseudogenes, and a quarter of people have a homozygous deletion of at least one OR gene (some have as many as four) . Because individuals perceive smells with different sensitivities from one another, it is possible that the varied number and repertoire of these genes create individualized perception of odors.
In a similar fashion, the varying copy number of AMY1 and, consequently, the level of the salivary enzyme, amylase, may alter one's perception of starchy foods. Adding amylase or an amylase inhibitor to starchy food changes the perception of the texture of the food [93,94]. In fact, individuals with a low copy number of AMY1 perceive starchy foods as more viscous than do those with higher copy numbers . Such differences in perception may alter one's preference for certain foods (e.g. one might consider `creamy' foods as more desirable, but having a higher copy number of AMY1 causes creamy foods, such as custards, to digest faster in the mouth, making them seem watery and, subsequently, less desirable) .
Trichromatic vision, which arose from a duplication of an opsin gene on the X chromosome, is another example of a CNV that has ramifications on the perception of environment. Deletions and gene conversion events between the opsin gene paralogs can result in color blindness (reviewed in [28,95]). As CNVs have the potential to affect the senses of smell, taste and sight, it is tempting to speculate that certain multi-allelic copy number variable genes and gene families, whether under selection or neutral conditions, contribute to the way in which humans perceive their environments.
It is an exciting time for evolutionary geneticists to decipher the potential impact of CNVs on human evolution and adaptation. Gains and losses of genomic segments, which sometimes involve the creation of novel genes, can substantially impact phenotypes. Initial studies of the evolution of human CNVs and CNDs have now identified candidate regions under positive selection (Table 1) and have also demonstrated that most functionally conserved regions lack CNVs (presumably because of negative selection) [14,36]. Regions that differ in copy number between humans and non-human primates can be informative about ancient adaptations that may have led to species-specific pheno-types. Such CNDs have altered sexual development and, possibly, brain development in the human lineage compared with chimpanzees. The copy number of variable regions within humans can inform about recent adaptations that may lead to genetic and phenotypic differences between individuals and populations. Recent positive selection acting on CNVs within human populations affects traits such as steroid metabolism, malarial morbidity and starch digestion. Notably, CNVs and CNDs exhibiting signatures of adaptation can have consequences regarding drug metabolism, immune response and sensory perception. Thus, it is warranted to study the copy number of medically relevant genes within an evolutionary framework.
In the coming years, the scientific community should anticipate the increasingly accurate discovery and analysis of CNVs, which, in turn, will highlight new regions of the human genome affecting adaptation. Identifying such CNVs will enhance our understanding of the evolutionary history of our species, delineate the genetic factors underlying our phenotypic variation and pinpoint the molecular reasons that predispose us to certain diseases.
Generally, variants are described as being neutral, under negative selection, under positive selection or, occasionally, under balancing selection. Most empirical tests for selection examine potential deviations from neutrality. Positive selection causes the allele frequency of a variant to rise, and it can drive sequence divergence and alter the relationship of the selected variant with nearby variants (Figure 1). These signatures can be used to detect positive selection. Often, more than one line of evidence is needed to convincingly state that a variant or gene is under positive selection . The types of signatures that are examined can broadly be classified as frequency based (e.g. population differentiation), linkage based (e.g. extended homozygosity and LD), or substitution based (e.g. high ratio of nonsynonymous substitutions to synonymous substitutions) (Table I). Some tests of selection work better for detecting recent positive selection (i.e. within human populations), whereas others are better at detecting ancient positive selection (i.e. between human and non-human primates), as the signals of selection tend to decay over time. Such tests of selection need to be compared to a null model of neutrality. This null model is usually created from one of two sources: first, primarily neutral regions of the genome, such as introns or processed pseudogenes, are compared to the variant(s) of interest. Second, a genome-wide distribution of statistics can be created, and this is used to determine whether the variant(s) of interest are outliers in the distribution.
Perhaps the most convincing evidence of positive selection results from substitution based tests. Substitution based tests of positive selection (such as the Ka/Ks ratio) utilize the ratio of nonsynonymous and synonymous mutations to infer positive selection. The Ka/Ks ratio is a powerful test for species-specific population selection when cross-species data are available . Under neutral conditions, the ratio should be similar to 1, and high proportions of nonsynonymous mutations can indicate positive selection acting on the gene. Also, a related test for cross-species comparisons is the McDonald-Kreitman test, which measures the genic nonsynonymous and synonymous mutations within and between species . If the proportion of between-species nonsynonymous differences is high, but the within-species variation is very low, this indicates positive selection. There are several studies that successfully used these measures for SNPs, short tandem repeats (STRs), RFLPs, and more recently, CNVs (Table 1, main text) . In special circumstances, individual studies have successfully applied tests of neutrality that were originally designed for other types of variants to understand selective pressures acting on CNVs [7,41,45].
As genomics technologies, namely second-generation sequencing and array comparative genomic hybridization (aCGH), have advanced, so has the discovery of CNVs. CNVs were initially discovered using large insert clone aCGH, a technique based on hybridizing differentially labeled reference samples and test samples to large clone-derived DNAs that had been spotted on an array . Subsequently, oligo-based aCGH allowed for high-throughput discovery of CNVs with unprecedented precision (Figure I). In addition, SNP arrays, in which a single fluorescently labeled sample is hybridized to millions of oligos, have been used to identify CNVs . Two more recent studies used millions of oligo probes by aCGH to identify CNVs that are 500 base pairs (bp) or larger with a 50-bp resolution to detect the breakpoints [13,14]. Once CNVs are discovered, some can then be genotyped by either aCGH platforms with probes targeting the known CNVs or SNP-based arrays.
DNA sequencing, in addition to array-based technologies, has also been crucial in the accurate characterization of CNVs. Initially, first-generation (capillary) paired-end sequencing was used to determine whether two paired sequences from the same clone mapped to the reference genome further away or closer together than expected, indicating a loss or gain of genomic DNA in the sample, respectively [107–109]. In a similar fashion, split-reads, which map to separate locations in the reference genome, leaving a gap in between the two fragments of the read, have been utilized to discover CNVs . These methodologies have recently been adapted to study CNVs using second-generation sequencing  (Figure I). Sequencing technologies are now being used to resolve even complex CNVs . In addition to arrays and sequencing, visual-based technologies, such as optical mapping, hold promise for genome-wide methods to detect CNVs .
We would like to thank Qihui Zhu, Sunita Setlur, George Perry, Upeka Samarakoon and Ryan E. Mills for helpful discussions and comments on a previous version of this manuscript. This work was funded by R01 GM0851533-04 and F32 AG 039979.