|Home | About | Journals | Submit | Contact Us | Français|
Following recent technological advances there has been an increasing interest in genome structural variation, in particular copy-number variants (CNVs) – large-scale duplications and deletions – in the human genome. Although not immediately evident, CNV surveys make a conceptual connection between the fields of population genetics and protein families, in particular with regard to the stability and expandability of families. The mechanisms giving rise to CNVs can be considered as fundamental processes underlying gene duplication and loss; duplicated genes being the results of “successful” copies, fixed and maintained in the population. Conversely, many “unsuccessful” duplicates remain in the genome as pseudogenes. Here, we survey studies on CNVs, highlighting issues related to protein families. In particular, CNVs tend to affect specific gene functional categories, such as those associated with environmental response, and are depleted in genes related to basic cellular processes. Furthermore, CNVs occur more often at the periphery of the protein interaction network. Thereby, functional categories associated with successful duplicates and unsuccessful duplicates are clearly distinguishable. These trends are likely reflective of CNV formation biases and natural selection, both of which differentially influence distinct protein families.
Gene duplication is a major process leading to novel genes and proteins, which may naively be assumed to be a relatively slow process in evolutionary terms. However, recent results from the field of genetics argue that gene duplication has occurred frequently during the recent history of the human population and that gene duplicates occur in humans in variable numbers and may be constantly generated de novo. Studies measuring human genome variation are receiving much attention currently , as novel genomics approaches have revealed an unanticipated level of genetic variation in the human population (e.g., [2–7]). A type of variation that was recently found to be abundant in the human genome is genome structural variation [4,5,7–13]. Genome structural variants are generally defined as (e.g., [14,15]), kilobase- to megabase-sized deletions, insertions, duplications, and inversions. Furthermore, structural variants cause a greater amount of sequence divergence between humans than the widely studied single nucleotide polymorphisms (SNPs) [4–7,14,15], if one considers the total number of nucleotides spanned by both forms of variation. Even though genome structural variants can alter the intron/exon-structure of genes by disrupting exons or fusing genes together , they frequently span entire genes leading to different gene copy-numbers between individuals. Following the first genome-wide mapping in humans [8,16], genome structural variants have been identified in several mammalian genomes (e.g., [17–22]) and at varying levels of resolution (see Box 1). Here, we use the term copy-number variant (CNV) to refer to a genome structural variant leading to changes in gene copy number (rather than an inversion or a variant not encompassing genes, both of which may also influence protein function, e.g., by influencing gene regulation). While there have been early insights on the origin of small indels ( 1kb) [23,24], we are just now beginning to understand mechanisms behind the formation of CNVs. Recent advances have been fuelled by the development of approaches for mapping genome structural variation at the resolution of base-pairs [5,25].
CNVs are of significance in relation to the human proteome in various ways. First, copy-numbers of protein-coding genes can be strikingly different between apparently healthy (“normal”) individuals (e.g., [4,28,29]), with instances of up to 10 additional copies reported for several protein-coding gene loci (e.g., [28–31]). In line with this, CNV de novo-formation is thought to be constantly ongoing in mammalian genomes [21,26,27], affecting recent and current protein evolution; in fact, CNV genesis may occur at rates higher than point mutations with impact on gene function. Finally, results from several studies point to a tight relationship of gene copy-number with messenger RNA and protein expression-level (e.g., [29,32]). Variation at the level of gene expression may represent the underlying basis for several phenotypic traits associated with CNVs, such as dietary preferences across distinct populations  or the susceptibility to diseases including HIV , breast cancer , autism , and several auto-immune diseases [30,31,34]. Furthermore, through this “gene-dosage” effect, CNVs are likely to influence protein complex formation and tightly regulated cellular systems. Since some of these require their individual components to be expressed at stoichiometrically precise levels, a CNV may have potential harmful (or beneficial) effects. Many additional phenotypic relationships are likely to be discovered in the near future with the ongoing application and improvement of approaches for ascertaining CNVs (Box 1) and for associating CNVs and phenotypes [35–37]. Thus, CNVs are not only relevant to population genetics, but should to be considered in systems biology and proteomics studies. CNVs may constitute a source of redundancy and thus evolvability or robustness, i.e., provide ‘replacement proteins’. Often, CNVs will behave selectively neutral (similar to most SNPs). Nevertheless, they represent a genomic pool of evolving transcripts, genes, and proteins that in longer evolutionary terms may become fixed in the population as novel genes. Here, we summarize recent findings in the field of genome structural variation and discuss implications for the systems biology and proteomics fields.
Knowledge on CNVs has dramatically increased following recent technological advances (see Box 1). For instance, a CNV map generated from data of over two hundred individuals has revealed that 12% or more of the human genome is prone to copy-number variation . Recent studies at considerably higher resolution sufficient to map small CNVs (<50 kb) and to identify the precise boundaries (or breakpoints) of CNVs have revealed that the number of genome structural variants (>1 kb) that distinguish genomes of different individuals is at least on the order of 600–900 per individual [5,6]. Of these, approximately ~150 genome structural variants per individual presumably directly affect protein-coding genes by intersecting with them . Moreover, recent surveys have led to a re-estimation of the total amount of sequence divergence between individuals; while it was initially assumed that the genomes of two unrelated individuals differ by ~0.1% (mainly due to SNPs), it has recently been estimated that at least 0.5% of our genomes differ , with the majority of variation being due to CNVs.
Recent findings concerning the abundance of CNVs in the human genome add to current perspectives on gene duplication and loss – essential processes in genome and proteome evolution. For nearly a hundred years, duplication of genetic material has been regarded as an important factor in the evolution of higher organisms (see  and references therein) – and protein birth by duplication is widely considered to be more common than formation of proteins ‘from scratch’ . Following gene duplication, one of the newly generated paralogs may escape selective constraints (purifying selection) and become free to acquire a new function (neo-functionalization). Furthermore, both paralogous sequences may experience decreased selective pressure after duplication, which may reflect partitioning between paralogs into different functions which had been combined in the multifunctional ancestral gene [40,41] (sub-functionalization). Gene duplication is also thought to be a major contributor to the evolution of protein networks , even though it may not account for the evolution of complex molecular machines . Duplications may evolve in an effectively neutral fashion over extended evolutionary time scales . They further may be advantageous to the cell by increasing the robustness against mutations (e.g., ). Moreover, at short evolutionary time scales the potential to modify gene/protein expression levels through gene dosage change may promote gene duplications and losses. In this regard, a genome-wide study  has recently reported relationships between CNVs and mRNA levels. Furthermore, Perry et al. found that increased copy-numbers of the amylase gene reflect higher levels of protein expression and are correlated with dietary preferences for starch . Note that a single CNV formation event – a type of mutation that for some genomic loci appears to occur more frequently than nucleotide substitutions (see below) – may be sufficient to specifically promote gene expression modification; thus gene copy-number changes may facilitate evolutionary adaptation involving protein abundance change. Nevertheless, nucleotide substitutions having an effect on the regulation of gene expression are likely to eventually supersede gene-copy number increase (or decrease), i.e., take over in the long run; in particular, maintaining a large number of identical genes per genome during longer evolutionary time scales is likely causing significantly increased ‘costs’ related to genome stability and repair.
The abundance of CNVs in the genome indicates that gene duplication (and loss) probably occurs at a constant and high rate in humans. For a number of loci in the genome involved in commonly recurring genomic disorders – regions in which CNVs may recur frequently – this rate has recently been estimated to be 1e-4 to 1e-6 per generation , which is considerably higher than the rate at which point mutations are thought to occur (2e-8; see refs. in ). Furthermore, in a recent analysis involving inbred mice, CNV formation rates as high as 1e-2 to 1e-3 have been inferred for loci encoding genes . Note that in order to properly compare these rates we have to take into account the fact that the rate at which CNVs arise has been determined for large loci and large CNVs, e.g., of 100 kb in size, whereas the point mutation rate is given per nucleotide. If we consider that ~1% of the genome comprises coding sequence, then the rate at which protein coding sequence will experience a new point mutation within a given 100 kb locus is approximately 2e-8 * 1e5 * 0.01 = 2e-5. Conversely, any given novel CNV of 100 kb would affect protein coding sequence in the given locus. Thus, for several gene loci, CNVs formed de novo may be significantly more likely to affect coding sequence than point mutations. Frequently, point mutations will remain silent (e.g., if they fall into synonymous sites) and may have little or no effect on protein function. On the other hand, protein duplicates may not always be expressed, and expression differences may sometimes have little or no functional consequence.
It is evident from genome-wide surveys that CNVs exhibit a highly non-uniform distribution along chromosomes. This distribution may have different causes: First, it may be due to biases in the ascertainment of CNVs. Second, locus-specific differences in the rate at which CNVs are formed may cause this disparity. Finally, the distribution may be due to natural selection acting differentially throughout the genome, i.e., relative to phenotypic changes caused by different genomic regions that are affected by CNVs.
We believe that the fact that several complementary technologies have detected CNVs at overlapping genomic loci (which becomes quickly apparent when browsing the Database of Genomic Variants (DGV) ) indicates that technological biases are unlikely to be responsible for the trend.
However, discerning the remaining two potential causes is not straightforward. Mutation, population-variation and fixation by natural selection or random drift have been studied extensively in relation to SNPs, but much less so for CNVs. The existence of genomic loci undergoing recurrent de novo structural rearrangements in relation to disease  suggests that genomic CNV formation biases exist. In this regard, for instance, subtelomeric regions represent hot spots for interchromosomal recombination  and segmental duplication sequence [45,46]. In line with this, results from Redon et al.  indicate an enrichment of CNVs in subtelomeric regions (within 500 kb of the ends of chromosome arms). Consequently, breakage or fusion of chromosomes during the evolution of mammalian genomes may have influenced the rate of duplication (and loss) of gene families across species.
Natural selection can be analyzed by studying the overlap of CNVs with various functional elements. For instance, recent studies have revealed that protein-coding genes, and also other genomic elements including highly conserved non-coding regions, tend to be depleted among CNVs, indicating purifying selection [4,5,47]. In particular, deletions appear to be under stronger selection than duplications . Furthermore, certain functional categories of protein-coding genes are more prone to be affected by CNVs than others. For instance, Table 1 shows a strong enrichment among CNVs for several protein domains. Our survey presented in Figure 1 extends this analysis by assessing which protein functional categories are most strikingly enriched or depleted amongst CNVs: consistent with earlier surveys we find that proteins involved in processes related to environmental response tend to be enriched in CNVs [4,5,8,9,14,48,49] and duplicated genes retained in the genome , whereas proteins involved in fundamental cellular functions, such as cellular physiological processes, tend to be depleted. While the latter trend is presumably due to purifying selection owing to constraints, some of the former enrichment may be due to positive selection. Such effects should be observable also in fixed variants. Hence, we extended our survey by comparing “successful duplicates” (i.e., recent segmental duplications) with “unsuccessful duplicates” (nonprocessed pseudogenes, i.e., duplicated genes that were recently inactivated by mutation; e.g., [51,52]). Whereas distributions for successful duplicates reveal trends similar to the ones observed for CNVs (Figure 1b), we note distinct trends for unsuccessful duplicates. Namely, protein-coding genes acting in metabolism and cellular physiological processes, that is dosage-sensitive genes, appear significantly enriched among pseudogenes (Figure 1c) – although also genes putatively involved in environmental response (such as genes mediating locomotion in response to stimuli) were observed to be significantly enriched, consistent with an earlier survey . Overall, the results are consistent with constraint (purifying selection) acting on dosage sensitive genes, leading to the removal of extra gene copies causing dosage imbalance.
Additionally, our survey shows that unsuccessful duplicates tend to be longer than successful duplicates (Table 2), both at the gene and at the protein level. Although this trend may partially be influenced by the way successful and unsuccessful duplicates have been ascertained, the observations are in line with previous findings that complex genes, such as alternatively spliced ones that are on average longer than non-alternatively spliced genes , tend be less prone to duplication than genes with few exons and no or few additional splice forms [54–56].
Selection rarely acts on functions carried out by a protein ‘in isolation’. Most proteins, rather than working as a single entity, act in concert as members of a tightly regulated pathway or as a large multi-protein complex. Consequently, the level at which proteins tend to be affected by CNVs is partially reflected in the protein’s role in the protein interaction network, i.e., the entirety of proteins thought to interact in the cell: Recently, it was shown that CNVs are more likely to affect proteins at the periphery of the network (with few interaction partners), whereas proteins at the network center (many interaction partners) are less likely to be variable in copy number [57,58]. These observations are consistent with an over-representation of small protein families (having few or no paralogs) in the center of protein networks  and the observation that members of large protein families tend not to be involved in protein complexes . It is plausible that proteins at the network periphery are under less evolutionary constraint and are thus freer to evolve. In contrast, duplicates affecting the network center may be detrimental and thus more likely to be selectively removed. The latter is strongly supported by the fact that unsuccessful gene duplicates are observed at the network center at a significantly higher frequency than successful duplicates (Figure 2).
Besides purifying selection, positive or directional selection has been implicated in influencing the distribution of CNVs and successful duplicates in the human genome. For instance, genes frequently affected by CNVs were reported to exhibit elevated rates of amino acid change in evolution , which may be an indicator for positive selection. Moreover, a recent case study focusing on the salivary amylase protein Amy1 has concluded that AMY1 gene copy number in human populations likely underlies diet-related positive selection pressures . Furthermore, duplications are, similar to positively selected nucleotide changes, biased to the protein interaction network periphery ; this indicates that adaptive evolution – involving SNPs or CNVs – tends to act at the periphery of the network rather than the center. Concerning successful gene duplicates, several groups have reported signs of positive selection (at the level of amino acid replacements) for recently generated gene duplicates in primates (see e.g., [61,62]) and rodents . Finally, a recent computational analysis has presented evidence for substantial positive selection in hotspots of recently formed segmental duplications in humans ; these hotspots are presumably subject to recurrent de novo CNV formation.
At least for some genes it appears that gene copy-number may evolve in a neutral fashion: for instance, Nozawa and coworkers  reported that no significant difference exists in the amount of CNVs between functional and nonfunctional (i.e., pseudogenic) sensory receptor genes, a gene family particularly prone to structural variation (e.g., [4,65]). On the other hand, the positive effect of gene duplication or loss in the case of CNVs spanning more than one gene may in some instances balance or overshadow the potentially negative impact of protein dosage imbalances and may drive the fixation of CNVs in particular regions of the genome.
Nevertheless, negative effects of commonly occurring CNVs are also visible in current CNV datasets (Figure 3). In particular, a survey in which we linked protein domains present in CNVs to the Online Mendelian Inheritance in Man (OMIM) and the Cancer Gene Census (CGC)  databases revealed an enrichment of copy-number variable genes amongst disease-related genes; indeed, positive effects of CNVs need to be balanced against potential harmful influences of genome structural variation.
CNVs should be considered in systems biology and proteome evolution-related studies due to their effect on protein expression, function and the phenotype, and their likely contribution to protein family evolution. After formation and subsequent fixation following selection or random drift, CNVs may give rise to gene duplicates or losses; thus they represent important genomic intermediates in genome and proteome evolution.
Our understanding of CNVs was considerably enhanced by novel high-resolution genomics technologies (Figure 4). Genome-wide microarray technologies based on Bacterial Artificial Chromosomes or representational oligonucleotide microarray analysis (ROMA), which uses short oligonucleotides probing genomic loci at a density of one oligonucleotide per 30 kilobases, enabled generation of a first record of CNVs in the human genome [8,16]. Subsequently, mapping-resolutions have considerably increased following the development of computational approaches for mapping fosmid clone-ends to the reference genome , the mining and statistical analysis of SNP genotyping data [10,11], and the development of high-resolution oligonucleotide microarray technology [12,25,69,70] For instance, high-resolution comparative genome hybridization (HR-CGH)  based on oligonucleotide tiling arrays enables the generation of CNV maps at a resolution below 300 bp. Novel sophisticated computational approaches [65,71,72] have facilitated scoring and interpreting the data, and allowed mapping the actual physical boundaries, or breakpoints, of CNVs systematically [25,65,70]. Other recent surveys provided records of small and medium-sized indels based on comparing raw DNA sequence reads  and alternative human genome assemblies [6,74] to the human reference genome. Furthermore, a recent survey based on next-generation DNA sequencing provided a genome-wide account at sub-kilobase resolution of genome structural variants – i.e., deletions, insertions and inversions – in two human genomes by high-resolution and massive paired-end mapping (PEM) .
**Redon et al., 2006
In this paper two complementary microarray technologies were used to catalogue CNVs genome-wide in over two hundred healthy individuals. Nearly 12% of the human genome was found to be prone to variation in copy-number, and copy-number variable regions mapped encompassed hundreds of genes. The generated data enabled the authors to examine the genomic impact of CNVs.
**Perry et al., 2007
The copy-number of the salivary amylase gene (AMY1) - which shows a strong correlation with protein expression level - is markedly differing across human populations. This study reports that with gene counts being large in populations that use high amounts of starch in their diet, the distribution of AMY1 copy-numbers was likely influenced by positive selection.
*Egan et al., 2007
The authors systematically analyzed the de novo formation of CNVs in the genomes of inbred mice. Surprisingly, the analyzed genomes contained a large extent of recently formed CNVs, distributed in a non-random fashion across the genome and frequently encompassing genes. These findings may have implications on future studies involving model organisms.
*Jiang et al., 2007
The authors devised an algorithmic framework to reconstruct the evolutionary history of recent segmental duplications in the human genome. Many recent duplications occurred in a small subset of genomic hotspots (i.e., core duplicons), which may be centers for human transcript, gene, and protein birth. These centers are enriched for protein-coding genes, many of which show signs of positive selection.
*Kim et al., 2007
This work analyses the relationships of adaptive evolution and genetic variation with proteomic properties, namely topological positioning within the protein interaction network (see also work by Dopman and Hartl, 2007). The network periphery is enriched both for signatures of recent adaptation and genetic variation (SNPs and CNVs). The trends are rationalized in terms of constraints imposed by protein structure, and explained by the approximate mapping of the network to cellular organization.
*Dopman and Hartl, 2007
The authors characterize CNVs across genomic regions in a model organism, Drosophila melanogaster, and report a surprising amount of CNVs in flies. A comprehensive analysis of evolutionary processes revealed various evolutionary trends that are paralleled by findings in relation to CNVs in humans, with negative selection and presumably local biases in mutational mechanisms being main factors shaping genome-wide CNV occurrence patterns. Similar to observations that have been made in the human proteome (by Kim et al., 2007), the authors find that fly CNVs are depleted amongst proteins that are central in the protein interaction network.
*Urban et al., 2006
The authors report the first chromosome-wide tiling microarray experiment enabling the mapping of CNVs at ~300 bp resolution (i.e., below the resolution of most exons), paving the way for future high-resolution analyses of CNVs in the human genome using cost-efficient microarray technology. Using a novel approach, HR-CGH, the authors resolved breakpoints of commonly occurring CNVs as well as large genomic aberrations associated with congenital diseases.
Funding was provided by a Marie Curie Fellowship (J.O.K.) and the NIH (Yale Center of Excellence in Genomic Science grant). The authors thank Pedro Alves and Jeroen Raes for comments on the manuscript.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.