Gene duplication is a major process leading to novel genes and proteins, which may naively be assumed to be a relatively slow process in evolutionary terms. However, recent results from the field of genetics argue that gene duplication has occurred frequently during the recent history of the human population and that gene duplicates occur in humans in variable numbers and may be constantly generated de novo
. Studies measuring human genome variation are receiving much attention currently [1
], as novel genomics approaches have revealed an unanticipated level of genetic variation in the human population (e.g., [2
]). A type of variation that was recently found to be abundant in the human genome is genome structural variation [4
]. Genome structural variants are generally defined as (e.g., [14
]), kilobase- to megabase-sized deletions, insertions, duplications, and inversions. Furthermore, structural variants cause a greater amount of sequence divergence between humans than the widely studied single nucleotide polymorphisms (SNPs) [4
], if one considers the total number of nucleotides spanned by both forms of variation. Even though genome structural variants can alter the intron/exon-structure of genes by disrupting exons or fusing genes together [5
], they frequently span entire genes leading to different gene copy-numbers between individuals. Following the first genome-wide mapping in humans [8
], genome structural variants have been identified in several mammalian genomes (e.g., [17
]) and at varying levels of resolution (see Box 1). Here, we use the term copy-number variant (CNV) to refer to a genome structural variant leading to changes in gene copy number (rather than an inversion or a variant not encompassing genes, both of which may also influence protein function, e.g., by influencing gene regulation). While there have been early insights on the origin of small indels (
], we are just now beginning to understand mechanisms behind the formation of CNVs. Recent advances have been fuelled by the development of approaches for mapping genome structural variation at the resolution of base-pairs [5
CNVs are of significance in relation to the human proteome in various ways. First, copy-numbers of protein-coding genes can be strikingly different between apparently healthy (“normal”) individuals (e.g., [4
]), with instances of up to 10 additional copies reported for several protein-coding gene loci (e.g., [28
]). In line with this, CNV de novo
-formation is thought to be constantly ongoing in mammalian genomes [21
], affecting recent and current protein evolution; in fact, CNV genesis may occur at rates higher than point mutations with impact on gene function. Finally, results from several studies point to a tight relationship of gene copy-number with messenger RNA and protein expression-level (e.g., [29
]). Variation at the level of gene expression may represent the underlying basis for several phenotypic traits associated with CNVs, such as dietary preferences across distinct populations [29
] or the susceptibility to diseases including HIV [28
], breast cancer [33
], autism [26
], and several auto-immune diseases [30
]. Furthermore, through this “gene-dosage” effect, CNVs are likely to influence protein complex formation and tightly regulated cellular systems. Since some of these require their individual components to be expressed at stoichiometrically precise levels, a CNV may have potential harmful (or beneficial) effects. Many additional phenotypic relationships are likely to be discovered in the near future with the ongoing application and improvement of approaches for ascertaining CNVs (Box 1) and for associating CNVs and phenotypes [35
]. Thus, CNVs are not only relevant to population genetics, but should to be considered in systems biology and proteomics studies. CNVs may constitute a source of redundancy and thus evolvability or robustness, i.e., provide ‘replacement proteins’. Often, CNVs will behave selectively neutral (similar to most SNPs). Nevertheless, they represent a genomic pool of evolving transcripts, genes, and proteins that in longer evolutionary terms may become fixed in the population as novel genes. Here, we summarize recent findings in the field of genome structural variation and discuss implications for the systems biology and proteomics fields.
An abundance of copy-number variants in the genome
Knowledge on CNVs has dramatically increased following recent technological advances (see Box 1). For instance, a CNV map generated from data of over two hundred individuals has revealed that 12% or more of the human genome is prone to copy-number variation [4
]. Recent studies at considerably higher resolution sufficient to map small CNVs (<50 kb) and to identify the precise boundaries (or breakpoints) of CNVs have revealed that the number of genome structural variants (>1 kb) that distinguish genomes of different individuals is at least on the order of 600–900 per individual [5
]. Of these, approximately ~150 genome structural variants per individual presumably directly affect protein-coding genes by intersecting with them [5
]. Moreover, recent surveys have led to a re-estimation of the total amount of sequence divergence between individuals; while it was initially assumed that the genomes of two unrelated individuals differ by ~0.1% (mainly due to SNPs), it has recently been estimated that at least 0.5% of our genomes differ [6
], with the majority of variation being due to CNVs.
Considerations for our understanding of protein evolution
Recent findings concerning the abundance of CNVs in the human genome add to current perspectives on gene duplication and loss – essential processes in genome and proteome evolution. For nearly a hundred years, duplication of genetic material has been regarded as an important factor in the evolution of higher organisms (see [38
] and references therein) – and protein birth by duplication is widely considered to be more common than formation of proteins ‘from scratch’ [39
]. Following gene duplication, one of the newly generated paralogs may escape selective constraints (purifying selection) and become free to acquire a new function (neo-functionalization). Furthermore, both paralogous sequences may experience decreased selective pressure after duplication, which may reflect partitioning between paralogs into different functions which had been combined in the multifunctional ancestral gene [40
] (sub-functionalization). Gene duplication is also thought to be a major contributor to the evolution of protein networks [42
], even though it may not account for the evolution of complex molecular machines [43
]. Duplications may evolve in an effectively neutral fashion over extended evolutionary time scales [41
]. They further may be advantageous to the cell by increasing the robustness against mutations (e.g., [37
]). Moreover, at short evolutionary time scales the potential to modify gene/protein expression levels through gene dosage change may promote gene duplications and losses. In this regard, a genome-wide study [32
] has recently reported relationships between CNVs and mRNA levels. Furthermore, Perry et al.
found that increased copy-numbers of the amylase gene reflect higher levels of protein expression and are correlated with dietary preferences for starch [29
]. Note that a single CNV formation event – a type of mutation that for some genomic loci appears to occur more frequently than nucleotide substitutions (see below) – may be sufficient to specifically promote gene expression modification; thus gene copy-number changes may facilitate evolutionary adaptation involving protein abundance change. Nevertheless, nucleotide substitutions having an effect on the regulation of gene expression are likely to eventually supersede gene-copy number increase (or decrease), i.e., take over in the long run; in particular, maintaining a large number of identical genes per genome during longer evolutionary time scales is likely causing significantly increased ‘costs’ related to genome stability and repair.
De novo CNV formation
The abundance of CNVs in the genome indicates that gene duplication (and loss) probably occurs at a constant and high rate in humans. For a number of loci in the genome involved in commonly recurring genomic disorders – regions in which CNVs may recur frequently – this rate has recently been estimated to be 1e-4 to 1e-6 per generation [44
], which is considerably higher than the rate at which point mutations are thought to occur (2e-8; see refs. in [44
]). Furthermore, in a recent analysis involving inbred mice, CNV formation rates as high as 1e-2 to 1e-3 have been inferred for loci encoding genes [21
]. Note that in order to properly compare these rates we have to take into account the fact that the rate at which CNVs arise has been determined for large loci and large CNVs, e.g., of 100 kb in size, whereas the point mutation rate is given per nucleotide. If we consider that ~1% of the genome comprises coding sequence, then the rate at which protein coding sequence will experience a new point mutation within a given 100 kb locus is approximately 2e-8 * 1e5 * 0.01 = 2e-5. Conversely, any given novel CNV of 100 kb would affect protein coding sequence in the given locus. Thus, for several gene loci, CNVs formed de novo
may be significantly more likely to affect coding sequence than point mutations. Frequently, point mutations will remain silent (e.g., if they fall into synonymous sites) and may have little or no effect on protein function. On the other hand, protein duplicates may not always be expressed, and expression differences may sometimes have little or no functional consequence.
CNVs, gene duplicates and formation bias
It is evident from genome-wide surveys that CNVs exhibit a highly non-uniform distribution along chromosomes. This distribution may have different causes: First, it may be due to biases in the ascertainment of CNVs. Second, locus-specific differences in the rate at which CNVs are formed may cause this disparity. Finally, the distribution may be due to natural selection acting differentially throughout the genome, i.e., relative to phenotypic changes caused by different genomic regions that are affected by CNVs.
We believe that the fact that several complementary technologies have detected CNVs at overlapping genomic loci (which becomes quickly apparent when browsing the Database of Genomic Variants (DGV) [16
]) indicates that technological biases are unlikely to be responsible for the trend.
However, discerning the remaining two potential causes is not straightforward. Mutation, population-variation and fixation by natural selection or random drift have been studied extensively in relation to SNPs, but much less so for CNVs. The existence of genomic loci undergoing recurrent de novo
structural rearrangements in relation to disease [44
] suggests that genomic CNV formation biases exist. In this regard, for instance, subtelomeric regions represent hot spots for interchromosomal recombination [45
] and segmental duplication sequence [45
]. In line with this, results from Redon et al.
] indicate an enrichment of CNVs in subtelomeric regions (within 500 kb of the ends of chromosome arms). Consequently, breakage or fusion of chromosomes during the evolution of mammalian genomes may have influenced the rate of duplication (and loss) of gene families across species.
Natural selection: enrichment and depletion in biological processes
Natural selection can be analyzed by studying the overlap of CNVs with various functional elements. For instance, recent studies have revealed that protein-coding genes, and also other genomic elements including highly conserved non-coding regions, tend to be depleted among CNVs, indicating purifying selection [4
]. In particular, deletions appear to be under stronger selection than duplications [4
]. Furthermore, certain functional categories of protein-coding genes are more prone to be affected by CNVs than others. For instance, shows a strong enrichment among CNVs for several protein domains. Our survey presented in extends this analysis by assessing which protein functional categories are most strikingly enriched or depleted amongst CNVs: consistent with earlier surveys we find that proteins involved in processes related to environmental response tend to be enriched in CNVs [4
] and duplicated genes retained in the genome [50
], whereas proteins involved in fundamental cellular functions, such as cellular physiological processes, tend to be depleted. While the latter trend is presumably due to purifying selection owing to constraints, some of the former enrichment may be due to positive selection. Such effects should be observable also in fixed variants. Hence, we extended our survey by comparing “successful duplicates” (i.e., recent segmental duplications) with “unsuccessful duplicates” (nonprocessed pseudogenes
, i.e., duplicated genes that were recently inactivated by mutation; e.g., [51
]). Whereas distributions for successful duplicates reveal trends similar to the ones observed for CNVs (), we note distinct trends for unsuccessful duplicates. Namely, protein-coding genes acting in metabolism and cellular physiological processes, that is dosage-sensitive genes, appear significantly enriched among pseudogenes () – although also genes putatively involved in environmental response (such as genes mediating locomotion in response to stimuli) were observed to be significantly enriched, consistent with an earlier survey [51
]. Overall, the results are consistent with constraint (purifying selection) acting on dosage sensitive genes, leading to the removal of extra gene copies causing dosage imbalance.
Most significantly enriched protein domains in CNVs
Figure 1 Enrichment and depletion of gene functional categories (Gene Ontology (GO) annotation , GO biological process, level 3) among genes affected by CNVs. Significant enrichment (red shading) and depletion (blue shading) of protein-coding genes were determined (more ...)
Additionally, our survey shows that unsuccessful duplicates tend to be longer than successful duplicates (), both at the gene and at the protein level. Although this trend may partially be influenced by the way successful and unsuccessful duplicates have been ascertained, the observations are in line with previous findings that complex genes, such as alternatively spliced ones that are on average longer than non-alternatively spliced genes [53
], tend be less prone to duplication than genes with few exons and no or few additional splice forms [54
Influence of the lengths of protein coding genes
Natural selection: relationship of duplications and protein interaction networks
Selection rarely acts on functions carried out by a protein ‘in isolation’. Most proteins, rather than working as a single entity, act in concert
as members of a tightly regulated pathway or as a large multi-protein complex. Consequently, the level at which proteins tend to be affected by CNVs is partially reflected in the protein’s role in the protein interaction network, i.e., the entirety of proteins thought to interact in the cell: Recently, it was shown that CNVs are more likely to affect proteins at the periphery of the network (with few interaction partners), whereas proteins at the network center (many interaction partners) are less likely to be variable in copy number [57
]. These observations are consistent with an over-representation of small protein families (having few or no paralogs) in the center of protein networks [59
] and the observation that members of large protein families tend not to be involved in protein complexes [60
]. It is plausible that proteins at the network periphery are under less evolutionary constraint and are thus freer to evolve. In contrast, duplicates affecting the network center may be detrimental and thus more likely to be selectively removed. The latter is strongly supported by the fact that unsuccessful gene duplicates are observed at the network center at a significantly higher frequency than successful duplicates ().
Gene duplicates and the human protein interaction network.
Natural selection: other influences on copy number variation
Besides purifying selection, positive or directional selection has been implicated in influencing the distribution of CNVs and successful duplicates in the human genome. For instance, genes frequently affected by CNVs were reported to exhibit elevated rates of amino acid change in evolution [48
], which may be an indicator for positive selection. Moreover, a recent case study focusing on the salivary amylase protein Amy1 has concluded that AMY1
gene copy number in human populations likely underlies diet-related positive selection pressures [29
]. Furthermore, duplications are, similar to positively selected nucleotide changes, biased to the protein interaction network periphery [57
]; this indicates that adaptive evolution – involving SNPs or CNVs – tends to act at the periphery of the network rather than the center. Concerning successful gene duplicates, several groups have reported signs of positive selection (at the level of amino acid replacements) for recently generated gene duplicates in primates (see e.g., [61
]) and rodents [63
]. Finally, a recent computational analysis has presented evidence for substantial positive selection in hotspots of recently formed segmental duplications in humans [64
]; these hotspots are presumably subject to recurrent de novo
At least for some genes it appears that gene copy-number may evolve in a neutral fashion: for instance, Nozawa and coworkers [56
] reported that no significant difference exists in the amount of CNVs between functional and nonfunctional (i.e., pseudogenic) sensory receptor genes, a gene family particularly prone to structural variation (e.g., [4
]). On the other hand, the positive effect of gene duplication or loss in the case of CNVs spanning more than one gene may in some instances balance or overshadow the potentially negative impact of protein dosage imbalances and may drive the fixation of CNVs in particular regions of the genome.
Nevertheless, negative effects of commonly occurring CNVs are also visible in current CNV datasets (). In particular, a survey in which we linked protein domains present in CNVs to the Online Mendelian Inheritance in Man (OMIM) and the Cancer Gene Census (CGC) [66
] databases revealed an enrichment of copy-number variable genes amongst disease-related genes; indeed, positive effects of CNVs need to be balanced against potential harmful influences of genome structural variation.
Disease associations of protein domains in genes affected by copy-number variation