It has been estimated that at least 2% of the human genome is affected by structural variations
[1], such as inversions, small insertions/deletions or large copy-number variants (CNVs)
[2]. These sometimes large rearrangements can be seen as an important driving force of genome evolution
[3]. As a consequence, theories on gene evolution have to be re-evaluated in the context of such rapid and widespread large scale variation. Previous studies have already shown that the locations and functional annotations of genes in CNV regions are strongly biased
[1],
[4]. CNVs are found more often in pericentromeric and subtelomeric regions and they overlap significantly with regions of segmental duplications. Genes within CNV regions are frequently involved in sensory perception and immune system activity, to a lesser extent in cell adhesion and in a number of cases signal transduction
[1]. Furthermore, it has also been observed that copy-number variability is negatively correlated with protein interaction network metrics such as connectivity and centrality
[5]. Two theories have been postulated to explain this non-random distribution of CNVs. The
mutational hypothesis states that most CNVs are in effect phenotypically neutral, but are carried by flanking genomic elements like segmental duplication or ALU repeats which cause the bias in CNV distribution. The opposing theory could be called the
selection hypothesis, stating that negative and positive selection shape the distribution of CNVs through the functional elements they encompass.
Gene duplication and loss are key mechanisms in evolution
[6]. Historically, it was assumed in this context that most genes can be duplicated without substantial negative fitness effects. Similarly, the established hypothesis explaining gene dominance formulated by Wright
[7] states that dominance is caused by “bottlenecks” in metabolic pathways and is generally rare
[8]. This is in stark contrast to the observation that at least 20% of the entries in the OMIM database of human diseases with a Mendelian pattern of inheritance are described as
heterozygous mutations
[9]. It has also been shown that there are distinct differences between genes as to their duplicability
[10],
[11] and that duplicated genes are in many cases still under negative selection
[12],
[13]. Birchler
et al. [14] reported widespread dosage compensation upon polyploidization of several large chromosomal regions in maize. For all these reasons, it is now widely accepted that some genes are
dosage sensitive.
What are the underlying causes of dosage sensitivity? Papp
et al. [15] postulated that multi-protein complexes need to maintain the stoichiometry of their subunits to perform their biological function (the
balance hypothesis). A range of experiments lend support to the balance hypothesis. It has been noted that expression levels of interacting proteins are highly co-ordinated
[16], hinting that proportionality of subunit abundances is important. In a previous study, we also reported an enrichment for dominant disease mutations amongst interacting proteins
[17]. Within the conceptual framework of the balance hypothesis, this can be explained by the impact of even small stoichiometric changes (the one mutated allele) on the function of the entire protein complex. It has also been argued that tolerance towards polyploidization, compared to the sometimes severe effects of smaller duplications can be explained by conservation of stoichiometry
[18]. Finally, it has been noted that highly-interacting proteins in higher organisms belong to small gene families
[10], which could be conveniently explained by a bias against duplication acting on multi-protein complexes.
There have been, however, several conflicting reports. Deutschbauer
et al. [19] performed an exhaustive heterozygous deletion screen in yeast. They reported only 3% of genes to be haploinsufficient. While these genes were enriched for members of protein complexes, their overexpression did not cause a similar phenotype as their deletion. Subsequently, Sopko
et al. [20] systematically induced gene overexpression for all ORFs in yeast. The genes found to be toxic when overexpressed did not overlap with the haploinsufficient genes described by Deutschbauer
et al., and were not significantly enriched for protein complexes.
These findings point towards a more complex relationship between haploinsufficiency and duplication sensitivity
[21]. A limited number of enzymes are sensitive to low dosage because they are the rate limiting factor in a biochemical reaction. A range of proteins are likely to cause non-physiological binding or even agglomeration as a result of overexpression, as exemplified by susceptibility to early-onset Alzheimer's disease as a result of duplication of the APP locus
[22]. Finally, haploinsufficiency as well as duplication sensitivity are likely to affect those
master-regulators controlling the balanced expression of a range of other proteins
[23],
[24]. These proteins are in fact often complexes
[25].
The newly developed CORUM database
[26] contains mammalian protein complexes that were manually annotated by expert curators. It contains a large number of gene regulatory and transcriptional genes, as listed in . In this work, we use gene expression and copy-number variation data to assess the relationship between protein complexes from CORUM, dosage sensitivity and recent gene evolution in the human population. We show that changes in gene copy number have a weak but measurable effect on gene expression. We find that protein complex genes are enriched for known dosage sensitive genes and exhibit substantially lower expressional noise than other genes. Consequentially, we observe that dosage sensitive genes are underrepresented in CNV regions.
| Table 1Composition of the CORUM database. |