Our study provides a systematic and semi-quantitative assessment of essentiality and gene duplication across eleven prokaryotic and eukaryotic organisms revealing a heterogeneous picture. To the best of our knowledge, this is the first such organism-wide comparison.
Chances of survival upon gene deletion are very high in most organisms (>80%), i.e. there are only few essential genes (Figure ). We observe some variation in survival that cannot be explained by experimental differences alone. The bacteria in our dataset have been analyzed come from different experimental backgrounds (i.e. insertion vs. deletion, population vs. clonal study, Table ). For example, screens of mixed populations with random gene insertions identify more essential genes than clonal studies, e.g. H. pylori, H. influenzae, and M. tuberculosis vs. P. aeruginosa, B. subtilis and E. coli (Table ); however, there is no general trend.
The extremely high chances of survival in fly (Figure ) can be (in part) attributed to the use of a cell line rather than the whole organism and of RNAi knockdowns instead of full gene deletion [35
], and may be an underestimate due to current technical limitations. However, in worm, the same technique, RNAi-KDs, on the whole organism also produced high survival rates, but a much higher contribution of duplicates to survival (see below).
The low chances of survival in mouse are likely due to the mouse dataset not originating from a large-scale screen, but from individual experiments that may have preferentially targeted and reported essential genes. For example, the gene targets in the mouse dataset are strongly enriched for orthologs of human disease genes (OMIM data, not shown
); thus the dataset is biased. The lack of buffering by duplicate genes in mouse has been demonstrated recently [18
]; however, with the availability of an unbiased large-scale essentiality screen in mouse these results may be refined.
The degree of gene essentiality (or degree of survival) can be influenced by the experimental technique and the definition of essentiality that is used. In contrast, if duplicates contribute to survival upon gene loss, then this effect should be detectable irrespective of the number of essential and non-essential genes identified (provided that the selection is unbiased). In other words, we expect buffering by duplicates to be less dependent on technical differences than essentiality alone. We introduced statistical tests to assess the significance of buffering by duplicates (Figure ). A small P-value implies that duplicates are significantly enriched amongst non-essential genes compared to random and vice versa. Thus, for example, H. pylori has only few genes with duplicates (Figure ), but these duplicates exhibit a significant contribution to survival upon gene knockout (Figure ). Likewise, B. subtilis and E. coli have similar degrees of gene essentiality (one examined by insertion, the other by knockout experiments), and similar fractions of duplicate genes, but very different contributions of these duplicates to survival.
Duplicates significantly and positively contribute to survival in nine of the eleven organisms, but have noticeable effects only in six (>5%; H. pylori
, M. tuberculosis
, P. aeruginosa
, E. coli
, yeast, worm; Figure ). Given that duplicates make up to 80% of eukaryotic genomes (Figure ), the small contribution is surprising and points to dominant roles of other buffering processes, such as rerouting metabolic flux (see ref. [9
] for an example).
Buffering by duplicates is uncorrelated with organismal complexity. Buffering capacity varies widely amongst bacteria and eukaryotes, even when accounting for differences in experimental approaches (Table ). M. genitalium, H. influenzae, B. subtilis
, fly and mouse show low or even negative contributions of duplicates to buffering; H. pylori
, yeast and worm show the highest. M. genitalium
is a parasite with a small range of host- or tissue-specific living conditions [36
] and a very small genome [37
](Figure ). Its low rate of survival upon gene-KO could be explained by the low number of duplicate genes and the lack of condition-specific dispensability of genes which boost survival rates under normal conditions [12
]. However, the same reasoning could apply to H. pylori
and H. influenzae
which have genome sizes similar to M. genitalium
and restricted living conditions, but have much higher survival rates and different buffering capacities of duplicates. Mouse represents an exception in the analysis by having relatively low survival rates (Figure ), a higher ratio of duplicates vs. singletons than other organisms (Figure ), but a negative contribution of duplicates to survival (Figure ). As explained above, conclusions in mouse may be refined later.
Next we examined gene characteristics which have been suggested to influence buffering capacity. For example, we would expect duplicates of high sequence proximity (measured by E-value) to be more likely to buffer for loss of function than duplicates that diverged in their sequence. Similarly, we would expect genes with many duplicates (large gene families) to be more likely to be buffered for loss of function than genes of small families. Both expectations are fulfilled in only some of the organisms (Table ), e.g. in the two most thoroughly studied organisms yeast and worm, but not in others.
Related to sequence similarity is function, which is more dissimilar amongst buffering duplicates than naively expected, when measured in terms of expression regulation [20
] and genetic interactions [13
]. When evaluating function similarity in terms of verbal descriptions, shortest path length in a network of functional relationships, and in terms of similarity of their KO-phenotype and physical interaction vectors, buffering genes were slightly (but not significantly) more similar to each other in function than non-buffering genes (Table ). Thus, function similarity is also only a weak indicator of buffering capacity of duplicates.
The single best correlate of buffering capacity by gene duplicates (identified in our study) is expression level. Genes of high expression levels tend to have more duplicates, but these duplicates are also more likely to buffer for loss of the gene's function. (Note the subtle difference between the two observations.) The trend holds true for all organisms with positive buffering capacity (except for M. tuberculosis
) and for different measures of expression levels (Additional file 1
). For example, in highly expressed genes in E. coli, C
increases to 23%. Likewise, buffering two-gene families in yeast have higher mRNA and protein abundance than non-buffering two-gene families, higher transcription and translation rates and smaller protein degradation rates (Table ).
In sum, buffering by gene duplicates only plays a significant and visible role in robustness against gene loss in some organisms but not in others. Factors influencing such buffering are, in decreasing order of approximate importance, gene expression levels, sequence distance between duplicates, the number of duplicates available per gene, the gene's function and the type of organism and its lifestyle. Such ranking holds true despite differences in experimental approaches. The lack of consistency across organisms, lack of strong correlates and low extent of buffering by duplicates suggests that buffering by duplicates is indeed merely a by-product of other processes. Genes with high expression levels are more likely to be essential [38
] and have increased duplicate retention rates [12
]. These duplicates thus likely function to amplify gene dosage [22
], which is supported by their tendency to be co-expressed [13
]. Our analysis shows that only in relatively few cases these duplicates serve as backup for the loss of gene function.