|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
One mechanism to account for robustness against gene knockouts or knockdowns is through buffering by gene duplicates, but the extent and general correlates of this process in organisms is still a matter of debate. To reveal general trends of this process, we provide a comprehensive comparison of gene essentiality, duplication and buffering by duplicates across seven bacteria (Mycoplasma genitalium, Bacillus subtilis, Helicobacter pylori, Haemophilus influenzae, Mycobacterium tuberculosis, Pseudomonas aeruginosa, Escherichia coli), and four eukaryotes (Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Mus musculus (mouse)).
In nine of the eleven organisms, duplicates significantly increase chances of survival upon gene deletion (P-value ≤ 0.05), but only by up to 13%. Given that duplicates make up to 80% of eukaryotic genomes, the small contribution is surprising and points to dominant roles of other buffering processes, such as alternative metabolic pathways. The buffering capacity of duplicates appears to be independent of the degree of gene essentiality and tends to be higher for genes with high expression levels. For example, buffering capacity increases to 23% amongst highly expressed genes in E. coli. Sequence similarity and the number of duplicates per gene are weak predictors of the duplicate's buffering capacity. In a case study we show that buffering gene duplicates in yeast and worm are somewhat more similar in their functions than non-buffering duplicates and have increased transcriptional and translational activity.
In sum, the extent of gene essentiality and buffering by duplicates is not conserved across organisms and does not correlate with the organisms' apparent complexity. This heterogeneity goes beyond what would be expected from differences in experimental approaches alone. Buffering by duplicates contributes to robustness in several organisms, but to a small extent – and the relatively large amount of buffering by duplicates observed in yeast and worm may be largely specific to these organisms. Thus, the only common factor of buffering by duplicates between different organisms may be the by-product of duplicate retention due to demands of high dosage.
Cells and organisms show a remarkable robustness against loss of one or more genes, which has triggered an ongoing discussion on the factors promoting such robustness [1,2]. One of the simplest and most obvious mechanism for buffering is redundancy produced by gene duplicates [3,4]. Indeed, gene duplication is a major factor shaping prokaryotic and eukaryotic genomes [5-7]. Duplicate genes diverge in their sequence and function  and may or may not have the ability to buffer for loss of the respective homolog. While processes other than buffering by duplicates play important roles in robustness against gene loss, e.g. use of alternative pathways [8,9], the relationship between essentiality and the existence of gene duplicates has attracted much attention, and previous work revealed an intricate picture.
For example, estimates of the role of duplicates as backups for gene loss vary widely within and across organisms. Most yeast genes are non-essential, i.e. dispensable, in rich medium or under standard laboratory conditions (>80%, ref. ). A study by Gu et al. attributes 23–59% of the dispensability (or survival) to buffering by gene duplicates , whereas other studies quote a much lower range (15–28%) [8,12-15]. Only 2% of gene pairs with a synthetic sick or lethal (SSL) mutant phenotype in yeast show detectable similarity [16,17], and amongst the ~20% of mouse genes examined to-date no buffering by duplicates has been observed [18,19].
Several molecular causes may underlie buffering by duplicates, and their relative contributions are still debated. For example, buffering duplicates lack functional redundancy that would be expected from their backup role. Buffering duplicates in yeast have only partially overlapping expression  and genetic interaction profiles , suggesting their functions have diverged. Alternative explanations for the bias against duplicates amongst essential genes have been suggested. For example, it may be disadvantageous for the cell to retain duplicates for genes with severe (lethal) knockout phenotypes because this may disrupt their finely balanced expression dosage . Further, the correlation between gene expression levels and existence of duplicates suggests buffering for gene loss may only be a by-product of processes that retain duplicates for dosage amplification [12,13,22,23].
Despite the availability of several large-scale datasets on single gene knockouts (KO) or knock-downs (KD) as well as double gene-KOs for all of these organisms, previous studies mainly focused on single organisms like yeast [8,11-14], worm  and mouse [18,19]. Major hindrances of a cross-organism comparison are differences in experimental approaches and the specific definition of essentiality used. The types and numbers of essential genes per organism are influenced by several factors: the mutational strategy (insertion, knockout (deletion) or knockdown), growth of the organism in clonal or mixed populations, life cycle stage of the organism, and, for multi-cellular organisms, whether the whole organism or simply a cell line was targeted. Selection pressure is more stringent in mixed than in clonal populations, and we expect lower survival rates in the former. For example, a mutant bacterium of decreased fitness may be selected against in a mixed population, but still be able to form an isolated colony. Insertion experiments may result in leaky expression compared to knockout or deletion experiments, and thus identify fewer essential genes. Finally, while RNAi experiments in worm have reasonably low false-positive and false-negative rates [25,26], we would still expect lower degrees of gene essentiality from this knockdown technique than from gene deletions.
To gain further insights into general principles of buffering by gene duplicates, we conducted a comprehensive cross-organism comparison of essentiality and its relationship to gene duplication, analyzing eleven prokaryotic and eukaryotic organisms – M. genitalium, H. pylori, H. influenzae, M. tuberculosis, P. aeruginosa, B. subtilis, E. coli, S. cerevisiae (yeast), C. elegans (worm), D. melanogaster (fly), and M. musculus (mouse). To do so, we addressed the above-mentioned challenges in several ways. When selecting essentiality datasets, we aimed to minimize variation in experimental approaches, and, whenever possible, sampled several organisms for a specific technique (Table (Table1).1). We tested different definitions of gene duplication, measures of expression levels, and (for yeast) robustness of the results against removal of genes of the whole-gene duplication [27,28] and ribosomal genes (Additional file 1). When assessing the contribution of duplicates to survival upon gene-KO/KD, we normalized by the number of essential genes. Differences in technical approaches certainly influence the extent of essentiality detected amongst organisms; however, if duplicates have a buffering role against loss of gene function then this effect should be observable regardless of the exact number of genes identified to be essential.
Our study reveals heterogeneity of essentiality and the contribution of duplicates to survival that goes beyond what is accountable for by technical differences. We show that organismal complexity and lifestyle, gene function, function similarity, sequence similarity or the number of duplicates per gene are only weak predictors of the buffering capacity – gene expression levels and related measures are the strongest correlates. Simple relationships with respect to essentiality and gene duplication hold true for some organisms, but not for others. Buffering by duplicates plays a significant but small and heterogeneous role.
If duplicate genes play a significant role in buffering against mutations, then genes with one or more paralogs should have higher chances of survival upon deletion than singletons. This simple relationship has been demonstrated for yeast  and C. elegans , but not yet for other organisms. To test the generality of this prediction, we estimated families of homologous genes for eleven bacterial and eukaryotic organisms based on a BLAST  sequence similarity search (E-value < 1.0e-10), and compared survival upon knockout (KO) or knockdown (KD) of genes from these gene families to survival upon KO/KD of singletons (Table (Table1).1). We estimate gene expression levels by use of the Codon Bias Index (Methods).
We define the effective family size D of a target gene as the number of duplicates remaining after KO or KD. D = 0 denotes singletons genes; D ≥ 1 denotes genes with paralogs. The probability P(D ≥ 1) is derived from the fraction of genes in a genome which do have one or more duplicates (paralogs). We also use the probability P(S) which describes for an organism chances of survival upon gene-deletion; P(S) is derived from the fraction of genes identified as dispensable (non-essential) in large-scale screens. When discussing 'buffering by duplicates' we mean the enrichment of duplicates amongst non-essential genes as inferred from statistical analysis. 'Essentiality/non-essentiality (survival)' is purely based on outcomes of experiments.
Table Table1,1, Figure Figure11 and and22 summarize our results with respect to survival and gene duplication across whole genomes. Most genomes in our dataset have relatively few essential genes; chances for survival upon loss of a single gene are high in both prokaryotes and eukaryotes (P(S) > 0.80), except for M. genitalium, H. influenzae and mouse (Figure (Figure1A).1A). Genes of high expression levels are more likely to be essential than genes of low expression levels (smaller P(S)); in half (six) of the organisms the difference is significant (P-value ≤ 0.01).
In accordance with the expectation that more complex organisms tend to have more duplicate genes, the fraction of genes with duplicates (D ≥ 1) increases from M. genitalium and the other bacteria, to yeast and the three animals (Figure (Figure1B).1B). Compared to other organisms, mouse has a noticeable depletion of singleton genes (D = 0) relative to genes with duplicates. In five organisms, there is a significant increase in the fraction of duplicates (D ≥ 1) amongst highly expressed genes compared to other genes (P-value ≤ 0.01); an exception is B. subtilis in which the trend is inverted. When using Codon Adaptation Index or experimental expression data we obtain similar results (Additional file 1).
To assess the contribution of duplicates to survival following gene-KO/KD we define the buffering capacity C as C = P(S|D ≥ 1)/P(S|D = 0) – 1, where P(S|D = 0) is the probability of survival given the gene does not have additional duplicates, i.e. is a singleton. P(S|D ≥ 1) is the probability of survival given the gene has one or more additional duplicates. C is calculated for each organism and quantifies the increase in probability of survival upon gene-KO/KD for genes which have a duplicate in the genome.
In nine of the eleven organisms, duplicates contribute significantly and positively to survival (P-value ≤ 0.05); with contributions ranging from 1 to 13% (Table (Table1,1, Figure Figure2).2). The exceptions are M. genitalium and mouse in which duplicates appear to decrease chances of KO survival. The extent of buffering by duplicates, i.e. the value of C, does not correlate with the organisms' complexity or genome size. Total C is largest in yeast, worm and H. pylori and smallest in H. influenzae, B. subtilis and fly. While the total number and fraction of genes with duplicates increases from simpler to more complex organisms (Figure (Figure1B),1B), the propensity of duplicates to buffer against gene loss varies independently.
Next we ask whether amongst genes with duplicates chances for buffering upon gene loss increase with high expression levels compared to low expression levels. In most of the organisms, there are significant differences in buffering capacity C amongst genes of low and high expression levels (P-value ≤ 0.05). However, only in five organisms (H. pylori, P. aeruginosa, E. coli, yeast, and worm), genes of high expression levels and with duplicates have significantly improved chances of survival; with C reaching 23% in E. coli. In M. genitalium and M. tuberculosis, C is positive amongst highly expressed genes when examining experimental expression data (Additional file 1); in B. subtilis and fly survival is generally very high and a distinction between genes of high or low expression does not have any effect.
These results are robust to various methods of paralog estimation, although exact numbers change depending on parameter settings. We tested, for example, different E-value cutoffs, different length requirements on the match region or when using methods of homology estimation that are completely independent of particular E-value thresholds (Additional file 1).
Assuming that paralogs can take over the function of a deleted gene, one may hypothesize that chances of doing so increase i) with the number of paralogs present, and ii) their similarity to the mutant protein. We tested these predictions in the eleven organisms.
Only in three organisms, P. aeruginosa, E. coli, and worm, chances of survival correlate significantly (P-value ≤ 0.05) with both the number of duplicates available per gene and with the distance of the gene to the nearest homolog (R2 ≥ 0.64 and R2 ≥ 0.80, respectively; Table Table1).1). These correlations have been observed previously in worm , but are not common amongst the organisms of our study. Yeast has a decent correlation with distance to the nearest homology (R2 = 0.72), but not with the number of duplicates per gene. These results do not change even when removing ribosomal genes or gene pairs originating from the whole-genome duplication , or when focusing on highly expressed genes (Additional file 1). Yeast is particularly enriched in two-gene families (D = 1) which buffer for each other (Additional file 1). Figure Figure3A3A shows these distributions for E. coli, yeast and worm.
We further tested C for genes in different groups of gene function, without finding strong biases (Additional file 1).
To better understand buffering by duplicates, we compared the properties of a subset of duplicates which are likely to buffer for each other's function to those which do not buffer for each other. In particular, we analyzed two-gene families which had been tested for both single- and double gene-KOs. Of course, members of larger gene families can also buffer for each other – however, it is more difficult to distinguish buffering genes from those with other functions. For two-gene families, if the double-KO of two non-essential genes is lethal, the two genes are likely to buffer for each other's function in single-KOs, i.e. we call them buffering duplicates. Despite the generally low contribution of duplicates to survival upon gene knockout, these two-gene families are paramount candidates for buffering. If a double-KO is viable, reasons other than the presence of a duplicate should explain their viable single-KO phenotype. We call these pairs non-buffering duplicates.
Amongst the ~300,000 yeast gene pairs tested for double-KO phenotypes tested in large- and small-scale screens , we identified 50 two-gene families with genetic interactions (buffering) and eight two-gene families with a viable double-KO phenotype (non-buffering). These two-gene families represent prime candidates for comparing characteristics of buffering and non-buffering duplicates, respectively. Table Table22 and Additional file 1 describe their properties tested across and between the genes. There are also another 551 two-gene families in yeast which have not been tested in double-KO experiments; Additional file 1 describes their characteristics.
Both buffering and non-buffering two-gene families are defined by the same E-value threshold (10-10, Methods); however, buffering genes have significantly higher sequence identity between the members (P-value < 0.05; Table Table2).2). Buffering genes are also more conserved than non-buffering genes, i.e. have slower rates of evolution and more orthologs across organisms.
We examined the functional similarity between genes in the sets of pairs, testing whether buffering duplicates are more similar in their function than non-buffering duplicates. We find that genes buffering two-gene families have mostly identical function descriptions, and descriptions for non-buffering genes are similar but not identical (Table (Table3,3, ,4)4) – however, this finding is only qualitative. To quantify functional distance, we measured the average shortest path between the genes in a network of functional relationships : buffering genes had slightly shorter paths between each other than non-buffering genes (not significant, Table Table2),2), i.e. their functions are closer to each other. Other quantitative measures of gene function can be derived from the number and types of physical protein-protein interactions, functional interactions , genetic interactions or gene-KO phenotypes under various conditions. Buffering genes are more similar to each other than non-buffering genes in all these measures except for genetic interactions, although the trends are not significant (Table (Table2).2). The lack of similarity of genetic interaction profiles between buffering genes is consistent with recent findings by Ihmels et al.  although these authors included epistatic interactions other than lethal double-KO phenotypes in their analysis.
Buffering and non-buffering genes show clear differences in terms of transcriptional and translational regulation (Table (Table2).2). Buffering genes have higher mRNA and protein expression levels. Measures of translation efficiency, e.g. protein length, molecular weight, Codon Adaptation Index (CAI), or protein production rate, are significantly elevated in buffering genes compared to non-buffering ones (P-value ≤ 0.05); protein degradation is slightly decreased. Interestingly, some of these measures (e. g. length, CAI) are significantly more different between members of a buffering gene pair than between members of a non-buffering gene pair (Additional file 1).
We also extracted orthologs of the buffering and non-buffering yeast two-gene families in fly, worm and mouse using InParanoid . (None of the yeast genes had orthologs in E. coli). If a buffering gene pair in yeast has a single-gene ortholog in another organism (without additional duplicates), we expect this ortholog to be essential – more often than single-gene orthologs of non-buffering gene pairs. If an ortholog of a buffering two-gene family has paralogs, we do not expect it to be essential. Indeed, buffering gene pairs are enriched for essential single orthologs compared to non-buffering gene pairs, although the trend is very weak and not significant due to small numbers in the dataset (Table (Table5,5, P-value = 0.19; Additional file 1, P-value = 0.07). There are several examples of essential single orthologs of buffering gene pairs: HMG1 and HMG2 are isozymes of HMG-CoA reductase in yeast (Table (Table3)3) and their double KO phenotype is lethal. The genes have one ortholog in worm (F08F8. 2) and one in mouse (HMG-CoAR, MGI96159) which both have embryonic lethal KO/KD phenotypes. SSF1 and SSF2 are yeast proteins required for ribosomal large subunit maturation (Table (Table3),3), and they have single essential orthologs in worm (K09H9. 6, lpd-6) and fly (CG5786, Peter Pan).
For further validation, we extracted the 143 worm two-gene families tested in double-RNAi knockdowns  which consist of 16 pairs of synthetic sick or lethal (SSL) phenotypes, i.e. buffering duplicates, and 127 non-buffering duplicate gene pairs. Unfortunately, there are no experimental data available for worm genes to test for measures of transcriptional and translational efficiency. When calculating CAI for the worm sequences, we found a significant bias confirming the trend in yeast (Table (Table2).2). Buffering genes are more efficiently translated than non-buffering genes.
Noticeably, yeast is enriched for buffering gene pairs (50) vs. non-buffering gene pairs (eight) compared to worm (16 and 143-16 = 127, respectively). This bias holds true even if only regarding the yeast gene pairs identified in large-scale screens: ten buffering and eight non-buffering pairs. Previous work has shown that yeast is enriched for buffering gene pairs which originate from the whole genome duplication . In addition, RNAi-based screens in worms may miss synthetically lethal interactions and thus have a high false-negative rate amongst gene pairs found to be non-buffering.
Our study provides a systematic and semi-quantitative assessment of essentiality and gene duplication across eleven prokaryotic and eukaryotic organisms revealing a heterogeneous picture. To the best of our knowledge, this is the first such organism-wide comparison.
Chances of survival upon gene deletion are very high in most organisms (>80%), i.e. there are only few essential genes (Figure (Figure1A).1A). We observe some variation in survival that cannot be explained by experimental differences alone. The bacteria in our dataset have been analyzed come from different experimental backgrounds (i.e. insertion vs. deletion, population vs. clonal study, Table Table1).1). For example, screens of mixed populations with random gene insertions identify more essential genes than clonal studies, e.g. H. pylori, H. influenzae, and M. tuberculosis vs. P. aeruginosa, B. subtilis and E. coli (Table (Table1);1); however, there is no general trend.
The extremely high chances of survival in fly (Figure (Figure1A)1A) can be (in part) attributed to the use of a cell line rather than the whole organism and of RNAi knockdowns instead of full gene deletion , and may be an underestimate due to current technical limitations. However, in worm, the same technique, RNAi-KDs, on the whole organism also produced high survival rates, but a much higher contribution of duplicates to survival (see below).
The low chances of survival in mouse are likely due to the mouse dataset not originating from a large-scale screen, but from individual experiments that may have preferentially targeted and reported essential genes. For example, the gene targets in the mouse dataset are strongly enriched for orthologs of human disease genes (OMIM data, not shown); thus the dataset is biased. The lack of buffering by duplicate genes in mouse has been demonstrated recently [18,19]; however, with the availability of an unbiased large-scale essentiality screen in mouse these results may be refined.
The degree of gene essentiality (or degree of survival) can be influenced by the experimental technique and the definition of essentiality that is used. In contrast, if duplicates contribute to survival upon gene loss, then this effect should be detectable irrespective of the number of essential and non-essential genes identified (provided that the selection is unbiased). In other words, we expect buffering by duplicates to be less dependent on technical differences than essentiality alone. We introduced statistical tests to assess the significance of buffering by duplicates (Figure (Figure2).2). A small P-value implies that duplicates are significantly enriched amongst non-essential genes compared to random and vice versa. Thus, for example, H. pylori has only few genes with duplicates (Figure (Figure1B),1B), but these duplicates exhibit a significant contribution to survival upon gene knockout (Figure (Figure2).2). Likewise, B. subtilis and E. coli have similar degrees of gene essentiality (one examined by insertion, the other by knockout experiments), and similar fractions of duplicate genes, but very different contributions of these duplicates to survival.
Duplicates significantly and positively contribute to survival in nine of the eleven organisms, but have noticeable effects only in six (>5%; H. pylori, M. tuberculosis, P. aeruginosa, E. coli, yeast, worm; Figure Figure2).2). Given that duplicates make up to 80% of eukaryotic genomes (Figure (Figure1B),1B), the small contribution is surprising and points to dominant roles of other buffering processes, such as rerouting metabolic flux (see ref.  for an example).
Buffering by duplicates is uncorrelated with organismal complexity. Buffering capacity varies widely amongst bacteria and eukaryotes, even when accounting for differences in experimental approaches (Table (Table1).1). M. genitalium, H. influenzae, B. subtilis, fly and mouse show low or even negative contributions of duplicates to buffering; H. pylori, yeast and worm show the highest. M. genitalium is a parasite with a small range of host- or tissue-specific living conditions  and a very small genome (Figure ](Figure1).1). Its low rate of survival upon gene-KO could be explained by the low number of duplicate genes and the lack of condition-specific dispensability of genes which boost survival rates under normal conditions . However, the same reasoning could apply to H. pylori and H. influenzae which have genome sizes similar to M. genitalium and restricted living conditions, but have much higher survival rates and different buffering capacities of duplicates. Mouse represents an exception in the analysis by having relatively low survival rates (Figure (Figure1A),1A), a higher ratio of duplicates vs. singletons than other organisms (Figure (Figure1B),1B), but a negative contribution of duplicates to survival (Figure (Figure2).2). As explained above, conclusions in mouse may be refined later.
Next we examined gene characteristics which have been suggested to influence buffering capacity. For example, we would expect duplicates of high sequence proximity (measured by E-value) to be more likely to buffer for loss of function than duplicates that diverged in their sequence. Similarly, we would expect genes with many duplicates (large gene families) to be more likely to be buffered for loss of function than genes of small families. Both expectations are fulfilled in only some of the organisms (Table (Table1),1), e.g. in the two most thoroughly studied organisms yeast and worm, but not in others.
Related to sequence similarity is function, which is more dissimilar amongst buffering duplicates than naively expected, when measured in terms of expression regulation  and genetic interactions . When evaluating function similarity in terms of verbal descriptions, shortest path length in a network of functional relationships, and in terms of similarity of their KO-phenotype and physical interaction vectors, buffering genes were slightly (but not significantly) more similar to each other in function than non-buffering genes (Table (Table2).2). Thus, function similarity is also only a weak indicator of buffering capacity of duplicates.
The single best correlate of buffering capacity by gene duplicates (identified in our study) is expression level. Genes of high expression levels tend to have more duplicates, but these duplicates are also more likely to buffer for loss of the gene's function. (Note the subtle difference between the two observations.) The trend holds true for all organisms with positive buffering capacity (except for M. tuberculosis) and for different measures of expression levels (Additional file 1). For example, in highly expressed genes in E. coli, C increases to 23%. Likewise, buffering two-gene families in yeast have higher mRNA and protein abundance than non-buffering two-gene families, higher transcription and translation rates and smaller protein degradation rates (Table (Table22).
In sum, buffering by gene duplicates only plays a significant and visible role in robustness against gene loss in some organisms but not in others. Factors influencing such buffering are, in decreasing order of approximate importance, gene expression levels, sequence distance between duplicates, the number of duplicates available per gene, the gene's function and the type of organism and its lifestyle. Such ranking holds true despite differences in experimental approaches. The lack of consistency across organisms, lack of strong correlates and low extent of buffering by duplicates suggests that buffering by duplicates is indeed merely a by-product of other processes. Genes with high expression levels are more likely to be essential  and have increased duplicate retention rates [12,23]. These duplicates thus likely function to amplify gene dosage , which is supported by their tendency to be co-expressed . Our analysis shows that only in relatively few cases these duplicates serve as backup for the loss of gene function.
We obtained the amino acid sequences for ten genomes (Mycoplasma genitalium; Bacillus subtilis; Helicobacter pylori; Haemophilus influenzae; Mycobacterium tuberculosis; Pseudomonas aeruginosa; Escherichia coli; Saccharomyces cerevisiae (yeast); Caenorhabditis elegans (worm); Drosophila melanogaster (fly); Mus musculus (mouse)) from a collection in the SUPERFAMILY database . Information on gene essentiality (lethal phenotypes upon single gene-KO or KD) was taken from publications [25,35,36,40-46]. Table Table11 provides an overview of the number of genes in tested each organism (background set) and the number of genes identified to be essential. The table describes briefly the experimental strategy, as described in the publications and in the SEED database http://theseed.uchicago.edu. All screens were conducted in rich medium and on whole organisms except for fly (cell line). For mouse, data of ~4,000 individual knockout experiments were obtained from the Mouse Genome Database .
To-date, large-scale double-KO/KD data is only available for yeast and worm. For yeast we compiled in addition to the original data published by Tong et al. [16,48] 13 datasets identified as 'systematic screens' in the BioGRID database [30,49-60]. In a parsimonious approach, we only included data on lethal phenotypes of double-KOs in our study and no other epistatic interactions. To calculate the background set of tested gene pairs, we paired the 204 bait genes identified in the 14 analyses with all non-essential yeast genes , resulting in ~300,000 tested pairs.
For worm we extracted data from two large-scale double KD screens [26,61], which comprise 52781 tested gene-pairs and 3927 genetic interactions. Another study in worm specifically targeted two-gene families with a single ortholog in yeast , and we used these pairs to investigate properties of two-gene families.
We measured similarity between all sequences using a BLAST all-against-all search , and required an E-value < 10-10 for two genes to be predicted homologs. This E-value threshold was established in yeast and adjusted accordingly in organisms of very different genome size, e.g. in M. genitatlium (10-9) and worm (3.0*10-10). This threshold identified 609 two-gene families in yeast. We tested several other methods of homology prediction including different E-value thresholds, E-value-independent methods and use of InParanoid , all with results qualitatively identical to those discussed here (Additional file 1).
As a surrogate for gene expression levels, we calculated the Codon Bias Index (CBI) for each gene using the CodonW server , with standard settings and parameters for the respective organism. We also calculated the Codon Adaptation Index (CAI). However, since it requires a reference dataset of expressed genes (which was not always available) we consider CAI less appropriate of a measure than CBI. Both measures are expected to work less well in multi-cellular organisms due to tissue-specific expression which may not be captured by these sequence features. For further validation, we extracted from literature experimental expression data for all organisms except H. pylori. Results for CAI and experimental expression data are in Additional file 1. For the results in Figure Figure11 and and2,2, we rank-ordered the CBI values within each genome and selected subsets of genes with the highest or lowest CBI; the sizes of the subsets varied according to the organism's genome size. See Additional file 1 for details.
In yeast, 50 two-gene families were identified as buffering (SSL phenotype) and eight two-gene families as non-buffering (viable phenotype). The buffering pairs consist of nine pairs identified in the 14 large-scale double-KO screens (see above), and 42 additional pairs identified in small-scale experiments and listed in BioGRID ). The non-buffering pairs originate from pairs tested in 14 large-scale screens and found to have viable phenotypes. Table Table22 describes characteristics between the two members of a gene family and characteristics of individual genes, averaged across the whole set. For vector comparisons, we constructed binary vectors (1 = observation, 0 = no observation) based on networks of functional interactions , genetic interactions (see description of datasets above), physical interactions (extracted from BioGRID ), and single gene-KO phenotypes . The similarity between two vectors is measured as the percentage of shared positive interactions (Jaccard Index). More results are in Additional file 1.
As a control for the effects of WGD genes, we also compared some characteristics in all 609 yeast two-gene families split into 108 and 501 two-gene families with and without evidence for their origin in the WGD , respectively (Additional file 1). As another control, we extracted the 143 worm two-gene families, which were identified and tested by Tischler et al.  and calculated codon adaptation indices (Additional file 1). Results from these controls are consistent with those from the yeast analysis.
We used the FunSpec server  and SGD  for yeast protein function annotation. The SUPERFAMILY database  was used for annotation of ribosomal proteins in yeast. Genes originating from the whole-genome duplication were taken directly from the published paper . Characteristics described in Table Table22 are obtained from the sources quoted in the table and in Additional file 1. For the ortholog analysis described in Table Table5,5, we extracted information from InParanoid , and mapped that against the gene essentiality data described above. Information on yeast two-gene families is presented in Additional file 2.
CAI: Codon Adaptation Index; CBI: Codon Bias Index; D: effective gene family size (number of additional gene duplicates); E-value: expectation value; KD: knockdown; KO: knockout; MIPS: Munich Information Center for Protein Sequences; P(S): probability of survival upon single- or double gene-KO or KD; R2: squared Pearson correlation coefficient; SGA: Synthetic Genetic Array; SSL: synthetic sick or lethal (mutant); SGD: Saccharomyces Genome Database; WGD: whole-genome duplication.
Organisms: M. genitalium: Mycoplasma genitalium; H. pylori: Helicobacter pylori; H. influenzae: Haemophilus influenzae; M. tuberculosis: Mycobacterium tuberculosis; Paer: Pseudomonas aeruginosa; B. subtilis: Bacillus subtilis; E. coli: Escherichia coli; S. cerevisiae: Saccharomyces cerevisiae (yeast); C. elegans: Caenorhabditis elegans (worm); D. melanogaster: Drosophila melanogaster (fly); M. musculus: Mus musculus (mouse).
KH conducted the experiments, analyzed results and wrote the paper. EMM provided valuable input and support at all stages of the project. CV initiated and guided the project, conducted some of the experiments, analyzed results and wrote the paper. All authors read and approved the final manuscript.
Supplementary Notes. Additional figures and comments on the analyses.
Supplementary Data. Data on yeast gene pairs collected during the analyses.
We are most grateful to E Levy for several useful discussions. We also thank J Pereira-Leal, M Tsechansky, and SL Wong for their help at various stages of the project. CV acknowledges support by the International Human Frontier Science Program. EMM acknowledges support by NSF, NIH, Welch (F15-15) and the Packard Foundation.