RNAi feeding assay
Query-target gene pairs were tested for interaction by feeding target gene RNAi to worms with a mutation in the query gene. RNAi cultures were grown in 100 μg/ml LB Amp overnight at 37°C. 40 μl of culture was placed on each well of 12-well plates containing 3.5 ml NGM [62
] supplemented with 105.6 μg/ml carbenicillin and 1 mM isopropyl-beta-D-thiogalactopyranoside (IPTG). Plates seeded with bacteria were dried overnight at room temperature and for 40 min in a flow hood. Two stage L3–L4 worms (N2, egl-15
)) were placed in each well of a 12-well plate using a COPAS BioSort worm sorter (Union Biometrica, Holliston, MA). Worms were grown at 20°C (egl-15
)) or at 16°C (glp-1
)). The following controls were grown in each experiment. As a positive control for RNAi efficiency, wild-type (N2) worms and the query mutants were fed pop-1
). As negative controls for background growth levels, N2 worms were fed target RNAi and query mutants were fed L4440 mock-RNAi.
Typically, one person can prepare and process experiments with four worm strains fed 384 RNAi-inducing bacterial strains in triplicate over the course of two weeks. Overlapping sets of experiments of similar size can be prepared while the worms in the first experiment are growing, resulting in an average throughput of 1,920 genetic tests per week per person.
Analysis of the distribution of functional categories within the LGIII set
Within the LGIII set of genes, there are 203 genes annotated with at least one GO biological process. These genes represent 280 unique GO Process 1000 categories. One thousand samples from the C. elegans genome of 203 genes with at least one GO biological process were then chosen randomly. The random set has a mean of 322.5 unique GO Process 1000 categories with a standard of deviation of 32.8. Compared to the random set, there is no significant difference in the number of unique GO processes in the LGIII set (z-score = -1.298; p = 0.097 after Bonferroni correction). Furthermore, of the 280 unique GO biological processes in the LGIII set, only 18 are significantly enriched (p > 0.01) in the LGIII set, and all of these are represented by only one (12 processes), two (four processes) or three (two processes) genes (see Additional data file 2).
Scoring query-target interactions
The number of progeny counted in a well that resulted from each query-target pair and control combination was counted and recorded as growth scores. A well with no progeny was given a growth score of zero, whereas a well overgrown with progeny was given a growth score of six. Growth scores of 1 to 5 were assigned to wells with increasing numbers of worms (1, 1–10 progeny; 2, 11–50 progeny; 3, 51–100 progeny; 4, 101–200 progeny; 5, 200+ progeny). From pilot experiments performed by two independent investigators, we found that worm populations can be quickly and reliably binned into these categories. We took several counts of the same maturing population over the course of several days. Each query-target pair and its two controls were tested in at least three rounds. Experiments suspected of contamination were flagged as suspect and repeated. Counts obtained in a round were annotated with confidence scores of 0, 1, or 2, reflecting whether they were suspect, not suspect, or resulted from a second attempt, respectively. A large fraction of all experiments was digitally archived using a high-throughput digital imager [63
Determination of interactions from growth scores
Let G(Q, T,i,j) be the growth score for the (Q,T) query-target pair on the jth day of round i. For each query-target pair, two growth score differences were calculated: 1, Dnull(i,j) = G(Q,null,i,j) - G(Q,T,i,j), the difference between the experimental population (query mutant; target RNAi) and the mock RNAi vector control (query mutant; L4440 RNAi); and 2, Dwt(i,j) = G(wt,T,i,j) - G(Q,T,i,j), the difference between the experimental population and the wild-type control (N2; target RNAi). The following sequential rules were used to call a (Q,T) pair an interaction:
For round i, its jth day's counts were called 'deviant' if both Dwt(i,j) and Dnull(i,j) were at least d.
A round's set of counts was labeled 'positive' if at least e of its days were found to be deviant (e = 1 or 2) or a majority of its days were deviant (e = 0).
A (Q,T) pair was then called an interaction if at least s of its rounds were positive (s = 1 or 2) or a majority of its rounds were positive (s = 0).
Three additional criteria were used to determine how counts from suspect rounds were treated:
Suspect rounds were excluded from the analysis if the confidence score was less than a threshold c (c = 0, 1, or 2).
Counts derived from suspect rounds were removed if a second attempt was conducted as long as the parameter r was set; if r was not set, all counts were retained.
Suspect rounds were included to bring the total number of rounds to a minimum of m (m = 1 or 2).
Generation and comparison of network variants
We applied all combinations of the above criteria to generate 51 unique network variants. All interacting pairs within a network variant were query-target pairs that had satisfied all of the criteria imposed by the variant. For example, in a variant with the following criteria: d = 3, e = 1, s = 2, r = 1, c = 0, and m = 2, all query-target pairs that were called interacting were found in at least two (s = 2) positive rounds that had at least one deviant day (e = 1), for which the difference between the growth scores of the experimental population and the control populations was at least three (d = 3). If any round was considered suspect and the experiment for that round had been repeated, only growth scores from the second attempt were used (r = 1). Otherwise, rounds with all levels of confidence were used (c = 0). If fewer than two rounds of data were available for a specific query-target pair, data from additional rounds were included, so that at least two rounds of data were available, starting from the most confident rounds (m = 2).
To compare network variants, we identified pairs of genes within each variant that share a GO biological process classification [26
]. Only categories with fewer than 1,000 genes were considered. We calculated 'recall' and 'precision' for each variant, V, as:
Recall (V) = (number of co-classified interacting pairs in V)/(number of possible co-classified pairs) and
Precision (V) = (number of co-classified interacting pairs in V)/(number of interacting pairs in V)
We estimated the significance of the degree to which each network linked genes in the same GO biological process category using the hypergeometric distribution. The hyper-geometric distribution takes into account the number of co-classified interacting pairs in each variant relative to the size of the variant, the total number of all possible co-classified gene pairs, and the total number of gene pairs tested, and is thus a measure of the significance of both the recall and precision of a variant.
Clustering of interaction strengths
An interaction strength (IS) was calculated so that target and query genes could be clustered on the basis of their interaction profiles. The IS measures the average difference between the experimental and control populations of worms. For interacting pairs, we averaged Dwt(i,j) and Dnull(i,j) using only days and rounds passing criteria 3 to 6. For pairs considered non-interacting, all rounds that passed criteria 4 to 6 were included in the computation. The final interaction strength for a particular query-target pair was calculated as:
where 1(i) was 1 if round i passed the above criteria and was 0 otherwise, h is the total number of rounds that passed the criteria, and ni is the number of days in round i. IS represents the average growth score for a query-target pair calculated over its valid data.
Target and query genes were clustered on the basis of their interaction strengths. Hierarchical agglomerative clustering was run using Cluster 3.0 [65
] on both the target and query dimensions using average linkage as the cluster similarity metric and uncentered Pearson correlation as the IS
profile similarity metric, respectively. Individual target gene clusters were defined by cutting the hierarchical tree at a height of 0.4. The degree to which each cluster contained genes assigned to the same gene functional category was measured using the hypergeometric distribution and a significance cutoff of P
Gene functional categories
We searched for common functional annotation present in clusters of genes generated by the hierachical clustering. To do so, we collected several datasets of gene functional categories described for C. elegans
genes specifically as well as for predicted C. elegans
orthologs from other organisms. We collected C. elegans
gene categories from GO [26
] (downloaded from [67
] on 17 January, 2007) and KEGG [68
] (downloaded from [69
] on 13 June, 2005). We restricted GO process categories to those containing 1,000 genes or fewer. Annotations implied by the 'is-a' or 'part-of' subsumption GO hierarchies were automatically added. We also collected S. cerevisae
gene pathways from MIPS [70
] (downloaded on 12 May, 2002) and H. sapiens
gene pathways from BioCarta [71
] (downloaded on 13 June, 2005). For the MIPS and BioCarta datasets, we found the predicted C. elegans
ortholog for each gene in a pathway by identifying the reciprocal best match protein using the BLASTP program [72
]. All of the categories with their associated genes can be found in Additional data file 12.
Construction of the query network
Pairs of query genes found to interact with a significantly similar set of target genes were connected by 'congruent links' as defined by Tong et al
] and Ye et al
]. The P
value of the overlap of k
target genes of a query gene pair (A,B) was determined using the hypergeometic distribution:
where K is the number of target genes linked to query gene A, n is the number of target genes linked to query gene B, and N is the number of tested target genes. A P value cutoff of p < 10-9 yielded a total of 16 congruent links.
Testing the correlation of target hubs with RNAi phenotype
We tested whether targets with high degree (those linked to many query genes) have an increased tendency to produce a strong phenotype when targeted by RNAi compared to targets with low degree (those linked to few query genes). The phenotype data of Kamath et al
] were used. We define a strong phenotype as any of the following: Emb (embryonic lethal), Ste (sterile), Let (lethal), Lva (larval arrest), Lvl (larval lethal), or Adl (adult lethal). Our null hypothesis is that the degree of a target gene is not correlated with strong RNAi phenotypes. Under the null hypothesis, we expect to find an equal proportion of strong RNAi phenotypes among targets with any degree. We quantified the difference between the observed and expected number of target genes with a strong RNAi phenotype for each degree using a chi-square test with 10 degrees of freedom (one less than the number of query genes).
Comparing the network properties of the SGI and SGA genetic networks
To measure topological network properties of the SGI and yeast SGA genetic-interaction networks, we used the program tYNA [73
] to analyze the variance of the SGI and yeast SGA network properties. The resulting standard errors of the mean for the SGI network parameters are reported in the text.
Construction of the co-phenotype network
A co-phenotype network was created by linking genes with similar loss-of-function phenotypes detected in recently published high-throughput RNAi screens [3
]. An RNAi phenotype compendium was assembled by compiling the results of three genome-wide RNAi studies: 31 phenotypes scored for 1,472 RNAi from the Kamath et al
] dataset; 25 phenotypes scored for 1,486 RNAi from the Simmer et al
] dataset; and 26 phenotypes scored for 1,066 RNAi from the Rual et al
] dataset. Several phenotypic annotations in the datasets were converted to provide a uniform terminology that allowed the three datasets to be integrated. These conversions included labeling brood counts scored as '1–5' and '6–10' as 'Ste'; relabeling 'Prz' as 'Prl'; relabeling 'Lvl' as 'Let'; and labeling any embryonic lethal percentages over 10% as 'Emb.' In total, 37 phenotypes scored across 2,327 unique RNAi experiments were collected from the three studies and recorded in a 2,327 × 37 RNAi phenotype matrix, K
. Each entry in the matrix, Kiv
was set to 1 if RNAi against gene i
produced phenotype v
in one of the three studies and was set to 0 otherwise. Each row in the matrix is referred to as a gene's RNAi phenotype profile.
We devised a measure of phenotypic similarity motivated by the uncentered Pearson correlation coefficient (phenotypic PCC) approach of Gunsalus et al
]. However, we chose not to use the phenotypic PCC as it can produce false-positive links between genes with a high correlation that is based on a single (or even a few) shared common phenotype(s) when the two genes fail to produce phenotypes in all (or many) of the other phenotypes. Inspection of the compiled RNAi phenotype dataset reveals thousands of gene pairs that result in such spurious, yet perfect, correlation. In addition, phenotypic PCC will result in false negatives due to low correlations between genes that share several rare phenotypes but that differ in only a few others.
We suggest that a good measure of similarity should give more weight to rare phenotypes as opposed to common phenotypes shared between genes because infrequent phenotypes will co-occur less often in two genes by chance. Furthermore, the similarity between two genes should increase if both do not produce a very common phenotype when genes are targeted by RNAi. We calculated a loss-of-function agreement score, LOFA, for two genes i and j, that captures these intuitions, defined as:
where fv is the frequency of phenotype v across the genome and Kiv is the (i,v)th entry from the RNAi phenotype compendium matrix as described above. If RNAi produces phenotype v in two genes, the LOFA score is increased by -log(fv). The boost is larger for more infrequent phenotypes. For example, a phenotype that occurs in 1 out of 100 genes will increase the score by 2 units, whereas a phenotype that occurs in 1 out of 10 genes will contribute only 1 unit of score. The LOFA's second term gives a bonus to two genes if they both do not share a common phenotype in an analogous fashion.
The LOFA and phenotypic PCC measures of similarity were compared by measuring their ability to predict genes of related function. For each score, we constructed networks induced by using a cutoff above which genes were considered to be functionally related. We first varied the LOFA score cutoff from high to low, producing 51 networks of increasing size. Similarly, 51 networks of increasing size were produced for phenotypic PCC by lowering the phenotypic PCC cutoff. The precision of each network was measured by calculating the fraction of linked genes found to be annotated with a common GO category. Precision levels were then plotted against the network size. LOFA was found to be superior to phenotypic PCC for connecting genes of related function as it produced substantially higher precision levels than phenotypic PCC for every network size (Additional data file 13).
A final co-phenotype network was constructed by linking genes exhibiting significant levels of agreement. The significance of the LOFA score was assessed by generating 3 million random LOFA values. We first constructed a random dataset in which the genes associated with loss-of-function phenotype v in the RNAi phenotype compendium were permuted. This was repeated for each phenotype to produce one permuted dataset from which 100,000 random pairs were then picked and LOFA was calculated. We repeated this procedure for 30 different permuted datasets. We found that a cutoff of 7.0 was equivalent to an estimated significance level of 0.001, as approximately 100 LOFAs computed from random datasets exceeded this value on average in each of the 30 permuted trials.
Construction of the transposed SGA network and the interolog network
We constructed the transposed SGA network of synthetic genetic interactions from those interactions described in [12
] by mapping each yeast gene to its predicted worm ortholog(s). Maps were created containing all gene pairs with BLASTP significance values of p
or better [72
]. For interactions between yeast genes with multiple predicted worm orthologs, transposed interactions were created for all combinations of predicted orthologs.
The interolog network was created from eukaryotic protein-protein interactions reported in BioGRID [41
]. All interactions assembled from organisms other than C. elegans
were mapped to predicted worm ortholog pairs using BLASTP with a significance cutoff of p
Construction of permuted networks
To gauge the significance of various network properties, 1,000 randomly permuted networks were constructed for each data type. Permuted SGI networks were created by combining permuted signaling and LGIII networks. A link in each of these networks associates one query gene with one RNAi target gene. The permuted SGI networks link each query gene to a random set of target genes by randomly picking genes from the entire set of target genes tested in the screen. The number of target genes linked to each query was held fixed in the permuted networks to preserve the degree distribution across query genes. We also created permuted Lehner et al
] networks, yeast SGA networks, and protein-interaction networks using this method. Permuted coexpression, co-phenotype, and fine genetic networks were created by randomly linking genes present in each network. Random superimposed networks were created by taking the union of all links from the permuted networks obtained from the separate data types.
Determination of the significance of the number of supported links
The significance of the number of supported links (gene pairs linked by more than one data type) in the superimposed network was estimated by comparing the observed number of supported links to the number of supported links in 1,000 randomly permuted superimposed networks. Significance was calculated with a standard Z-score transformation using the mean and standard deviation of the number of supported links across the random networks. The significance of the overlap of two data types was estimated in a similar manner.
Identification of gene subnetworks
We identified subnetworks, defined as small- to medium-sized groups of possibly overlapping genes, by searching for densely connected sets of genes in individual networks and in the superimposed network using MODES [74
]. We used MODES parameter settings such that a subnetwork must have at least 50% connectivity, cannot overlap any other subnetwork by more than half of its genes, and must contain a minimum of four genes.
A connectivity significance score was assigned to each subnetwork based on the number of links connecting each of its members. The connectivity significance score for a subnetwork containing n genes was calculated as a standard Z-score (l - m)/s where l is the observed number of links in the subnetwork, and m and s are the mean and standard deviation of the number of links across 1,000 random collections of n genes.
As a post-processing step, any gene that was not grouped into a subnetwork by MODES was iteratively considered for addition to each subnetwork. To achieve this, a hierarchical clustering merge step was performed on all such genes across all subnetworks, using the connectivity score as the basis for a similarity metric. At each step in the clustering, the gene/subnetwork pair with the largest increase in connectivity score was combined. The connectivity score increase was calculated as the subnetwork's connectivity score upon addition of the gene minus its connectivity score before the addition of the gene.
Broad subnetworks were identified in single-data-type networks using the VxOrd algorithm [40
]. VxOrd clusters a network of genes on a two-dimensional surface using multidimensional scaling [75
]. The links between genes are treated as spring constants and a configuration of the springs is sought that minimizes the total free energy of the system. The result is a collection of genes arranged on the X
plane. We partitioned the genes into clusters using the dense subregions obtained from two-dimensional density estimation over a grid superimposed on the X
plane. We formed clusters of genes in contiguous regions whose densities were at least 10% of the maximum density and matched a minimum area cutoff.
Characterization of multiply supported subnetworks
Each subnetwork identified in the superimposed network was inspected to determine which types of data significantly link its gene members. For each subnetwork, the significance of the number of links of a specific data type that connected two genes within the subnetwork was calculated using the connectivity significance score (see previous section). Subnetworks were annotated as enriched for a data source if the connectivity score had an associated P value of 0.01 or less.
The bar-1 module was identified in a search for multiply-supported subnetworks within an earlier version of the superimposed network. The links within the subnetwork were updated using the same data as reported in the current subnetwork. This resulted in the addition of two links to the module: an interolog interaction between efl-1 and lin-35 and a Lehner interaction between ubc-18 and lin-35.
Nile Red analysis
L4 parental worms were placed on NGM plates seeded with RNAi or mock-RNAi bacteria and 0.015 μg/ml Nile Red. L4 F1 and F2 progeny were analyzed by fluorescence microscopy for Nile Red intensity. To quantify Nile Red intensity, Openlab software (Improvision, Lexington, MA) was used to calculate mean fluorescence within a measured area as well as the length of the worm. Nile Red intensity was calculated as: mean fluorescence × area/length of worm.
Identification of significantly bridged subnetwork pairs
All pairs of subnetworks derived from the coexpression, co-phenotype, and interolog networks were inspected for significant bridging by SGI links. An SGI link is considered to bridge a pair of subnetworks if it connects a gene in one subnetwork to a gene in another subnetwork. The total number of bridges was counted for each pair of subnetworks. The significance of the number of bridges for each subnetwork pair was then determined with a standard Z-score transformation using the mean and standard deviation of the number of bridges between that subnetwork pair in 1,000 randomly permuted SGI networks (see Additional data file 14 for evidence that a normal approximation in the Z-score transformation is valid). In addition to a cutoff of P < 0.01, a subnetwork pair was required to have at least three bridges to be considered significantly bridged.
Estimation of the significance of the number of bridged subnetwork pairs
We estimated the significance of the number of significantly bridged subnetwork pairs by comparing to the number of pairs significantly bridged by permuted SGI networks. Each of the 1,000 randomly permuted SGI networks was used to search for significantly bridged subnetwork pairs using the same method described above for the true SGI network. The mean and standard deviation of the number of significantly bridged subnetwork pairs were then calculated across all permuted networks. The number of subnetwork pairs significantly bridged by the SGI network was then compared to these values using a standard Z-score transformation to obtain a single significance value.
Determination of bridging propensities
To measure the propensity for a given data type to bridge subnetworks more than expected by chance, we restricted our analysis to all subnetwork-to-subnetwork links (SSLs). We defined an SSL as a linked gene pair (A,B) in which both A and B were included in at least one broad subnetwork of any data type. Over all SSLs we counted the number of 'supports', those links in which genes A and B occurred in the same subnetwork, as well as 'bridges', those links in which A and B occurred in separate subnetworks. Links that both bridge and support were counted as supports. The 'bridging fraction' was then calculated as the total number of bridges divided by the total number of SSLs. The observed bridging fraction was calculated using all SSLs in the network. The expected bridging fraction was calculated using all SSLs tested in the dataset. To measure the tendency for a given data type to link across versus within broad subnetworks, we calculated the 'bridging propensity' as the observed bridging fraction divided by the expected bridging fraction, minus 1. Positive bridging propensities are indicative of a link type tending to bridge (as opposed to fall within) broad subnetworks more than expected by chance.
Determination of the degree of subnetwork bridging conservation
To determine if the same subnetwork pairs were bridged in worm and yeast, we identified significantly bridged subnetwork pairs separately in each species. We used a compendium of SGI and Lehner et al
] interactions for worm, and transposed SGA links for yeast. We examined all pairs of subnetworks and broad subnetworks separately. We calculated the expected number of bridges as the number of possible (tested) gene pairs between the subnetworks times the probability of linking a gene pair for that data type. An estimate of the probability of a data type linking a gene pair was calculated as the number of links in its network divided by the number of possible (tested) links. This yielded an estimated background probability of 0.039 for worm, and 0.034 for yeast.
To determine the degree of subnetwork bridging conservation among all possible pairs of subnetworks, we created contingency tables containing the observed and expected number of subnetwork pairs significantly bridged only in worm, only in yeast, in both, and in neither. The expected number of pairs for each of these four categories was then calculated, assuming independence of worm and yeast bridging. We first calculated the worm bridging probability, Pw (Py for yeast), as the number of bridged subnetwork pairs divided by the total number of pairs, N. The expected number of subnetwork pairs bridged only in worm was then calculated as NPw (1 - Py). Likewise, the expected number of bridged pairs only in yeast was calculated as N (1 - Pw) Py. The expected number of bridged pairs in both species was calculated as NPw Py. Finally, the expected number of pairs bridged by neither was N (1 - Pw)(1 - Py). We used a chi-square test with 3 degrees of freedom to determine if the observed and expected counts for each of these categories were significantly different.