The SinkSource algorithm can be understood via the following physical analogy. We consider the PPI network to be a flow network. Here, each edge is a pipe and its weight denotes the amount of fluid that can flow through the pipe per unit time. Each node has a reservoir of fluid. We maintain the level of the reservoir at each HDF at 1 unit and at each non-HDF at 0 units. We let fluid flow through this network. At equilibrium (when the amount of fluid flowing into each node is equal to the amount flowing out), the reservoir height at each node denotes our confidence that the node is an HDF. Our approach is reminiscent of the FunctionalFlow algorithm 
developed for predicting gene functions, with one crucial difference. The FunctionalFlow algorithm does not use negative examples, permitting the reservoir level at a node to increase without bound. Hence, the algorithm stops after a user-specified number of phases. In contrast, our algorithm will converge to a unique solution.
We applied seven prediction algorithms to the HDF data in the context of a human PPI network integrated from seven public databases 
, (see “Data and Algorithms”). The algorithms were the SinkSource algorithm; a variant called SinkSource+ that does not need negative examples; the commonly-used guilt-by-association approach, both with and without negative examples (called Local and Local+ in this work); a method based on Hopfield networks 
; the FunctionalFlow algorithm 
; and another flow-based approach called PRINCE 
. Guilt-by-association, Hopfield networks, and FunctionalFlow have been proposed to address the problem of gene function prediction. PRINCE is an approach to prioritize disease-related genes; we selected PRINCE since it outperformed many other methods for predicting disease related genes, including cluster and neighborhood based algorithms. We applied the algorithms to four sets of positive examples: the HDFs in the Brass et al.
study (B), the HDFs in the Konig et al.
study (K), the HDFs in the Zhou et al.
study (Z), and the union of these three sets (BKZ). We restricted these sets to those proteins that participated in at least one interaction in the human PPI network. We used an unweighted version of the network for all results below.
Combining the Brass, Konig, and Zhou datasets improves cross-validation results
displays the results of two-fold cross validation for the six algorithms tested on four datasets. Two-fold cross validation involves splitting the positive and negative examples into two halves, and using each half to make predictions for the genes in the other half. We used two-fold cross validation since we felt it better mimics our state of knowledge of HDFs than the more commonly used five-fold or 10-fold cross validations. We averaged the results over 10 independent runs for each algorithm-dataset combination. For each algorithm, it is evident from that the area under the precision-recall curve (AUPRC) value for the BKZ dataset is larger than the values for the B, K, or Z datasets. It is also clear that these results are robust to the randomization inherent in cross validation: the largest standard deviation in the AUPRC values is 0.033 (as indicated by the error bars in and data in Table S1
). displays the precision-recall curve for SinkSource on the four datasets and shows the results for SinkSource+. The results for SinkSource+ were obtained with an internal parameter λ set to a value of 1 (see “Other Algorithms” for the role played by this parameter in the SinkSource+ algorithm). In each figure, we observed that the curve for the BKZ dataset dominated the other three curves at most values of recall. This result is consistent with the expectation that the Brass, Konig, and Zhou studies did not discover all true HDFs, and that combining the three sets provides a better coverage of the true HDF universe. We also noted that the variation in precision (indicated by the error bars in and ) decreases with increasing recall, suggesting that high confidence predictions are more subject to variation than low confidence predictions. Finally, compares the performance of all seven algorithms on the BKZ dataset. Three of the algorithms that do not use negative examples (Local+, SinkSource+, and Functional Flow with 1 and with 7 phases) achieved higher precision values than the other algorithms for values of recall less than 20%. However, SinkSource has the best performance for values of recall greater than 20%. PRINCE, the fourth algorithm that did not use negative examples, had uniformly lower precision than SinkSource+. Its precision was superior to that of SinkSource for values of recall less than 10%. To obtain the results for PRINCE, we used 0.8 for the value of an internal parameter α, since PRINCE achieved the highest precision values for this setting of α (see “Other Algorithms” for the role played by this parameter in the SinkSource+ algorithm). Furthermore, the precisions of the algorithms that do not use negative examples dropped considerably beyond a recall of 20% (beyond 10% in the case of PRINCE). We believe that this performance drop is caused by an undue influence of positive examples, resulting in many false positives. The performance of FunctionalFlow did not vary much with an increase in the number of phases (see Figure S1
). The performance of SinkSource+ was independent of the parameter λ (see Figure S2
), as was the performance of PRINCE with respect to the parameter α (see Figure S3
). We also noted that the AUPRC values for the BKZ dataset were 0.67 for Local, Local+, and for FunctionalFlow with 7 phases, 0.65 for PRINCE, 0.69 for SinkSource+, 0.73 for SinkSource, and 0.74 for Hopfield. There is a difference of 11% between the AUPRCs of the worst performing algorithms (0.67) and the best performing algorithm (0.74). The results for weighted versions of the network did not substantially differ from those for the unweighted network (see Figure S4
and Table S2
Cross validation results on the unweighted human PPI network.
The SinkSource algorithm achieved a precision of 81% at 20% recall. The precision dropped only to 70% at a recall of 60%. The corresponding precisions for SinkSource+ were 85% and 60%. Although the Hopfield network algorithm achieved an AUPRC of 0.74, we observed that the smallest recall value attained by the algorithm was 60%, since the algorithm assigned a confidence of either 1 or −1 to a large number of predictions. We concluded that the Hopfield network algorithm was not a good choice for prioritizing predictions for further experimental analysis.
It is surprising that the very simple guilt-by-association algorithms (Local+ and FunctionalFlow with one phase) perform nearly as well as more sophisticated methods (FunctionalFlow with 7 phases, Hopfield, PRINCE, and SinkSource) that attempt to optimize predictions by taking into account constraints imposed by the entire protein interaction network. However, across 10 runs of cross validation, both Local+ and FunctionalFlow with one phase showed higher variation in precision and recall than the other algorithms (see Figure S5
). Therefore, these two algorithms are likely to be more susceptible to missing or erroneous information.
Based on these results, we concluded that SinkSource+ and SinkSource were the two best algorithms for predicting HDFs. When high precision is required, SinkSource+ is superior to SinkSource. Thus, the predictions made by SinkSource+ might be the most suitable as the basis for detailed experimental studies of candidate HDFs. In the rest of the paper, we focus on the results obtained by the SinkSource+ and SinkSource algorithms.
SinkSource+ and SinkSource make overlapping predictions
We compared how many predictions SinkSource+ and SinkSource made at confidence values that correspond to approximately 80% precision after cross validation. SinkSource+ achieved a precision of 85% (and a recall of 20%) at a confidence of 0.5. The corresponding numbers for SinkSource were a confidence of 0.71 at a precision of 81% (and a recall of 20%). To further compare the two algorithms, we computed the overlaps in their predictions for different cutoffs on the confidence values. Specifically, we computed the k
highest confidence genes predicted by SinkSource+ and the k
highest-confidence genes predicted by SinkSource, and measured the Jaccard coefficient of the pair of gene sets, for different values of k
in increments of 100. Figure S6
demonstrates that the overlap between the predictions of the two algorithms is at least 0.34 up to the first 2000 predictions, with peaks at around 300 and 1000 predictions. These results are consistent with the relatively low recall (20–40%) predicted for the two algorithms at this level of precision. The data suggest that approximately half of the predictions may be ranked differently by the two algorithms. Predictions made by SinkSource+ for different values of the parameter λ did not vary much in their ranking (see Figures S7
On the basis of these comparisons, we identified a set of high confidence predictions composed of the 1000 top-ranked predictions from SinkSource+ and from SinkSource respectively. These two sets contained 606 predictions in common and comprised a total of 1394 proteins in addition to the 908 BKZ HDFs. At the confidence levels of the 1000 SinkSource and SinkSource+ predictions, the precisions with two-fold cross validation are 88% and 81% respectively, suggesting that these predictions are relatively reliable. The corresponding recalls with two-fold validation are roughly 17% and 15% respectively, suggesting that these predictions are quite conservative.
In the rest of the paper, we use the phrases “BKZ HDFs”, “SS+ predicted HDFs”, and “SS predicted HDFs” to distinguish between the HDFs identified by one or more of the three siRNA screens 
, the HDFs predicted by SinkSource+, and the HDFs predicted by SinkSource, respectively. We extensively evaluated the predicted HDFs by comparing them to each other and to BKZ HDFs in terms of their functional annotations, interactions with HIV proteins, clustering with the PPI network, and role in disease pathogenesis. We based these evaluations on additional datasets that we did not use for predicting HDFs. Specifically, the new datasets we used were (i) Gene Ontology (GO) annotations for human proteins, (ii) interactions between HIV and human proteins, and (iii) gene expression data from two non-human primate species following infection with SIV. Hence, the analyses described below constitute independent evaluation of the relevance of our predictions to HIV infection and disease progression.
Predicted HDFs are enriched in HIV-related GO terms
We summarized the functional roles of predicted HDFs by asking which GO terms were enriched in the HDFs, and whether any terms were considerably enriched in predicted HDFs but not in BKZ HDFs. We used the FuncAssociate software 
for this purpose, since it can take ordered lists of genes as input, in which case it finds and utilizes the set of top-ranked genes displaying the greatest enrichment. FuncAssociate adjusts for multiple hypotheses testing by computing an experiment-wise p
-value. Note that FuncAssociate operates solely on the ranked list of genes and the GO annotations. It does not utilize a network. (See “Methods
” for details.) We invoked FuncAssociate with three inputs: (a) the unordered set of BKZ HDFs, (b) the SS+ predicted HDFs, ordered by confidence, and (c) the SS predicted HDFs, also ordered by confidence. We used default values of all other parameters used by FuncAssociate. FuncAssociate reported 52 GO terms as being enriched in BKZ HDFs with an adjusted p
-value of 0.05 or less and 199 GO terms as enriched in SS+ predicted HDFs. We identified three classes of terms (see Table S3
). We note that FuncAssociate may report many related terms as enriched, due to the hierarchical nature of GO. Therefore, we also manually inspected the directed acyclic graph connecting the enriched terms in order to make the observations below.
- 49 GO terms enriched in both BKZ HDFs and SS+ predicted HDFs: For the most part, these terms corresponded to the biological processes or complexes that were also identified by Bushman et al. . These terms included the proteasome, transcription/RNA polymerase, the mediator complex, transcriptional elongation, and RNA binding and splicing. This recapitulation is not surprising since Bushman et al. identified these GO terms by searching for dense PPI subnetworks connecting BKZ HDFs and other HIV-related proteins. Proteins in such dense subgraphs are likely to be adjacent in the PPI network to proteins that are predicted to be HDFs with high confidence by our algorithms.
- 3 GO terms enriched in BKZ HDFs but not in SS+ predicted HDFs: Three terms enriched only in BKZ HDFs were nucleocytoplasmic transporter activity, proteasome core complex, alpha-subunit complex, and Golgi apparatus. Except for Golgi apparatus, closely related terms were enriched in predicted HDFs.
- 413 GO terms enriched only in SS+ predicted HDFs: Many GO terms were enriched only in SS+ predicted HDFs. Examples are GO terms corresponding to two protein complexes, the Ndc80 complex (GO:0031262) and MIS12/MIND type complex (GO:0000444). Both terms were enriched only in predicted HDFs with a p-value of 0.002. All four components of the Ndc80 complex (NDC80, NUF2, SPC24, and SPC25) and all four components of MIS12/MIND type complex (DSN1, MIS12, NSL1, and PMF1) occurred within the top 275 predictions made by SinkSource+. Both complexes are part of the kinetochore and play important roles in forming stable kinetochore-microtubule attachments. Retroviruses such as HIV hijack microtubules in order to cross the cytoplasm into the nucleus and to allow HIV gene products to return to the cell surface . Although the Ndc80 and MIS12/MIND type complexes have not been directly implicated in the HIV life cycle, they represent new candidates for involvement in HIV movement through the host cell cytoplasm.
The trends were similar for the HDFs predicted by SinkSource (data not shown). Therefore, we compared the FuncAssociate results for SS+ predicted HDFs and for SS predicted HDFs in a similar manner. We only considered GO terms enriched with an adjusted p
-value of 0.05 or less. As shown in Table S4
, 280 GO terms were enriched in both sets of predictions, 182 GO terms were enriched only in SinkSource+ predictions, and 25 GO terms were enriched only in SinkSource predictions. The 280 common terms were related to processes such as RNA splicing (GO:0008380), translation initiation (GO:0003743), and oxidative phosphorylation (GO:0003743) and complexes such as the proteasome (GO:0003743), the kinetochore (GO:0000776), and the nuclear pore (GO:0005643); we discuss their relevance to HIV when we discuss clusters in the PPI network below (See “PPI Clusters Spanned by BKZ HDFs and Predicted HDFs Are Exploited by HIV
”). The 182 GO terms enriched only in SinkSource+ predictions included the Ndc80 complex and MIS12/MIND type complex (mentioned above), apoptosis (including its induction and regulation) (GO:0006915, GO:0006917, and GO:0042981), and specializations of terms enriched in both sets of predictions. Among the 25 GO terms enriched only in SinkSource predictions, there were 12 GO terms whose specializations or near neighbors (in the GO directed acyclic graph) were enriched in SinkSource+ predictions. Each of the remaining 13 GO terms enriched only in SinkSource predictions were closely related to the assembly of glycosylphosphatidylinositol (GPI) anchors (GO:0006506). Based on these results, we concluded that, for the most part, similar functions were enriched in HDFs predicted by SinkSource+ and by SinkSource.
SS+ predicted HDFs interact with HIV proteins to a statistically-significant extent
Bushman et al.
observed that each of the Brass, Konig, and Zhou HDF sets were statistically significantly enriched with human proteins that interact with HIV proteins (as reported in the NCBI HIV interaction database 
). We hypothesized that predicted HDFs might be significantly enriched with HIV interactors. Accordingly, for each algorithm, we selected the k
top ranking predictions made by that algorithm, for different values of k
starting at 100 and in increments of 100, computed the overlap of each set of predictions with the human proteins that interact with HIV, estimated the statistical significance of the overlap using the one-sided version of Fisher's exact test, and adjusted the p
-values to account for testing multiple hypotheses 
. The overlap fraction for SS+ predicted HDFs peaked at 26% (79 of the top 300 predicted HDFs interact with HIV proteins, p
), better than the BKZ HDFs of which 20% (109 proteins, p
) interacted with HIV proteins. The trend for SS predicted HDFs was mixed: the overlap ratio was as high as 17.5% (70 of the top 400 predictions interact with HIV proteins), slightly less than the BKZ HDFs, but in no case was the enrichment statistically significant. These results suggest that SinkSource+ HDF predictions are dominated by proteins that lie close to BKZ and HIV proteins in the joint HIV-human PPI network, whereas the SinkSource predictions are dispersed further away. We discuss specific SS+ predicted HDFs that interact with HIV in the context of MCODE clusters below.
PPI clusters spanned by BKZ HDFs and SS+ predicted HDFs are exploited by HIV
The cross validation analysis suggested that HDFs are not randomly located in the human PPI network. Rather, HDFs are closer to each other within the PPI network than to the negative examples. Therefore, in order to better understand how BKZ HDFs and SS+ predicted HDFs are related to each other, we computed the subnetwork of PPIs spanned by these two sets of genes. We applied a modified version of the well-known MCODE 
graph clustering algorithm to this sub-network (see “Modifying MCODE to Compute PPI Clusters”). The network contained 1,562 proteins and 30,855 PPIs. MCODE identified 41 clusters of varying sizes containing a total of 829 proteins and 16,721 PPIs. contains statistics on the 10 clusters with the largest number of PPIs computed by MCODE. Using the one-sided version of Fisher's exact test, we checked the overlap of each of the 42 clusters with BKZ HDFs. Only eight clusters had overlaps that were statistically significant, as shown in Table S5
. Table S6
contains a list of BKZ HDFs and HDFs predicted by SinkSource+, annotated with MCODE cluster membership and information on interaction with HIV proteins. Table S7
lists the human PPIs in each MCODE cluster.
Statistics on the 10 clusters with the largest number of PPIs reported by MCODE.
We computed GO terms enriched in all clusters. contains statistics on highly enriched GO terms in the 10 most highly-connected clusters discovered by MCODE. Among the top 10 clusters, only clusters #1, #4, #7, #8, and #9 have statistically significant overlaps with BKZ HDFs (see Table S5
). The fraction of BKZ HDFs is small in clusters #1, #4, and #9, so we reasoned that any functions enriched in these clusters would not be overly influenced by annotations of BKZ HDFs. In contrast, more than half the proteins in clusters #7 and #8 are BKZ HDFs; the functions enriched in these clusters are likely to annotate a number of BKZ HDFs. We now discuss the enriched functions in all clusters in . We focus our discussion on selected predicted HDFs contained within these clusters and present the support in the literature for the relevance of these HDFs to HIV pathogenesis.
The ten clusters with the largest number of PPIs reported by MCODE and the functions that each is the most enriched in.
The most enriched function in cluster #1 is the biological process “RNA metabolic process” (p
). As many as 52 proteins in this cluster are members of the spliceosome (p
), which is a complex of specialized RNA and protein subunits that removes introns from a transcribed pre-mRNA segment. HIV interacts with several components of the spliceosome in order to stimulate transcription and viral production via the LTR 
. HIV has also been shown to inhibit the production of spliceosomal proteins as a mechanism to block downstream immune responses. 22 predicted HDFs and 14 BKZ HDFs in this cluster are known to interact with HIV. For example, the HIV VPR protein has been shown to hinder spliceosome assembly by interfering with the function of the SF3B2–SF3B4 host complex 
; SinkSource+ predicts SF3B4 as an HDF with confidence 0.87 (rank 55). This disruption inhibits the correct splicing of several cellular pre-mRNAs, including β-globin and immunoglobulin M (IgM). IgM has an important role as both a regulator of the immune system and as an inhibitor of apoptosis. Blocking IgM production may allow the virus to inhibit an immune response and to activate cell death, phenomena that have been linked to the progression of HIV infection 
High-ranking predicted HDFs with known HIV interactions that are members of the spliceosomal complex include the small nuclear ribonucleoproteins SNRPB, SNRPB2, SNRPD1, and SNRPD2. The HIV TAT protein interacts with SNRPD2 (predicted with a confidence of 0.87 and rank of 59 by SinkSource+) in order to stimulate transcription from the long terminal repeat (LTR) that acts as a switch to control the production of new viruses 
Cluster #2 is enriched in the ribosome and in the biological process “translational elongation” with 75 of the 108 proteins in the cluster annotated with each of these terms (p
, respectively). Bushman et al. 
also identified a complex of 13 proteins involved in translation elongation. Our results substantially expand this complex. Among the proteins predicted by SinkSource+ that belong to this cluster, EIF2S1, EIF2S2, EIF2S3, EIF4E, EIF4G1, and EIF5B are known to interact with HIV molecules, supporting these predictions. TAR is a 5′-terminal hairpin in HIV-1 mRNA that binds viral Tat and several cellular proteins. Eukaryotic translation initiation factor 2 (EIF2) binds the TAR secondary structure in HIV-1 RNA 
, suggesting that TAR may be involved in the translation of viral mRNA. Another facet of HIV interaction with host translation elongation occurs in human CD4+ cells, where HIV-1 protease cleaves eukaryotic translation initiation factor EIF4G, thereby inhibiting host protein synthesis that is directed by capped mRNAs 
Cluster #3 is highly enriched in the kinetochore (p
). Other highly enriched GO terms include the MIS12/MIND type complex, the centromeric region of the chromosome, and the M phase of the mitotic cell cycle. The kinetochore is a multi-subunit protein complex that is located at the centromeric region of DNA. Microtubules connected to spindle poles attach themselves to the kinetochore. No BKZ HDFs are members of this cluster. However, five proteins in the cluster, KIF2C, BIRC5, PAFAH1B1, PPP1CC, and CDC20, are known to interact with HIV, supporting the validity of these HDF predictions. PAFAH1B1 (also known as LIS), a subunit of the platelet-activating factor acetylhydrolase, is a member of the kinetochore and the microtubule. The interaction of HIV-1 Tat protein with PAFAH1B1 may contribute to the effect of Tat on the distortion of microtubule formation 
, which in turn may induce apoptosis of T cells. In addition, this cluster may be related to HIV's utilization of the host cell cytoskeletal machinery to traffic from the cell membrane to the nucleus and vice-versa 
The most enriched GO term in cluster #4 is “respiratory chain” (p
), with 47 of the 57 proteins in this cluster annotated with this term. Many of these genes are members of the NADH dehydrogenase complex (p
), are involved in oxidative phosphorylation (p
), and are localized to the mitochondrial membrane (p
). Both the Brass and the Konig screens uncovered members of the NADH dehydrogenase complex, suggesting that HIV replication may involve the mitochondrial respiratory chain and the modulation of oxidative phosphorylation. The role played by host mitochondrial proteins in HIV-induced T-cell apoptosis has been extensively studied 
. Recently, it has been shown that components of the mitochondrial oxidative phosphorylation system are differentially regulated in apoptotic T-cells that have been infected by HIV 
. In eukaryotes, oxidative phosphorylation occurs in the electron transport chain in the mitochondrion. NADH dehydrogenase, a multi-subunit protein complex, is the first enzyme in this chain. The down-regulation of NDUFA6, a unit of the NADH dehydrogenase complex reported by both the Brass and Zhou screens, has been implicated in the induction of apoptosis in T cells by HIV 
. SinkSource+ predicts NDUFS1, one of the units of this complex, as an HDF with confidence 0.82 (rank 185). Caspase cleavage of NDUFS1 has been shown to mediate disruption of mitochondrial function during apoptosis 
, suggesting that NDUFS1 may play a role in the induction of T cell apoptosis by HIV.
GTPase mediated signal transduction
Cluster #5 contains 24 proteins of which three are BKZ HDFs. 21 proteins in the cluster are involved in small GTPase mediated signal transduction, with a p
-value of 1.6×10−9
. Many proteins in the cluster belong to RAS family of proteins. Six proteins in the cluster, RHOB, RHOG, RAC2, RHOA, CDC42, and RAC1 are known to interact with HIV. Interactions of the small GTPases CDC42 and RAC1 with HIV protein Nef activates the p21-activated kinase 1 PAK1 
, a factor that is critical for efficient viral replication and pathogenesis.
DNA replication initiation
Of 29 proteins in cluster #6, 13 are annotated with the biological process “DNA replication initiation” (p
). There are no BKZ HDFs in this cluster. However, four proteins in the cluster, CDC6, CDK2, PCNA, and RPA4 are known to interact with HIV proteins, suggesting the validity of these HDF predictions. Cyclin-dependent kinase 2 (CDK2) is a catalytic subunit of the cyclin-dependent protein kinase complex, whose activity is restricted to the G1-S phase, and which is essential for transition of the cell cycle from G1 to S phase. CDK2 phosphorylates HIV Tat protein, a step that is important for HIV-1 transcription 
Cluster #7 contains 20 proteins that are significantly annotated with the GO terms “Transcription factor binding” (3.4×10−10
) and “Transcription initiation” (5.3×10−9
). As many as 11 BKZ HDFs are members of this cluster. Almost all proteins in this cluster are subunits of the mediator complex. This complex enables transcription by connecting transcriptional activators to the RNA polymerase II transcriptional machinery 
. Bushman et al. 
also identified this complex. They proposed that “changes in dosage in the mediator complex are not toxic to cells, but that Tat-activated transcription is extremely sensitive to mediator dosage.”
The proteasome is a large protein complex in the cell that is responsible for the degradation of unnecessary or damaged proteins and for post-translational regulation of the levels of many proteins via the ubiquitinylation pathway. 18 of the 60 proteins in cluster #8 are members of the proteasome (p
) as are 22 of the 37 proteins in cluster #9 (p
). 20 BKZ HDFs belong to cluster #8 and 9 to cluster #9. In the case of HIV infection, an active proteasome has been shown to be involved in HIV replication 
and is necessary for the release and maturation of infectious HIV particles 
. For example, the HIV VIF protein binds to the host APOBEC3G protein and targets it for degradation through an interaction with the proteasome 
. This process inhibits the APOBEC3G-mediated restriction of HIV replication.
MHC protein complex
Of the 56 proteins in cluster #10, 10 are annotated with “MHC protein complex” (p
). 11 predicted HDFs in the cluster are known to interact with HIV. Many of these proteins are members of the class II major histocompatibility complex; HIV protein Tat down-regulates the expression of MHC class II genes in antigen-presenting cells 
Anaphase promoting complex
“Cell cycle process” is enriched in cluster #10 with a p
-value of 5.2×10−7
. Of the 13 proteins annotated with this process that are members of cluster #10, six proteins (ANAPC1, ANAPC4, ANAPC5, ANAPC7, ANAPC10, and ANAPC11) are subunits of the anaphase promoting complex (APC). HIV protein VPR induces G2/M arrest in order to facilitate the entry of the viral pre-integration complex into the nucleus. Studies with adenovirus and chicken anemia virus have suggested that proteins in these viruses target the APC in order to induce G2/M arrest 
. Thus, although none of the APC proteins in this cluster are known to interact with HIV, it is possible that VPR-induced G2/M arrest may result from inhibition of the APC.
Nuclear pore complex
The “nuclear pore complex” is the GO term most enriched in cluster #12 (not displayed in and in ); 14 of the 18 proteins are members of this complex (p
). Seven predicted HDFs in cluster #12, BANF1, HMGA1, NUPL2, NUP54, PSIP1, RAN, and RANBP1, interact with HIV proteins. Bushman et al. 
also identified the nuclear pore, although proteins annotated to this term did not appear in a dense cluster in their analysis. The nuclear envelope is a lipid bilayer that serves as a physical barrier between the contents of the nucleus and cytoplasm. This barrier contains pores through which materials can be exchanged between the two cellular compartments. Large macromolecules require the assistance of karyopherins to pass through nuclear pores. Karyopherins bind to their cargo; after they cross the nuclear envelope, an interaction with the human RAN protein releases the bound partner. HIV has evolved to manipulate this cellular process. NUPL1 interacts with HIV VPR to mediate the docking of VPR at the nuclear envelope, a step that contributes to the nuclear import of viral DNA 
. RAN bound with GTP is known to bind to a complex of HIV protein REV and exportin 1 (CRM1) to mediate nuclear export of HIV mRNA 
. The Barrier-to-autointegration factor BANF1 is localized both to the nucleus and to the cytoplasm. It is known to be exploited by retroviruses for promoting integration of viral DNA into the host chromosome 
BKZ and predicted HDF genes are differentially expressed during AIDS development in non-human primates
Since HDFs play a critical role in HIV replication 
, we hypothesized that some of them may have value as prognostic markers of HIV pathogenesis and of AIDS development and progression. We anticipated that both experimentally-detected (BKZ) and predicted HDFs would satisfy this hypothesis. To explore this question, we combined BKZ HDFs and predicted HDFs with DNA microarray data from a study detailing the host response to simian immunodeficiency virus (SIV) infection in African green monkeys (AGMs) and pigtailed macaques (PTMs). AGMs are natural reservoirs of SIV that do not develop AIDS, while PTMs are non-natural hosts that develop AIDS when infected with SIV. The virus replicates to the same viral load in both of these hosts. Lederer et al. 
performed a longitudinal transcriptomic analysis comparing AGMs to PTMs. They analyzed the host response in the setting of acute SIV infection with the same primary isolate (SIVagm.sab92018). They studied three different tissues: blood, colon, and lymph nodes. They collected samples at 10 days and 45 days post-viral inoculation and compared each sample to a sample from the same animal pre-inoculation. For each day-tissue combination, they performed an analysis of three AGMs and three PTMs using rhesus macaque (Macaca mulatta
) oligonucleotide microarrays. The probes in this microarray were based on the human Reference Sequence (RefSeq) collection. Thus, there is a direct mapping from these probes to human gene identifiers.
For each tissue (blood, colon, lymph node) and day (10 and 45 post infection) combination, we performed a separate ANOVA analysis, using the host system as factor, to identify genes that are differentially expressed between AGMs and PTMs. Such differentially expressed genes could potentially serve as diagnostic markers of AIDS development and progression. We constructed six lists (three tissues×two time points) of genes that were differentially expressed between AGMs and PTMs to a statistically-significant extent (p
≤0.05). We used the one-sided version of Fisher's exact test to determine if BKZ HDFs had a significant intersection with each of these six lists. We repeated this test with the top k
predicted HDFs, for values of k
starting at 100 and in increments of 100. We used the method of Benjamini and Hochberg 
to correct for testing multiple hypotheses.
displays plots of the fraction of BKZ HDFs or of predicted HDFs that are also differentially-expressed to a significant extent in the AGM-PTM comparison; Figures S9
plot the corresponding p
-values. Note that the plot for BKZ HDFs is a horizontal line since changing the score cutoff for predictions has no effect on BKZ HDFs. Three notable trends emerged from this analysis. First, for many tissue-day combinations, the overlap fraction for predicted HDFs was larger than the overlap fraction for BKZ HDFs. These trends were most noteworthy in day 10 lymph nodes, where the overlap ratio for predicted HDFs was larger than that for BKZ HDFs over the entire range of prediction confidence values. In particular, in day 10 lymph nodes, the overlap fraction of SS+ predicted HDFs peaked at 0.26 (53 of the top 203 predicted HDFs were also differentially-expressed in day 10 lymph nodes, p
-value 0.01). The largest overlap for SS predicted HDFs was also 0.26 (26 of the top 100 predicted HDFs, an insignificant p
-value of 0.07). In contrast, the overlap ratio for BKZ HDFs with genes differentially expressed in day 10 lymph nodes was 0.19 (p
-value, 0.59). Second, none of the overlaps of BKZ HDFs with differentially-expressed genes were statistically significant, for any tissue-day combination. In contrast, p
-values for HDFs predicted by each algorithm were statistically significant (red points in and Figures S9
) in day 10 lymph nodes, across a wide range of prediction confidences. Third, no statistically significant overlaps appeared for predicted HDFs in blood or colon samples at any time point or in day 45 samples from lymph nodes.
Plots of the fraction of BKZ or of predicted HDFs that are also differentially expressed in the AGM-PTM comparison: (a) SinkSource+ and (b) SinkSource.
We re-estimated the significance of these results after randomizing the gene expression data, by permuting each gene's p-values independently. This process retained the distribution of p-values for each gene, but randomized the associations between p-values and tissue-day combinations. We repeated the overlap analysis for predicted HDFs with each of 10,000 randomized gene expression data sets, for a total of 60,000 randomized tissue-day combinations. We observed only one randomized dataset for which any overlap ratio was at least as large as 0.26, the largest overlap ratio between HDFs predicted by SinkSource+ and genes differentially expressed in day 10 lymph nodes. Thus, the p-value of the observed overlap ratio was 1.7×10−5. For predictions made by SinkSource, we obtained a p-value of 8.3×10−5, for the largest observed overlap of 0.26.
Thus, we concluded that the predicted HDFs have a significant overlap with genes that are differentially expressed between AGMs and PTMs in day 10 lymph nodes, indicating that many predicted HDFs show considerably different programs of expression in the two species in response to SIV infection, especially in early time points. These data suggest that the algorithms have identified a highly responsive subset of potential HDFs, and provide strong experimental support for the prediction that these proteins are in fact HDFs. This result further suggests that viral manipulation of these host factors in lymph nodes soon after infection may have an effect on long-term pathological outcome. We used FuncAssociate to perform GO enrichment analysis on predicted HDFs that were also differentially expressed between AGMs and PTMs in day 10 lymph nodes. The terms we found were almost identical to those reported in the PPI clusters (data not shown). In summary, these results suggest that not only are HDFs critical for viral replication and infection, they may have potential value as prognostic markers to determine pathological outcome and the likelihood of AIDS development.
We have used network-based approaches to predict HIV dependency factors (HDFs). Upon two-fold cross-validation, we found that combining the three experimental data sets yielded much higher precision and recall than using each data set on its own. A number of the algorithms we compared achieved both high precision and recall on cross validation. Our results suggest that global optimization techniques such as SinkSource and SinkSource+ perform slightly better than the simple guilt-by-association rule 
. Furthermore, SinkSource+ and SinkSource had the most consistent and reliable performance. Software implementing the function prediction algorithms is available at http://bioinformatics.cs.vt.edu/~murali/software/gain
. We also observed that estimating the reliability of PPIs did not confer an advantage; in fact, the cross validation results worsened slightly with edge weights (Table S2
). The decrease in performance is likely to be a combination of the close proximity of HDFs within the PPI network and the high reliability of PPIs that HDFs are involved in, since the corresponding biological processes are well studied.
We found that the HDFs predicted by SinkSource+ were significantly enriched in proteins that interact with HIV proteins. On the other hand, SinkSource predicted a set of HDFs that were not significantly enriched in HIV-interacting proteins. We computed clusters within the subgraph of the PPI network that encompassed the BKZ HDFs and HDFs predicted by SinkSource+. These clusters were enriched in host cellular complexes and pathways known to be that are known to be manipulated by HIV and perturbed during HIV infection such as the spliceosome, the microtubule network, the proteasome, the mitochondrion, and nuclear import and export.
Finally, we integrated BKZ HDFs and predicted HDFs with gene expression data from a non-human primate study detailing the host response to SIV infection in non-human primates that do not develop AIDS (African green monkeys) and those that do (pigtailed macaques) 
. We found that up to 26% of predicted HDFs are differentially expressed, when we compared their gene expression profiles in macaques to their profiles in African green monkeys. This differential expression of HDFs was time- and tissue-specific, being strongest in lymph nodes 10 days post-inoculation. These HDFs are excellent candidates for studying transcriptional programs relevant to AIDS progression in humans.
Our results support three conclusions. First, existing genomic screens are incomplete and many HDFs are yet to be discovered. The HDFs predicted by SinkSource+ may include many proteins required for HIV replication that could not have been uncovered experimentally because the predictions were not constrained to non-essential human proteins. Second, HDFs are clustered in the human PPI network and belong to cellular pathways or protein complexes that play a critical role in HIV pathogenesis and AIDS progression. Third, many HDF genes show differential expression during AIDS development in non-human primates. Thus, HDFs may play an important role in the control of initial infection and eventual pathological outcome.
It will be valuable to integrate other HIV-relevant functional genomic data with PPI networks to improve the quality and robustness of HDF prediction. Modeling the impact on off-target effects of siRNAs on false positive HDFs is also important. To date, experiments that have detected HDFs have been performed in cell lines. Approaches such as ours may help to prioritize HDFs for further experimental study in more disease-relevant models such as non-human primates. Ultimately, we anticipate that future extensions of our work may provide multiple new targets and strategies for combating HIV in humans.
Our approach is general purpose and can be applied to interpret other genome wide gene-level studies. In particular, if independent labs have conducted multiple studies to study the same biological system or phenomenon, we provide a methodology to interpret them simultaneously within the context of molecular interaction networks. Our approach can be used to ask if the studies reinforce or contradict each other and to prioritize new genes for further experimental analysis.