Most of the current disease candidate gene prioritization methods [1
] rely on functional annotations. However, the coverage of the gene functional annotations is a limiting factor. Although more than 1,500 human disease genes have been documented, most of them continue to be functionally uncharacterized. Currently, only a fraction of the genome is annotated with pathways and phenotypes [6
]. While two thirds of all the genes are annotated by at least one functional annotation, the remaining one third is yet to be annotated.
Analysis of protein-protein interaction networks (PPINs) is important for inferring the function of uncharacterized proteins. Protein-protein interactions refer to the association among the protein molecules and the study of these associations from the perspective of biochemistry, signal transduction and biomolecular networks. Recent biotechnological advances like the high-throughput yeast two-hybrid screen facilitated building proteome-wide PPINs or "interactome" maps in humans [7
]. The shift in focus to systems biology in the post-genomic era has generated further interest in PPINs and biological pathways. Network-based analyses have been developed with a number of goals [9
], including protein function prediction [10
], identification of functional modules [11
], interaction prediction [12
], identification of disease candidate genes [14
] and drug targets [16
], and the study of network structure and evolution [18
]. While there is a wealth of protein-disease relationships in the published literature and a number of PPIN resources, relatively few studies have actually used PPIN analyses for prioritizing disease genes. Thus, making use of these networks in the context of disease is a relatively new challenge [14
]. One of the earliest efforts [23
] uses a classifier based on several topological features, including degree (the number of links to the protein), 1N index (proportion of links to disease-related proteins), 2N index (average 1N index in the neighbors), average distance to disease genes, and positive topology coefficient (average neighborhood overlapping with disease genes). Xu et al., built a KNN-based classifier with all disease genes from OMIM and concluded that hereditary disease genes from OMIM in the literature-curated protein-protein interaction network are characterized by a larger degree, a tendency to interact with other disease genes, more common neighbors, and quick communication to each other [23
]. A more recent application, Genes2Networks [24
], identifies important genes based on a list of "seed" genes. It generates a Z-score for each "intermediate" gene from a binomial proportions test to represent its specificity or significance to the "seed" genes. The former method, independent of known disease-related genes, is used for disease candidate gene identification, especially in cases where there is little or no prior knowledge about the disease. The latter application, on the other hand, uses a "seed" list as training to score the neighboring genes. It avoids bias toward highly connected "hub" genes, but the candidate gene is searched in a local network region, and the user has to provide the size of the neighborhood region in the network.
Recent technological advances in genomic sequencing, gene expression analysis, and other massively parallel techniques, while opening new opportunities, continue to pose a formidable challenge in deriving meaningful information from the large data silos. Typically, such data can be represented as networks in which the nodes (e.g., genes, mRNA, microRNA, proteins or metabolites) are linked by edges (e.g., DNA-protein or protein-protein or miRNA-mRNA interactions or correlations). Structural analysis of these networks can lead to new insights into biological systems and is a helpful method for proposing new and testable hypotheses. Biological networks have in fact been found to be comparable to communication and social networks [25
]. For instance, PPINs and communication networks share several common characters such as scale-freeness and small-world properties, suggesting that the algorithms used for social and Web networks are equally applicable to biological networks. Although PPIs have been used widely to identify novel disease candidate genes [14
], besides Kohler et al. [30
] and Wu et al. [34
], there have been no reports on using PPIs for disease gene prioritization. Additionally, to the best of our knowledge, this is the first study that uses social- and Web- network analysis-based algorithms to prioritize disease candidate genes.
In the analysis of social networks, Web graphs and telecommunication networks, a common question frequently asked is: Which entities are most important in the network? Although visualization-centered approaches such as graph drawing are useful to gain qualitative intuition about the structure, especially in small graphs, it is not practical to use these approaches for large and more complex networks. As an alternative, a number of other approaches have therefore been developed. For instance, a variety of measures (degree centrality [36
], closeness centrality [37
] and betweenness centrality [38
]) have been proposed by sociologists to determine the "centrality" of a node in a social network. Likewise, in the area of Web graphs, computer scientists have proposed and used several algorithms such as HITS [39
] and PageRank [40
] for automatically determining the "importance" of Web pages.
In the current study, for the first time, we utilize the above methods to prioritize disease candidate genes by estimating their relative importance in the PPIN to the disease-related genes. Specifically, we determine the optimal parameter values in the methods and record the corresponding performance. The first algorithm that we use is based on White and Smyth's PageRank algorithm. White and Smyth [41
] proposed a general framework and a set of algorithms under the framework to measure the relative importance in networks. The first method is an extension of the original PageRank algorithm and is called PageRank with Priors. It mimics the random surfer model wherein a random Internet surfer starts from one of a set of root nodes, R, and follows one of the links randomly in each step. In this process, the surfer jumps back to the root nodes at probability β, thus restarting the whole process. Intuitively, the PageRank with Priors algorithm generates a score that is proportional to the probability of reaching any node in the Web surfing process. This score indicates or measures the relative "closeness" or importance to the root nodes. The second algorithm, named HITS with Priors, is an extension of HITS (Hyperlink-Induced Topic Search), which is a link analysis algorithm developed by Jon Kleinberg to rate Web pages. It determines two values for a page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages [42
]. In the Web surfing model, the surfer still starts from one of the root nodes. In the odd steps he/she can either follow a random "out-link" or jump back to a root node, and in the even steps he/she can instead follow an "in-link" or jump back to a root node. Similar to the PageRank with Priors, HITS with Priors also estimates the relative probability of reaching a node in the network. The third algorithm we use is the K-Step Markov method. In a similar Web surfing scenario, this method mimics a surfer who starts with one of the root nodes. The surfer follows a random link in each step, but he/she return to the root node after K steps and restarts surfing.