The identification of essential proteins is crucial for understanding of the minimal requirements for cellular life [
1], which is also very important for the discovery of human disease genes and defending against human pathogens [
2-
4]. For example, the identification of essential genes and non-essential genes is valuable for rational drug design [
5]. Essential proteins in pathogenic organisms can be taken as the potential targets for new antibiotics [
6].
Essential proteins are those proteins necessary for growth in a rich medium where all the required nutrients are available [
1]. The deletion of such proteins will result in lethality or infertility, i.e., the organism cannot survive without them [
7,
8]. Different experimental methods, such as single gene knockouts [
9], RNA interference [
10] and conditional knockouts [
11], have been implemented for the discovery of essential proteins. However, these experimental methods generally require large amounts of resources and are very time consuming.
To break through these experimental constraints, some researchers have proposed various computational approaches. With the accumulation of data derived from experimental small-scale studies and high-throughput techniques, there is a growing awareness that the topological properties of biological networks would be useful for the identification of essential proteins. It has been observed in several species, such as
Saccharomyces cerevisiae, Caenorhabditis elegans, and
Drosophila melanogaster [
12,
13], that proteins in the network highly connecting with other proteins are more likely to be essential than those selected by chance [
14]. This is called the "centrality-lethality rule" [
14]. Although there exist some controversies about whether, why and how the highly connected proteins tend to be essential in biological networks [
15-
18], most researchers have confirmed the correlation between topological centrality and protein essentiality [
13,
19-
21].
Specifically, some global network characteristics, such as betweenness centrality [
22] and closeness centrality [
23], and local network features, such as maximum neighborhood component [
24] and local average connectivity [
25], have already been used to determine a protein's essentiality. Recently, Park and Kim [
26] investigated the localized network centrality and essentiality in the yeast protein-protein interaction network. They made a comprehensive examination and comparison among different types of centrality measures, which included shortest path betweenness, shortest path closeness, eigenvector centrality, harary graph centrality, information centrality, stress centrality, random walk betweenness, random walk closeness, degree centrality, clustering coefficient, subgraph centrality, complexity measure, sub-network maximum degree, and assortative mixing (ASS) centralities. In our previous studies [
25,
27,
28], we have also shown the feasibility of using network topological features to detect essential proteins from the yeast protein-protein interaction networks. Moreover, several recent centrality measures, such as Range-Limited Centrality [
29], L-Index [
30], LeaderRank [
31], Normalized
α-Centrality [
32], and Moduland-Centrality [
33], used in complex networks can also be used to analyze the protein-protein interaction networks.
Though a great progress has been made on the computational methods for the identification of essential proteins based on network topologies, there are still several challenges that researchers have to meet. First, the protein-protein interaction dataset for each species is not complete up to now. Second, a high proportion of false positives has been found in protein-protein interaction networks, especially for those obtained by high-throughput technologies. In addition, as reported by Zotenko et al. [
17], essential proteins tend to form highly connected clusters rather than function independently. It is well known that both false negatives and false positives in protein-protein interaction networks are hard to be cleaned out. For false positives, a general approach is to evaluate the interactions by using different weighting methods. More recently, there is a new trend that improves the precision of essential protein discovery method by integration of network topology and other information. For example, Acencio et al [
1] explored essential proteins based on the integration of network topological features and two types of GO annotations: cellular localization and biological process. Recently, several researchers began to pay attention to the relationship between protein essentiality and their cluster property [
27,
34].
With respect to these various difficulties and progresses, we propose a new centrality measure, named PeC, by integrating protein-protein interaction data and gene expression data. Different from other centrality measures, PeC determines a protein's essentiality not only based on its connectivity, but also whether it has a high probability to be co-clustered and co-expressed with its neighbors. The performance of PeC was tested on the well studied species of
Saccharomyces cerevisiae. Compared to other fifteen previous centrality measures: Degree Centrality (DC) [
14], Betweenness Centrality (BC) [
22], Closeness Centrality (CC) [
23], Subgraph Centrality(SC) [
35], Eigenvector Centrality(EC) [
36], Information Centrality(IC) [
37], Bottle Neck (BN) [
38,
39], Density of Maximum Neighborhood Component (DMNC) [
24], Local Average Connectivity-based method (LAC) [
25], Sum of ECC (SoECC) [
27], Range-Limited Centrality (RL) [
29], L-Index (LI) [
30], LeaderRank (LR) [
31], Normalized
α-Centrality (NC) [
32], and Moduland-Centrality (MC) [
33], PeC achieves higher precision for the identification of essential proteins. The experimental results show that the integration of network topology and gene expression increased the predictability of essential proteins in comparison with those centrality measures only based on network topological features.
New centrality measure: PeC
In this study, a new centrality measure, PeC, is proposed based on the integration of protein-protein interaction data and gene expression data. The basic ideas behind PeC are as follows: (1) A highly connected protein is more likely to be essential than a low connected one; (2) Essential proteins tend to form densely connected clusters; (3) Essential proteins in the same cluster have a more chance to be co-expressed. In PeC, a protein's essentiality is determined by the number of the protein's neighbors and the probability that the protein is co-clustered and co-expressed with its neighbors.
To describe PeC simply and clearly, we provide the following definitions and descriptions. The protein-protein interaction network is represented by an undirected graph
G(
V, E), where a node
v
V represents a protein and an edge
e(
u, v)
E denotes an interaction between two proteins
u and
v. Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These gene products are often proteins. Of course, there may exist some functional RNAs from non-protein coding genes. Here, we only consider the gene expressions for proteins. For a protein
v, its gene expressions with
s different times are denoted as
Ge(v) = {
g(
v, 1),
g(
v, 2), ...,
g(
v, s)}. The probability that two proteins are co-clustered and co-expressed is evaluated based on the edge clustering coefficient (ECC) and pearson correlation coefficient (PCC).
Edge clustering coefficient (ECC)
Clustering coefficient was first proposed to describe the property of a vertex in a network, which has been used as an effective tool to analyze the topology of protein-protein interaction networks [
40]. Radicchi
et al. [
41] generalized the clustering coefficient of a vertex to an edge, and defined it as the number of triangles to which a given edge belonged, divided by the number of triangles that might potentially include the triangles. In our previous studies [
25,
42], we have proposed a modified definition of edge clustering coefficient (ECC) to overcome the fact that the definition of ECC in [
41] is not feasible when the network has few triangles. For an edge (
u, v) connecting node
u and node
v, we calculate its ECC by using the common neighbors instead of triangles. The ECC of an edge (
u, v) is defined as:
where Nu (or Nv) is the set of neighbors of vertex u (or v) and du (or dv) denotes the degree of vertex u (or v), i.e., the number of nodes which u (or v) directly connects in graph G.
ECC(u, v) is a local variable which characterizes the closeness of two proteins u and v. Obviously, two proteins u and v with a larger value of ECC(u, v) are more likely to be in the same cluster.
The advantage of ECC is that it describes effectively the probability of two proteins being in a cluster from the topology view. However, it also has disadvantage. The effectiveness of ECC heavily depends on the reliability of the protein-protein interaction networks. Thus, in this paper we will introduce another metric, pearson correlation coefficient, which is independent of the reliability of the protein-protein interaction networks, to evaluate how likely two proteins are in the same cluster from another view.
Pearson correlation coefficient (PCC)
To evaluate how strong two interacting proteins are co-expressed, we calculate their pearson's correlation coefficient(PCC). The PCC [
43] of a pair of genes (
X and
Y), which encode the corresponding paired proteins (
u and
v) interacting in the protein-protein interaction network, is defined as:
where
s is the number of samples of the gene expression data;
g(
X, i) (or (
g(
Y, i))) is the expression level of gene
X (or
Y) in the sample
i under a specific condition;

(
X) (or
(
Y)) represents the mean expression level of gene
X (or
Y) and
σ(
X) (or
σ(
Y)) represents the standard deviation of expression level of gene
X (or
Y). Here, we defined the pearson's correlation coefficient of a pair of proteins (
u and
v) as equal to the PCC of their corresponding paired genes (
X and
Y), that is
PCC(
u, v) =
PCC(
X, Y). The value of
PCC ranges from -1 to 1. If
PCC(
u, v) has a positive value, there is a positive linear correlation between
u and
v.
New centrality measure PeC by integration of PCC and ECC
It has been proved that there exist a number of protein complexes which play a key role in carrying out biological functionality [
44] and the essentiality tends to be a product of a protein complex rather than an individual protein [
45]. Based on the definitions of edge clustering coefficient (ECC) and pearson's correlation coefficient (PCC), we propose a new centrality measure which is named as PeC. The probability that two proteins are co-clustered is described from a topological view and the probability that two proteins are co-clustered is characterized from a biological view. Thus, we defined the probability of paired proteins
u and
v to be in the same cluster as following:
For a protein v, its PeC(v) is defined as the sum of the probabilities that the protein and its neighbors belong to a same cluster:
Where Nv denotes the set of all neighbors of node v.
The value of
PeC(v) is determined by not only how many neighbors the protein has but also how likely it is co-clustered with its neighbors. In our previous studies [
25], we have found that in the cases of non-essential proteins, which have a high degree, there are generally few interactions between their neighbors. When predicting essential proteins, PeC can discriminate these different types of highly connected proteins by the computation of sum of
pc.