Essential proteins (also known as lethal proteins) are indispensable to life as without them the lethality or infertility is caused. Identification of essential proteins has been the pursuit of biologists for two main purposes. From the theoretical perspective, identification of essential proteins provides insight in understanding minimal requirements for cellular survival and development. It also plays a significant role in the emerging science of synthetic biology which aims to create a cell with minimal genome
]. From the practical perspective, essential proteins are drug targets for new antibiotics, due to their indispensability for bacterial cell survival
]. Moreover, research results suggest that essential proteins (or genes) have associations with human disease genes
]. Studying of essential proteins also facilitates identifying the disease genes. In biology, there are many experimental methods which can predict and discover essential proteins, such as single gene knockouts
], RNA interference
] and conditional knockouts
]. However, these experiments are expensive and inefficient. Furthermore, they are limited to a few species. So a highly accurate computational method becomes a very important choice for identifying essential proteins.
Recently, many computational methods have been proposed to identify essential proteins, based on the features of essential proteins. One of the most important features of essential proteins is their conservative property. Previous studies have shown that essential proteins evolve much slower than other proteins. They are more evolutionarily conserved than nonessential proteins
]. This is because essential genes are more likely involved in basic cellular processes, thus the negative selection acting on essential genes are more stringent than non-essentials
]. In several studies
], the term ‘phyletic retention’ is introduced to describe the homology mapping of a protein in other organisms by using BLAST or reciprocal best hit
], in place of the term ‘conservation’. Moreover, Gustafson et al.
] point out the phyletic retention is the most predictive of essentiality. Besides the phyletic retention trait of proteins, other types of genomic features, such as GC content, protein length, ORF length
], cellular localization
], and so on, are also mentioned for predicting essential proteins by taking the advantage of supervised machine learning-based methods. Since these methods develop a classifier to learn traits of essential genes in one organism and then predict those in the other organism or in the test dataset of the same organism, a set of essential proteins and their related properties have to be known in prior. Consequently, the performance of these methods closely depends on classifier and the distance between training organisms and test organisms.
Another important feature of essential proteins is their topological properties in Protein-Protein Interaction (PPI) networks. Proteins in cells interact with each other and construct a PPI network. A group of researchers focus on studying the relationships between essentialities and topological properties of proteins in PPI networks. Study has shown that there is a positive correlation between the lethality and the centrality in PPI networks
]. Thus, the most highly connected proteins are more likely to be indispensable. As a consequence, a series of centrality measures based on network topological features have been used for identifying essential proteins, such as Degree Centrality (DC)
], Betweenness Centrality (BC)
] Closeness Centrality (CC)
], Subgraph Centrality (SC)
], Eigenvector Centrality (EC)
], Information Centrality (IC)
] and Edge Clustering Coefficient Centrality (NC)
] and so on. These methods rank proteins in terms of their centrality in PPI networks. Then the ranking scores of these proteins are used to judge whether a protein is essential. The merit of these methods is that they identify essential proteins directly and don’t need to train a classifier according to a set of known essential proteins.
However, there exist some limitations on these centrality methods. Firstly, the available PPI data is incomplete and contains many false positives and false negatives, which impacts the correctness of discovering essential proteins. Secondly, most of these methods seldom analyze other intrinsic properties of the known essential proteins while using only topological properties of networks. To overcome these limitations, recently many research groups have focused on identification of essential proteins by integrating PPI networks with other biological information. Li et al.
] construct a weighted PPI network by taking consideration of gene annotations. With the integration of network topology and gene expression, the same group of researchers proposes a new method called PeC
]which increases the predictability of essential proteins in comparison with those centrality measures only based on network topological features. On the other hand, by using supervised machine learning-based methods, some researchers combine network topological properties with genomic features, such as cellular localization
] to identify essential proteins.
Additionally, Pereira-Leal et al.
] have reported that essential proteins are, on average, more frequently connected to other essential proteins than nonessential proteins are. By analyzing the topological properties of interactions between essential proteins, they have detected an almost fully connected exponential network, which implies a strong correlation between the essentiality of a protein and that of its neighbors.
Based on the facts mentioned above, we propose an iteration method for predicting essential proteins by i
rthology with PPI n
etwork, named as ION. In ION, the conservative property of proteins is also taken into account. To measure the conservation of proteins, we find orthologous proteins in other species, instead of sequence alignment using BLAST. Orthologs are homologous proteins that are derived from a common ancestor. They usually have high similar amino acid sequences and retain the same or very similar functions. This allows us to infer biological information between these proteins. Many studies use orthologous information to identify evolutionary signals of PPI networks
], discover the rate of protein evolution
], infer protein conservation
]. Recently more and more algorithms have been used to detect orthologs, such as IsoRank
]. Furthermore, many databases and public resources of orthologs are available now, for instance, COG
] and Inparanoid
], which facilitate orthologs-based researches.
In addition to orthologous properties of proteins, the connectivity and features of their neighbors are also considered in ION. Comparing with supervised machine learning methods, ION combines the three features to give each protein a ranking score using the iteration method without knowing a set of essential proteins. Differently from other centrality methods, ION identifies essential proteins depending on not only the connections between proteins but also their orthologous properties and features of their neighbors. To evaluate the performance of ION, we predict essential proteins by using yeast data sets. Experimental results show that the prediction performance of ION by integrating the proteins’ orthologous property with their neighbor’s property in the PPI network is better than that by using only either property. Moreover ION can achieve better performance in essentiality prediction than above eight other existing centrality methods (DC, BC, CC, SC, EC, IC, NC and PeC) in terms of their precision-recall (PR) curves and jackknife curves. In top 100 ranked proteins, ION identifies 78 essential proteins and NC only identifies 55 essential proteins, which illustrates that ION achieves 42% improvement than NC that has the best performance among the seven existing centrality methods(DC, BC, CC, SC, EC, IC and NC). Compared with PeC which identifies essential proteins by integrating gene expression data with PPI networks, ION also outperforms it. Especially, with more candidate proteins selected, the advantage of ION in the prediction of essential proteins becomes increasingly obvious. Moreover, compared with PeC, NC and DC, more proteins in top 100 ranked by ION belong to the complexes with certain biological functions. In order to investigate whether the amount of reference organisms have influence on the performance of ION, some experiments are carried out. The experimental results show that using all available reference organisms can improve the performance of ION. At the last part of the paper, we compare the prediction performance of ION with that of other seven centrality methods (DC, BC, CC, SC, EC, IC and NC), based on proteins from E. coli K-12 (E. coli). Results confirm that ION gets better performance on prediction of essential proteins in E. coli than the seven centrality methods.