Defining the functional relationships between proteins is essential to understand many aspects of biology. A classical approach of understanding gene functional relationships is to produce phenotype of combination mutant in two genes [
1]; such relationships are called genetic interactions. Recently, high throughput methods [
2-
4] have been developed to generate large scale genetic interactions in model organisms, such as yeast [
5],
Schizosaccharomyces pombe [
6] and
E. coli. [
7]. The large scale genetic interactions have attracted much attention as they are capable of defining the genome-wide functional relationships among proteins and are fundamental to comprehensive understanding of the organization of biological systems [
5,
8,
9]. However, even with high throughput methods [
2-
4], experimental mapping of genetic interactions is still extremely labor intensive and one cannot screen genome-wide combinations in multiple cell organisms with ten thousands of genes as of now [
10]. Thus, it is imperative to develop computational approaches to predict genome-wide genetic interactions and help complement and enhance wet-lab studies.
In extreme cases, mutation of two nonessential genes can lead to lethal phenotype; this kind of genetic interaction is called synthetic lethal genetic interaction (SLGI). Figure illustrates one such synthetic lethal genetic interaction. The SLGIs are of interest because they are able to reveal functional relationships between proteins, pathways and complexes [
11-
13]. Two synthetic lethal genes have high probability of occurrence in compensatory pathways [
14] or compensatory complexes [
15]. Furthermore, the SLGIs are potentially useful in finding drug targets or drug combinations [
16].
Prediction of SLGIs is impeded by the limit of understanding of genetic interactions. Unlike protein-protein interactions that are known as physical dockings among proteins, the molecular mechanism under genetic interactions has not been fully understood. Thus, it is difficult to select features and understand how features are related to SLGIs. Several computational approaches have been proposed for prediction of SLGIs, and many features, such as protein interactions, gene expression, functional annotation, gene location, protein network characteristics, and genetic phenotype, have been used by these approaches [
10,
17-
20]. However, those methods depend on other genome-wide experimental results.
It is known as a virtual axiom in biology that the "sequence specifies structure and structure determines functionality" [
21]. We hypothesize that it is possible to predict the SLGIs using the characteristics of protein sequence alone. Recently, we demonstrated that the yeast synthetic lethal genetic interactions can be explained by the genetic interactions between domains of those proteins [
22]. Representing the structures and function of proteins, protein domains are usually regarded as building blocks of proteins and are conserved during evolution. Our studies showed that the domain genetic interactions are new type of relationship between protein domains. Moreover, we found that different domains in multi-domain yeast proteins contribute to their genetic interactions differently. The domain genetic interactions help define more precisely the function related to the synthetic lethal genetic interactions, and then help understand how domains contribute to different functionalities of multi-domain proteins. Using the probabilities of domain genetic interactions, we were able to predict novel yeast synthetic lethal genetic interactions.
However, the feasibility of domain based prediction is limited by the coverage of protein domains. For example, only 4480 of more than 6700 yeast proteins contain PfamA domains. In this study, we used the short polypeptide sequences, which are typically shorter than the classically defined protein domains, to characterize the functionalities of proteins. We demonstrated that the genetic interaction between a pair of proteins can be determined by the genetic interactions between the short polypeptide clusters of those proteins. Using short polypeptide clusters as features, we can not only increase the prediction coverage, but also improve the prediction performance.