We first compared AAC against the evolutionarily-rich protein domain features for predicting interactions in the three yeast interaction datasets. We then compared AAC against the tuples and signature product features, which like AAC do not require protein domain information on yeast, worm and fly datasets. We then performed a post-hoc feature analysis to identify the AAC features that were most beneficial for predicting interactions. Finally, we used classifiers combining AAC and domains to predict the complete yeast interactome and validated novel interactions using Gene ontology.
Comparison with existing features
The goal of comparative analysis was: (a) to determine how well a simple feature like AAC performed against well-known features such as domains, (b) to assess if AAC features can improve performance when used in combination with domains, (c) to compare AAC to other non-domain sequence features such as the tuple feature 
and the signature product feature 
AAC performs at par with domains
We trained and tested classifiers on the three yeast datasets (TWOHYB, AFFMS, PCA), selecting only protein pairs for which domain information was available for both proteins. We selected only protein pairs with domains to have a fair and direct comparison against a classifier that relies only on domains for interaction prediction (). We compared a classifier using domains against a classifier using either AAC monomers or AAC dimers as features. With the exception of the Naive Bayes using AAC dimer (AFFMS, PCA), surprisingly there was no statistically significant difference in performance of classifiers using AAC features or domains. Overall, both AAC features performed at par with domains in the majority of the cases across different datasets and classifiers, which was surprising and indicated that AAC alone captures a substantial amount of information required for identifying interacting proteins.
Performance comparison of AAC features (AAC monomer, AAC dimer) against domains.
Combining AAC with domains results in no significant improvement in performance
To assess the value of combining AAC with evolutionarily-rich domain features, we combined domains with AAC monomer and AAC dimer features and compared the performance of classifiers using the combined set of features against classifiers using either of these features alone. We estimated performance on the protein pairs for which we had domains to allow comparison against a classifier which used only domains as features (). In all three datasets, combining AAC with domain features did not significantly change performance, which is not surprising because the protein pairs have domains and therefore should be highly predictable using the domain-based classifier. This results suggests that we can safely combine simple sequence-based features with evolutionarily-conserved features such as domains without suffering any performance loss due to excessive features.
Performance comparison of AAC features in combination with domains.
AAC performs at par with non-domain features
We compared the performance of AAC with other non-domain features, which can also predict interactions between proteins lacking domain information. The two features that we evaluated were the tuple feature from Gomez et al. 
, and the signature products (Sigprod) from Martin et al. 
We compared AAC against the tuple and Sigprod features on three yeast datasets (TWOHYB, AFFMS, PCA). The three datasets were each split into two parts: protein pairs with domains and protein pairs without domains. We report the performance on protein pairs with domains (With domains), on protein pairs without domains (No Domains) and on the complete dataset (All Protein pairs). The AUC-scores on protein pairs without domains evaluated how well non-domain features including AAC are able to predict interactions (or non-interactions) among proteins for which no domain information is available. The AUC-scores on the complete datasets evaluated the overall performance of different features on protein pairs irrespective of domain information availability. These results are for the SVM classifier, because it provides performance numbers for all features (Sigprod is specific to a SVM classifier). Results for the Maximum entropy classifier are similar (Supporting text S1
, Fig. S4
On protein pairs without domains (), AAC dimers were significantly better (
) than tuples for AFFMS. AAC monomer was also better than tuples for AFFMS (
). Tuples were never significantly better than the AAC features. AAC features also performed at par with Sigprod with the exception of AAC dimers for AFFMS. This indicates that for these protein pairs, AAC dimers capture the majority of the information captured in Sigprod.
Performance of AAC features against other non-domain features.
On protein pairs with domains (), Sigprod outperformed AAC features less often (PCA for dimers, PCA, TWOHYB for monomers) than tuples. Both AAC features were much closer to Sigprod than Tuple, especially on the largest dataset (AFFMS). Finally, on the complete datasets, AAC dimers were at par with Sigprod in all three datasets, whereas AAC monomers were at par with Sigprod in two datasets. Overall, AAC features were better than tuples and AAC dimers were at par with the Sigprod features in the majority of the cases.
Performance comparison on fly and worm datasets
In addition to comparing AAC features on the three yeast datasets described above, we also compared AAC features on two-hybrid datasets from worm and fly (). We considered AAC features against the Sigprod features and found that AAC features performed at par with Sigprod. Overall, these results suggest that features based on AAC can perform as well as existing sequence-based features, which do not require domains. This level of performance of AAC features is true for different organisms and different datasets.
Comparison of AAC features against signature product features (Sigprod) on protein interaction datasets from worm and fly.
Identification of important features
The fact that AAC monomer and dimer features can have at par performance with more complex features such as tuples, domains or signature product is very surprising considering the simplicity of these features. To investigate what makes AAC a good feature for protein interaction prediction, we considered a classifier using both domains and AAC features and obtained the AAC features that occurred among the
features most important for interaction prediction. We then asked if there were any AAC monomers and dimers that were statistically over-represented in known protein interaction domains 
We considered the true positives among the top
predicted protein pairs and obtained the
most important features for each set of true positives. These predictions were obtained from classifiers using both AAC and domains features. AAC features comprised 10–40% of the top
features (). The highest proportion of AAC features were from
, decreasing with larger
, suggesting that AAC contributes to the highest confidence predictions.
Figure 5 Percentage of the AAC features among the top features.
We found that several of the AAC monomer and dimers were statistically over-represented in regions representing protein-protein interaction domains (). Features differing in discretization levels were considered the same. For example A_1 and A_2 were both considered as Alanine, where 1 and 2 represent discretization level. We assessed statistical significance of observing this proportion of AAC monomers and dimers to be over-represented in the protein interaction domains, by comparing the proportion to the total number of possible AAC monomers and dimers that are enriched in the interaction domains. Of the 420 possible monomers and dimers, there are 175 that are statistically over-represented in protein interaction domains. We found that the proportion of over-represented features was statistically significant for some (
) but not all cases. The AAC monomers and dimers that are over-represented are likely capturing crucial information in domains and therefore helping interaction prediction. However, the proportion of AAC monomers and dimers over-represented in domain regions is not always significant, suggesting that overall AAC composition may be capturing additional interaction-sensitive information outside of protein domains.
Percentage of the top AAC monomers and dimers that were significantly enriched in domain regions involved in protein interactions.
To visually illustrate that the dimers were capturing meaningful information of interactions we considered two proteins, EFT2 from the high-confidence interacting pairs, and RNR1, from the high-confidence non-interacting pairs (). We selected these proteins from the AFFMS dataset, such that the proteins had roughly the same length and had an associated structure in the protein data-bank covering
of the protein. We then displayed a select set of dimers that had high scores. We found that several of the dimers differed in concentration (KA, EQ) between the two proteins indicating that the dimers were capturing information discriminating between interacting and non-interacting proteins. Although this is one specific case of all the proteins in interactions or non-interactions, we found this visual differentiation between the different protein types to be encouraging and opens up directions of future research relating 3D structure of proteins and dimers concentrations.
Three dimensional structures of ETF2 and RNR1 proteins obtained from the protein data bank.
We examined the overlap between the statistically over-represented AAC monomers and dimers from the three datasets using
. We selected
because the percentage of AAC features among the top
features was maximal for
. We selected
to include a large number of features for comparison. There was not much overlap suggesting that these datasets were capturing different sets of protein interactions (, ). The small overlap set included both hydrophilic (Tyrosine (Y), Tryptophan (W)) and hydrophobic amino acids (Alanine (A), Isoleucine (I)). We found slightly more overlap in TWOHYB and PCA than either with AFFMS, which is not surprising because both PCA and TWOHYB are pairwise interaction sets whereas AFFMS is a co-complex interaction set. Features common to AFFMS and TWOHYB included charged amino acids (Aspartic acid (D), Glutamic acid (E)) and mostly non-polar amino acids (Valine (V), Phenylalanine (F), Alanine (A)). The features common to PCA and AFFMS included charged (Arginine), but mostly non-polar amino acids (Glycine (G), Leucine (L), Tryptophan (W)). Finally, PCA and TWOHYB had features with mostly non-polar amino acids with the exception of one non-polar (Tyrosine, (T)). The identification of primarily non-polar amino acids is somewhat surprising, since it is the polar amino acids that are on the surface of proteins and thought to participate in protein interactions. Thus, the features identified here must be related to some other characteristic of the interacting proteins.
Overlap of AAC monomers and dimers from different datasets.
AAC monomers and dimers over-represented in protein interaction domains.
Features that were exclusive to AFFMS included all the charged amino acids (D, E, K, R, H), and one polar (Q) and remaining non-polar amino acids (A, G, M, V, I). In contrast PCA and TWOHYB had very few charged amino acids, only Aspartic acid (D) in TWOHYB, and Aspartic acid (D) and Argnine (R) in PCA. The presence of all charged amino acids in the AFFMS suggests charge may be important for forming large protein complexes. Features exclusive to TWOHYB, had only one polar (Glutamine, Q) amino acid and the remaining were all non-polar (F, G, I, V, M). Finally, features exclusive to PCA included polar (Q, T, Y), charged (D,R) and non-polar amino acids (A, F, C, M, V, W). Overall PCA had the maximum range of amino acids, even though it was the smallest data set.
Our post-hoc analysis of important features led us to conclude that several AAC monomers and dimers were significantly enriched in domains involved in protein interactions, but the specific features that were deemed important depended on the dataset: features involving charged amino acids in AFFMS, and non-polar amino acids in TWOHYB and a mixture of polar and non-polar amino acids in PCA.
Whole yeast proteome analysis: Identification of novel interactions
To predict interactions in the entire yeast genome, we trained three classifiers on the AFFMS, PCA and TWOHYB datasets. The predicted interactome was created from the intersection of the interaction sets predicted by each classifier. We considered intersections at different confidence levels, ranging in 80%–95%, and identified the number of known interactions at each confidence level (). We found a large proportion of our interaction set to comprise novel interactions.
Number of predicted and true interactions at different confidence levels.
Because many of our interactions were novel, we carried out preliminary validation using expression data and gene ontology categories 
. Our expectations were that interacting proteins would tend to be co-expressed and be in similar processes or locations. For co-expression analysis we computed the correlation coefficient between the two proteins of a predicted interaction (or non-interaction) using expression data from Gasch et al. 
, which profiled the transcriptomic response of yeast cells under different stress conditions (). We found that the average correlation for the interactions (
), while low, is higher than the non-interacting proteins (
, Kolmogorov Smirnov
). This low correlation has been seen before and suggests that protein stability, maintained via post-translational modifications, may play a significant role in complex formation and function 
, or may be due to proteins interacting under conditions not captured in the expression dataset. However, compared to non-interacting proteins, the interacting proteins exhibit a significant bias in the distribution towards positive correlation.
Distribution of co-expression of predicted interactions and non-interactions at different confidence levels.
We further analyzed these interactions for co-localization, co-function, and co-process using GO Slim terms and found that proteins predicted to interact tended to co-localize, or participate in the same processes more than the proteins predicted to not interact (). In particular, interacting proteins were statistically enriched for co-localization (
) and co-process (
) where as predicted non-interactions were statistically depleted from co-localization (
) and co-process (
). For function, even though predicted interactions had a higher fraction of interactions participating in the same function, both interacting and non-interacting proteins were enriched for co-function. This suggests that GO slim functional categories may not be as predictive of interacting versus non-interacting proteins as process and location. This is consistent with low sensitivity of protein interaction identification using all GO molecular functions versus sensitivity using a filtered set of functions 
. To investigate this further we considered the enrichment on a per functional category basis and found that both interacting and non-interacting proteins were enriched in hydrolase activity, and non-interacting proteins were enriched in transferase activity. Further, on excluding these two categories, the non-interacting proteins were no longer enriched in co-function while the interacting proteins remained enriched in co-function (
). This suggests that proteins that are hydrolases may be further grouped into other categories, some of which interact and some of which do not interact. Proteins that are transferases do not interact with each other. This gives us an interesting direction of future research to investigate the propensity of different proteins to interact based on their functional roles. The high enrichment of co-localization and co-process is consistent with our prediction of interactions and, validates our predicted interactions using gene ontology, and future experimental validation of the high confidence predictions are likely to yield true positive interactions.
Co-annotation of predicted interactions and non-interactions at different confidence levels of interaction.
Analysis of novel interactions: Identification of new function
We identified 1412 high confidence (95%) interactions, including 197 existing interactions. We examined more closely the most highly connected nodes (hub nodes) of this high confidence network, where a hub was a node with
interaction partners. The largest hub was the protein LAS17, an actin assembly protein and the yeast homolog for the Wiskott-Aldrich disease in humans 
. This protein has 12 known interactions in the existing interaction databases and we found 176 more interactions, most of which were among proteins involved in actin cytoskeleton organization, consistent with the known function of LAS17.
Gene ontology enrichment of the hubs identified cell budding, cytokinesis and mRNA stability and catabolism as additional enriched processes. Other protein hubs were also involved in a variety of processes including nuclear transport (KAP95, SRP1), transcription (NOT3, NAB3) and telomere maintenance (GAL11, STO1). Because hubs captured the majority of the interactions, we concluded that interactions in the high confidence network were involved in cell-budding, actin assembly, nuclear pore transport and mRNA stability.
One of our goals, using sequence-based interaction classifiers, was to capture and analyze interactions among proteins that cannot be analyzed using domain-based methods. This is especially useful for uncharacterized
proteins for which roles may be inferred based on interacting proteins. Therefore we focused on predicted protein pairs where one of the proteins did not have any known domains. There were a total of 169 such interactions including 75 interactions involving 13 uncharacterized proteins. One of the uncharacterized proteins (YJR151W-A) was also a hub with 37 interaction partners (). Using the “guilt by association” approach we predict that this protein has a role in transcription, because of its predicted interactions with several universal transcription initiation factors (TIF and TAF), and also in mRNA processing and metabolism, because of its interactions with splicing factors, P-body, and translation-initiation proteins 
. Interestingly, YJR151W-A may not have been studied carefully because it was not thought to be a gene. We assigned putative roles to other uncharacterized proteins based on their interactions with other characterized proteins (). The ability to assign new putative function to uncharacterized proteins, for which domains are also not available, highlights the usefulness of predicting protein interactions using non-domain features such as AAC. Overall our interaction set had both known and uncharacterized proteins, allowing us to validate existing knowledge and predict new function for uncharacterized proteins.
Protein interaction sub network with the uncharacterized ORFs.
Predicted function of uncharacterized ORFs.