Protein–protein interactions (PPIs) underlie many processes essential to living organisms. Years of small-scale experimental work, along with genome-wide studies powered by high-throughput techniques (e.g. Gavin et al., 2006
; Ito et al., 2001
; Krogan et al., 2006
; Tarassov et al., 2008
; Uetz et al., 2000
; Yu et al., 2008
) have generated significant numbers of known PPIs, which provide a good foundation on which to learn protein sequence features that distinguish interacting protein pairs from non-interacting ones. In general, this has been a difficult and largely unsolved computational problem, exascerbated by strong biases in available datasets, including redundant interactions and skewed amino acid compositions in well-represented protein complexes (e.g. the ribosome). Nonethless, diverse computational methods have been developed that predict PPIs using protein sequence features (Ben-Hur and Noble, 2005
; Bock and Gough, 2001
; Chou and Cai, 2006
; Gomez et al., 2003
; Guo et al., 2008
; Martin et al., 2005
; Nanni and Lumini, 2006
; Pitre et al., 2008
; Roy et al., 2009
; Shen et al., 2007
; Sprinzak and Margalit, 2001
; Yu et al., 2010a
As with all computational prediction methods, improvements to datasets used for testing and training can strongly affect the quality of the predictions. It is thus critical that protein sequence feature-based PPI prediction methods be validated with appropriate positive and negative datasets. Since the numbers of high-confidence positive PPIs are still relatively modest, especially in comparison to the numbers of potential negative examples, most studies have used as much of the positive PPI data as possible. High-quality negative PPI data are equally important for learning and validation processes. Unfortunately, such data are not widely available, although a new database has begun to archive such data (Smialowski et al., 2010
). Therefore, a typical strategy has been to employ protein pairs that are not previously known to interact as the set of negative PPIs. This is generally a reasonable assumption given that negative PPIs outnumber positive ones by a factor of hundreds to thousands.
More specifically, let us say that we have P
protein pairs known to interact and the P
protein pairs involve K
different proteins. Then, there are K
−1)/2 possible protein pairs. The P
pairs known to interact serve as positive examples for predicting new interactions. The remaining N
protein pairs, dominated by true negative interactions, are assumed not to interact (in general) and serve as negative examples, where N
. Usually, PN
is of a manageable magnitude whereas N
is not. For many algorithms, cross-validating predictive algorithms on the complete set of K
−1)/2 protein pairs (consisting of P
positive PPIs and N
negative ones) is not feasible simply because it is too immense. Thus, sampling subsets, especially of negative PPIs, is routine practice.
Typically, one might want an unbiased subset of negative PPIs of size n
, where n
is of a manageable magnitude. Cross-validated test results on the set combining P
positive PPIs and n
negative ones are then assumed to generalize to the whole set of K
−1)/2 protein pairs because the negative subset used for cross-validation is an unbiased representative of the N
negative PPIs. This type of subset sampling—for cross-validated testing—aims to create unbiased subsets of negative PPIs such that predictive performance estimated with them can be safely assumed to generalize to the population level. Importantly, biased subsets of negative PPIs for cross-validation will likely fail to generalize to population levels, as has been previously demonstrated (Ben-Hur and Noble, 2006
Negative PPI subsampling also arises in a second distinct context: let us assume that the size of our unbiased negative subset is 100 times larger than the set of positive PPIs (i.e. n
). (Note that this ratio of positive and negative PPIs is conservative for most organisms, but employing negative subsets of a size significantly greater than 100P
is also not routinely feasible because of practical computational issues such as memory requirements and computation time.) In a typical cross-validated training approach, the training data would be highly skewed, with negative examples represented 100 times more than positive ones. Such high skew often adversely affects predictive algorithms, for example, leading to trivial predictions of all negatives that achieve 99% prediction accuracy. As algorithms such as support vector machines (SVMs) scale poorly with the amount of training data (Park, 2009
), subset sampling of negative PPIs for the purposes of training—a distinct context from cross-validated testing—offers substantial advantages.
Clearly, the purpose of subset sampling for training differs from that of subset sampling for cross-validated testing. In subset sampling for cross-validation, one desires unbiased subsets representative of the overall population such that predictive performance estimated with them can be safely assumed to generalize to that population. In subset sampling for training, however, one is concerned about getting subsets most suitable for effective training of prediction algorithms. A lack of bias is the key to subset sampling for cross-validation whereas it may not be to subset sampling for training.
Traditionally, random sampling has been used for subset sampling for cross-validation, in which one randomly and uniformly samples negative PPIs. Recently, Yu et al. (2010b)
argued against the suitability of random sampling. Instead, they proposed balanced sampling, in which the number of occurrences of a protein in the negative PPI subset is forced to match that in the positive PPI set. Using this approach, they concluded that PPI prediction algorithms employing protein sequence features perform hardly better than chance.
In this study, we demonstrate that balanced sampling for cross-validation generates highly biased negative subsets and that the predictive performance estimated with them does not generalize to populations. We demonstrate that, as a result, the predictive performances of PPIs estimated using balanced sampling are considerable underestimates, and the conclusions of Yu et al. (2010b)
regarding PPI algorithm performance are invalid. Nonetheless, balanced sampling does offer advantages for the purposes of training, as we show with tests that isolate these distinct applications of sampling. In particular, we measure the relative contribution of representational bias of hub proteins in PPI sets to training, versus the contribution of protein sequence features themselves, and we show using balanced sampling that both contribute to effective PPI predictive performance. We conclude that protein sequence features are indeed informative for predicting PPIs, and we provide some guidance for implementing random versus balanced sampling in this context.