Five datasets were used in this paper:
• Nr6505 - For analyzing the protein interface conservation.
• Oblig94 and Trans135 - For comparing the degree of conservation of protein interfaces in transient/obligate binding proteins.
• Benchmark180 - For evaluating the prediction performance of HomPPI.
• S1 and S2 - For evaluating the performance of NPS-HomPPI on interfaces of disordered proteins.
• nr_pdbaa_s2c - For BLASTP searching for close sequence homologs
We extracted a maximal non-redundant set of known protein-protein interacting chains from the Protein Data Bank (PDB) [71
] available on 2/4/2010. We used the following steps to build Nr6505 to eliminate the influence of over-represented protein families in PDB:
1. Extract all the X-ray derived protein structures with resolution 3.5 Å or better in PDB. Remove proteins with less than 40 residues. We obtained 102,853 protein chains.
2. Remove redundancy of the resulting dataset in step 1 using PISCES[103
]. All the remaining sequences have less than or equal to 30% sequence similarity. We obtained 6505 chains.
Oblig94 and Trans135
This dataset of 94 obligate protein-protein dimer complexes and the dataset of 135 transient dimer complexes was obtained from a large non-redundant dataset of 115 obligate complexes and 212 transient complexes (3.25 Å or better resolution, determined using X-ray crystallography) previously generated by Mintseris and Weng [76
] to study the conservation of protein-protein interfaces. In ordered to exclude the influence of other types of interfaces, we extracted 94 obligate dimers and 135 transient dimers from the original dataset and get Oblig94 and Trans135. In Oblig94, 1QLA has been superseded by 2BS2. In Trans135, 1DN1 and 1IIS have been superseded by 3C98 and 1T83, respectively, and 1F83, 1DF9, 4CPA and 1JCH have since been deemed as obsolete and hence discarded from PDB.
We tested NPS-HomPPI on a benchmark dataset manually collected and used as evaluation dataset by Bradford and Westhead [79
]. This dataset consists of 180 protein chains taken from 149 complexes; 36 of these are involved in enzyme-inhibitor interactions, 27 in hetero-obligate interactions, 87 in homo-obligate interactions, and 30 in non-enzyme-inhibitor transient (NEIT) interactions.
Disordered protein datasets S1 and S2
We evaluated the performance of NPS-HomPPI on a non-redundant disordered dataset that has been recently collected by Meszaros et al [87
]. S1 consists of 46 complexes of short disordered and long globular proteins. S2 consists of 28 complexes of long disordered and long globular proteins. Note that a protein complex e.g., 1fv1 C:AB formed by a disordered protein C with two ordered proteins A and B, yields two sets of interface residues for C (corresponding to interfaces between C with A and C with B). As a result, 46 complexes in S1 and 28 complexes in S2 (respectively) correspond to 56 and 40 interfaces of disordered proteins. We focused on cases in which NPS-HomPPI is able to identify Safe/Twilight/Dark zone homologs for the query proteins resulting in NPS-HomPPI interface predictions for 28 out of 56 and 31 out of 40 interfaces of disordered proteins in S1 and S2 respectively.
This dataset is used for BLASTP searches. We used the fasta files from S2C database [104
] to generate our BLAST database nr_pdbaa_s2c. We removed proteins with resolution worse than 3.5 Å from S2C fasta formatted database. We built a non-redundant database for BLAST queries from the S2C fasta formatted database. To generate the non-redundant BLAST database, we grouped proteins with identical sequences into one entry. We used the resulting database to search for homologs of a query sequence using BLASTP 2.2.22+ [77
]. There are 36,352 sequences and 9,549,671 total residues in nb_pdbaa_s2c.
This paper adopts a stringent definition of protein-protein interfaces. Surface residues are defined as residues that have the relative solvent accessible area (RASA) at least 5% [84
]. Interface residues are defined as surface residues with at least one atom that is within a distance of 4 Å from any of the atoms of residues in the chain. The ratios of interface residues versus the total number of residues for the datasets used in this work are summarized in Table . Interface information was extracted from the ProtInDB server http://protInDB.cs.iastate.edu
The Proportion of Interface Residues in Datasets used in this Study.
Mapping Interfaces in Structures to Sequences
We label the protein sequences as interface or non-interface residues (according to the definition of interface residues given above) as follows: We first calculate the relevant distances between atoms using the atom coordinates in ATOM section in PDB files. Then, by associating the ATOM section to residues in the SEQRES section, we can map the corresponding residues to protein sequences. However, various errors in PDB files make this a non-trivial task. Hence, we used the mapping files from S2C database, which offers corrected mapping information from ATOM section to residues in the SEQRES section of PDB files, to map interfaces determined in structures to full sequences.
NCBI BLAST Parameters
The amino acid substitution matrix and gap cost are essential parameters that need to be specified in BLAST searches. In this study, we used the substitution matrices and gap costs recommended for the different query lengths [105
] (See Table ).
BLAST Substitution Matrices and Gap Costs used for BLASTP searches in this paper.
To evaluate the extent to which protein interfaces are conserved in query-homolog pairs and to estimate the performance of HomPPI and other predictors that we compare with in predicting the interface residues of a novel protein (i.e., one not used to train the predictor), we consider several standard performance measures including sensitivity (recall), specificity (precision), accuracy and Matthews correlation coefficient (CC) [106
]. Specifically, for each test protein i
, we calculate the corresponding performance measures for each protein i
where TPi, FPi, TNi and FNi are respectively the number of interface residues of protein i that are correctly predicted to be interface residues, the number of residues of protein i that are incorrectly predicted to be interface residues, the number of residues of protein i that are correctly predicted to be non-interface residues, and the number of residues of protein i that are incorrectly predicted to be non-interface residues.
We calculate the protein-based
overall performance measures as follows:
where N is the total number of test proteins.
These measures describe different aspects of predictor performance. The overall sensitivity is the probability, on average, of correctly predicting the interface residues of a given protein. The overall specificity is the probability, on average, that a predicted interface residue in any given protein is in fact an interface residue. The overall accuracy corresponds to the fraction of residues in any given protein, on average, that are correctly predicted. The overall Matthews correlation coefficient measures of how predictions correlate, on average, with true interfaces and non-interfaces.
Often it is possible to trade off one performance measure (e.g., specificity) against another (e.g., sensitivity) by varying the threshold that is applied to the prediction score to generate the binary (interface versus non-interface) predictions. Hence, we include of the overall sensitivity against overall specificity for different choices of the threshold. The resulting specificity-sensitivity plots or precision-recall plots show the trade-off between sensitivity and specificity and hence provide a much more complete picture of predictive performance.
The performance measures described above provide an estimate of the reliability of the predictor in predicting interface residues of a novel protein
. It is worth noting that most of the papers in the literature on interface residue prediction report performance measures by averaging over residues
(as opposed to proteins). The residue-based
overall performance measures are calculated as follows:
Residue-based specificity-sensitivity plots in this case show how the trade-off between specificityR and specificityR is obtained by varying the threshold applied to the prediction score. The residue-based performance measures provide an estimate of the reliability of the predictor in correctly labelling a given residue. However, in practice, it is useful to know how well a predictor can be expected to perform on a given protein sequence as opposed to a residue. sensitivityP, specificityP, accuracyR, and CCP are more informative than their residue-based counterparts. Hence, in this paper, we report results based on the protein-based measures although, for the purpose of comparison with other published methods, we include the results based on the residue-based measures in Supplementary Materials in HomPPI website.
Interface Conservation (IC) Scores
In protein interface conservation analysis, we used the CC (defined above) as a measure of the extent to which the interface residues in query protein are similar to those in a putative homolog. For clarity, we refer this measure as the Interface Conservation (IC) score.
NPS-HomPPI is a Non-Partner-Specific Homologous Sequence-Based Protein-Protein Interface Prediction algorithm. NPS-HomPPI is based on the conclusion from statistical analysis of protein interface conservation on Nr6505, Trans135 and Oblig94, i.e., that protein interfaces are conserved across close sequence homologs.
As illustrated in Figure , NPS-HomPPI predicts interface residues in a query protein based on the known interface residues of a selected subset of homologs in a sequence alignment. Homologs of the query protein sequence are identified by searching the nr_pdb_s2c database using BLASTP. Note that, in our experiments, in order to allow unbiased evaluation of the performance of NPS-HomPPI, the query sequence itself and sequences that share a high degree (≥95%) of amino acid sequence identity with, and are from the same species as the query sequence are deleted from the set of putative homologs.
Figure 11 An example of Interface Residue Prediction using NPS-HomPPI. The sequence of the query protein 1 byf chain A is BLASTed against nr_pdb_s2c database. In this case, 3 sequences meet the thresholds set by NPS-HomPPI for "close homolog" in Safe Zone or Twilight (more ...)
If at least one homolog in the Safe Zone is found by the BLASTP search, NPS-HomPPI uses the Safe Zone homolog(s) to infer the interfaces of the query protein. Otherwise, the search is repeated for homologs in the Twilight and Dark Zones. If NPS-HomPPI cannot find homologs in any of the three zones, it does not provide any predictions. The default zone boundaries used by NPS-HomPPI (and hence the parameters used in NPS-HomPPI search for homologs of a query sequence) is based on our interface conservation analysis on the dataset of transient dimers Trans135 (Table ). The choice of these default parameter thresholds for NPS-HomPPI is intentionally rather conservative; the thresholds can be relaxed if additional information is available (e.g., if we know that the query protein is an obligate binding protein). The IC score of each of the homologs of a query sequence in the alignment returned by BLASTP is predicted using the regression model for the IC score (see eq. 1) from the BLASTP statistics for the alignment of each homolog with the query sequence. For a given query sequence, at most K closest (Safe, Twilight, or Dark Zone homologs, as the case may be, in that order) are selected from the alignment of the query sequence with its homologs to be used to infer the interface residues of the query sequence. In our experiments, K, the maximum number of homologs used in the prediction was set equal to 10. At most K homologs of the query sequence are determined by ranking the homologs in the alignment in decreasing order of their predicted IC scores and choosing (at most) K Safe zone homologs (or Twilight zone homologs if no Safe zone homologs exist or Dark zone homologs if neither Safe nor Twilight zone homologs exist). Once the (at most) K closest homologs to be used for predicting the interface residues of the query sequence are chosen, each residue in the query sequence is labelled as an interface or non-interface residue based on the majority (over the set of at most K closest homologs of the query sequence) of the labels associated with the corresponding position in the alignment. More specifically, each of the at most K homologs provides a positive vote for a given position in the query sequence if the corresponding residue of the homolog is an interface residue; and a negative vote if it is a non-interface residue. The prediction score of NPS-HomPPI for that position in the query sequence is simply the number of positive votes divided by the total number of votes. A query sequence residue with a HomPPI score ≥0.5 is predicted to be an interface residue (See Figure for an example); otherwise, it is predicted to be a non-interface residue. This procedure can be seen as an application of the (at most) K nearest neighbor classifier at each residue of the query sequence.
NPS-Interface Conservation As a Function of Sequence Alignment
We built a linear model for NPS-interface conservation based on the most important sequence alignment statistics identified in the PCA analysis: logEVal, Positive Score, logLAL.
Variables, parameter estimates and coefficients are shown in Table . All the coefficients are significant.
Variables, Parameter Estimates and Significance Values for the Linear Model for NPS-Interface Conservation.
PS-HomPPI predicts the interface residues in a protein chain based on the known interface residues of its closest homo-interologs. Given a query protein A and its interaction partner B, PS-HomPPI first identifies the set homo-interologs of A-B using BLASTP to identify the homologs of A and homologs of B. From the BLASTP results, we identify a set of homo-interologs that meet sequence similarity thresholds (determined based on the results of our partner-specific interface conservation analysis, as described in the Results Section). We discard the whole PDB complex that contains A-B, to ensure an objective assessment of the reliability of our prediction procedure. For query A-B and its homologous interacting pair A'-B', we also discard the interacting protein pair A'-B' if A and A' or B and B' share ≥95% sequence identity and belong to the same species.
PS-HomPPI uses homo-interologs in the Safe and Twilight Zones to make predictions. The zone boundaries were determined using Trans135 and are shown in Table . The PS-HomPPI prediction process is similar to that of NPS-HomPPI in that it progressively searches for homointerologs from higher, then lower, homology zones: i.e., if PS-HomPPI cannot find at least one homo-interolog in the Safe Zone, it next looks for homo-interologs in the Twilight Zone.
Boundaries of Safe, Twilight and Dark Zones used by PS-HomPPI.
PS-HomPPI predicts whether an amino acid in query sequence A is an interface residue or not based on the corresponding position in its alignment with (at most) K of the closest homo-interologs of A-B (based on their predicted IC scores). In our experiments, K was set equal to 10. Given a query-partner pair A-B, we label each position in the amino acid sequence of protein A as an interface or non-interface based on whether or not a majority of the corresponding positions of the homologs of A within the homo-interologs of A-B are interface residues. More specifically, each of the at most K homo-interologs provides a positive vote for a given position in the query protein sequence A if the corresponding residue of its homolog A' in its homo-interolog is an interface residue; and a negative vote if it is a non-interface residue. The prediction score of PS-HomPPI for that position in the query sequence is simply the number of positive votes divided by the total number of votes. A residue in the query protein A with a prediction score ≥0.5, is predicted as interface, otherwise, it is predicted as non-interface.
PS-Interface Conservation As a Function of Sequence Alignment
We built a linear model for PS-interface conservation based on the important sequence alignment statistics identified in the PCA analysis: logEVal
, Positive Score
A-B is query protein pair and A'-B' is the homo-interolog of A-B. EValAA'
are the EVal
between A and A', and between B and B'. positiveSAA'
are the BLAST Positive Score
between A and A', between B and B'. The model is
Variables, parameter estimates and coefficients are shown in Table . All the coefficients are significant.
Variables, Parameter Estimates and Significance Values for the Linear Model for PS-Interface Conservation.