Protein–protein interactions are not limited to direct physical binding. Proteins may also interact indirectly—by sharing a substrate in a metabolic pathway, by regulating each other transcriptionally, or by participating in larger multi-protein assemblies. For predicting such functional associations (including direct binding), the current growth in completed genomes offers unique opportunities through so-called ‘genomic context’ or ‘nonhomology-based’ inference methods (1
These methods are based on the fact that functionally associated proteins are encoded by genes that share similar selection pressures—the genes need to be maintained together, and regulated together, such that the encoded proteins can interact at the same time and place in the cell. This leaves signals in the genome, which become detectable above the noise of random genomic events when analyzing multiple species. For example, the need for maintaining functionally associated genes together can become visible as an agreement in occurrence-patterns across several genomes (4
): the genes tend to be either present together, or absent together—they have the same ‘phylogenetic profile’. This is particularly informative when the profile is not in agreement with organismal phylogeny, as is the case when horizontal transfers or gene losses are involved (6
). Likewise, the need for similar regulation is often reflected in a tendency of functionally associated genes to be close neighbors in prokaryotic genomes (8
), where they generally have the same transcriptional orientation and little or no sequence between them. This suggests that they are single transcription units (operons), recurring in similar but not identical composition across several genomes (10
). Finally, genes whose protein products need to interact closely in the cell have a noticeable tendency to be fused into a single gene, encoding a combined polypeptide (11
) in which the proteins have a higher chance of interacting productively.
Optimal, user-friendly exploitation of genomic context for the prediction of functional interactions requires: (i) a benchmarked scoring scheme that integrates the three types of context and gives a confidence value for each prediction, (ii) automatic implementation and orthology assignment of the genes in newly published genomes, and (iii) easy navigation between various displays so that not only the pairwise interactions, but also the network of interactions and the presence of potential (sub)modules in the network become visible. Previous genomic context databases such as Indigo (13
), the first version of STRING (14
), the Clusters of Orthologous Group (COG) database (15
), Predictome (16
), and SNAPper (17
) only rely on a single form of genomic context. Where they do include multiple forms (Predictome and COG) these are not integrated; nor do any of the databases indicate the reliability of the predictions. This indication of reliability is necessary: with the ever-increasing number of genomes, the amount of predictions can become quite large and, depending on the parameters, may include many false positives. We took the opportunity of a complete redesign of STRING to introduce such a scoring scheme, derived by integrating all three types of genomic context. Additionally, STRING is now continuously updated and the predictions are fully precomputed. Particular emphasis has been placed on fast and easy navigation, coupled to integrated visual outputs (see Fig. for an example output of STRING).
Figure 1 An example of a functional module detected by STRING. The module encompasses the prototypic phosphate-regulon, an active uptake-system for inorganic phosphate found in most, but not all, prokaryotes (24). (A) Network view. Green lines connect proteins (more ...)