Known interactions in STRING are primarily imported from existing excellent interaction databases (1
), and are complemented by automated text mining of PubMed abstracts and several other bodies of scientific text [such as from Ref. (6
)]. As is the case for all interactions in STRING, imported interactions are mapped onto a consistent set of proteins and identifiers, thereby facilitating comparison between datasets. STRING does not store specific details regarding splicing isoforms or post-translational modifications, but instead reduces protein isoforms to a single protein per locus (usually as defined by the longest known protein-coding transcript). This level of resolution enables efficient storage and is compatible with most prediction/transfer algorithms, which usually operate only at the level of the gene locus.
Known interactions are further complemented by de novo
interaction predictions derived from several comparative genomics prediction algorithms that are mainly applicable to prokaryotes (13
). These algorithms systematically compare genomes, searching for frequently observed gene neighborhoods, gene fusion events and similarities in gene occurrence across genomes. For each prediction algorithm, dedicated viewers of the genomic evidence are available in STRING.
Interaction evidence from model organisms is often useful for other organisms as well, especially when orthologs of interacting proteins can be clearly identified in the second organism. STRING systematically executes such orthology transfers, using both precomputed orthologs from the COG database (20
), as well as a homology-based orthology scheme computed de novo
). STRING can thus immediately predict a large number of interactions for any newly sequenced genome, as soon as it is included into the system. The combination of known, predicted and transferred interactions is unique, making STRING the most comprehensive interaction resource available to date, especially for organisms not addressed experimentally.
The homology data stored in STRING form the basis for the interaction transfers, and are the result of more than 7 × 1011
pairwise protein comparisons using the sensitive Smith–Waterman dynamic programming algorithm. This dataset is a very useful asset in itself [see also (21
)], and can be accessed independently of the protein interaction networks by locally installing the STRING database files. Users of the website can also browse all of the homologs detected for any protein of interest, and can inspect alignments with very fast response times ().
Figure 2 Precomputed homology relations and alignments. For most genomes contained in STRING, sensitive all-against-all homology searches using the Smith–Waterman algorithm are included. These form the basis for assigning orthologs and transferring interaction (more ...)