|Home | About | Journals | Submit | Contact Us | Français|
Functional partnerships between proteins are at the core of complex cellular phenotypes, and the networks formed by interacting proteins provide researchers with crucial scaffolds for modeling, data reduction and annotation. STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins. The most important new developments in STRING 8 over previous releases include a URL-based programming interface, which can be used to query STRING from other resources, improved interaction prediction via genomic neighborhood in prokaryotes, and the inclusion of protein structures. Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms, providing the most comprehensive view on protein–protein interactions currently available. STRING can be reached at http://string-db.org/.
In contrast to genome sequences, which are quickly becoming a commodity, the functional connectivity within a proteome is a much more challenging problem. The various protein complexes, transient interactions and functional pathways are all context-dependent, and the experimental techniques for their elucidation are diverse, often not directly comparable, and less reliable than genome sequencing. Nevertheless, protein–protein interaction networks (or also ‘association networks’ in case functional associations are included) are a crucial ingredient for any system-level understanding of cellular machineries (1–5). Furthermore, protein networks can serve very concrete, practical purposes such as filtering and assessing high-throughput functional genomics data, and providing intuitive visual scaffolds for annotating the structural, functional and evolutionary properties of proteins.
The database and web-tool STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a meta-resource that aggregates most of the available information on protein–protein associations, scores and weights it, and augments it with predicted interactions, as well as with the results of automatic literature-mining searches. Since its first release in 2000 (6), it has grown into the most comprehensive resource of its type. It builds upon and extends the excellent, manual annotation efforts undertaken at primary protein interaction databases (7–12) and at databases of curated pathway knowledge (13–15). Here, we describe new features that have been added since our report on the previous release, STRING 7 (16).
The basic interaction unit in STRING is the ‘functional association’, which is defined in this database as the specific and meaningful interaction between two proteins that jointly contribute to the same functional process. With respect to the interacting proteins, STRING does not consider any specific splicing isoforms or posttranslational modifications, but instead represents each protein-coding locus in a genome by a single protein (the longest isoform). Thus, and because STRING aggregates data and predictions stemming from a wide spectrum of cell types and environmental conditions, it aims to represent the union of all possible protein–protein links. From this union, the actual network for any given spatio-temporal snapshot of the cell can in principle be deduced by projection, for example by removing proteins known to be not expressed or not active under the conditions studied (17).
In keeping with the above definitions, STRING imports protein association knowledge not only from databases of physical interactions, but also from databases of curated biological pathway knowledge. Apart form the resources already included in the previous release [MINT (10), HPRD (9), BIND (12), DIP (11), BioGRID (8), KEGG (13) and Reactome (14)], a number of resources have been newly included [IntAct (7), EcoCyc (15), NCI-Nature Pathway Interaction Database and Gene Ontology (GO) protein complexes]. For the full STRING release, this set of previously known and well-described interactions is then complemented by interactions that are predicted computationally, specifically for STRING, using a number of prediction algorithms (18,19). First, we conduct systematic searches for genes that are found in close proximity within prokaryotic chromosomes, which is a good indicator for functional linkage. Second, we search for instances where genes have joined to encode a single fusion protein, which is indicative of functional linkage even in organisms where the two proteins have not fused. Third, we search for gene families that share above-random similarities in their evolutionary histories (i.e. they have similar ‘phylogenetic profiles’). This, again, predicts that they contribute to similar functional processes in the cell. Fourth, we conduct searches for genes that display a similar transcriptional response across a variety of conditions (co-expression). Individually, the above predictors may not always have the specificity of direct experimental interaction assays; however, when used in concert and integrated probabilistically, the performance even of relatively weak predictors can rival that of experimental data (20).
Lastly, two further sources of interactions in STRING are actually providing the majority of associations; these are text-mining and interaction transfer between organisms. For the former, we parse a large body of scientific texts [SGD (21), OMIM (22), The Interactive Fly, and all abstracts from PubMed]. We search for statistically relevant co-occurrences of gene names, and also extract a subset of semantically specified interactions using Natural Language Processing (23). For the transfer of interactions between organisms, we estimate whether a pair of interacting proteins found conserved in another organism justifies the transfer of the interaction to that other organism (24). The transferred interactions, as well as all predicted or imported interactions, are benchmarked and scored against a common reference of functional partnership [we currently use the joint membership of proteins in biological pathways, as annotated at KEGG (13), as our gold-standard].
Together, the above sources of interactions, including predictions and transfers, result in a uniquely high coverage of the interaction networks stored in STRING (Figure 1), particularly for well-studied model organisms. Since the previous release, STRING has almost doubled the number of supported organisms, which now stands at 630. The number of stored interactions has increased as well, to a total of more than 50 million. Since the various subtypes of the interaction evidence are stored separately in the database, they can be disabled at will—giving users the ability to adjust the scope and specificity of STRING towards their particular application.
When working with prokaryotes, scientists have long used conserved genomic neighborhood arrangements of genes to infer functional linkage, assuming that such arrangements reflect polycistronic transcription units (operons). STRING has followed this principle, compiling and benchmarking protein–protein associations based on close, co-directional neighborhood of genes on the genome. As of version 8, this has been extended to cover also neighboring genes that are counter-directional in a head-to-head orientation (‘divergent transcription’). Such divergently oriented gene pairs have been shown to be indicative of functional linkage as well (25), albeit with somewhat lower confidence. Often, one of the two genes is a transcriptional regulator, targeting the neighboring gene (25). STRING now uses this type of arrangement in its neighborhood algorithm as well (benchmarked separately, Figure 2). In addition, STRING is now more error tolerant when assembling conserved neighborhoods, ignoring short, partially overlapping genes on the antisense strand that are likely to be spurious predictions.
For each update, STRING now parses all entries of the PDB database of protein structures (26). The use of protein structures is two-fold: first, to inform the user that a given protein—or a close homolog thereof—indeed has 3D structure information. In this case, a small preview of a representative structure is shown in the network, and the user can follow it to view the full structure and to proceed to the PDB website. Second, protein structures serve as interaction evidence themselves, when more than one distinct peptide chain is found in the structure. In this case, a stable and reliable protein–protein interaction is assumed.
To facilitate the integration of STRING into network tools like Cytoscape (27) and workflow engines like Taverna (28), we have created an application programming interface (API) that allows access to the interaction network in computer-readable formats (Figure 3). Additionally, specific API functions allow retrieval of individual records from our database, for example to map a protein via its name onto a STRING entry. We further envision that the STRING API will be useful to developers of web services, who plan to make use of the STRING interaction network. If a particular web service needs access to the complete set of interactions, it may still be advisable to maintain a local copy of our data distribution. However, if the service requires access to many different subsets (depending on user input), querying STRING via its API could reduce administrative load.
The API is called by constructing a URL that contains the type of the request, the desired output format and the input items. The STRING server then returns the result of the computation in the desired format. Further documentation can be accessed via the STRING homepage.
Apart from the ad hoc and barrier-free access through the website, STRING can be downloaded and used locally, either in the form of concise flat-files or as a mirror installation of the complete relational database back-end (some of the downloads do require a free, nonredistribution license applicable to academic nonprofit users). The interacting entities in STRING can be set to be either proteins, or groups of orthologs spanning multiple organisms (‘COG-mode’). For the latter, STRING relies on an updated and extended version of the COGs [‘Clusters of Orthologous Groups’ (29)], which is being maintained at the eggNOG database (30). A variety of other databases use STRING networks as a basis for further computations/annotations, for example by augmenting the networks with small molecules [STITCH, (31)], or by using the network to increase the power of kinase–substrate predictions [NetworKIN, (32)]. STRING has also been integrated into third-party tools such as NeAT [Network Analysis Tools, (33)], which provides various ways to analyze the interaction network, or Gaggle (34), which enables automated data transfer into other tools via a browser add-on.
Swiss Institute of Bioinformatics; University of Zurich through its Research Priority Program ‘Systems Biology and Functional Genomics’; European Commission's FP6 Programme through the ADIT Integrated Project (LSHB-CT-2005-511065); BioSapiens Network of Excellence (LSHG-CT-2003-503265). Funding for open access charge: University of Zurich.
The authors wish to thank Dianna Fisk from the Saccharomyces Genome Database, and Thomas B. Brody from The Interactive Fly, for access to gene summary paragraphs. Code development was partially conducted at the ‘WebService BioHackathon 2008’ in Tokyo, Japan.