|Home | About | Journals | Submit | Contact Us | Français|
ProteinDBS v2.0 is a web server designed for efficient and accurate comparisons and searches of structurally similar proteins from a large-scale database. It provides two comparison methods, global-to-global and local-to-local, to facilitate the searches of protein structures or substructures. ProteinDBS v2.0 applies advanced feature extraction algorithms and scalable indexing techniques to achieve a high-running speed while preserving reasonably high precision of structural comparison. The experimental results show that our system is able to return results of global comparisons in seconds from a complete Protein Data Bank (PDB) database of 152 959 protein chains and that it takes much less time to complete local comparisons from a non-redundant database of 3276 proteins than other accurate comparison methods. ProteinDBS v2.0 supports query by PDB protein ID and by new structures uploaded by users. To our knowledge, this is the only search engine that can simultaneously support global and local comparisons. ProteinDBS v2.0 is a useful tool to investigate functional or evolutional relationships among proteins. Moreover, the common substructures identified by local comparison can be potentially used to assist the human curation process in discovering new domains or folds from the ever-growing protein structure databases. The system is hosted at http://ProteinDBS.rnet.missouri.edu.
The great demand for an efficient and accurate search engine for 3D protein structures has continued to rise due to the dramatic increase in protein structural data and the role of protein structures in biological findings (1). The number of known protein structures in the primary structural database, the Protein Data Bank (PDB), had reached 63 271 (~152 959 protein chain structures) as of 16 February 2010, and is expected to continue growing at a high rate. The most important and difficult task in handling such a large number of protein structures is to develop an efficient and accurate tool for fast comparison between a new structure and all existing ones in the database, so as to discover potential biological connections. To assist in this task, a high-throughput and accurate structural comparison method is essential. Traditional comparison methods, such as DALI (2) and CE (3), are based on the calculation of a distance matrix of residues, which can provide accurate alignment but are usually computationally expensive.
In recent years, approaches have been developed to improve the performance of structural comparison and search. Fast web servers, including TOPSCAN (4), YAKUSA (5), 3D-Blast (6) and iSARST (7) map protein structures into 1D sequences and then use various sequence alignment methods to align two structures. These approaches exhibit good efficiency; however, 1D representations of 3D structures potentially lose details of structural topologies, which could lead to lower accuracy than the accurate structural comparison methods (6).
To meet the challenges of strict efficiency and accuracy requirements from the large-scale protein structure database, ProteinDBS was initially developed in 2003 to provide the community a real-time web server for searching globally similar protein structures (8). The first generation of ProteinDBS has been widely used by the community and recognized by the 3 September 2004, issue of Science (9). During these years, ProteinDBS has been continuously improved in performance and service to keep it as a useful resource to study protein structures.
In the new version of ProteinDBS, major advancements include local structural comparison as well as biologist-friendly query interfaces and visualization tools. In contrast to the global comparison, which tries to superimpose most of the corresponding backbone atoms from two proteins, the local comparison seeks to find all common substructures between two proteins. As an example, gap-free common substructures are usually linked by coils of different lengths in a protein structure family. These multiple common substructures, different from canonical secondary structures, might be used not only for the discovery of new domains or folds in the structure database but also for the identification of functional and evolutionary relationships of protein structures since they are more conserved than other regions (10–12). However, aligning protein substructures is known to be a non-deterministic polynomial (NP) time-hard problem, and the existing methods are rarely designed to handle such kind of problem. The problem becomes more challenging as one considers the growing rates at which new protein structures are being added to the database. Hence, the main goal of ProteinDBS v2.0 is to tackle these issues and equip the community with user-friendly tools that can deliver efficient and accurate results for protein structure comparison and search.
ProteinDBS v2.0 has been optimized in the following aspects:
To our knowledge, ProteinDBS v2.0 is the only server that can simultaneously support large-scale comparison and search of globally and locally similar protein structures.
The system architecture of ProteinDBS v2.0, as shown in Figure 1, contains five modules: (i) protein structure database management; (ii) data pre-processing; (iii) query interface; (iv) distributed search engine; and (v) retrieval results visualization. A system tutorial can be viewed at the ProteinDBS web site.
ProteinDBS v2.0 maintains two independent databases of protein structures for global and local comparisons. The database of global comparisons is updated weekly and newly added protein structures are automatically downloaded from the PDB ftp site (ftp://ftp.wwpdb.org/pub/pdb/). For each new protein structure, a 2D distance matrix is generated from 3D coordinates of the protein chains. The distance matrices are empirically proven capable of representing global protein structural topologies. From the distance matrices, 33 features are then extracted, and a tree structure, an M-Tree (17), is utilized to index the multi-dimensional data.
The database of local comparisons is a non-redundant data set of protein chains selected from PDBSelect (16) and SCOP v1.75 (15). When new data set is released, substructure units, defined as continuous fragments of backbone with fixed length, are first identified from each protein in the data set using a sliding window. Our assumption is that a protein containing similar substructure units should be further investigated to find long common substructures. In order to efficiently search proteins with similar substructures, the system first organizes structurally similar substructure units into a cluster and selects a representative for each cluster. The representative is assigned a label called ‘term’ in our server. The system then maps the protein structure into a series of terms by comparing the substructure units with the pre-defined substructure representative of each cluster. Finally, the system utilizes an M-Tree to index the terms of the entire database of proteins to facilitate fast searches using information retrieval techniques.
There are two types of query methods, as shown in the top block of Figure 1: local-to-local and global-to-global search. Both methods feature ‘query by ID’ and ‘query by structure example’. Using an internet browser, a user can upload a new protein chain structure in PDB format or provide a PDB ID contained in the protein database to find similar protein structures.
The global-to-global search, as mentioned previously, first maps the query protein structures into a 2D distance matrix and extracts features from the distance matrix. In this way, the query protein can be represented by a data point in the feature space populated by the entire protein database. Thus, one-against-all global comparison is analogous to searching nearest neighbors in feature space, and such a search can be completed in real time.
For the local-to-local search, the system first extracts substructure units of the query protein and then clusters them into groups. From the index of database terms, the system finds candidate proteins for comparison and filters out those proteins without common substructures. To achieve accurate substructure comparison, the system deploys a coarse-to-fine strategy to align the query protein and a database protein. Specifically, the system first finds relevant matches at the substructure level with a customized dynamic programming algorithm (14) and then refines the substructure alignment at the atomic level. This two-level alignment framework is a tradeoff allowing the system to achieve high efficiency without sacrificing accuracy of results.
The global comparison retrieval results for a query chain 1o7j_A are shown in Figure 2. A set of top-ranked structures is returned to the user, eight at a time. To visualize the quality of the search results, a 3D superimposition of the query and the top-retrieval result are displayed to the user. The user can select any of the ranked results from the top-right panel. Figure 2 presents a new interface for the superimposition view of the query protein chain and the top-ranked result, 1jsr_B, which is generated by clicking on the thumbnail image on the top-right panel. The sequence alignment result is also displayed to the user with root mean square deviation (RMSD) and alignment length values.
For a global structure search, users can anticipate real-time results. Local structure searches, however, usually take minutes, dependent on the size of the query protein. Our system provides two options for the users: (i) the system will return a session ID for the query along with an estimated execution time after the query protein structure has been uploaded. The user can then bookmark the link of the session ID and check back with the resulting page a few minutes later. (ii) If the user is willing to provide an email address when the query protein structure is uploaded, the system will send ranked results to the user’s email account.
In the query page for local-to-local search, users can perform various search options by specifying (i) the view mode of the results; (ii) the session ID that was assigned after the query was submitted; and (iii) a threshold for substructure sizes. The local comparison method supports three types of result browsing modes: M1, in which the top 10 SCOP folds with the best matched protein structure are shown; M2, in which the top 100 matched protein structures from different SCOP folds are displayed; and M3, in which the top 10 SCOP folds are presented with all matched proteins from the same fold.
Figure 3 shows an example of M3, which organizes the retrieval results using a tree-view on the top-right panel. The top-left panel presents the superimposition of the query protein, 1o7j_A, and one of the top matched database proteins, 1gve_B, with a substructure size threshold of >3 residues. The common substructures are highlighted with different colors. The lower panel shows the sequence alignment result with RMSD and alignment length values. The users can use the ‘residue checkbox’ and the ‘substructure bar’ under the residues to interact with the 3D superimposition view. The superimposition is shown with all the qualified substructures at the beginning. Each substructure pair is differentiated with different colors in the 3D view and sequence alignment. When investigating a specific substructure, the users first use the hyperlink ‘Clear Display’ to hide all the substructures and then click on the ‘substructure bar’ to show the substructures in the 3D view. Similarly, clicking on the ‘residue checkbox’ will highlight one designated residue.
In addition, users can specify different display themes, such as backbone, cartoon, strand and dots, by clicking on the corresponding label. All aligned protein structures can be downloaded from the result pages.
Two major performance evaluations have been conducted for ProteinDBS, namely retrieval accuracy and efficiency. If the top-ranked results are from the same structural family, they are denoted as good results. As the global comparison method introduces new features to improve the system performance, on average, our system’s global search exhibits 97.04% precision at the 10% recall rate and 87.82% precision at the 100% recall rate. A query using the protein chain with the maximum length in the testing set, 566 Cα atoms, takes 3.37 s to return the ranked search results. These tests were conducted on a Linux distributed system consisting of five servers (13).
We applied the local comparison method to SCOP fold classification and compared its performance with known algorithms, such as DALI (2), CE (3), MultiProt (18), SSM (19) and MAMMOTH (20), on a non-redundant database of 3276 protein chains selected from PDBSelect and SCOP v1.73. Our system was able to return ranked results in 182 s for query protein structures with an average length of 167 residues, which is 53.10, 10.87, 3.60 and 1.64 times faster than DALI, CE, MultiProt and MAMMOTH, respectively. Evaluated on three different data sets of non-redundant proteins from SCOP, the average accuracy of our system is approximately equal to DALI, better than MAMMOTH and significantly better than CE, MultiProt and SSM. These tests were conducted on a Linux Fedora server with AMD Opteron dual-core 1000 series processors and 2GB RAM (14).
Over the past decade, we have witnessed a rapidly increasing number of protein structures, which poses a great challenge to search engines that retrieve structurally similar proteins. The ProteinDBS v2.0 web server presented in this article comes equipped with an efficient and accurate search engine, a large-scale protein structure database and a more user-friendly interface. ProteinDBS can return accurate results in seconds for global structure search and takes much less time for local structural searches compared to other accurate comparison methods while preserving higher or similar accuracy. It is expected that this web server will be beneficial to the life sciences community for comparative structural analysis, automatic fold classification and the discovery of functional and evolutionary connections between protein structures.
Shumaker Endowment in Bioinformatics. Funding for open access charge: University of Missouri Shumaker Endowment in Bioinformatics.
Conflict of interest statement. None declared.
The authors are grateful to the researchers and groups who made the following software packages and databases available for us to use in ProteinDBS: the Jmol package (http://jmol.sourceforge.net/), which generates view of aligned proteins, the SCOP database for ground truth testing (15), and the PDB (http://www.pdb.org) for maintaining the tertiary structures.