|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
We have developed the following web servers for protein structural modeling and analysis at http://theory.med.buffalo.edu: THUMBUP, UMDHMMTMHP and TUPS, predictors of transmembrane helical protein topology based on a mean-burial-propensity scale of amino acid residues (THUMBUP), hidden Markov model (UMDHMMTMHP) and their combinations (TUPS); SPARKS 2.0 and SP3, two profile–profile alignment methods, that match input query sequence(s) to structural templates by integrating sequence profile with knowledge-based structural score (SPARKS 2.0) and structure-derived profile (SP3); DFIRE, a knowledge-based potential for scoring free energy of monomers (DMONOMER), loop conformations (DLOOP), mutant stability (DMUTANT) and binding affinity of protein–protein/peptide/DNA complexes (DCOMPLEX & DDNA); TCD, a program for protein-folding rate and transition-state analysis of small globular proteins; and DOGMA, a web-server that allows comparative analysis of domain combinations between plant and other 55 organisms. These servers provide tools for prediction and/or analysis of proteins on the secondary structure, tertiary structure and interaction levels, respectively.
In the post-genomics era, attention is now squarely focused on the interconnections between sequences, structures and function of proteins. As more sequences from genome-sequencing projects and more structures from structure-genomics projects become available, tools are urgently needed to extract the maximum amount of information from them in order to analyze and predict unknown structures and function. We present a number of web-based servers available at http://theory.med.buffalo.edu as shown in Table 1. They are THUMBUP, UMDHMMTMHP and TUPS for topology prediction of transmembrane helical proteins (1); SPARKS 2.0 (2) and SP3 (3) for sequence-to-structure fold recognition and alignment; DFIRE energy function (4) for scoring structural monomer (DMONOMER) and loop conformations (DLOOP) (5), predicting mutant stability (DMUTANT) (4), binding affinity of protein–protein/peptide complexes (DCOMPLEX) (6) and protein–DNA complexes (DDNA) (7); TCD for analysis of folding kinetics (8,9) and DOGMA for comparative analysis of plant domain graph (10). These servers can be classified as the tools for prediction and analysis of the secondary structures, tertiary structures and interactions of proteins as shown in Figure 1. Details are described below.
Communications and regulation of the communications between the inside and the outside of cell membranes are controlled mostly by transmembrane (TM) proteins. Most TM proteins are helical (TMH) proteins. Many different methods have been developed to predict the topology of TMH proteins (11–13). The determination of the topology of a TMH protein is useful for the annotation of its function.
THUMBUP uses a simple scale of burial propensity and a sliding window-based algorithm to predict TM helical segments, and a positive-inside rule (14) to predict N-terminal orientation. The use of burial propensity was based on the fact that helical membrane proteins are packed more tightly than helical soluble proteins (15). It was found that THUMBUP gives an excellent prediction for TM proteins with known structures (3D_helix database), but relatively poorer prediction for a 1D_helix database (topology information was obtained by gene fusion and other experimental techniques) (1). The latter was attributed in part to the high inaccuracy of 1D_helix database employed (16–18).
UMDHMMTMHP uses a modified version of hidden Markov model software developed at University of Maryland (version 1.02, http://www.cfar.umd.edu/~kanungo/software/software.html) for transmembrane-helical-topology prediction. The program differs from typical HMM-based methods for TMH proteins in that the parameters in UMDHMMTMHP were trained by the 3D_helix database only.
TUPS combines the prediction of THUMBUP and UMDHMMTMHP for TM segments and PHOBIUS (19) for the identification of signal peptides. More specifically, TUPS first takes the results from UMDHMMTMHP. Then, if a TM segment predicted by THUMBUP does not overlap with any TM segments predicted by UMDHMMTMHP, the segment is included in the TUPS prediction. Finally, signal peptides identified by PHOBIUS are removed from the TUPS prediction. There is no additional parameter introduced in TUPS other than the parameters determined in THUMBUP and UMDHMMTMHP.
In addition to the 3D and 1D helix datasets tested in the original paper (1), we tested THUMBUP and UMDHMMTMHP in the static benchmark established by Kernytsky and Rost (20). UMDHMMTMHP and THUMBUP without any modification provides 86 and 80% per-segment accuracy for high-resolution dataset, respectively. The performances were ranked #1 and #3, respectively, among the methods compared in the static benchmark. Their performances on low-resolution dataset were only about average, as expected. The new TUPS server provides 88% per-segment accuracy for high-resolution dataset in this benchmark with significant lower rate for misidentifying signal peptides as TM helices (3 versus 70 in UMDHMMTMHP and 28 in THUMBUP). TUPS also provides a substantially better performance per topology accuracy on our 3D_helix test set (1) (86% versus 75% by THUMBUP and 78% by UMDHMMTMHP).
The input is protein sequence in the FASTA format. Multiple sequences can also be submitted. The output provides information on the residue ranges of TM helices (if any) and the N-terminal orientation (Inside or Outside of membrane if the protein is a TMH protein) for every protein submitted. The output is now reported in a table format for easy understanding. A graphical interface will be built in near future for visualizing the TM region. Sample input and output with detailed line-to-line explanations are available online.
Fold recognition refers to recognition of structural similarity of two proteins with or without significant sequence identity. One way to detect structural similarity is to identify remote sequence homology via sequence comparison. Advances have been made from the pairwise to multiple sequence comparison, from sequence-to-sequence, sequence-to-profile to profile-to-profile comparison. Another way to detect structural similarity is via sequence-to-structure threading. More recent works attempt to optimally combine the sequence and structure information for a more accurate/sensitive fold recognition. For a recent review, see Ref. (21).
Both fold recognition servers SPARKS 2.0 (2) and SP3 (3) belong to the profile-based methods that provide sequence to structure alignment based on the sequence as well as the structure information of templates. SPARKS 2.0 and SP3 differ in how structural information is integrated with the sequence profile of templates. The former uses a sophisticated knowledge-based, single-body score that includes torsion, contact energy and surface-accessible potentials. The structure score is calculated by threading the query sequence into template structure. The latter builds two separate sequence profiles from the sequence and structure of a template. The structure-derived sequence profile was derived from depth-dependent structural alignment of the fragments in the template structure with the fragments in a fragment library. SPARKS 2.0 an upgraded version of SPARKS (2), takes the methods for parameter optimization, dynamic programming and template ranking from SP3 (3). Both SPARKS 2.0 and SP3 automatically make a weekly update for template and sequence libraries, i.e. based on new releases from the NCBI (sequences) and PDB (structures), respectively.
Testing on various benchmarks including LiveBench (22) indicates that SP3 is slightly more accurate than SPARKS 2.0. SPARKS 2.0 and SP3 are the two best servers for comparative modeling targets and are among the top single-method servers for all targets in the CASP 6 meeting that assessed 49 automatic webservers (http://predictioncenter.llnl.gov/casp6/meeting/presentations/talks.html).
The input for both SPARKS 2.0 and SP3 is the query sequence in the FASTA format and the number of structure models to be built is based on top ranked templates. The structure models are built by MODELLER (23). It usually takes 30 min to a few hours to complete the fold recognition of a sequence (depending on the size of the query protein and the load of the server computer). The output (in html format) contains the links to PSI-BLAST output for sequence profile, PSIPRED output for the secondary structure prediction, the top 10 sequence-to-structure alignments and the structure models (in PDB format) built based on the alignments. The significance of the sequence-to-structure alignment is indicated by the Z-score for each alignment. An alignment is significant if Z-score is >5.6 for SPARKS 2.0 and >6.3 for SP3. The thresholds were based on LiveBench 8 (22) for predicted models with MaxSub score (24) >0.01 when compared to their respective native structures. The output is now reported in a table format for easy understanding. Sample input and output with detailed line-to-line explanations are available online.
One bottleneck to the solution of the problems of how proteins fold, bind and function is the lack of an accurate energy function. The energy functions that are currently used by the computational biology community are obtained through either a physical-based (25) or a ‘bioinformatics-based’ statistical approach (26). Statistical energy functions are easy to produce and have been proven effective in many applications.
Our group developed an all-atom statistical potential based on a new reference state named Distance-scaled, Finite, Ideal-gas REference (DFIRE). The DFIRE-based energy function has been successfully applied to structure (4) and docking selections (6), loop scoring (5), prediction of mutation-induced change in stability (4), and binding affinity of protein–protein (peptide) (6), protein–ligand (7) and protein–DNA complexes (7). These applications resulted in several servers: DMONOMER and DLOOP for scoring protein monomer and loop conformations, respectively; DMUTANT for predicting mutant stability; DCOMPLEX and DDNA for predicting binding affinities of protein–protein/peptide complexes and those of protein–DNA complexes, respectively.
Comparisons between the DFIRE energy function and other knowledge-based or physical-based energy functions were made. For example, the DFIRE energy function was found to be comparable in accuracy to some physical-based energy functions equipped with various state-of-the-art solvation models [illustrated in loop selection (5)] or empirical energy functions with many adjustable terms [illustrated in docking (6) and prediction of protein–ligand binding affinities (7)]. The usefulness of the DFIRE energy-based servers was also independently verified in predicting protein stability of arc repressor mutants by using our webserver (27).
The input for DMONOMER, DCOMPLEX and DDNA is the atomic coordinates file in PDB format and the chain ID, while DLOOP needs additional input for loop location. The outputs for these four servers are corresponding DFIRE energy scores and/or binding affinities. DCOMPLEX also gives an indication whether input complex is a genuine homodimer or crystal artifact. Inputs for DMUTANT is structure file, Chain ID and residue position. The output is the stability change due to the mutation of a specified residue into 19 other residues. Note that the binding affinities predicted by DCOMPLEX and DDNA were shifted and/or scaled based on test sets used in publication. Sample input and output with detailed explanations are available online for each server.
Our group developed a parameter called total contact distance (TCD) to predict folding rates of small two-state proteins (8). This parameter was built on the observation that either contact order (CO) or long-range order (LRO) parameter has a significant correlation with the logarithms of folding rates (28,29).
The TCD web-server takes the inputs of the structure file, chain ID and residue range of interest for a specific protein. Its output is the calculated value of TCD as well as the predicted folding rate. The auxiliary TCD transition-state server presents the predicted TCD, the approximate size of the folding transition state of a given protein (9).
Proteins are made of functional domains. One effective method to uncover the function of proteins on a genomic scale is by analyzing the network graph of domain–domain interactions (30). A domain graph consists of all domains found in a given proteome. Each vertex (node) represents a distinct domain and two vertices are linked by an edge if they occur together in at least one protein.
DOGMA is an online server implementing CADO (Comparative Analysis of Protein Domain Organization) algorithms (31) and applying it in the comparative analysis of domain graph between plant and other 55 organisms (9 eukaryote, 30 bacteria and 16 archae) (10). The input includes name(s) of Pfam domain(s) (32) and organism(s) to be compared with plant (taken Arabidopsis as representative). Depending on the option chosen, output can be domain graph, shortest path between two given domains, phylogentic profile, and others in both comparative and graphical format. Although the original paper is about comparison between plant and other proteomes, DOGMA could be used to analyze any one against other 55 proteomes.
This work was supported by NIH (R01 GM 966049 and R01 GM 068530), a grant from HHMI to SUNY Buffalo and by the Center for Computational Research and the Keck Center for Computational Biology at SUNY Buffalo. Y.Z. is also in part supported by a two-base fund (No. 203240420391) from National Science Foundation of China. Funding to pay the Open Access publication charges for this article was provided by NIH.
Conflict of interest statement. None declared.